new updates

vivamoto · May 12, 2021 · 9cd49d4 · 9cd49d4
1 parent 6d3e7d7
commit 9cd49d4
Show file tree

Hide file tree

Showing 28 changed files with 465 additions and 174 deletions.
diff --git a/docs/build/doctrees/custom_module.doctree b/docs/build/doctrees/custom_module.doctree
diff --git a/docs/build/doctrees/environment.pickle b/docs/build/doctrees/environment.pickle
diff --git a/docs/build/doctrees/index.doctree b/docs/build/doctrees/index.doctree
diff --git a/docs/build/doctrees/slurm.doctree b/docs/build/doctrees/slurm.doctree
diff --git a/docs/build/html/_sources/custom_module.rst.txt b/docs/build/html/_sources/custom_module.rst.txt
@@ -1,7 +1,7 @@
 Working with modules
 ====================
 
-Modules are a convenient way to manage environment variables for applications use. Unless you use the default installation of Anaconda available in HPC, you'll need to modify custom modules. This section briefly explains how to work with modules and provides a custom module for Miniconda. See the references [#ref]_ to learn more about modules.
+Modules are a convenient way to manage environment variables for applications use. Unless you use the default installation of Anaconda available in HPC, you'll need to create custom modules. This section briefly explains how to work with modules and provides a custom module for Miniconda. See the references [#ref]_ to learn more about modules.
 
 Environment modules set environment variables with specific values for each application. Run **module avail** to list all modules available::
 
@@ -39,12 +39,29 @@ Environment modules set environment variables with specific values for each appl
 	Use "module spider" to find all possible modules.
 	Use "module keyword key1 key2 ..." to search for all possible modules matching any of the "keys".
 
+Notice that default modules have a ``(D)`` besides the name and loaded modules come with a ``(L)``. You can load a module with ``module load``::
 
+	$ module load Anaconda/3-2019.03
 
-You can load a module with ``module load``::
+Clean all loaded modules with ``module purge``::
 
-	$ module load Anaconda/3-2019.03
+	$ module purge
+
+Run ``module show`` to list the commands executed in the module::
 
+	$ module show Anaconda/3-2020.11
+	----------------------------------------------------------------------------------------------------------------------------------
+	   /opt/ohpc/pub/modulefiles/Anaconda/3-2020.11.lua:
+	----------------------------------------------------------------------------------------------------------------------------------
+	help([[This module loads /scratch/apps/gnu/anaconda3
+	]])
+	conflict("Anaconda","Anaconda3","anaconda","python")
+	setenv("INSTALL_DIR","/scratch/apps/gnu/anaconda3")
+	prepend_path("LD_LIBRARY_PATH","/scratch/apps/gnu/anaconda3")
+	prepend_path("LD_LIBRARY_PATH","/scratch/apps/gnu/anaconda3/libexec")
+	prepend_path("INCLUDE","/scratch/apps/gnu/anaconda3/include")
+	prepend_path("PATH","/scratch/apps/gnu/anaconda3/sbin")
+	prepend_path("PATH","/scratch/apps/gnu/anaconda3/bin")
 
 Create custom module
 --------------------
@@ -90,6 +107,101 @@ You also need to add these lines in your SLURM schedule script to load the envir
 	module load Miniconda/1.0
 
 
+Module usage
+------------
+
+Just run ``module`` to list all available options::
+
+	$ module
+
+	Modules based on Lua: Version 7.8.15  2019-01-16 12:46 -06:00
+		by Robert McLay mclay@tacc.utexas.edu
+
+	module [options] sub-command [args ...]
+
+	Help sub-commands:
+	------------------
+	  help                              prints this message
+	  help                module [...]  print help message from module(s)
+
+	Loading/Unloading sub-commands:
+	-------------------------------
+	  load | add          module [...]  load module(s)
+	  try-load | try-add  module [...]  Add module(s), do not complain if not found
+	  del | unload        module [...]  Remove module(s), do not complain if not found
+	  swap | sw | switch  m1 m2         unload m1 and load m2
+	  purge                             unload all modules
+	  refresh                           reload aliases from current list of modules.
+	  update                            reload all currently loaded modules.
+
+	Listing / Searching sub-commands:
+	---------------------------------
+	  list                              List loaded modules
+	  list                s1 s2 ...     List loaded modules that match the pattern
+	  avail | av                        List available modules
+	  avail | av          string        List available modules that contain "string".
+	  spider                            List all possible modules
+	  spider              module        List all possible version of that module file
+	  spider              string        List all module that contain the "string".
+	  spider              name/version  Detailed information about that version of the module.
+	  whatis              module        Print whatis information about module
+	  keyword | key       string        Search all name and whatis that contain "string".
+
+	Searching with Lmod:
+	--------------------
+	  All searching (spider, list, avail, keyword) support regular expressions:
+
+
+	  -r spider           '^p'          Finds all the modules that start with `p' or `P'
+	  -r spider           mpi           Finds all modules that have "mpi" in their name.
+	  -r spider           'mpi$         Finds all modules that end with "mpi" in their name.
+
+	Handling a collection of modules:
+	--------------------------------
+	  save | s                          Save the current list of modules to a user defined "default" collection.
+	  save | s            name          Save the current list of modules to "name" collection.
+	  reset                             The same as "restore system"
+	  restore | r                       Restore modules from the user's "default" or system default.
+	  restore | r         name          Restore modules from "name" collection.
+	  restore             system        Restore module state to system defaults.
+	  savelist                          List of saved collections.
+	  describe | mcc      name          Describe the contents of a module collection.
+	  disable             name          Disable a collection.
+
+	Deprecated commands:
+	--------------------
+	  getdefault          [name]        load name collection of modules or user's "default" if no name given.
+										===> Use "restore" instead <====
+	  setdefault          [name]        Save current list of modules to name if given, otherwise save as the default list for you the
+										user.
+										===> Use "save" instead. <====
+
+	Miscellaneous sub-commands:
+	---------------------------
+	  is-loaded           modulefile    return true if module is loaded
+	  is-avail            modulefile    return true if module can be loaded
+	  show                modulefile    show the commands in the module file.
+	  use [-a]            path          Prepend or Append path to MODULEPATH.
+	  unuse               path          remove path from MODULEPATH.
+	  tablelist                         output list of active modules as a lua table.
+
+	Important Environment Variables:
+	--------------------------------
+	  LMOD_COLORIZE                     If defined to be "YES" then Lmod prints properties and warning in color.
+
+		--------------------------------------------------------------------------------------------------------------------------------
+
+	Lmod Web Sites
+
+	  Documentation:    http://lmod.readthedocs.org
+	  Github:           https://github.com/TACC/Lmod
+	  Sourceforge:      https://lmod.sf.net
+	  TACC Homepage:    https://www.tacc.utexas.edu/research-development/tacc-projects/lmod
+
+	  To report a bug please read http://lmod.readthedocs.io/en/latest/075_bug_reporting.html
+		--------------------------------------------------------------------------------------------------------------------------------
+
+
 .. [#ref] References: 
 
 https://researchcomputing.princeton.edu/support/knowledge-base/modules

diff --git a/docs/build/html/_sources/index.rst.txt b/docs/build/html/_sources/index.rst.txt
@@ -63,7 +63,7 @@ Cluster **lince**:
    :maxdepth: 2
    :caption: Code Development
 
-   debug_config
+   code_development
    tensorflow_settings
    gpucheck
    linux

diff --git a/docs/build/html/_sources/slurm.rst.txt b/docs/build/html/_sources/slurm.rst.txt
@@ -3,7 +3,7 @@ Slurm
 
 Slurm Workload Manager is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. It is used by many of the world's supercomputers and computer clusters.
 
-The Slurm manages the amount of resources allocated to each job. The number of nodes, CPU cores, memory, GPUs and period are examples of resources one can allocate to a particular job. This is ideal for distributed computing among several nodes.
+Slurm manages the amount of resources allocated to each job. The number of nodes, CPU cores, memory, GPUs and period are examples of resources one can allocate to a particular job. This is ideal for distributed computing among several nodes.
 
 Machine learning and deep learning models can be trained in HPC with Tensorflow, PyTorch, Dask or other distributed computing library.
 
@@ -21,36 +21,47 @@ This introductory video shows some useful commands.
 
 Here's a list of some commonly used user commands. See Slurm `man pages <https://slurm.schedmd.com/man_index.html>`_ for a complete list of commands or download the  :download:`command summary PDF<_static/summary.pdf>`. Note that all Slurm commands start with **'s'**.
 
-**sbatch** is used to submit a job script for later execution. The script will typically contain one or more srun commands to launch parallel tasks.
-
-**scancel** is used to cancel a pending or running job or job step. It can also be used to send an arbitrary signal to all processes associated with a running job or job step.
-
-**scontrol** is the administrative tool used to view and/or modify Slurm state. Note that many scontrol commands can only be executed as user root.
-
-**squeue** reports the state of jobs or job steps. It has a wide variety of filtering, sorting, and formatting options. By default, it reports the running jobs in priority order and then the pending jobs in priority order.
-
-**srun** is used to submit a job for execution or initiate job steps in real time. **srun** has a wide variety of options to specify resource requirements, including: minimum and maximum node count, processor count, specific nodes to use or not use, and specific node characteristics (so much memory, disk space, certain required features, etc.). A job can contain multiple job steps executing sequentially or in parallel on independent or shared resources within the job's node allocation.
-
-
-
-
-
++-----------------------+----------------------------------------------------+
+| Command               | Description                                        |
++=======================+====================================================+
+| sbatch <slurm_script> | Submit a job script for later execution.           |
++-----------------------+----------------------------------------------------+
+| scancel <jobid>       | Cancel a pending or running job or job step        |
++-----------------------+----------------------------------------------------+
+| srun                  | Parallel job launcher (Slurm analog of mpirun)     |
++-----------------------+----------------------------------------------------+
+| squeue                | Show all jobs in the queue                         |
++-----------------------+----------------------------------------------------+
+| squeue -u <username>  | Show jobs in the queue for a specific user         |
++-----------------------+----------------------------------------------------+
+| squeue --start        | Report the expected start time for pending jobs    |
++-----------------------+----------------------------------------------------+
+| squeue -j <jobid>     | Show the nodes allocated to a running job          |
++-----------------------+----------------------------------------------------+
+| scontrol show config  | View default parameter settings                    |
++-----------------------+----------------------------------------------------+
+| sinfo                 | Show cluster status                                     |
++-----------------------+----------------------------------------------------+
 
 
 Job schedule
 ------------
-`UiT The Arctic University of Norway <https://hpc-uit.readthedocs.io/en/latest/jobs/examples.html>`_ provides a list of job script examples.
 
-::
+Submit a script to the queue with ``sbatch <script>``::
+
+	$ sbatch script.sh
 
+	
+The options of ``sbatch`` command may be inserted into the script following the ``#SBATCH`` directive::
 
+	$ cat script.sh
 	#!/bin/bash -v
-	#SBATCH --partition=GPUSP4      # partition name, always 'GPUSP4'
+	#SBATCH --partition=GPUSP4      # partition name. lince = 'GPUSP4', aguia = 'SP2'
 	#SBATCH --job-name=tr-ae        # job name
 	#SBATCH --nodes=1               # number of nodes allocated for this job
 	#SBATCH --ntasks=2              # total number of tasks / mpi processes
 	#SBATCH --cpus-per-task=8       # number OpenMP Threads per process
-	#SBATCH --time=08:00:00         # total run time limit ([[D]D-]HH:MM:SS). Default is 8 hours, maximum 80 hours.
+	#SBATCH --time=08:00:00         # total run time limit ([[D]D-]HH:MM:SS)
 	#SBATCH --gres=gpu:tesla:2      # number of GPUs
 	# Get email notification when job begins, finishes or fails
 	#SBATCH --mail-type=ALL         # type of notification: BEGIN, END, FAIL, ALL
@@ -76,46 +87,12 @@ Job schedule
 	module use --append /scratch/11568881/modulefiles/
 	module load Miniconda/1.0
 
-	# Define directories used in this script
-	export PROJ=$HOME/project/            # project directory
-	export LOG=$HOME/project/log/         # log directory
-	export PYTHON=$HOME/miniconda3/bin/   # path to Python executable
-
-	# Debug mode loads small dataset and runs for few epochs.
-	export DEBUG_MODE=FALSE         # 
-
-	# System info (optional). Get hardware, Linux and Python libraries information
-	bash $PROJ/system_info.sh >> $LOG/system_info_$SLURM_JOB_NODELIST\.log
-
 	# Run the application.
-	# These examples train a neural network with different architectures passed
-	# as environment variables.
-	export AE_ARCH=1_2_4_6_8_10_12  # Number of filters in each layer
-	export AE_KERNEL=5_5_3_3_3_3_3  # Filter size
-	echo [`date '+%Y-%m-%d %H:%M:%S'`] Running $AE_ARCH
-	srun $PYTHON/python3 $PROJ/autoencoder.py  >>  $LOG/autoencoder_$AE_ARCH\.log  2>&1
-
-	export AE_ARCH=1_2_4_6_8_10     # Number of filters in each layer
-	export AE_KERNEL=5_5_3_3_3_3    # Filter size
 	echo [`date '+%Y-%m-%d %H:%M:%S'`] Running $AE_ARCH
-	srun $PYTHON/python3 $PROJ/autoencoder.py  >>  $LOG/autoencoder_$AE_ARCH\.log  2>&1
+	srun <train_model.py>
 
-	export AE_ARCH=1_2_4_6_8        # Number of filters in each layer
-	export AE_KERNEL=5_5_3_3_3      # Filter size
-	echo [`date '+%Y-%m-%d %H:%M:%S'`] Running $AE_ARCH
-	srun $PYTHON/python3 $PROJ/autoencoder.py  >>  $LOG/autoencoder_$AE_ARCH\.log  2>&1
-
-	export AE_ARCH=1_2_4_6          # Number of filters in each layer
-	export AE_KERNEL=5_5_3_3        # Filter size
-	echo [`date '+%Y-%m-%d %H:%M:%S'`] Running $AE_ARCH
-	srun $PYTHON/python3 $PROJ/autoencoder.py  >>  $LOG/autoencoder_$AE_ARCH\.log  2>&1
-
-	export AE_ARCH=1_2_4            # Number of filters in each layer
-	export AE_KERNEL=5_5_3          # Filter size
-	echo [`date '+%Y-%m-%d %H:%M:%S'`] Running $AE_ARCH
-	srun $PYTHON/python3 $PROJ/autoencoder.py  >>  $LOG/autoencoder_$AE_ARCH\.log  2>&1
+`UiT The Arctic University of Norway <https://hpc-uit.readthedocs.io/en/latest/jobs/examples.html>`_ provides additional job script examples.
 
-See more examples in `HPC-UiT documentation <https://hpc-uit.readthedocs.io/en/latest/jobs/examples.html>`_.
 
 
 

diff --git a/docs/build/html/about.html b/docs/build/html/about.html
@@ -95,6 +95,7 @@
 </ul>
 <p class="caption"><span class="caption-text">Code Development</span></p>
 <ul>
+<li class="toctree-l1"><a class="reference internal" href="code_development.html">Code development</a></li>
 <li class="toctree-l1"><a class="reference internal" href="tensorflow_settings.html">Tensorflow configuration</a></li>
 <li class="toctree-l1"><a class="reference internal" href="gpucheck.html">GPU Health Check</a></li>
 <li class="toctree-l1"><a class="reference internal" href="linux.html">Useful Linux Commands</a></li>

diff --git a/docs/build/html/client_configuration.html b/docs/build/html/client_configuration.html
@@ -106,6 +106,7 @@
 </ul>
 <p class="caption"><span class="caption-text">Code Development</span></p>
 <ul>
+<li class="toctree-l1"><a class="reference internal" href="code_development.html">Code development</a></li>
 <li class="toctree-l1"><a class="reference internal" href="tensorflow_settings.html">Tensorflow configuration</a></li>
 <li class="toctree-l1"><a class="reference internal" href="gpucheck.html">GPU Health Check</a></li>
 <li class="toctree-l1"><a class="reference internal" href="linux.html">Useful Linux Commands</a></li>

diff --git a/docs/build/html/code_development.html b/docs/build/html/code_development.html
@@ -40,7 +40,9 @@
 
     <link rel="author" title="About these documents" href="about.html" />
     <link rel="index" title="Index" href="genindex.html" />
-    <link rel="search" title="Search" href="search.html" /> 
+    <link rel="search" title="Search" href="search.html" />
+    <link rel="next" title="Tensorflow configuration" href="tensorflow_settings.html" />
+    <link rel="prev" title="Install Google Drive" href="install_gdrive.html" /> 
 </head>
 
 <body class="wy-body-for-nav">
@@ -92,7 +94,12 @@
 <li class="toctree-l1"><a class="reference internal" href="install_gdrive.html">Install Google Drive</a></li>
 </ul>
 <p class="caption"><span class="caption-text">Code Development</span></p>
-<ul>
+<ul class="current">
+<li class="toctree-l1 current"><a class="current reference internal" href="#">Code development</a><ul>
+<li class="toctree-l2"><a class="reference internal" href="#test-and-debug">Test and debug</a></li>
+<li class="toctree-l2"><a class="reference internal" href="#google-colab-setup">Google Colab setup</a></li>
+</ul>
+</li>
 <li class="toctree-l1"><a class="reference internal" href="tensorflow_settings.html">Tensorflow configuration</a></li>
 <li class="toctree-l1"><a class="reference internal" href="gpucheck.html">GPU Health Check</a></li>
 <li class="toctree-l1"><a class="reference internal" href="linux.html">Useful Linux Commands</a></li>
@@ -246,6 +253,10 @@ <h2>Google Colab setup<a class="headerlink" href="#google-colab-setup" title="Pe
 
           </div>
           <footer>
+    <div class="rst-footer-buttons" role="navigation" aria-label="footer navigation">
+        <a href="tensorflow_settings.html" class="btn btn-neutral float-right" title="Tensorflow configuration" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right" aria-hidden="true"></span></a>
+        <a href="install_gdrive.html" class="btn btn-neutral float-left" title="Install Google Drive" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left" aria-hidden="true"></span> Previous</a>
+    </div>
 
   <hr/>