Skip to content

Commit

Permalink
new updates
Browse files Browse the repository at this point in the history
  • Loading branch information
vivamoto committed May 12, 2021
1 parent 6d3e7d7 commit 9cd49d4
Show file tree
Hide file tree
Showing 28 changed files with 465 additions and 174 deletions.
Binary file modified docs/build/doctrees/custom_module.doctree
Binary file not shown.
Binary file modified docs/build/doctrees/environment.pickle
Binary file not shown.
Binary file modified docs/build/doctrees/index.doctree
Binary file not shown.
Binary file modified docs/build/doctrees/slurm.doctree
Binary file not shown.
118 changes: 115 additions & 3 deletions docs/build/html/_sources/custom_module.rst.txt
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Working with modules
====================

Modules are a convenient way to manage environment variables for applications use. Unless you use the default installation of Anaconda available in HPC, you'll need to modify custom modules. This section briefly explains how to work with modules and provides a custom module for Miniconda. See the references [#ref]_ to learn more about modules.
Modules are a convenient way to manage environment variables for applications use. Unless you use the default installation of Anaconda available in HPC, you'll need to create custom modules. This section briefly explains how to work with modules and provides a custom module for Miniconda. See the references [#ref]_ to learn more about modules.

Environment modules set environment variables with specific values for each application. Run **module avail** to list all modules available::

Expand Down Expand Up @@ -39,12 +39,29 @@ Environment modules set environment variables with specific values for each appl
Use "module spider" to find all possible modules.
Use "module keyword key1 key2 ..." to search for all possible modules matching any of the "keys".

Notice that default modules have a ``(D)`` besides the name and loaded modules come with a ``(L)``. You can load a module with ``module load``::

$ module load Anaconda/3-2019.03

You can load a module with ``module load``::
Clean all loaded modules with ``module purge``::

$ module load Anaconda/3-2019.03
$ module purge

Run ``module show`` to list the commands executed in the module::

$ module show Anaconda/3-2020.11
----------------------------------------------------------------------------------------------------------------------------------
/opt/ohpc/pub/modulefiles/Anaconda/3-2020.11.lua:
----------------------------------------------------------------------------------------------------------------------------------
help([[This module loads /scratch/apps/gnu/anaconda3
]])
conflict("Anaconda","Anaconda3","anaconda","python")
setenv("INSTALL_DIR","/scratch/apps/gnu/anaconda3")
prepend_path("LD_LIBRARY_PATH","/scratch/apps/gnu/anaconda3")
prepend_path("LD_LIBRARY_PATH","/scratch/apps/gnu/anaconda3/libexec")
prepend_path("INCLUDE","/scratch/apps/gnu/anaconda3/include")
prepend_path("PATH","/scratch/apps/gnu/anaconda3/sbin")
prepend_path("PATH","/scratch/apps/gnu/anaconda3/bin")

Create custom module
--------------------
Expand Down Expand Up @@ -90,6 +107,101 @@ You also need to add these lines in your SLURM schedule script to load the envir
module load Miniconda/1.0


Module usage
------------

Just run ``module`` to list all available options::

$ module

Modules based on Lua: Version 7.8.15 2019-01-16 12:46 -06:00
by Robert McLay mclay@tacc.utexas.edu

module [options] sub-command [args ...]

Help sub-commands:
------------------
help prints this message
help module [...] print help message from module(s)

Loading/Unloading sub-commands:
-------------------------------
load | add module [...] load module(s)
try-load | try-add module [...] Add module(s), do not complain if not found
del | unload module [...] Remove module(s), do not complain if not found
swap | sw | switch m1 m2 unload m1 and load m2
purge unload all modules
refresh reload aliases from current list of modules.
update reload all currently loaded modules.

Listing / Searching sub-commands:
---------------------------------
list List loaded modules
list s1 s2 ... List loaded modules that match the pattern
avail | av List available modules
avail | av string List available modules that contain "string".
spider List all possible modules
spider module List all possible version of that module file
spider string List all module that contain the "string".
spider name/version Detailed information about that version of the module.
whatis module Print whatis information about module
keyword | key string Search all name and whatis that contain "string".

Searching with Lmod:
--------------------
All searching (spider, list, avail, keyword) support regular expressions:


-r spider '^p' Finds all the modules that start with `p' or `P'
-r spider mpi Finds all modules that have "mpi" in their name.
-r spider 'mpi$ Finds all modules that end with "mpi" in their name.

Handling a collection of modules:
--------------------------------
save | s Save the current list of modules to a user defined "default" collection.
save | s name Save the current list of modules to "name" collection.
reset The same as "restore system"
restore | r Restore modules from the user's "default" or system default.
restore | r name Restore modules from "name" collection.
restore system Restore module state to system defaults.
savelist List of saved collections.
describe | mcc name Describe the contents of a module collection.
disable name Disable a collection.

Deprecated commands:
--------------------
getdefault [name] load name collection of modules or user's "default" if no name given.
===> Use "restore" instead <====
setdefault [name] Save current list of modules to name if given, otherwise save as the default list for you the
user.
===> Use "save" instead. <====

Miscellaneous sub-commands:
---------------------------
is-loaded modulefile return true if module is loaded
is-avail modulefile return true if module can be loaded
show modulefile show the commands in the module file.
use [-a] path Prepend or Append path to MODULEPATH.
unuse path remove path from MODULEPATH.
tablelist output list of active modules as a lua table.

Important Environment Variables:
--------------------------------
LMOD_COLORIZE If defined to be "YES" then Lmod prints properties and warning in color.

--------------------------------------------------------------------------------------------------------------------------------

Lmod Web Sites

Documentation: http://lmod.readthedocs.org
Github: https://github.com/TACC/Lmod
Sourceforge: https://lmod.sf.net
TACC Homepage: https://www.tacc.utexas.edu/research-development/tacc-projects/lmod

To report a bug please read http://lmod.readthedocs.io/en/latest/075_bug_reporting.html
--------------------------------------------------------------------------------------------------------------------------------


.. [#ref] References:
https://researchcomputing.princeton.edu/support/knowledge-base/modules
Expand Down
2 changes: 1 addition & 1 deletion docs/build/html/_sources/index.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ Cluster **lince**:
:maxdepth: 2
:caption: Code Development

debug_config
code_development
tensorflow_settings
gpucheck
linux
Expand Down
87 changes: 32 additions & 55 deletions docs/build/html/_sources/slurm.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ Slurm

Slurm Workload Manager is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. It is used by many of the world's supercomputers and computer clusters.

The Slurm manages the amount of resources allocated to each job. The number of nodes, CPU cores, memory, GPUs and period are examples of resources one can allocate to a particular job. This is ideal for distributed computing among several nodes.
Slurm manages the amount of resources allocated to each job. The number of nodes, CPU cores, memory, GPUs and period are examples of resources one can allocate to a particular job. This is ideal for distributed computing among several nodes.

Machine learning and deep learning models can be trained in HPC with Tensorflow, PyTorch, Dask or other distributed computing library.

Expand All @@ -21,36 +21,47 @@ This introductory video shows some useful commands.

Here's a list of some commonly used user commands. See Slurm `man pages <https://slurm.schedmd.com/man_index.html>`_ for a complete list of commands or download the :download:`command summary PDF<_static/summary.pdf>`. Note that all Slurm commands start with **'s'**.

**sbatch** is used to submit a job script for later execution. The script will typically contain one or more srun commands to launch parallel tasks.

**scancel** is used to cancel a pending or running job or job step. It can also be used to send an arbitrary signal to all processes associated with a running job or job step.

**scontrol** is the administrative tool used to view and/or modify Slurm state. Note that many scontrol commands can only be executed as user root.

**squeue** reports the state of jobs or job steps. It has a wide variety of filtering, sorting, and formatting options. By default, it reports the running jobs in priority order and then the pending jobs in priority order.

**srun** is used to submit a job for execution or initiate job steps in real time. **srun** has a wide variety of options to specify resource requirements, including: minimum and maximum node count, processor count, specific nodes to use or not use, and specific node characteristics (so much memory, disk space, certain required features, etc.). A job can contain multiple job steps executing sequentially or in parallel on independent or shared resources within the job's node allocation.





+-----------------------+----------------------------------------------------+
| Command | Description |
+=======================+====================================================+
| sbatch <slurm_script> | Submit a job script for later execution. |
+-----------------------+----------------------------------------------------+
| scancel <jobid> | Cancel a pending or running job or job step |
+-----------------------+----------------------------------------------------+
| srun | Parallel job launcher (Slurm analog of mpirun) |
+-----------------------+----------------------------------------------------+
| squeue | Show all jobs in the queue |
+-----------------------+----------------------------------------------------+
| squeue -u <username> | Show jobs in the queue for a specific user |
+-----------------------+----------------------------------------------------+
| squeue --start | Report the expected start time for pending jobs |
+-----------------------+----------------------------------------------------+
| squeue -j <jobid> | Show the nodes allocated to a running job |
+-----------------------+----------------------------------------------------+
| scontrol show config | View default parameter settings |
+-----------------------+----------------------------------------------------+
| sinfo | Show cluster status |
+-----------------------+----------------------------------------------------+


Job schedule
------------
`UiT The Arctic University of Norway <https://hpc-uit.readthedocs.io/en/latest/jobs/examples.html>`_ provides a list of job script examples.

::
Submit a script to the queue with ``sbatch <script>``::

$ sbatch script.sh

The options of ``sbatch`` command may be inserted into the script following the ``#SBATCH`` directive::

$ cat script.sh
#!/bin/bash -v
#SBATCH --partition=GPUSP4 # partition name, always 'GPUSP4'
#SBATCH --partition=GPUSP4 # partition name. lince = 'GPUSP4', aguia = 'SP2'
#SBATCH --job-name=tr-ae # job name
#SBATCH --nodes=1 # number of nodes allocated for this job
#SBATCH --ntasks=2 # total number of tasks / mpi processes
#SBATCH --cpus-per-task=8 # number OpenMP Threads per process
#SBATCH --time=08:00:00 # total run time limit ([[D]D-]HH:MM:SS). Default is 8 hours, maximum 80 hours.
#SBATCH --time=08:00:00 # total run time limit ([[D]D-]HH:MM:SS)
#SBATCH --gres=gpu:tesla:2 # number of GPUs
# Get email notification when job begins, finishes or fails
#SBATCH --mail-type=ALL # type of notification: BEGIN, END, FAIL, ALL
Expand All @@ -76,46 +87,12 @@ Job schedule
module use --append /scratch/11568881/modulefiles/
module load Miniconda/1.0

# Define directories used in this script
export PROJ=$HOME/project/ # project directory
export LOG=$HOME/project/log/ # log directory
export PYTHON=$HOME/miniconda3/bin/ # path to Python executable

# Debug mode loads small dataset and runs for few epochs.
export DEBUG_MODE=FALSE #

# System info (optional). Get hardware, Linux and Python libraries information
bash $PROJ/system_info.sh >> $LOG/system_info_$SLURM_JOB_NODELIST\.log

# Run the application.
# These examples train a neural network with different architectures passed
# as environment variables.
export AE_ARCH=1_2_4_6_8_10_12 # Number of filters in each layer
export AE_KERNEL=5_5_3_3_3_3_3 # Filter size
echo [`date '+%Y-%m-%d %H:%M:%S'`] Running $AE_ARCH
srun $PYTHON/python3 $PROJ/autoencoder.py >> $LOG/autoencoder_$AE_ARCH\.log 2>&1

export AE_ARCH=1_2_4_6_8_10 # Number of filters in each layer
export AE_KERNEL=5_5_3_3_3_3 # Filter size
echo [`date '+%Y-%m-%d %H:%M:%S'`] Running $AE_ARCH
srun $PYTHON/python3 $PROJ/autoencoder.py >> $LOG/autoencoder_$AE_ARCH\.log 2>&1
srun <train_model.py>

export AE_ARCH=1_2_4_6_8 # Number of filters in each layer
export AE_KERNEL=5_5_3_3_3 # Filter size
echo [`date '+%Y-%m-%d %H:%M:%S'`] Running $AE_ARCH
srun $PYTHON/python3 $PROJ/autoencoder.py >> $LOG/autoencoder_$AE_ARCH\.log 2>&1

export AE_ARCH=1_2_4_6 # Number of filters in each layer
export AE_KERNEL=5_5_3_3 # Filter size
echo [`date '+%Y-%m-%d %H:%M:%S'`] Running $AE_ARCH
srun $PYTHON/python3 $PROJ/autoencoder.py >> $LOG/autoencoder_$AE_ARCH\.log 2>&1

export AE_ARCH=1_2_4 # Number of filters in each layer
export AE_KERNEL=5_5_3 # Filter size
echo [`date '+%Y-%m-%d %H:%M:%S'`] Running $AE_ARCH
srun $PYTHON/python3 $PROJ/autoencoder.py >> $LOG/autoencoder_$AE_ARCH\.log 2>&1
`UiT The Arctic University of Norway <https://hpc-uit.readthedocs.io/en/latest/jobs/examples.html>`_ provides additional job script examples.

See more examples in `HPC-UiT documentation <https://hpc-uit.readthedocs.io/en/latest/jobs/examples.html>`_.



Expand Down
1 change: 1 addition & 0 deletions docs/build/html/about.html
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,7 @@
</ul>
<p class="caption"><span class="caption-text">Code Development</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="code_development.html">Code development</a></li>
<li class="toctree-l1"><a class="reference internal" href="tensorflow_settings.html">Tensorflow configuration</a></li>
<li class="toctree-l1"><a class="reference internal" href="gpucheck.html">GPU Health Check</a></li>
<li class="toctree-l1"><a class="reference internal" href="linux.html">Useful Linux Commands</a></li>
Expand Down
1 change: 1 addition & 0 deletions docs/build/html/client_configuration.html
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,7 @@
</ul>
<p class="caption"><span class="caption-text">Code Development</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="code_development.html">Code development</a></li>
<li class="toctree-l1"><a class="reference internal" href="tensorflow_settings.html">Tensorflow configuration</a></li>
<li class="toctree-l1"><a class="reference internal" href="gpucheck.html">GPU Health Check</a></li>
<li class="toctree-l1"><a class="reference internal" href="linux.html">Useful Linux Commands</a></li>
Expand Down
15 changes: 13 additions & 2 deletions docs/build/html/code_development.html
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,9 @@

<link rel="author" title="About these documents" href="about.html" />
<link rel="index" title="Index" href="genindex.html" />
<link rel="search" title="Search" href="search.html" />
<link rel="search" title="Search" href="search.html" />
<link rel="next" title="Tensorflow configuration" href="tensorflow_settings.html" />
<link rel="prev" title="Install Google Drive" href="install_gdrive.html" />
</head>

<body class="wy-body-for-nav">
Expand Down Expand Up @@ -92,7 +94,12 @@
<li class="toctree-l1"><a class="reference internal" href="install_gdrive.html">Install Google Drive</a></li>
</ul>
<p class="caption"><span class="caption-text">Code Development</span></p>
<ul>
<ul class="current">
<li class="toctree-l1 current"><a class="current reference internal" href="#">Code development</a><ul>
<li class="toctree-l2"><a class="reference internal" href="#test-and-debug">Test and debug</a></li>
<li class="toctree-l2"><a class="reference internal" href="#google-colab-setup">Google Colab setup</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="tensorflow_settings.html">Tensorflow configuration</a></li>
<li class="toctree-l1"><a class="reference internal" href="gpucheck.html">GPU Health Check</a></li>
<li class="toctree-l1"><a class="reference internal" href="linux.html">Useful Linux Commands</a></li>
Expand Down Expand Up @@ -246,6 +253,10 @@ <h2>Google Colab setup<a class="headerlink" href="#google-colab-setup" title="Pe

</div>
<footer>
<div class="rst-footer-buttons" role="navigation" aria-label="footer navigation">
<a href="tensorflow_settings.html" class="btn btn-neutral float-right" title="Tensorflow configuration" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right" aria-hidden="true"></span></a>
<a href="install_gdrive.html" class="btn btn-neutral float-left" title="Install Google Drive" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left" aria-hidden="true"></span> Previous</a>
</div>

<hr/>

Expand Down

0 comments on commit 9cd49d4

Please sign in to comment.