Skip to content

Commit

Permalink
Merge pull request #71 from Danzelot/slurm
Browse files Browse the repository at this point in the history
Restructures Job/SLURM section
  • Loading branch information
bast committed Nov 10, 2019
2 parents e8f723e + 73912d2 commit b00254e
Show file tree
Hide file tree
Showing 11 changed files with 220 additions and 258 deletions.
2 changes: 1 addition & 1 deletion help/faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -274,7 +274,7 @@ For instance::

**QOSMaxWallDurationPerJobLimit**

QOSMaxWallDurationPerJobLimit means that MaxWallDurationPerJobLimit has been exceeded. Basically, you have asked for more time than allowed for the given QOS/Partition. Please have a look at :doc:`/jobs/partitions`.
QOSMaxWallDurationPerJobLimit means that MaxWallDurationPerJobLimit has been exceeded. Basically, you have asked for more time than allowed for the given QOS/Partition. Please have a look at :ref:`label_partitions`


**Priority vs. Resources**
Expand Down
23 changes: 17 additions & 6 deletions index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ HPC-UiT Services User Documentation
news/news


.. Getting help section
.. toctree::
:maxdepth: 1
:caption: Getting help
Expand All @@ -37,6 +38,7 @@ HPC-UiT Services User Documentation
help/hpc-cafe


.. General stallo sections
.. toctree::
:maxdepth: 1
:caption: Stallo
Expand All @@ -45,6 +47,7 @@ HPC-UiT Services User Documentation
stallo/uit-guidelines


.. Account section
.. toctree::
:maxdepth: 1
:caption: Account
Expand All @@ -53,24 +56,30 @@ HPC-UiT Services User Documentation
account/login
account/accounting


.. Job section
.. toctree::
:maxdepth: 1
:caption: Jobs

jobs/dos_and_donts
jobs/batch
jobs/examples
jobs/dos_and_donts

.. toctree::
:maxdepth: 2

jobs/slurm_parameter
jobs/process-count
jobs/partitions

.. toctree::
:maxdepth: 1

jobs/interactive
jobs/job_management
jobs/monitoring
jobs/running_mpi_jobs
jobs/environment-variables
jobs/torque_slurm_table


.. Software section
.. toctree::
:maxdepth: 1
:caption: Software
Expand All @@ -89,6 +98,7 @@ HPC-UiT Services User Documentation
applications/sw_guides


.. Storage section
.. toctree::
:maxdepth: 1
:caption: Storage
Expand All @@ -98,6 +108,7 @@ HPC-UiT Services User Documentation
storage/lustre-performance


.. Code development section
.. toctree::
:maxdepth: 1
:caption: Code development
Expand Down
65 changes: 4 additions & 61 deletions jobs/batch.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@

.. _batch_system:

Batch system
============
Expand All @@ -10,9 +10,6 @@ a batch system that will execute the applications on the available resources.
The batch system on Stallo is `SLURM <https://slurm.schedmd.com/>`_ (Simple
Linux Utility for Resource Management.)

If you are already used to Torque/Maui (the previous queue system used on
Stallo), but not SLURM, you might find this :ref:`torque_slurm_table` useful.


Creating a job script
---------------------
Expand All @@ -35,6 +32,7 @@ It is sometimes convenient if you do not have to edit the job script every time
to change the input file. Or perhaps you want to submit hundreds of jobs and
loop over a range of input files. For this it is handy to pass command-line
parameters to the job script.
For an overview of the different possible parameters, see :ref:`slurm_parameter`.

In SLURM you can do this::

Expand All @@ -52,36 +50,11 @@ And then you can pick the parameters up inside the job script::
# argument 2 is myoutput
mybinary.x < ${1} > ${2}


How to set the account in your job script
-----------------------------------------

You can set it like this::

#SBATCH --account=nn1234k


Managing jobs
=============

The lifecycle of a job can be managed with as little as three different
commands:

#. Submit the job with ``sbatch <script_name>``.
#. Check the job status with ``squeue``. (to limit the display to only
your jobs use ``squeue -u <user_name>``.)
#. (optional) Delete the job with ``scancel <job_id>``.

You can also hold the start of a job:

scontrol hold <job_id>
Put a hold on the job. A job on hold will not start or block other jobs from starting until you release the hold.
scontrol release <job_id>
Release the hold on a job.
For recommended sets of parameters see also :ref:`slurm_recommendations`.


Walltime
========
--------

We recommend you to be as precise as you can when specifying the
parameters as they will inflict on how fast your jobs will start to run.
Expand All @@ -99,33 +72,3 @@ To find out whether all users within one project share the same priority, run::
For a given account (project) consider the column "RawShares". If the RawShares
for the users is "parent", they all share the same fairshare priority. If it is
a number, they have individual priorities.


Job status descriptions in squeue
=================================

When you run ``squeue`` (probably limiting the output with ``squeue -u <user_name>``), you will get a list of all jobs currently running or waiting to start. Most of the columns should be self-explaining, but the *ST* and *NODELIST (REASON)* columns can be confusing.

*ST* stands for *state*. The most important states are listed below. For a more comprehensive list, check the `squeue help page section Job State Codes <https://slurm.schedmd.com/squeue.html#lbAG>`_.

R
The job is running
PD
The job is pending (i.e. waiting to run)
CG
The job is completing, meaning that it will be finished soon

The column *NODELIST (REASON)* will show you a list of computing nodes the job is running on if the job is actually running. If the job is pending, the column will give you a reason why it still pending. The most important reasons are listed below. For a more comprehensive list, check the `squeue help page section Job Reason Codes <https://slurm.schedmd.com/squeue.html#lbAF>`_.

Priority
There is another pending job with higher priority
Resources
The job has the highest priority, but is waiting for some running job to finish.
QOS*Limit
This should only happen if you run your job with ``--qos=devel``. In developer mode you may only have one single job in the queue.
launch failed requeued held
Job launch failed for some reason. This is normally due to a faulty node. Please contact us via support@metacenter.no stating the problem, your user name, and the jobid(s).
Dependency
Job cannot start before some other job is finished. This should only happen if you started the job with ``--dependency=...``
DependencyNeverSatisfied
Same as *Dependency*, but that other job failed. You must cancel the job with ``scancel JOBID``.
2 changes: 1 addition & 1 deletion jobs/dos_and_donts.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,6 @@ Dos and don'ts
==============

- Never run calculations on the home disk
- Always use the queueing system
- Always use the SLURM queueing system
- The login nodes are only for editing files and submitting jobs
- Do not run calculations interactively on the login nodes
41 changes: 0 additions & 41 deletions jobs/environment-variables.rst

This file was deleted.

7 changes: 3 additions & 4 deletions jobs/files/slurm-OMP.sh
Original file line number Diff line number Diff line change
Expand Up @@ -11,14 +11,13 @@
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=20

# exclusive makes all memory available
#SBATCH --exclusive

# run for five minutes
# d-hh:mm:ss
#SBATCH --time=0-00:05:00

# 500MB memory per core
# this is a hard limit
#SBATCH --mem-per-cpu=500MB

# turn on all mail notification
#SBATCH --mail-type=ALL

Expand Down
49 changes: 49 additions & 0 deletions jobs/job_management.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
.. _job_management:

Managing jobs
=============

The lifecycle of a job can be managed with as little as three different
commands:

#. Submit the job with ``sbatch <script_name>``.
#. Check the job status with ``squeue``. (to limit the display to only
your jobs use ``squeue -u <user_name>``.)
#. (optional) Delete the job with ``scancel <job_id>``.

You can also hold the start of a job:

scontrol hold <job_id>
Put a hold on the job. A job on hold will not start or block other jobs from starting until you release the hold.
scontrol release <job_id>
Release the hold on a job.


Job status descriptions in squeue
---------------------------------

When you run ``squeue`` (probably limiting the output with ``squeue -u <user_name>``), you will get a list of all jobs currently running or waiting to start. Most of the columns should be self-explaining, but the *ST* and *NODELIST (REASON)* columns can be confusing.

*ST* stands for *state*. The most important states are listed below. For a more comprehensive list, check the `squeue help page section Job State Codes <https://slurm.schedmd.com/squeue.html#lbAG>`_.

R
The job is running
PD
The job is pending (i.e. waiting to run)
CG
The job is completing, meaning that it will be finished soon

The column *NODELIST (REASON)* will show you a list of computing nodes the job is running on if the job is actually running. If the job is pending, the column will give you a reason why it still pending. The most important reasons are listed below. For a more comprehensive list, check the `squeue help page section Job Reason Codes <https://slurm.schedmd.com/squeue.html#lbAF>`_.

Priority
There is another pending job with higher priority
Resources
The job has the highest priority, but is waiting for some running job to finish.
QOS*Limit
This should only happen if you run your job with ``--qos=devel``. In developer mode you may only have one single job in the queue.
launch failed requeued held
Job launch failed for some reason. This is normally due to a faulty node. Please contact us via support@metacenter.no stating the problem, your user name, and the jobid(s).
Dependency
Job cannot start before some other job is finished. This should only happen if you started the job with ``--dependency=...``
DependencyNeverSatisfied
Same as *Dependency*, but that other job failed. You must cancel the job with ``scancel JOBID``.
70 changes: 0 additions & 70 deletions jobs/partitions.rst

This file was deleted.

0 comments on commit b00254e

Please sign in to comment.