Merge pull request #32 from mathiasbockwoldt/faq_and_intro

FAQ and minor changes
uit-no · Mar 2, 2018 · b4ca3d2 · b4ca3d2
2 parents 3384e46 + e586004
commit b4ca3d2
Show file tree

Hide file tree

Showing 4 changed files with 63 additions and 100 deletions.
diff --git a/help/faq.rst b/help/faq.rst
@@ -17,29 +17,24 @@ You can reset it here: https://www.metacenter.no/user/
 How do I change my password on Stallo?
 --------------------------------------
 
-The passwd command does not seem to work. My password is reset back to
-the old one after a while. Why is this happening?
-
-The Stallo system is using a centralised database for user management.
-This will override the password changes done locally on Stallo.
-
 The password can be changed on the
 `password metacenter page <https://www.metacenter.no/user/password/>`_, log in using your
 username on Stallo and the NOTUR domain.
 
+The ``passwd`` command known from other Linuxes does not work.
+The Stallo system is using a centralised database for user management.
+This will override the password changes done locally on Stallo.
+
 
 Installing software
 ===================
 
 I need Python package X but the one on Stallo is too old or I cannot find it
 ----------------------------------------------------------------------------
 
-I need a newer version of Scipy, Numpy, etc. Can you install it?
+You can choose different Python versions with the module system. See here: :doc:`/software/modules`
 
-We often have newer versions of software packages installed which may not be visible
-with the default user settings. Find out more about it here: :doc:`/software/modules`
-
-In cases where this still doesn't solve your problem and you would like to install it yourself, please read the next section below about installing without sudo rights.
+In cases where this still doesn't solve your problem or you would like to install a package yourself, please read the next section below about installing without sudo rights.
 
 If we don't have it installed, and installing it yourself is not a good solution for you, please contact us and we will do our best to help you.
 
@@ -49,9 +44,10 @@ Can I install Python software as a normal user without sudo rights?
 
 Yes. The recommended way to achieve this is using `virtual environments <https://docs.python.org/3/tutorial/venv.html>`_
 
-Example (as an example we install the Biopython package)::
+As an example we install the Biopython package::
 
-  $ module load GCC/4.9.1
+  $ module load GCC/6.4.0-2.28 # You should load a modern compiler with Python2.
+                               # This is not necessary with Python3.
   $ virtualenv venv
   $ source venv/bin/activate
   $ pip install biopython
@@ -61,8 +57,14 @@ the virtual environment::
 
   $ source venv/bin/activate
 
+If you want to leave the virtual environment again, type::
+
+  $ deactivate
+
 And you do not have to call it "venv". It is no problem to have many
-virtual environments in your home directory.
+virtual environments in your home directory. Each will start as a clean
+Python setup which you then can modify. This is also a great system to have
+different versions of the same module installed side by side.
 
 If you want to inherit system site packages into your virtual
 environment, do this instead::
@@ -72,37 +74,6 @@ environment, do this instead::
   $ pip install biopython
 
 
-Running software
-================
-
-Why is a specific node so incredibly slow compared to others?
--------------------------------------------------------------
-
-The node is probably swapping.
-
-
-What does swapping mean and why should I care?
-----------------------------------------------
-
-If the jobs consume more memory than the node physically has, the node starts
-to swap out memory to the disk.  This typically means a significant slowdown of
-the calculation.  And this is why you need to care about swapping: your
-calculation will slow down to a grinding halt.  You can also crash the node
-which is bad for us.
-
-
-How can I check whether my calculation is swapping?
----------------------------------------------------
-
-Option 1 (inside the university network) is to check
-http://stallo-login2.uit.no/slurmbrowser/html/squeue.html. Click on "nodes" and
-then the node in question.  On the right hand panel you see "Memory last hour".
-If memory is above the red mark, the node will swap.
-
-Option 2 is to log into the node and run "top". On the top you see how much
-memory is consumed and whether the node is swapping.
-
-
 Compute and storage quota
 =========================
 
@@ -142,8 +113,7 @@ Connecting via ssh
 How can I export the display from a compute node to my desktop?
 ---------------------------------------------------------------
 
-If you needs to export the display from a compute node to your desktop
-you should
+If you need to export the display from a compute node to your desktop you should
 
 #. First login to Stallo with display forwarding.
 #. Then you should reserve a node, with display forwarding, trough the
@@ -178,14 +148,14 @@ My ssh connections are dying / freezing
 How to prevent your ssh connections from dying / freezing.
 
 If your ssh connections more or less randomly are dying / freezing, try
-to add the following to your **local** ``~/.ssh/config`` file:
+to add the following to your *local* ``~/.ssh/config`` file:
 
 ::
 
     ServerAliveCountMax 3
     ServerAliveInterval 10
 
-(**local** means that you need to make these changes to your computer,
+(*local* means that you need to make these changes to your computer,
 not on stallo)
 
 The above config is for `OpenSSH <https://www.openssh.org>`_, if you're
@@ -260,74 +230,65 @@ To find out how to monitor your jobs and check their status see :ref:`monitoring
 Below are a few cases of why jobs don't start or error messages you might get:
 
 
-**Memory** **per** **core**
-
-"When I try to start a job with 2GB of memory pr. core, I get the following error:
-
-sbatch: error: Batch job submission failed: Requested node configuration is not available
+**Memory per core**
 
-With 1GB/core it works fine. What might be the cause to this?"
+   "When I try to start a job with 2GB of memory pr. core, I get the following error:
+   ``sbatch: error: Batch job submission failed: Requested node configuration is not available``
+   With 1GB/core it works fine. What might be the cause to this?"
 
 On Stallo we have two different configurations available; 16 core and 20 core nodes - with both a
-total of 32 GB of memory/node. Currently only the 20 core nodes have been enabled for the SLURM
-batch system, these have no local disk and thus no swap space. If you ask for full nodes by
+total of 32 GB of memory/node. If you ask for full nodes by
 specifying both number of nodes and cores/node together with 2 GB of memory/core, you will ask
 for 20 cores/node and 40 GB of memory. This configuration does not exist on Stallo. If you ask
-for 16 cores, still with 2GB/core, it seems to be a sort of buffer within SLURM no allowing you
-to consume absolutely all memory available (system needs some to work). 2000MB seems to work
+for 16 cores, still with 2GB/core, there is a sort of buffer within SLURM no allowing you
+to consume absolutely all memory available (system needs some to work). 2000MB/core works
 fine, but not 2 GB for 16 cores/node.
 
-The solution we want to push in general, see :ref:`first_time_gaussian`, is this:
+The solution we want to push in general is this::
 
-Specify number of tasks::
-
-#SBATCH -ntasks=80 # (number of nodes * number of cores, i.e. 5*16 or 4*20 = 80)
+   #SBATCH -ntasks=80 # (number of nodes * number of cores, i.e. 5*16 or 4*20 = 80)
 
 If you then ask for 2000MB of memory/core, you will be given 16 cores/node and a total
 of 16 nodes. 4000MB will give you 8 cores/node - everyone being happy. Just note the
 info about PE :ref:`accounting`; mem-per-cpu 4000MB will cost you twice as much as
 mem-per-cpu 2000MB.
 
+You can find an example here: :ref:`first_time_gaussian`
+
+Please also note that if you want to use the whole memory on a node, do not ask
+for 32GB, but for 31GB or 31000MB as the node needs some memory for the system itself.
+For an example, see here: :ref:`allocated_entire_memory`
+
 
 
 **Step memory limit**
 
-"Why do I get ``slurmstepd: Exceeded step memory limit`` in my log/output?"
+   "Why do I get ``slurmstepd: Exceeded step memory limit`` in my log/output?"
 
-For slurm, the memory flag seems to be a hard limit, meaning that when each core
+For slurm, the memory flag is a hard limit, meaning that when each core
 tries to utilize more than the given amount of memory, it is killed by the slurm-deamon.
-For example ``$SBATCH --mem-per-cpu=2GB`` means that you maximum can use 2 GB of memory pr
-core. With memory intensive applications like comsol or VASP, your job will likely be
-terminated. The solution to this problem is to, like we have said elsewhere, specify the
+For example ``$SBATCH --mem-per-cpu=2GB`` means that you maximum can use 2 GB of memory per
+core. With memory intensive applications like Comsol or VASP, your job will likely be
+terminated. The solution to this problem is to specify the
 number of tasks irrespectively of cores/node and ask for as much memory you will need.
 
 For instance::
 
- #SBATCH --ntasks=20
- #SBATCH --time=0-24:05:00
- #SBATCH --mem-per-cpu=6000MB
-
+   #SBATCH --ntasks=20
+   #SBATCH --time=0-24:05:00
+   #SBATCH --mem-per-cpu=6000MB
 
 
 **QOSMaxWallDurationPerJobLimit**
 
-QOSMaxWallDurationPerJobLimit means that MaxWallDurationPerJobLimit has been exceeded. Basically, the user has asked for more time than is allowed for the QOS associated with the job.
-
+QOSMaxWallDurationPerJobLimit means that MaxWallDurationPerJobLimit has been exceeded. Basically, you have asked for more time than allowed for the given QOS/Partition. Please have a look at :doc:`/jobs/partitions`.
 
 
 **Priority vs. Resources**
 
 Priority means that resources are in principle available, but someone else has higher priority in the queue. Resources means the at the moment the requested resources are not available.
 
 
-CPU v.s. core
--------------
-
-In this documentation we are frequently using the term *CPU*, which in
-most cases are equivalent to the more precise term *processor core* /
-*core*\. The \ *multi core age*\  is here now \ *:-)*
-
-
 How can I customize emails that I get after a job has completed?
 ----------------------------------------------------------------
 
@@ -336,20 +297,15 @@ that you send the email via the login node.
 
 As an example, add and adapt the following line at the end of your script::
 
-  echo "email content" | ssh stallo-1.local 'mail -s "job finished in /global/work/${USER}/${SLURM_JOBID}" firstname.lastname@uit.no'
+  echo "email content" | ssh stallo-1.local 'mail -s "Job finished: ${SLURM_JOBID}" firstname.lastname@uit.no'
 
 
-Running many short tasks
-========================
+How can I run many short tasks?
+-------------------------------
 
-Recommendations on how to run a lot of short tasks on the system. The
-overhead in the job start and cleanup makes it unpractical to run
+The overhead in the job start and cleanup makes it unpractical to run
 thousands of short tasks as individual jobs on Stallo.
 
-
-Background
-----------
-
 The queueing setup on stallo, or rather, the accounting system generates
 overhead in the start and finish of a job of about 1 second at each end
 of the job. This overhead is insignificant when running large parallel
@@ -360,11 +316,9 @@ unparallelizable part of the job. This is because the queuing system can
 only start and account one job at a time. This scaling problem is
 described by `Amdahls Law <https://en.wikipedia.org/wiki/Amdahl's_law>`_.
 
-Without going into any more details, let's look at the solution.
-
-
-Running tasks in parallel within one job
-----------------------------------------
+If the tasks are extremly short, you can use the example below. If you want to
+spawn many jobs without polluting the queueing system, please have a look at
+:ref:`job_arrays`.
 
 By using some shell trickery one can spawn and load-balance multiple
 independent task running in parallel within one node, just background
@@ -373,10 +327,8 @@ next:
 
 .. literalinclude:: files/multiple.sh
    :language: bash
-   :linenos:
 
 And here is the ``dowork.sh`` script:
 
 .. literalinclude:: files/dowork.sh
    :language: bash
-   :linenos:
diff --git a/help/files/multiple.sh b/help/files/multiple.sh
@@ -10,7 +10,7 @@
 #SBATCH --nodes=1
 #SBATCH --ntasks-per-node=20
 
-# We assume we will be done in 10 minutes:
+# We assume we will (in total) be done in 10 minutes:
 #SBATCH --time=0-00:10:00
 
 # Let us use all CPUs:
@@ -32,7 +32,7 @@ for t in $tasks; do
   ./dowork.sh $t &
 
   # You should leave the rest alone...
-  #
+
   # count the number of background tasks we have spawned
   # the jobs command print one line per task running so we only need
   # to count the number of lines.

diff --git a/jobs/batch.rst b/jobs/batch.rst
@@ -23,6 +23,10 @@ memory, etc., that will be interpreted by the batch system upon submission.
 
 You can find job script examples in :ref:`job_script_examples`.
 
+After you wrote your job script as shown in the examples, you can start it with::
+
+   sbatch jobscript.sh
+
 
 How to pass command-line parameters to the job script
 -----------------------------------------------------

diff --git a/jobs/examples.rst b/jobs/examples.rst
@@ -19,11 +19,17 @@ by typing::
 
    $ sbatch run.sh
 
+Please note that all values that you define with SBATCH directives are hard
+values. When you, for example, ask for 6000 MB of memory (``--mem=6000MB``) and
+your job uses more than that, the job will be automatically killed by the manager.
+
 
 .. literalinclude:: files/slurm-blueprint.sh
    :language: bash
 
 
+.. _job_arrays:
+
 Running many sequential jobs in parallel using job arrays
 ---------------------------------------------------------
 
@@ -90,6 +96,7 @@ each:
 The ``wait`` commands are important here - the run script will only continue
 once all commands started with ``&`` have completed.
 
+.. _allocated_entire_memory:
 
 Example on how to allocate entire memory on one node
 ----------------------------------------------------