# How to Queue your jobs on a shared compute system: SLURM

Our HPC systems are large, so they support a large number of users.
This means you cannot immediately access the hardware to run codes.
All jobs must be submitted to a queue that prioritizes jobs on a number of factors, your priority, the amount of resource requested, how to maximize facility utilization.
In essence, a short single node job from a user with high priority will run fast as it will fit in around other jobs. In contrast a long multi-node job from a user with low priority will be at the back of the queue.

This system is hugely preferable to every person having their own system, a large system with no queue, or departmental servers. 
It give all users the potential to run much larger jobs, it is economically very effective as we can ensure high utilization, and it is thus more environmentally friendly.
To address the 'no queue' idea, if all users could launch jobs to start immediately then it is likely a few users would have jobs constantly running. Your jobs run-time would be dependent on the number of other jobs running, and overall the extra workload of managing the context switching between jobs would make the total time taken to finish all jobs longer.

## SLURM concepts (HPC)


Creation of SLURM submission scripts requires some understanding of the terminology used to refer to elements of the system hardware. It also requires some understanding of how a computational workload is constructed in terms of processes and often threads.

* **Job**: When a user submits a job script to SLURM this creates a job, which is given a unique ID and placed into a queue until the requested resources are available. 

* **Node**: SLURM refers to a single server within the cluster as a *node*.

* **Partition**: A group of nodes to which jobs can be submitted, sometimes referred to as a queue for legacy reasons.  The default partition in SCRTP HPC clusters is called `compute` and is available to all users. Depending on the system there may also be `gpu` (GPU accelerators) and `hmem` (high memory) partitions available.

* **Socket**: SLURM refers to the node's compute processors (which contain multiple processing cores) by the *socket* they are plugged into. For example, in Avon the `compute` nodes each contain two Intel Xeon processors, one per socket.

* **CPU**: SLURM refers to each processor core as a *CPU*. Again, using Avon as an example, each Xeon processor in the `compute` partition contains `24` processor cores, and hence there are `48` CPUs total per node.

* **Task**: A task represents an instance of a running program/executable. Many HPC jobs will launch multiple tasks collectively performing a single calculation using [MPI](https://en.wikipedia.org/wiki/Message_Passing_Interface) or other mechanisms.

In turn each task might make use of multiple CPUs via threading e.g. [OpenMP](https://www.openmp.org/) or by spawning child processes, e.g., [Python multiprocessing](https://docs.python.org/3/library/multiprocessing.html).

The maximum number of CPUs that can be used by any one task (without oversubscribing the node) is the number of CPUs in a node.

Jobs that launch multiple tasks may make use of resources across multiple nodes (servers) simultaneously, with tasks communicating over the cluster network to implement a single calculation. The high speed, low latency networking in the HPC clusters make this viable, in contrast to the `taskfarm` where such calculations are discouraged and unlikely to be performant. 


## Writing SLURM Scripts

In the previous section we used a SLURM script to launch the a job to run the test on the environment created. 

As part of the course prep you could bring a piece of software that you want to run on the HPC. Preferably something that has some form of parallelization or GPUs.

If you do not have this we provide a few scripts that you can launch from an `sbatch` launch that simply check your submission was correct then terminate. You can use these to practice.

For this session we encourage you to look at [the documentation](https://docs.scrtp.warwick.ac.uk/hpc-pages/hpc-jobscript.html)

### Worked Examples

These worked examples will be submitted to development queues. The times requested will be minutes to pass the queues quickly, usually you will need to calculate the runtime of you code.

First we need to copy the test scripts to your home directory. You are welcome to read the scripts but there is no need to. The scripts are located in `/home/shared/csc/Teaching/slurm_tests/` on Godzilla, use `cp` to bring it over. (Note to submit jobs in the taskfarm you need to be in `/storage/<department>/<username>/`)

<details>
<summary> > Here is the script if you need it</summary>

```bash
cp -r /home/shared/csc/Teaching/slurm_tests/. ./slurm_tests
```

</details>

#### Test 1

You need to write a script to load the following modules:

OpenMPI version 4.1.4

Requesting:
2 nodes
4 tasks per node
8 cpus total
2500 megabytes of ram
2 minutes of time

It should then run:
test1.sh

<details>
<summary> > Solution</summary>

```bash
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=2500
#SBATCH --time=00:02:00

module purge
module load GCC/11.3.0 OpenMPI/4.1.4

srun bash test1.sh

```

</details>

#### Test 2

Run the python script `test2-mpi.py` which needs mpi4py using 18 tasks and 2000mb of memory per node for 3 mins. Hint: use the SCRTP Docs

<details>
<summary> > Solution</summary>

See one of our many examples but adapt it to fit the given specifications above.

https://docs.scrtp.warwick.ac.uk/hpc-pages/avon-slurm-examples/avon-mp4py.html#.

<details>
<summary> > full solution </summary>

```bash
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=18
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=2000
#SBATCH --time=00:03:00

module purge
module load GCC/11.3.0 OpenMPI/4.1.4
module load SciPy-bundle/2022.05

# srun will launch as many tasks as requested above
srun python test2-mpi.py
```

</details>

</details>


#### Test 3

This is a 2 part workflow. 

Run test3-a.sh using LAME/3.100

Run test3-b.sh using PIP-Pytorch/2.4.0

But only use one submission and ensure the PyTorch environment is separated from the LAME environment

Similar to Test 1 request the following:
Requesting:
2 nodes
4 tasks per node
8 cpus total
2500 megabytes of ram
2 minutes of time

If you see `The following have been reloaded with a version change` then your script is wrong and your environment is potentially damaged.

<details>
<summary> > Solution</summary>

```bash
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=2500
#SBATCH --time=00:02:00

module purge
module load GCCcore/11.3.0 LAME/3.100

srun bash test3-a.sh

module purge
module load GCC/13.2.0 OpenMPI/4.1.6 PIP-PyTorch/2.4.0

srun bash test3-b.sh

```

</details>

#### Additional
Go and run the example from 07-Modules and try and get into a jupyter notebook.