# Slurm

## Background

SLURM is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. 

[slurm.schemd.com](https://slurm.schedmd.com/overview.html)


## Directives (#SBATCH)

The way that Slurm determines how to allocate your jobs to the cluster (i.e. across how many compute nodes, with how many CPUs, for how long etc) is via Slurm directives that are included at the top of your job script. These directives are indicated by lines starting with `#SBATCH`

### Basic Example

Change the words in all-caps to what you need. 

```bash
#!/bin/bash
#SBATCH --partition=PARTITION
#SBATCH --time=TIME (DAYS-HOURS:MINUTES:SECONDS)
#SBATCH --nodes=NODES
#SBATCH --ntasks=NTASKS
#SBATCH --output=%j.out 
#SBATCH --error=%j.err
#SBATCH --name=JOBNAME
```

### GPU Example

```bash
#!/bin/bash
#SBATCH --partition=GPU_PARTITION
#SBATCH --time=TIME (DAYS-HOURS:MINUTES:SECONDS)
#SBATCH --nodes=NODES
#SBATCH --ntasks=NTASKS
#SBATCH --output=%j.out 
#SBATCH --error=%j.err
#SBATCH --name=JOBNAME
#SBATCH --gres=gpu:1
```

### Exclusive Example

```bash
#!/bin/bash
#SBATCH --partition=GPU_PARTITION
#SBATCH --time=TIME (DAYS-HOURS:MINUTES:SECONDS)
#SBATCH --nodes=NODES
#SBATCH --ntasks=NTASKS
#SBATCH --output=%j.out 
#SBATCH --error=%j.err
#SBATCH --name=JOBNAME
#SBATCH --mem=0
#SBATCH --exclusive
```


## Slurm Commands

There are Slurm specific commands.

### sinfo 
Get information about the resources on available nodes that make up the HPC cluster 

```bash 
sinfo # All resources 
sinfo | grep idle # Idle nodes 
```

### sbatch

Submit batch script to Slurm for processing

```bash
sbatch job.slurm
# The JOB_ID will be printed after you submit the batch script
# which looks like:
# “Submitted batch job *JOB_ID*”
```

### scancel

Cancel a job that is pending or running

```bash
scancel *JOB_ID*
```

### squeue

Show Slurm queue, can be all queue or just your jobs

```bash
squeue # All jobs queue
squeue -u *USERNAME* # USERNAME queue
```

## scontrol

```bash
scontrol show job *JOB_ID*
```


## Module 

HPC uses a module system to load most software into a user’s environment. Most software is not accessible by default and must be loaded in. This allows Research Computing to provide multiple versions of the software concurrently and enables users to easily switch between different versions.

### Commands

List all available modules

```bash
module av
```

Load a module

```bash
module load *name*
```

Remove a module 

```bash
module remove *name*
```

Remove ALL modules, cleaning your environment

```bash
module purge
```