Skip to content

SLURM Basics: Running Jobs

jkmcgann edited this page Nov 13, 2024 · 32 revisions

Introduction

The HPC clusters use the SLURM Resource Manager and Scheduler. Below is some basic information.

Slurm Basics

One way to submit jobs is to create a SLURM script that you submit to SLURM with the sbatch command. Here is a sample Job script:

#!/bin/bash
#SBATCH --job-name=my_job_name         # Job name
#SBATCH --partition=haswell            # Partition/Queue name
#SBATCH --mail-type=END,FAIL           # Mail events
#SBATCH --mail-user=email@maine.edu    # Where to send mail
#SBATCH --ntasks=1                     # Run a single task
#SBATCH --cpus-per-task=4              # Run with 4 threads
#SBATCH --mem=60gb                     # Job memory request
#SBATCH --time=24:00:00                # Time limit hrs:min:sec
#SBATCH --output=test_%j.log           # Standard output and error log

module load module_name ...

srun program param1 ...

The ntasks and cpus-per-task change depending on if you have a multi-threaded program (one program that uses multiple threads) vs. a multi-process program (multiple processes like with MPI). In the MPI case, you'd set --ntasks=50 to run with 50 processes. Other directives to control how the job is configured:

#SBATCH --ntasks-per-node=4 # Run on a 4 cores per node

#SBATCH --nodes=2 # Run on a 2 nodes

This would be for (for example) an MPI job running with 8 processes.

Useful SLURM Commands

sbatch: Command to submit a job:

sbatch script-name

The email directives are optional.

squeue: Command to Check all jobs in the queue:

squeue

or to check all jobs in the queue:

squeue -u user-name

Another form of this command is just sq which does the same thing except in a little different format. The main additional information that this command gives is the total number of cores for the job in the second to last column.

sinfo: Command to get the status of all of the Slurm partitions:

sinfo

PARTITION    AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug           up   infinite      1    mix node-153
haswell*        up   infinite      4  down* node-[127-129,139]
haswell*        up   infinite     15    mix node-[55-58,61,81,90,122-123,125-126,130,140-142]
haswell*        up   infinite     67   idle node-[59,63-80,82-89,91-121,124,131-138]
haswell-test    up   infinite      1   idle node-62
skylake         up   infinite      1 drain* node-148
skylake         up   infinite      4    mix node-[143-144,149-150]
skylake         up   infinite      3   idle node-[145-147]
gpu             up   infinite      5    mix gpu node-[g101,g102,g103,g104,g105]
epyc            up   infinite      4    mix node-[151,153,155,158,163]
epyc            up   infinite      1  alloc node-152
epyc            up   infinite      7   idle node-[154,156-157,159-162,164]
epyc-hm         up   infinite      2  alloc node-[167-168]
epyc-hm         up   infinite      2   idle node-[169-170]

scancel: Delete a job:

scancel JOB_ID

checkjob: The checkjob command can be used to get more information about a job or to check on the status of a job:

checkjob JOB_ID

This command mimics the same command name in the Moab scheduler that we used to use. Sample output:

[blackbear@penobscot pi_MPI]$ checkjob 962658
JobId=962658 JobName=parallel_pi_test
UserId=blackbear(1028) GroupId=blackbear(1003) MCS_label=N/A
Priority=10010 Nice=0 Account=(null) QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
DerivedExitCode=2:0
RunTime=00:00:10 TimeLimit=00:05:00 TimeMin=N/A
SubmitTime=2022-09-14T12:29:08 EligibleTime=2022-09-14T12:29:08
AccrueTime=2022-09-14T12:29:08
StartTime=2022-09-14T12:29:08 EndTime=2022-09-14T12:34:08 Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
LastSchedEval=2022-09-14T12:29:08
Partition=haswell AllocNode:Sid=penobscot:9998
ReqNodeList=(null) ExcNodeList=(null)
NodeList=node-[140-141]
BatchHost=node-140
NumNodes=2 NumCPUs=8 NumTasks=8 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=8,mem=2G,node=2,billing=8
Socks/Node=* NtasksPerN:B:S:C=4:0:*:* CoreSpec=*
   Nodes=node-[140-141] CPU_IDs=16-19 Mem=0 GRES_IDX=
MinCPUsNode=4 MinMemoryNode=1G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/home/backbear/pi_MPI/go.slurm
WorkDir=/blackbear/abol/pi_MPI
StdErr=/blackbear/abol/pi_MPI/parallel_pi_962658.log
StdIn=/dev/null
StdOut=/home/blackbear/pi_MPI/parallel_pi_962658.log
Power=

seff: Command to check the memory and CPU efficiency of a job. This command is mostly useful after a job has completed:

seff JOB_ID

for instance:

[root@penobscot slurm]# seff 962658
Job ID: 962658
Cluster: penobscot
Use of uninitialized value $user in concatenation (.) or string at /bin/seff line 154, <DATA> line 628.
User/Group: /blackbear
State: COMPLETED (exit code 0)
Nodes: 2
Cores per node: 4
CPU Utilized: 00:00:00
CPU Efficiency: 0.00% of 00:32:08 core-walltime
Job Wall-clock time: 00:04:01
Memory Utilized: 508.00 KB
Memory Efficiency: 0.02% of 2.00 GB
You have mail in /var/spool/mail/root

tail -f: Since you are logged into Katahdin and your job is running Checking output files

Partitions

In Slurm, the term "Partition" refers to a set of nodes. Other Resource Managers refer to these as Queues. Currently, the list of partitions on the Penobscot cluster is:

  • debug: general debugging of code. Currently just a single node

  • haswell: The largest partition as far as the number of nodes and cores. Around 90 nodes, each with Intel Haswell or Broadwell CPUs with either 24 or 28 cores and 64 GB or 128 GB of RAM

  • skylake: 8 Intel Skylake nodes, each with 36 cores and 256 GB of RAM

  • epyc: a new partition to access the new AMD EPYC3 nodes. These 14 nodes each have 96 cores and 512 GB of RAM

  • epyc-hm: these four nodes have AMD EPYC3 CPUs, 32 cores each node and with 1 TB of RAM

  • gpu: Penobscot only. Five NVIDIA nodes with a variety of GPUs and up to 1 TB of RAM

The list of partitions, along with the current state of the nodes in the partitions can be retrieved with the sinfo command.

Interactive Jobs

When you submit a job to SLURM, the job gets sent to a node or set of nodes and it runs in the background. After you submit the job, you are returned to the shell prompt and can continue on with what you are doing. Occasionally, you might find that you want to interact with the job, directly on the node that it is running. You can do this by using the srun command like:

srun --partition=epyc --ntasks=1 --cpus-per-task=4 --mem=64gb --gres=gpu:1 --time=10:00:00 --pty /bin/bash

This job is asking to run a job in the epyc partition with 4 CPU cores, one GPU and 64 GB of RAM for 10 hours. Once the resources are available, the terminal will be logged into the node that is running the job and you will be at a prompt. From there, you can run commands. This is particularly helpful when you want to use the nvcc command on a GPU node to compile with CUDA.