# Hello World in `srun`
Below is the absolute simplest way to run a job on Cheaha.

In [31]:
!srun --pty echo "Hello World!"

Hello World!


While `srun` is great for setting up an interactive job using `--pty`, it doesn't scale well to repeated use or for intricate jobs. It's hard to remember, share and modify. A job created with `srun` will also terminate if you lose your connection to Cheaha!

Prefer using `sbatch` over `srun`. With `sbatch` you write a script with one resource request per line, and can sequence multiple tasks. These tasks can take the form of trackable job steps using `srun`, or just using bare shell commands. It is also possible to use job arrays to submit many similar jobs at the same time. Using `sbatch` makes it easy to

- share with collaborators
- keep track of versions -- repeatability!
- read and modify
- run multiple commands in one job
- run many of the same type of job with a single submission
- won't terminate if you lose connection

Remember, `srun` is only useful for interactive jobs, one-off commands, and sub-tasks inside an `sbatch` job. In contrast `sbatch` is meant for repeatable, collaborative Research Computing!

Let's take a look at how to use `sbatch`.

# `sbatch`
Before we get started learning how to write sbatch scripts, there are some good practices to consider. Taken from the [Zen of Python](https://www.python.org/dev/peps/pep-0020/), we see that "explicit is better than implicit." This means, for `sbatch` scripts, don't rely on default values. Instead, be explicit about your intent for the job submission. Then other people, and yourself at a later date, will be able to understand what you meant. Always...

- give your jobs meaningful names with `--job-name`
- specify your output logs with `--output` and `--error`
- choose partition and resources carefully and explicitly

### Hello World
Here is a sample script which waits a few seconds after submission, then `echo`s a couple of lines to an output file.

The cell below uses the `ipython` magic `%%bash` to run the contents of the cell using the `bash` shell, as though typed at a terminal. The `cat` command concatenates things to a stream. The `> "hello_world.sh"` means we are redirecting the output of `cat` to the file `hello_world.sh`. The `<<EOT` starts a `heredoc` and redirects it into `cat`. A `heredoc` is what it sounds like, a "fake" file we are making up as we go. Our heredoc starts on the line after `cat` and ends on the line before `EOT`.

Basically, we're writing what you see into a file that we can use later. When you write a script on your own later, you won't need to do this. Instead, just open your favorite text editor and start with the line `#! /bin/bash` and up to, but not including, `EOT` at the end. We have the extra parts here so you can see the contents of the file directly in Jupyter without having to open the file.

In [61]:
%%bash
cat <<EOT > "hello_world.sh"
#! /bin/bash

## BOOKKEEPING
#SBATCH --job-name=hello_world
# %x means "put the job-name here"
#SBATCH --output=%x.log
#SBATCH --error=%x.log

## RESOURCES
#SBATCH --partition=express
#SBATCH --time=00:01:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=256M

# PAYLOAD
sleep 15
echo "hello world"
echo "hi again"
EOT

We can verify that the contents of `hello_world.sh` are what we expect using `cat` again. The `!` symbol is `ipython` magic that runs the line as a shell command. Make sure the output matches the cell above!

So `%%bash` runs an entire cell in `bash` and `!` runs a single line in `bash`.

In [62]:
!cat "hello_world.sh"

#! /bin/bash

## BOOKKEEPING
#SBATCH --job-name=hello_world
# %x means "put the job-name here"
#SBATCH --output=%x.log
#SBATCH --error=%x.log

## RESOURCES
#SBATCH --partition=express
#SBATCH --time=00:01:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=256M

# PAYLOAD
sleep 5
echo "hello world"
echo "hi again"


Now that we are sure our `hello_world.sh` file is prepared, we can submit the script to the slurm queue using sbatch.

In [42]:
!sbatch "hello_world.sh"

Submitted batch job 10147563


We can check that the job made it into the queue using `squeue`. You'll want to do this quickly because the job will only be around for about 15 seconds (see the line with `sleep 15`).

You should also see the `ood-jupyter` job you are using to run these cells as well!

In [63]:
!squeue -u $USER

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          10146968     short ood-jupy    wwarr  R    2:18:22      1 c0066


We can verify the output of the script using `cat` again.

In [64]:
!cat "hello_world.log"

hello world
hi again


Congratulations, you now know the workflow for using `sbatch` scripts! There are a couple more commands that you may find useful. Next let's see what happens if you submit a job by mistake and need to cancel it to free up resources.

### How to cancel a running job
Below we'll run the exact same workflow as we did for `hello_world.sh`. We'll call this one `cancelme.sh` because we're going to learn how to cancel a job. We accidentally made the job sleep for 3600 seconds (1 hour) before getting to the good part. We meant to only wait 15 seconds, but we didn't realize it until after we submitted the job. Oops!

In [67]:
%%bash --out t
cat <<EOT > "cancelme.sh"
#! /bin/bash

## BOOKKEEPING
#SBATCH --job-name=cancelme
# %x means "put the job-name here"
#SBATCH --output=%x.log
#SBATCH --error=%x.log

## RESOURCES
#SBATCH --partition=express
#SBATCH --time=01:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=256M

# PAYLOAD
sleep 3600  ## Oops we meant 15 seconds!
echo "finally!"
EOT

In [68]:
!cat "cancelme.sh"

#! /bin/bash

## BOOKKEEPING
#SBATCH --job-name=cancelme
# %x means "put the job-name here"
#SBATCH --output=%x.log
#SBATCH --error=%x.log

## RESOURCES
#SBATCH --partition=express
#SBATCH --time=01:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=256M

# PAYLOAD
sleep 3600  ## Oops we meant 15 seconds!
echo "finally!"


In [69]:
!sbatch cancelme.sh

Submitted batch job 10148017


In [70]:
!squeue -u $USER

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          10148017   express cancelme    wwarr  R       0:00      1 c0094
          10146968     short ood-jupy    wwarr  R    2:22:41      1 c0066


Our `cancelme` job is going to take an hour to clear from the queue, meanwhile we're just using up shared resources and worsening our job priority! Let's be good HPC citizens and cancel that mistaken job!

To cancel the job you'll need to modify the cell below. Where you see `[jobid]` below, replace it with the job id from our submission. Think `sbatch cancelme.sh`...

In [71]:
!scancel [jobid]

Let's check that we actually canceled the job. We can do that by making sure it's not in the queue any longer.

In [60]:
!squeue -u $USER

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          10146968     short ood-jupy    wwarr  R    1:12:54      1 c0066


### How to see past jobs
- `sacct`

### How to be efficient
- `seff`

### Hello World -- Arrays!
`--array` flag

### How to request GPUs
`--gres` flag

### Different ways to request resources
- `--ntasks` vs `--nodes` + `--ntasks-per-node`
- `--cpus-per-task`
- `--mem-per-cpu` vs `--mem`
- Where to find up-to-date `--partition` info
- Time formats for `--time`

### Other `sbatch` flags that may be useful
- `--mail-type` and `--mail-user`
- `-D` or `--chdir`
- `--export` and `--export-file` control which environment vars are exported
- `--no-requeue` avoid requeueing if job is terminated, contrast with `--requeue`
- `--parsable` can help with automation
- `--test-only` to estimate when job will start processing

### Useful environment variables to use in a script
- SLURM_ARRAY_JOB_ID, SLURM_ARRAY_TASK_COUNT, SLURM_ARRAY_TASK_ID
- SLURM_CPUS_PER_TASK, SLURM_CPUS_ON_NODE
- SLURM_JOB_NAME, SLURM_JOB_ID
- SLURM_MEM_PER_CPU, SLURM_MEM_PER_NODE

### Using `srun` inside `sbatch`