# A tour of the Dask-Jobqueue docs

The Dask-Jobqueue docs are a concise resource for most of the specifics of deploying / running Dask clusters on HPC.  They are here: <https://jobqueue.dask.org/>

As Dask-Jobqueue closely follows the development of Dask.distributed, you'll often find yourself looking at the docs of distributed as well.  They are here: <https://distributed.dask.org/>

## Installation

There's different ways of installing that are outlined in the docs.

Note that dask jobqueue is under heavy development with functionality being added on a daily / weekly / monthly basis.  You might want to install directly from the latest version of the Git `master` branch rather than waiting for a release that has the feature you need.  For this, run:
```
python -m pip install git+https://github.com/dask/dask-jobqueue@<git-ref>
``` 
with `<git-ref>` being the [Git reference](https://git-scm.com/book/en/v2/Git-Internals-Git-References) you want to install.  While this can be `master`, it is recommended to note an explicit commit.

## Getting help

- A good first step is _reading the [manual](http://jobqueue.dask.org)_.  This will get you accustomed to the jargon used, specific terms, and give you a broad overview of what to expect / not to expect from the package. (You can read the entire docs including skimming the API defs in less than a hour.)Please make sure to also 

- Secondly, the [`[dask]` tag on stackoverflos](https://stackoverflow.com/questions/tagged/dask) might be a good resource.

- Furthermore, the [dask-jobqueue issue tracker](https://github.com/dask/dask-jobqueue/issues) is a good place for usage questions.  (Often, it's advisable to search existing issues / pull requests. Somebody else might have had the same problem.)

## How this works

(There's example code for most of the cluster classes supported by Dask-Jpbqueue [in the docs](http://jobqueue.dask.org/en/stable/examples.html).)

In short, to set up a Dask scheduler and be ready to start jobs providing workers, run something like:
```python
from dask_jobqueue import SLURMCluster

cluster = SLURMCluster(processes=4,
                       threads=2,
                       memory="16GB",
                       project="project_id",
                       walltime="01:00:00",
                       queue="batch")
```

This will craete a Dask scheduler and wait for you to scale the cluster up using, e.g.,
```python
cluster.scale(8)
```

This will lead to the submission of a sufficient number of SLURM jobs to provide 8 workers.

## Configuration

While it is perfectly possible to directly provide all the keyword args at initialization of the cluster, it is much easier to create a [configuration file](http://jobqueue.dask.org/en/stable/configuration.html) containing all or most of the info.  To get the best performance on your cluster, it is recommended to get the advice of an admin who might provide hints on the best network **`interface`** to use for inter-worker communication of tell you what to use as high-bandwidth **`local_directory`**.

## Debugging

There's [info in the docs](http://jobqueue.dask.org/en/stable/debug.html).

One thing that might help you initially is the ability to see the job script that will be submitted to the job scheduler:
```python
print(cluster.job_script())
```

A typical job script will loke similar to this:
```shell
#!/usr/bin/env bash

#SBATCH -J dask-worker
#SBATCH -p batch
#SBATCH -n 1
#SBATCH --cpus-per-task=24
#SBATCH --mem=94G
#SBATCH -t 00:30:00
JOB_ID=${SLURM_JOB_ID%;*}

/path/to/python -m distributed.cli.dask_worker tcp://10.11.12.13:38814 \
    --nthreads 12 --nprocs 2 --memory-limit 50.00GB --name dask-worker--${JOB_ID}-- \
    --death-timeout 15s --local-directory /tmp --interface ib0
    
```

If you have trouble in spinning up a cluster at all, your admins might be the best persons to debug this job script.

## Monitoring your jobs

To see what your Dask-Jobqueue cluster is doing to the HPC job scheduler, a few shell commands might come in handy:

See all you jobs with:

```shell
squeue | grep ${USER}
```

or wrap it in a `watch` (but make sure to be patient and set a responsible interval with the `-n`-flag:

```shell
watch -n 30 "squeue | grep ${USER}"
```

Get a quick overview of the number of pending (PD), running (R), cancelling (CG), etc. jobs with
```shell
watch -n 10 "squeue | grep $USER | awk '{print \$5}' | sort | uniq -c | paste -s"
```

Monitor all your processes with either
```shell
top -u ${USER}
```
or

```shell
htop -u ${USER}
```