# Working with the Python DASK library

Chris Want -- chris.want@ualberta.ca

![](images/logos.png)

## Notes about the slides ...

They are in a Jupyter notebook using the RISE extension.

https://github.com/ualberta-rcg/wg-dask-webinar

RISE stands for Reveal.js - Jupyter/IPython Slideshow Extension

https://rise.readthedocs.io/en/maint-5.5/

## Goals

* Brief intro to DASK
* Show some possibilities for:
    * Parallelizing programs on your laptop
    * Working with Compute Canada
* Share some gotchas

Dask is a library to do parallel stuff

* Tools to create task graphs
* Schedulers/workers/threads to run task graphs
* Data collections

## `dask.delayed`

### The World's 2nd worst adding function ...
(... but at least it's predictable and easy to understand!)

In [None]:
from time import sleep

def slow_add(x, y):
    sleep(1)
    return x + y

In [None]:
%%time
# This takes three seconds to run because we call each
# function sequentially, one after the other

x = slow_add(1, 2)
y = slow_add(2, 3)

z = slow_add(x, y)

print("z is", z)

In [None]:
%%time
# Question: swapping two lines, is it any different?

y = slow_add(2, 3)
x = slow_add(1, 2)

z = slow_add(x, y)

print("z is", z)

### Parallelize with `dask.delayed`...

In [None]:
from dask import delayed

In [None]:
%%time
# This runs immediately, all it does is build a graph

x = delayed(slow_add)(1, 2)
y = delayed(slow_add)(2, 3)

z = delayed(slow_add)(x, y)
z

In [None]:
%%time
# This actually runs our computation
# using a local thread pool

z.compute()

In [None]:
# Look at the task graph for `z`
z.visualize()

In [None]:
# How about a for loop on a list?

data = [1, 2, 3, 4, 5, 6, 7, 8]

In [None]:
%%time
# Sequential code

results = []

for x in data:
    y = slow_add(x, 1)
    results.append(y)
    
total = sum(results)
print("total is", total)

In [None]:
results = []

for x in data:
    y = delayed(slow_add)(x, 1)
    results.append(y)
    
total = delayed(sum)(results)

# Let's see what type of thing total is
print("Printing total: ", total)

In [None]:
%%time

# Compuing ...
result = total.compute()
print("Printing result from computing total:", result)

In [None]:
total.visualize()

## Schedulers

Something has to execute these task graphs!

Two families of schedulers:
* Single machine
* Distributed

In [None]:
# Single thread ...

%time total.compute(scheduler='synchronous')

In [None]:
# Local threads
# Uses multiprocessing.pool.ThreadPool

# Use all the processors
%time total.compute(scheduler='threads')

In [None]:
# Or only some
%time total.compute(scheduler='threads', num_workers=2)

In [None]:
# Local processes
# Uses multiprocessing.Pool

# Use all the processors
%time result = total.compute(scheduler='processes')

In [None]:
%%time

# Or only some
result = total.compute(scheduler='processes',
                       num_workers=2)

In [None]:
# Gives a nice dashboard
# (requires the python package 'bokeh')

from dask.distributed import Client
client = Client()
client

## The World's worst adding function ...
(... easy to understand, but unpredictable)


In [None]:
from random import randrange

def random_slow_add(x, y):
    sleep(randrange(8,15))
    return x + y

In [None]:
results = []

for x in data:
    y = delayed(random_slow_add)(x, 1)
    results.append(y)
    
total = delayed(sum)(results)

In [None]:
%time result = total.compute()

In [None]:
client.close()

In [None]:
client = Client(processes=False)
client

## Data collections

* Dask Dataframe
* Dask Array
* Dask Bag

# Running on a Cluster

I'll be borrowing a few demos I teach at my Dask workshop.

### Setup: Log into cluster

`ssh $USER@graham.computecanada.ca`

### Setup: Make a working directory

```
cd $SCRATCH
mkdir dask-cluster-examples
cd dask-cluster-examples
```

### Setup: Grab demo files

```
wget https://raw.githubusercontent.com/ualberta-rcg/python-dask/master/cluster-examples/dask-workers-via-slurm.ipynb
wget https://raw.githubusercontent.com/ualberta-rcg/python-dask/master/cluster-examples/run-dask.py
wget https://raw.githubusercontent.com/ualberta-rcg/python-dask/master/cluster-examples/run-dask-submit.sh
```


### Setup: Create a python virtual environment

```
module load python/3.7
virtualenv --no-download ~/virtualenv/dask
source ~/virtualenv/dask/bin/activate
pip install jupyter dask dask-jobqueue distributed graphviz bokeh dask-mpi mpi4py
```

## Example 1: Running a Jupyter notebook and using SLURM to create workers

We follow the advice on this page: https://docs.computecanada.ca/wiki/Jupyter

### Example 1: Tunneling

Tunneling will allow the browser on your computer to access the Jupyter notebooks and Dask dashboard running on a cluster node.

First, in a separate terminal open a tunnel from your laptop/computer to the cluster, E.g.,

```
sshuttle --dns -Nr $USER@graham.computecanada.ca
```

If you are on windows, `sshuttle` probably won't work, so check the Jupyter documentation above for alternatives.

### Example 1: Provisioning a node to run a notebook on

**NOTE**: you will need to modify the account listed below for your own project.

We use the SLURM scheduler to get an interactive node to run Jupyter on:

```
salloc --account=cc-debug --ntasks=1
```

### Example 1: Running Jupyter

Once we have a prompt on the interactive node, we can run the notebook server:

```
cd $SCRATCH/dask-cluster-examples
source ~/virtualenv/dask/bin/activate
jupyter-notebook --ip `hostname -f` --no-browser &
```

This will tell us the address to use in our browser to access the notebooks, e.g.,

`http://gra284.graham.sharcnet:8888/?token=924667fa08c3baefa62c29e10d8c8fedcf70406b88a06177`

### Example 1: Run the example

We can now run the example `dask-workers-via-slurm.ipynb` and follow the instructions there.

## Example 2: Running a python script via SLURM to create a Dask network via MPI

We will use `dask-mpi` to automatically create our Dask network (scheduler, workers).

We simply go to the example directory and submit a job directly.
You will want to modify the `run-dask-submit.sh` SLURM script to get the accounting group right.

```
cd $SCRATCH/dask-cluster-examples
sbatch dask-mpi-submit.py
```

This will set up the Dask network using MPI in the script `dask-mpi.py`. The code executed is very similar to the Jupyter example.

## Example 3: Running a python script via SLURM to manually create Dask network

We will manually create a scheduler, some workers, and run a python script that uses this infrastructure.

We simply go to the example directory and submit a job directly.
You will want to modify the `run-dask-submit.sh` SLURM script to get the accounting group right.

```
cd $SCRATCH/dask-cluster-examples
sbatch run-dask-submit.py
```

This will set up the Dask network using commandline tools and run the script `run-dask.py` with python. The code executed is very similar to the Jupyter example.

## Thanks! Questions?

Slides: https://github.com/ualberta-rcg/wg-dask-webinar

Workshop: https://ualberta-rcg.github.io/python-dask/