# Jupyter + Dask :  local vs cluster

This notebook shows these features : 
* How to start a Dask cluster from a notebook on TREX
* How to start a local Dask cluster (simulation of a computer)
* How to manipulate time series with the dask.dataframe module
* How to use the dataframes and the client API to submit a serie of long operations (>100 ms)
* The code stay the same between a local cluster and a cluster on TREX

## Choosing a Dask subcluster

We can chose to use two types of Dask subcluster : 
* A local cluster with 4 cores, simulating a computer
* A Slurm cluster connected to the TREX cluster

If you decide to use the local cluster, run the cell that defines the Localcluster.

If you decide to use the Slurm cluster, run the cell that defines the SlurmCluster. 
The following cells can be called if you want to increase the size of the Dask cluster or make the size adaptive.

## Start and configure a local cluster

We start a local cluster with a configuration similar to a 4-core PC.

In [2]:
from distributed import LocalCluster, Client

cluster = LocalCluster(n_workers=1, threads_per_worker=4)

In [None]:
!squeue --me

In [4]:
cluster.close()

## Start and configure a Slurm Cluster

Thanks to the dask_jobqueue module, it is possible to start a Dask sub-cluster on the TREX cluster, with a few lines from a notebook.

### Initial imports and creation of the cluster

Here we import the mainly used classes, then we will create a cluster on Slurm.
This cluster will be composed of Dask workers, launched via Slurm jobs. Each Slurm job will use 4 ncpus and 32 GB of memory in Slurm, and will consist of 1 workers. This is what is defined in the constructor below.  
Then, each job will be launched independently on request from us or automatically if we have indicated it.

In [None]:
from dask_jobqueue import SLURMCluster
from dask.distributed import Client, progress

account_Trex = 'formation_isae'
partition_Trex = "cpu19_rh8"
qos_Trex = "--qos=cpu_2019_40"

cluster = SLURMCluster(
    # Dask-worker specific keywords
    n_workers=2,                # start 2 workers
    cores=2,                    # each worker runs on 2 cores
    memory="8GB",              # each worker uses 4GB memory (on TREX g2019 : nb_cores*8Go, on g2019 : nb_cores*4Go )
    processes=1,                # Number of Python processes to cut up each job
    local_directory='$TMPDIR',  # Location to put temporary data if necessary
    account=account_Trex,
    queue=partition_Trex,
    walltime='01:00:00',
    interface='ib0',
    log_directory='../dask-logs',
    job_extra_directives=[qos_Trex] # qos to use
)        



We can print the job script equivalent to this cluster.

In [None]:
print(cluster.job_script())

By displaying the cluster in the notebook, we should see a widget that allows us to vary the size of our cluster manually, or automatically (Adaptive cluster). We can also configure this via lines of codes as shown below.

In [None]:
cluster

Manual specification of cluster size. The given parameter is the number of Dask workers.

In [7]:
cluster.scale(3)

Indicate that we want an adaptive cluster (the size will vary depending on the load).

Launching the next (or previous) cell should change the information displayed in the widget.

In [None]:
cluster.adapt(minimum=2, maximum=4)

In [None]:
!squeue --me

## Client creation

In order for all the dask APIs (dataframe, delayed, bag ...) to use the Dask cluster that we have started, it is essential to initialize a client.

This client can also be used to submit tasks in the remainder of this example.

Showing the client should show the current cluster size and give a link to the Dask Dashboard. This link does not necessarily work from your browser. Nevertheless, a proxy technology has been deployed, so you should be able to access the Dask dashboard through the following URL:

https://jupyterhub.sis.cnes.fr/user/username/proxy/8787/status

The port can be different from the default port which is 8787.

In [None]:
from dask.distributed import Client

client = Client(cluster)
client

## Using the Client API to Submit Simulations

A typical use case of the cluster: the submission of a complex calculation (from a few seconds to several minutes) on a different set of input parameters. What we can see by doing Monte Carlo for example, but in other cases too.

The principle is therefore to generate or read all the parameters to be used for the calculation, then to launch this calculation for each set of parameters. This demo is of course simplified, the calculation function is a pure python function.

### Generating/reading input data

We consider here that the input parameters are read in a pandas dataframe, which is able to read CSV files, but it is also possible to generate one ourselves as below (but with more interesting data ...).  
We go on 1000 simulations here, you are free to modify that!

In [None]:
#Generates parameters 
import pandas as pd
import numpy as np
#We generate random params, but could do this with some intelligence, or read it from a csv file
input_params = pd.DataFrame(np.random.randint(low=0, high=1000, size=(500, 4)),
               columns=['a', 'b', 'c', 'd'])
input_params.head()

### Definition of the calculation method


The method below simply simulates a calculation lasting between 0 and 2 seconds. We can of course adapt it, and possibly call a much longer external process. Note that it is important to return the result via python, and not in a file!

In [12]:
# Launch a task on all of this params, dont wait for result
def my_costly_simulation(line):
    #print(line)
    import time
    import random
    time.sleep(random.random() * 2)
    return sum(line)

### Submission of calculation

All that remains is to submit and possibly wait for the end of the calculation.  
The submission is not blocking, the calculation will be carried out in the background on the Dask cluster, do not hesitate to open the Dask Dashboard application to see the progress, and possibly the elasticity of the cluster if you are in adaptive mode.

In [13]:
futures = client.map(my_costly_simulation, input_params.values)

You can also look at the jobs in progress on the cluster, and see their variation by executing the following cell several times.
Feel free to run the above simulation several times to experiment. It may be necessary to rerun the simulation function cell, otherwise Dask may assume that the same calculation is being asked of it, and do nothing.

In [None]:
!squeue --me

### Gathering the results

For now, the results are stored on the Dask workers, we can retrieve them using the gather method.
Here, we also make sure to merge input parameters and results into a nice table, which we can save in tabular format: CSV, HDF5 ...

In [None]:
# Block until result, and add the column to initial input tables
results = client.gather(futures)
output = input_params.copy()
output['result'] = pd.Series(results, index=output.index)
output.sample(5)

In [19]:
client.close()
cluster.close()