# Tuning adaptivity of Dask clusters

In the prior examples, we have seen basic adaptivity and manual scaling.  Here, we'll see how to adapt the dask cluster to approximately meet a target duration.  This will allow for switching between truely interactive work (where a user wants to see results immediately to decide about the next steps) and workloads where response time is of a lower priority.

## Set up a Slurm cluster

In [1]:
from dask.distributed import Client
from dask_jobqueue import SLURMCluster

The dedent function was deprecated in Matplotlib 3.1 and will be removed in 3.3. Use inspect.cleandoc instead.
  s = dedents('\n' + '\n'.join(lines[first:]))


In [2]:
cluster = SLURMCluster(
    cores=24,
    processes=2,
    memory="100GB",
    shebang='#!/usr/bin/env bash',
    queue="batch",
    walltime="00:30:00",
    local_directory='/tmp',
    death_timeout="15s",
    interface="ib0",
    log_directory="$SCRATCH_cecam/$USER/dask_jobqueue_logs/",
    project="ecam")

In [3]:
client = Client(cluster)
client

0,1
Client  Scheduler: tcp://10.80.32.36:43926  Dashboard: http://10.80.32.36:8787/status,Cluster  Workers: 0  Cores: 0  Memory: 0 B


## Scale the cluster

In [4]:
cluster.scale(2)

## Some artifical workload

In [5]:
from random import random
from time import sleep

In [6]:
def run_for_approx_n_seconds(n):
    n += random() / 5
    sleep(n)
    return n

In [7]:
from dask import bag as db

In [8]:
%%time

N = 200

ns = db.from_sequence((1.0 for n in range(N)), npartitions=N)
ns = ns.map(run_for_approx_n_seconds).compute();

CPU times: user 6.99 s, sys: 517 ms, total: 7.51 s
Wall time: 1min 4s


## More detailed adaptivity

In [9]:
# Check docstring of distributed.Adaptive for keywords
ca = cluster.adapt(
    minimum=2, maximum=40,
    target_duration="360s",  # measured in CPUtime per worker
    scale_factor=1);

sleep(4)  # Allow for scale-down

In [10]:
%%time

N = 1000

ns = db.from_sequence((3.0 for n in range(N)), npartitions=N)
ns = ns.map(run_for_approx_n_seconds).compute();

JobQueueCluster.scale_up was called with a number of workers lower that what is already running or pending
JobQueueCluster.scale_up was called with a number of workers lower that what is already running or pending
JobQueueCluster.scale_up was called with a number of workers lower that what is already running or pending
JobQueueCluster.scale_up was called with a number of workers lower that what is already running or pending
JobQueueCluster.scale_up was called with a number of workers lower that what is already running or pending
JobQueueCluster.scale_up was called with a number of workers lower that what is already running or pending
JobQueueCluster.scale_up was called with a number of workers lower that what is already running or pending
JobQueueCluster.scale_up was called with a number of workers lower that what is already running or pending
JobQueueCluster.scale_up was called with a number of workers lower that what is already running or pending
JobQueueCluster.scale_up was called w

CPU times: user 13.4 s, sys: 1.3 s, total: 14.6 s
Wall time: 43.2 s


In [11]:
sleep(4)  # allow for scale-down

In [12]:
%%time

N = 500

ns = db.from_sequence((3.0 for n in range(N)), npartitions=N)
ns = ns.map(run_for_approx_n_seconds).compute();

JobQueueCluster.scale_up was called with a number of workers lower that what is already running or pending
JobQueueCluster.scale_up was called with a number of workers lower that what is already running or pending
JobQueueCluster.scale_up was called with a number of workers lower that what is already running or pending
JobQueueCluster.scale_up was called with a number of workers lower that what is already running or pending
JobQueueCluster.scale_up was called with a number of workers lower that what is already running or pending
JobQueueCluster.scale_up was called with a number of workers lower that what is already running or pending
JobQueueCluster.scale_up was called with a number of workers lower that what is already running or pending
JobQueueCluster.scale_up was called with a number of workers lower that what is already running or pending
JobQueueCluster.scale_up was called with a number of workers lower that what is already running or pending
JobQueueCluster.scale_up was called w

CPU times: user 10.5 s, sys: 865 ms, total: 11.4 s
Wall time: 1min 7s


In [13]:
sleep(4)  # allow for scale-down

In [14]:
%%time

N = 300

ns = db.from_sequence((3.0 for n in range(N)), npartitions=N)
ns = ns.map(run_for_approx_n_seconds).compute();

CPU times: user 3.32 s, sys: 280 ms, total: 3.6 s
Wall time: 16.3 s


In [15]:
sleep(4)  # allow for scale-down

In [16]:
%%time

N = 5000

ns = db.from_sequence((1.0 for n in range(N)), npartitions=int(N / 5))
ns = ns.map(run_for_approx_n_seconds).compute();

JobQueueCluster.scale_up was called with a number of workers lower that what is already running or pending
JobQueueCluster.scale_up was called with a number of workers lower that what is already running or pending
JobQueueCluster.scale_up was called with a number of workers lower that what is already running or pending
JobQueueCluster.scale_up was called with a number of workers lower that what is already running or pending
JobQueueCluster.scale_up was called with a number of workers lower that what is already running or pending
JobQueueCluster.scale_up was called with a number of workers lower that what is already running or pending
JobQueueCluster.scale_up was called with a number of workers lower that what is already running or pending
JobQueueCluster.scale_up was called with a number of workers lower that what is already running or pending
JobQueueCluster.scale_up was called with a number of workers lower that what is already running or pending
JobQueueCluster.scale_up was called w

CPU times: user 14.8 s, sys: 1.33 s, total: 16.1 s
Wall time: 51.1 s


In [17]:
from dask import array as da

In [18]:
%%time

print((da.random.normal(size=(5000e9 / 8, ), chunks=(500e6 / 8, )) ** 2).mean().compute());

1.0000013221436481
CPU times: user 1min 2s, sys: 5.48 s, total: 1min 8s
Wall time: 1min 36s


In [19]:
client

0,1
Client  Scheduler: tcp://10.80.32.36:43926  Dashboard: http://10.80.32.36:8787/status,Cluster  Workers: 40  Cores: 480  Memory: 2.00 TB


In [20]:
# Check docstring of distributed.Adaptive for keywords
ca = cluster.adapt(
    minimum=2, maximum=60,
    target_duration="360s",  # measured in CPUtime per worker
    scale_factor=2);

sleep(4)  # Allow for scale-down

In [21]:
x = da.random.normal(size=(200e9 / 8, ), chunks=(200e6 / 8, ))
x = x.persist()

In [22]:
from dask.distributed import wait

In [23]:
wait(x);

JobQueueCluster.scale_up was called with a number of workers lower that what is already running or pending
JobQueueCluster.scale_up was called with a number of workers lower that what is already running or pending
JobQueueCluster.scale_up was called with a number of workers lower that what is already running or pending
JobQueueCluster.scale_up was called with a number of workers lower that what is already running or pending
JobQueueCluster.scale_up was called with a number of workers lower that what is already running or pending
JobQueueCluster.scale_up was called with a number of workers lower that what is already running or pending
JobQueueCluster.scale_up was called with a number of workers lower that what is already running or pending
JobQueueCluster.scale_up was called with a number of workers lower that what is already running or pending
JobQueueCluster.scale_up was called with a number of workers lower that what is already running or pending
JobQueueCluster.scale_up was called w

In [24]:
%%time

print(x.mean().compute())

KeyboardInterrupt: 

## Complete listing of software used here

In [25]:
%pip list

/usr/bin/sh: module: line 1: syntax error: unexpected end of file
/usr/bin/sh: error importing function definition for `BASH_FUNC_module'
/usr/bin/sh: jutil: line 1: syntax error: unexpected end of file
/usr/bin/sh: error importing function definition for `BASH_FUNC_jutil'
/usr/bin/sh: ml: line 1: syntax error: unexpected end of file
/usr/bin/sh: error importing function definition for `BASH_FUNC_ml'
Package            Version          
------------------ -----------------
asciitree          0.3.3            
aspy.yaml          1.2.0            
backcall           0.1.0            
bokeh              1.1.0            
certifi            2019.3.9         
cfgv               1.6.0            
cftime             1.0.3.4          
Click              7.0              
cloudpickle        1.0.0            
cycler             0.10.0           
cytoolz            0.9.0.1          
dask               1.2.0            
dask-jobqueue      0.4.1+32.g9c3371d
decorator          4.4.0            
dist

In [26]:
%conda list --explicit

/usr/bin/sh: module: line 1: syntax error: unexpected end of file
/usr/bin/sh: error importing function definition for `BASH_FUNC_module'
/usr/bin/sh: jutil: line 1: syntax error: unexpected end of file
/usr/bin/sh: error importing function definition for `BASH_FUNC_jutil'
/usr/bin/sh: ml: line 1: syntax error: unexpected end of file
/usr/bin/sh: error importing function definition for `BASH_FUNC_ml'
# This file may be used to create an environment using:
# $ conda create --name <env> --file <this file>
# platform: linux-64
@EXPLICIT


tornado.application - ERROR - Multiple exceptions in yield list
Traceback (most recent call last):
  File "/p/project/cecam/rath1/miniconda3_20190521/envs/dask_jobqueue_workshop/lib/python3.7/site-packages/tornado/gen.py", line 501, in callback
    result_list.append(f.result())
  File "/p/project/cecam/rath1/miniconda3_20190521/envs/dask_jobqueue_workshop/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
    yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/p/project/cecam/rath1/miniconda3_20190521/envs/dask_jobqueue_workshop/lib/python3.7/site-packages/distributed/core.py", line 727, in send_recv_from_rpc
    result = yield send_recv(comm=comm, op=key, **kwargs)
  File "/p/project/cecam/rath1/miniconda3_20190521/envs/dask_jobqueue_workshop/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
    value = future.result()
  File "/p/project/cecam/rath1/miniconda3_20190521/envs/dask_jobqueue_workshop/lib/python3.7/site-packages/tornado/gen.py", line 742, i

https://conda.anaconda.org/conda-forge/linux-64/git-lfs-2.7.2-0.tar.bz2
https://conda.anaconda.org/conda-forge/linux-64/ca-certificates-2019.3.9-hecc5488_0.tar.bz2
https://repo.anaconda.com/pkgs/main/linux-64/libgcc-ng-8.2.0-hdf63c60_1.tar.bz2
https://repo.anaconda.com/pkgs/main/linux-64/libgfortran-ng-7.3.0-hdf63c60_0.tar.bz2
https://repo.anaconda.com/pkgs/main/linux-64/libstdcxx-ng-8.2.0-hdf63c60_1.tar.bz2
https://conda.anaconda.org/conda-forge/linux-64/bzip2-1.0.6-h14c3975_1002.tar.bz2
https://conda.anaconda.org/conda-forge/linux-64/expat-2.2.5-hf484d3e_1002.tar.bz2
https://conda.anaconda.org/conda-forge/linux-64/icu-58.2-hf484d3e_1000.tar.bz2
https://conda.anaconda.org/conda-forge/linux-64/jpeg-9c-h14c3975_1001.tar.bz2
https://conda.anaconda.org/conda-forge/linux-64/libffi-3.2.1-he1b5a44_1006.tar.bz2
https://conda.anaconda.org/conda-forge/linux-64/libiconv-1.15-h516909a_1005.tar.bz2
https://conda.anaconda.org/conda-forge/linux-64/libsodium-1.0.16-h14c3975_1001.tar.bz2
https://conda