Exercises  - DASK Delayed
=========================

**Author:** Steffen Schober



## Motivation



We start with a simple example:



In [1]:
from time import sleep

def inc(x):
    sleep(1)
    return x + 1

def add(x, y):
    sleep(1)
    return x + y

In [2]:
%%time
# This takes three seconds to run because we call each
# function sequentially, one after the other
x = inc(1)
y = inc(2)
z = add(x, y)

CPU times: user 214 µs, sys: 917 µs, total: 1.13 ms
Wall time: 3 s


Obviously, the running time could be improved,
if `inc(1)` and `inc(2)` are run in parallel.
Let's start implementing this with DASK.



## Dask delayed



First some imports.



In [3]:
import dask
from dask import delayed

### First example



To make a lazy function we wrap the python functions with `dask.delayed`:



In [4]:
%%time

x = delayed(inc)(1)
y = delayed(inc)(2)
z = delayed(add)(x, y)

CPU times: user 444 µs, sys: 0 ns, total: 444 µs
Wall time: 369 µs


Note that so far no computations where performed, only
the graph is created. The following requires `graphviz` to be installed:



In [5]:
z.visualize()

CytoscapeWidget(cytoscape_layout={'name': 'dagre', 'rankDir': 'BT', 'nodeSep': 10, 'edgeSep': 10, 'spacingFact…

To trigger the computation we call the method `compute`:



In [6]:
%%time
# This actually runs our computation using a local thread pool
z.compute()

CPU times: user 147 ms, sys: 11.3 ms, total: 158 ms
Wall time: 2.16 s


5

### Second example



Here is another example, using the `delayed decorator`:



In [7]:
from numpy import random
import numpy as np

@delayed
def func1(x):
    # process item x will take radom time
    duration = random.rand()
    sleep(duration)
    # report "processing time"
    return 2*x

Before you execute the next cell, make a guess for the processing time:



In [8]:
%%time
[func1(i).compute() for i in range(10)]

CPU times: user 8.79 ms, sys: 0 ns, total: 8.79 ms
Wall time: 5.43 s


[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

Maybe not what you expected&#x2026;

Here, how to trigger the tasks in parallel:



In [9]:
%%time
dask.compute(*[func1(i) for i in range(10)])

CPU times: user 4.57 ms, sys: 2.83 ms, total: 7.4 ms
Wall time: 741 ms


(0, 2, 4, 6, 8, 10, 12, 14, 16, 18)

Should be much faster.



## Tasks



### Parallelizing a for-loop



In the example below we iterate through a list of inputs. If that input is even then we want to call inc. If the input is odd then we want to call double. This is<sub>even</sub> decision to call inc or double has to be made immediately (not lazily) in order for our graph-building Python code to proceed



In [10]:
def double(x):
    sleep(1)
    return 2 * x

def is_even(x):
    return not x % 2

data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

In [14]:
%%time
# Sequential code

results = []
for x in data:
    if is_even(x):
      y = double(x)
    else:
      y = inc(x)
    results.append(y)

total = sum(results)
print(total)

90
CPU times: user 1.62 ms, sys: 536 µs, total: 2.16 ms
Wall time: 10 s


**Task**: parallelize the sequential code above using `dask.delayed`.
You will need to delay some functions, but not all.



In [17]:
%%time

def process(x):
    if is_even(x):
        return double(x)
    else:
        return delayed(inc)(x)

delayed_results = [process(x) for x in data]
total = delayed(sum)(delayed_results)

result = dask.compute(total)
print("total:", result[0])

90
CPU times: user 2.33 ms, sys: 1.33 ms, total: 3.66 ms
Wall time: 6 s


In [20]:
%%time

import dask.bag as db

# Create a Dask bag from the list
bag = db.from_sequence(data)

# Use Dask's map function to apply the functions in parallel
bag = bag.map(lambda x: double(x) if is_even(x) else inc(x))

# Use Dask's sum function to calculate the total
total = bag.sum().compute()

print(total)

90
CPU times: user 19.3 ms, sys: 5.44 ms, total: 24.7 ms
Wall time: 6.22 s


### Reading data



We start by preparing some data.

1.  Make sure, that the `prep.py` is in the same directory than this noteboook.
2.  Create a directory `data` and run the following cell:



In [22]:
%run 03_prep.py -d accounts

In [23]:
import pandas as pd

import os
filenames = [os.path.join('data', 'accounts.%d.csv' % i) for i in [0, 1, 2]]
filenames

['data/accounts.0.csv', 'data/accounts.1.csv', 'data/accounts.2.csv']

In [25]:
%%time

# normal, sequential code
a = pd.read_csv(filenames[0])
b = pd.read_csv(filenames[1])
c = pd.read_csv(filenames[2])

na = len(a)
nb = len(b)
nc = len(c)

total = sum([na, nb, nc])
print("total:", total)

total: 3000000
CPU times: user 452 ms, sys: 17.1 ms, total: 469 ms
Wall time: 471 ms


**Task**: Recreate the  this graph again using the delayed function on the original Python code.
The three functions you want to delay are `pd.read_csv`, `len` and `sum`.



In [34]:
%%time

# Use Dask delayed for parallel execution
@delayed
def read_and_compute_length(filename):
    df = pd.read_csv(filename)
    return len(df)

# Use Dask's delayed functions to load and compute the lengths in parallel
a = read_and_compute_length(filenames[0])
b = read_and_compute_length(filenames[1])
c = read_and_compute_length(filenames[2])

total = delayed(sum)([a, b, c])

# Perform the computation
total_result = total.compute()
print("total:", total_result)

total: 3000000
CPU times: user 509 ms, sys: 31.7 ms, total: 541 ms
Wall time: 255 ms
