# Parallelize code with DASK

### Why parallelize your code? 
Nowadays even a simple desktop computer has more than one CPU (~4 to 8).
The presence of multiple CPUs allows to execute more than one instruction per time. This concept is called *Parallelization*.
The parallelization of the instructions allows us to work faster and to use and exploit all the power of modern computers.

The parallelization is the concept behind the Clusters and distributed computations.
While the parallelization of the source code of the applications is the first step to work with long or/and complex computations that can be concurrently executed, the clustering and distributed calculos are necessary steps to deal with large amount of data (*Big Data*) and to work with more complex computations.

### Basics

First let's make some simple functions, *increment* and *add*, that sleep for a while to simulate work. We'll then time run these functions normally.


In [None]:
from time import sleep


def increment(x):
    """
    take a number x and return x+1
    sleep for 1s
    """
    sleep(1)
    return x + 1

def add(x, y):
    """
    sum y to x and return the result
    sleep for 1s
    """
    sleep(1)
    return x + y

Now let's run a simple snippet that uses those function and see how long it takes.

In [None]:
%%time

x = increment(1)
y = increment(2)
z = add(x, y)

This takes three seconds to run because we call each function sequentially. This means that the x value is computed as first, y as second and z as third. Each instrucion is executed after the previous one.

### Parallelize the computation

In this step we parallelizze the execution of the *increment* function. This operation is performed by wrapping the invokation of the function with the *dask.delayed* function. This allows us to prepare the parallelization of the program execution without in fact executing the computation.

The invokation returns the *delayed object*, which is a placeholder of the original computation. Up until the *z.compute()* instruction, the code runs instantaneously without actually doing anything.

let's try:


In [None]:
%%time

# the delayed function takes several arguments. the first argument is the function that has to be executed in parallel.
# the following arguments are the arguments of the original function.
from dask import delayed

x = delayed(increment)(1)
y = delayed(increment)(2)
z = delayed(add)(x, y)

In [None]:
%%time

z.compute() 

As we can see, the total execution time of paralellized code is 2.0s while the execution time of the non-parallelized code is 3.0. 

The parallelized version has been 1 second faster because some instrutions have been executed concurrently. Let's see what happened.

In [None]:
z.visualize()

As we can see from the computation graph, the two invokations of the *increment* function have been computed 
in parallel.

### Exercise 1: Parallelize a for loop

Let's try a parallelize a for loop.

The exercise consist in reading some numbers from a source by incrementing them and then by put them on a new source. At the end all the new numbers must to be sum.


In [None]:
data = [1, 2, 3, 4, 5, 6, 7, 8]


Let's see the source of non-parallelize code:

In [None]:
%%time
# Sequential code

results = []
for x in data:
    y = increment(x)
    results.append(y)
    
result = sum(results)
print("After computing :", result)

Here there is the partial code of the parallelized code, fill the blank spaces by yourself.


In [None]:
%%time

results = []

for x in data:
    y = # put your code here.
    results.append(y)
    
total = #put your code here.
result = total.compute()
print("After computing :", result)  # After it's computed

### Exercise2 : Parallelizing a for-loop code with control flow


This exercise is quite similar to the first one.

The exercise consist in reading some numbers from a source by checking if the current number is even or odd throught the usage of ```is_even``` function . If the number is even it must be doubled by using function ```double```, otherwise it must be increment by using function ```increment```. Once the number has been processed in the right way it must be put on a new source. At the end all the new numbers must to be sum.


In [None]:
def double(x):
    sleep(1)
    return 2 * x

def is_even(x):
    return not x % 2

data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]


Let's see the source of non-parallelize code:

In [None]:
%%time
# Sequential code

results = []
for x in data:
    if is_even(x):
        y = double(x)
    else:
        y = increment(x)
    results.append(y)
    
result = sum(results)
print("After computing :", result)

In [None]:
results = []
for x in data:
    if is_even(x):  # even
        y =  # your code goes here 
    else:          # odd
        #y = # your code goes here 
    results.append(y)
    
total = #put your code here.
result = total.compute()
print("After computing :", result)