# Code Coffee
## Joblib Python Package
**8 Nov 2017**

From their documentation [found here](https://pythonhosted.org/joblib/index.html)

Joblib is a set of tools to provide lightweight pipelining in Python. In particular, joblib offers:

   1. transparent disk-caching of the output values and lazy re-evaluation (memoize pattern)
   2. easy simple parallel computing
   3. logging and tracing of the execution


### Embarassingly Parallel Pipelining

Joblib provides an incredibly easy way to turn a simple for loop into a parallel loop.  This lets you maximize the computing power you have available to you. If you throw this code on your quad-core machine, you'll get a slightly less than 4x speedup.  If you throw this on to an Ocelote interactive node, you'll get about 28x speedup — almost for free and with very little coding. 

_Note that this only works for embarassingly parallel jobs, meaning that individual steps in the loop can't depend on any other step in the loop._

If you have a bunch of very easy loops, joblib may not be for you. There is overhead associated with creating the parallel jobs, so, to maximize efficiency, make sure that your job execution time is much longer than the overhead. (Typically, if each iteration in a loop takes a few seconds, you'll get a performance increase with joblib). 

** The components of a Parallel job**

` output = Parallel(n_jobs=num_cores,backend="multiprocessing")( delayed(mydef)(i,j,k) for i in range(10) )`

* `output` is a list of the outputs from your definition
* `n_jobs` is the number of jobs you want executing at a time
* `backend` can be either "multiprocessing" for running one job on each core or "threading" to run a job on each thread you have available. 
* `delayed` means that your output will be returned in the order that you passed it in
* `mydef` is the name of the definition you've created.
* `(i,j,k)` is a tuple all of the arguments you pass into your definition. 


In [1]:
from joblib import Parallel, delayed
import multiprocessing
import numpy as np
import time

In [2]:
# As an example, here's a loop:

output=[]
for i in range(10):
    output+=[i**2.]
    
print(output)

# We can convert this into a definition
def mydef(i):
    var=i**2.
    return var

# And then we can pass this to joblib Parallel
num_cores = multiprocessing.cpu_count() # this automatically scales the problem to the size of our computer
output = Parallel(n_jobs=num_cores)( delayed(mydef)(i) for i in range(10) )
print(output)

[0.0, 1.0, 4.0, 9.0, 16.0, 25.0, 36.0, 49.0, 64.0, 81.0]
[0.0, 1.0, 4.0, 9.0, 16.0, 25.0, 36.0, 49.0, 64.0, 81.0]


In [3]:
# Note that output is a list of all of our returned results.  You may need to post-process multiple outputs
def mydef(i):
    var=i**2.
    return var,i

num_cores = multiprocessing.cpu_count() 
output = Parallel(n_jobs=num_cores)( delayed(mydef)(i) for i in range(10) )
print(output)

[(0.0, 0), (1.0, 1), (4.0, 2), (9.0, 3), (16.0, 4), (25.0, 5), (36.0, 6), (49.0, 7), (64.0, 8), (81.0, 9)]


The above problem doesn't show any speedup (and indeed, may show a slow down due to the overhead).  Now let's look at an example that shows the benefits of parallel computing. 

In [4]:
A = np.random.randint(0,high=10,size=(2000,2000))
B = np.random.randint(0,high=10,size=(2000,2000))

Let's take the dot product serially.

In [11]:
start=time.time()
np.dot(A,B)
end=time.time()
print("The serial job took",round(end-start,3),"seconds")
t1=end-start

The serial job took 30.158 seconds


That took a bit. Let's chunk our job and apply a parallel loop.

In [10]:
# Make this into a definition
def compute(a,b):
  return np.dot(a,b)

start=time.time()
num_cores = multiprocessing.cpu_count() 
print("We have",num_cores,"cores")
output = Parallel(n_jobs=num_cores*2,backend="threading")( delayed(compute)(i,B) for i in np.split(A,num_cores*2) )
output=np.vstack(output) # compile our output back into one matrix
end=time.time()
print("The parallel job took",round(end-start,3),"seconds")
t2=end-start
print("We got a factor of ",round(t1/t2,3),"speedup")

We have 4 cores
The parallel job took 23.671 seconds
We got a factor of  1.298 speedup


As you can see, this worked much better.

Ideas for using joblib parallel processes:
* Making plots for a movie
* Reading many small files
* Doing computations on large arrays
* Running many methods that take 5 minutes each.