## Optimizing Python

#### For any numerical analysis, it is best to use the highly optimized numpy library. 

In [36]:
import numpy as np

### Let's generate a toy dataset of random number vectors

In [37]:
num_vector = 100
num_measures = 100

data = np.random.uniform(size=(num_vector,num_measures))
print(type(data[0][0]))
print(data.shape)

<class 'numpy.float64'>
(100, 100)


#### Here we define a silly all-by-all distance metric:

In [38]:
def pairwise_distance(X):

    num_vectors = X.shape[0]
    num_measurements = X.shape[1] 
    D = np.empty((num_vectors, num_vectors), dtype=np.float64)
    
    for i in range(num_vectors):
        for j in range(num_vectors):
            d = np.square( np.subtract(X[i], X[j]) )
            D[i, j] = np.sqrt(np.sum(d))
    return(D)

For the 300\*100 matrix, calculating pairwise distance this way means 100\*\*2\*100=1000000 calculations

### Pure Python (and numpy)

In [39]:
%%time
result = pairwise_distance(data)

CPU times: user 69.6 ms, sys: 63 µs, total: 69.7 ms
Wall time: 68.7 ms


In [40]:
print(np.asarray(result))

[[0.         4.76827164 4.46169428 ... 4.1498546  3.70344501 3.56266852]
 [4.76827164 0.         4.37844496 ... 4.08085492 4.62398251 4.24708855]
 [4.46169428 4.37844496 0.         ... 3.7878381  3.96606605 4.00293712]
 ...
 [4.1498546  4.08085492 3.7878381  ... 0.         4.31309973 3.88091059]
 [3.70344501 4.62398251 3.96606605 ... 4.31309973 0.         3.67060857]
 [3.56266852 4.24708855 4.00293712 ... 3.88091059 3.67060857 0.        ]]


### Cython

In [41]:
%load_ext cython

The cython extension is already loaded. To reload it, use:
  %reload_ext cython


In [42]:
%%cython
import numpy as np
cimport numpy as np
cimport cython

def pairwise_distance_cython(double[:, ::1] X):
    
    cdef int num_vectors = X.shape[0]
    cdef int num_measurements = X.shape[1]
    cdef double[:, ::1] D = np.empty((X.shape[0], X.shape[0]), dtype=np.float)
    
    for i in range(num_vectors):
        for j in range(num_vectors):
            D[i, j] = np.sqrt ( np.sum( np.square( np.subtract(X[i], X[j]) ) ) )
    return(D)

In [43]:
%%time
result = pairwise_distance_cython(data)

CPU times: user 110 ms, sys: 4.1 ms, total: 114 ms
Wall time: 113 ms


In [44]:
print(np.asarray(result))

[[0.         4.76827164 4.46169428 ... 4.1498546  3.70344501 3.56266852]
 [4.76827164 0.         4.37844496 ... 4.08085492 4.62398251 4.24708855]
 [4.46169428 4.37844496 0.         ... 3.7878381  3.96606605 4.00293712]
 ...
 [4.1498546  4.08085492 3.7878381  ... 0.         4.31309973 3.88091059]
 [3.70344501 4.62398251 3.96606605 ... 4.31309973 0.         3.67060857]
 [3.56266852 4.24708855 4.00293712 ... 3.88091059 3.67060857 0.        ]]


### Numba

In [45]:
from numba import jit

In [46]:
@jit
def pairwise_distance_numba(X):

    num_vectors = X.shape[0]
    num_measurements = X.shape[1] 
    D = np.empty((num_vectors, num_vectors), dtype=np.float64)
    
    for i in range(num_vectors):
        for j in range(num_vectors):
            d = 0.
            for k in range(num_measurements):
                d += np.square( np.subtract(X[i][k], X[j][k])  )
            D[i, j] = np.sqrt(d)
    return(D)

In [47]:
%%time
result = pairwise_distance_numba(data)

CPU times: user 131 ms, sys: 33 µs, total: 131 ms
Wall time: 130 ms


> the first time it is executed the function is compiled. Run the function again to get the execution time without compilation

In [48]:
%%time
result = pairwise_distance_numba(data)

CPU times: user 806 µs, sys: 39 µs, total: 845 µs
Wall time: 847 µs


In [49]:
print(np.asarray(result))

[[0.         4.76827164 4.46169428 ... 4.1498546  3.70344501 3.56266852]
 [4.76827164 0.         4.37844496 ... 4.08085492 4.62398251 4.24708855]
 [4.46169428 4.37844496 0.         ... 3.7878381  3.96606605 4.00293712]
 ...
 [4.1498546  4.08085492 3.7878381  ... 0.         4.31309973 3.88091059]
 [3.70344501 4.62398251 3.96606605 ... 4.31309973 0.         3.67060857]
 [3.56266852 4.24708855 4.00293712 ... 3.88091059 3.67060857 0.        ]]


In [50]:
# alternative syntax
import numba
pairwise_distance_numba = numba.jit(pairwise_distance)

In [51]:
%%time
result = pairwise_distance_numba(data)

CPU times: user 215 ms, sys: 4.01 ms, total: 219 ms
Wall time: 217 ms


In [52]:
print(np.asarray(result))

[[0.         4.76827164 4.46169428 ... 4.1498546  3.70344501 3.56266852]
 [4.76827164 0.         4.37844496 ... 4.08085492 4.62398251 4.24708855]
 [4.46169428 4.37844496 0.         ... 3.7878381  3.96606605 4.00293712]
 ...
 [4.1498546  4.08085492 3.7878381  ... 0.         4.31309973 3.88091059]
 [3.70344501 4.62398251 3.96606605 ... 4.31309973 0.         3.67060857]
 [3.56266852 4.24708855 4.00293712 ... 3.88091059 3.67060857 0.        ]]


### Comparison with replicates

##### This is an extreme example that favors optimization by numba. In some other cases Cython may perform better.

In [53]:
%timeit -n 1 -r 3 result = pairwise_distance(data)

72.6 ms ± 4.97 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)


In [54]:
%timeit -n 1 -r 3 result = pairwise_distance_cython(data)

93.9 ms ± 2.06 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)


In [55]:
%timeit -n 1 -r 3 result = pairwise_distance_numba(data)

2 ms ± 61.2 µs per loop (mean ± std. dev. of 3 runs, 1 loop each)


### Multiprocessing (and refactoring)

We can take advantage of multiple cores using the multiprocessing module. In this approach, separate __processes__ are used, __not threads__. The use of threads is generally blocked by Python because of the "Global Interpreter Lock". This was a necessary design feature as a trade-off for the enormous flexibility in memory management that Python makes possible. This means that there is no shared memory when using multiprocessing, and thus the individual tasks must be independent.

Multiprocessing generally works well with lists, where one maps a function to each element of the list and these operations are computed as separated processes on separate cores per element of the list. To do this, we'll need to refactor our silly distance function. One approach would be to populate a list containing each of the vector pairs. The drawback here is the memory overhead of this list object:

In [56]:
data = np.random.uniform(size=(100,100))

In [57]:
def pairwise_list(X):

    list_of_tuples = list()
    
    num_vectors = X.shape[0]
    num_measurements = X.shape[1] 
    
    for i in range(num_vectors):
        for j in range(num_vectors):
            list_of_tuples.append((X[i],X[j]))
            
    return(list_of_tuples)

In [58]:
%%time
list_of_tuples = pairwise_list(data)

CPU times: user 27.5 ms, sys: 0 ns, total: 27.5 ms
Wall time: 27.4 ms


In [59]:
print(type(list_of_tuples))
print(len(list_of_tuples))

<class 'list'>
10000


Now we'll need to refactor our function for computing distances:

In [60]:
def pairwise_distance_rf(X):
    
    assert(len(X) == 2)
    assert(len(X[0]) == len(X[1]))
    
    d=0.
    for k in range(len(X[1])):
        d += np.square( np.subtract(X[0][k], X[1][k]))
    
    return( np.sqrt(np.sum(d) ) )

In [61]:
%timeit -n 1 -r 3  result = list(map(pairwise_distance_rf, list_of_tuples))

3.35 s ± 65.1 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)


Let's now repreat this with multiprocessing : 

In [62]:
import multiprocessing as mp

In [63]:
# we use the with statement to be sure we do not forget to do a pool.close() 
with mp.Pool(2) as pool :
    %timeit -n 1 -r 3  result = pool.map(pairwise_distance_rf, list_of_tuples)

1.81 s ± 46.9 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)


In [64]:
for NP in [1,2,4,8]:
    print(NP)
    with mp.Pool(NP) as pool :
        %timeit -n 1 -r 3  result = pool.map(pairwise_distance_rf, list_of_tuples)

1
3.39 s ± 61.9 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)
2
1.89 s ± 32.5 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)
4
1.31 s ± 46 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)
8
1.34 s ± 14.6 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)


However, `multiprocessing` does not always lead to speedups as it may prevent numpy from doing some of its optimization tricks, such as vectorization.

For instance, let's see how `multiprocessing` work on the vectorized version of our function :

In [65]:
def pairwise_distance_rf_2(X):
    
    assert(len(X) == 2)
    assert(len(X[0]) == len(X[1]))
    
    d = np.square( np.subtract(X[0], X[1]))
    
    return( np.sqrt(np.sum(d) ) )

In [66]:
data = np.random.uniform(size=(300,100)) # I make the data a bit larger
list_of_tuples = pairwise_list(data)

In [67]:
%timeit -n 1 -r 3  result = list(map(pairwise_distance_rf_2, list_of_tuples))

702 ms ± 30 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)


In [68]:

for NP in [2,4,8]:
    print(NP)
    with mp.Pool(NP) as pool :
        %timeit -n 1 -r 3  result = pool.map(pairwise_distance_rf_2, list_of_tuples)

2
1.06 s ± 44 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)
4
903 ms ± 28.7 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)
8
893 ms ± 29.3 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)


### numba and parallelization 

It is possible to provide a `numba` function to `mp.pool`, but `numba` already provides what's necessary to paralleliza your code :

In [69]:
from numba import njit, prange

@njit(parallel=True)
def pairwise_distance_numba_prange(X):

    num_vectors = X.shape[0]
    num_measurements = X.shape[1] 
    D = np.empty((num_vectors, num_vectors), dtype=np.float64)
    
    for i in prange(num_vectors):
        for j in range(num_vectors):
            d = 0.
            for k in range(num_measurements):
                d += np.square( np.subtract(X[i][k], X[j][k])  )
            D[i, j] = np.sqrt(d)
    return(D)

toydata = np.random.uniform(size=(10,10)) # I make toy data to launch the function once and compile it
toyresult = pairwise_distance_numba_prange( toydata ) 

In [70]:
# No numba, vectorized
%timeit -n 1 -r 3  result = list(map(pairwise_distance_rf_2, list_of_tuples))
# numba
%timeit -n 1 -r 3 result = pairwise_distance_numba( data )
# numba parallel=True
%timeit -n 1 -r 3 result = pairwise_distance_numba_prange( data )

643 ms ± 41.4 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)
17.5 ms ± 380 µs per loop (mean ± std. dev. of 3 runs, 1 loop each)
1.61 ms ± 337 µs per loop (mean ± std. dev. of 3 runs, 1 loop each)


A significant improvement with Numba again, but clearly Numba really shines when working with nested loops.

**In the end** which of these different method you should use will depend on the particulars of the computations you have to make.

In [95]:
import pandas as pd

df = pd.DataFrame({"x": range(1,6), "y": range(6,11)})
df2 = df.loc[(df["x"] <= 3), :] 
print(type(df2))
print(df2)
print(df)

<class 'pandas.core.frame.DataFrame'>
   x  y
0  1  6
1  2  7
2  3  8
   x   y
0  1   6
1  2   7
2  3   8
3  4   9
4  5  10


In [77]:
df.loc[df['x']<=3 , ] 

Unnamed: 0,x,y
0,1,6
1,2,7
2,3,8
