# Optimizing Python 
----------------------------------------------


In [1]:
%load_ext autoreload
%autoreload 2

<br>

# Table of Content <a id='toc'></a>


1. [numpy](#4)
   
2. [Numba](#6)

3. [Cython](#5)



<br>
<br>
<br>

[back to the toc](#toc)



Now that we have seen the tools to measure our code resource usage, we will review a couple of tricks that can help you speedup your python code tremendously.

The firsts are basic:
 1. **apply standard good-sense** : does your code reads/write to the disk more than it need to ? Do you spend a lot of time searching for items in lists instead of dictionnaries ?
 2. **switch to numpy** : vectorized operations are great (as we have seen). 
 


# 1. numpy <a id="4"></a>

If you have not done it already, a very good first step is to use numpy strucutres and functions wherever possible.

Indeed, numpy implements efficient (it is all C++ under the hood) and vectorized operations, within a fairly easy to approach interface.

It base struture is the **array**, which can be multi-dimensional, and can contains a single type of object (eg, all floats).


In [329]:
import numpy as np


L= [1,3,45,2,3]

A = np.array(L)

print('list',L)
print('array',A)

list [1, 3, 45, 2, 3]
array [ 1  3 45  2  3]


there are many array creation routines, some of which we have already seen:

In [330]:
np.zeros( (5,5) )# create a 5x5 array of 0s

array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.]])

In [331]:
np.random.randn(10)# 10 values randomly droawn from a standard normal distribution

array([-0.6557239 , -1.60084612, -0.05888957,  0.50653951,  0.62609221,
        0.43040314, -1.38031598,  1.2416588 ,  1.24270522,  1.59892489])

But the nicest is that you can perform operations on whole arrays at once, and fast:

In [332]:
A = np.random.randn(10**6)
L = list(A) # for comparison

# multiply all elements by 13:
%timeit -n 3 -r 7 A*13

%timeit -n 3 -r 7  [x*13 for x in L]


798 µs ± 373 µs per loop (mean ± std. dev. of 7 runs, 3 loops each)
150 ms ± 2.84 ms per loop (mean ± std. dev. of 7 runs, 3 loops each)


That is a speedup of ~200!

The same thing works if you want to do operation between arrays:

In [333]:
A1 = np.random.randint(low=1,high=6,size=3) ## 3 random numbers
A2 = np.random.randint(low=1,high=6,size=3) ## 3 other random numbers
print(A1,'+',A2,'->',A1+A2)

[4 5 5] + [4 3 3] -> [8 8 8]


It also has a number of nice common functions :

In [335]:
print("sum")
%timeit -n 3 -r 7 A.sum()
%timeit -n 3 -r 7 sum(L) #  compare with builtin sum
print("***")
print("mean")
%timeit -n 3 -r 7 A.mean()
%timeit -n 3 -r 7 sum(L)/len(L) 
print("***")
print("standard deviation")
%timeit -n 3 -r 7 A.sum()

## we have to build a little function here
def std(L):
    m = sum(L)/len(L) 
    s = 0
    for i in L:
        s+= (i-m)**2
    return (s/len(L))**0.5
%timeit -n 3 -r 7 std(L) 

print('***')
print('sorting')
%timeit -n 3 -r 7 np.sort(A)
%timeit -n 3 -r 7 sorted(L)

sum
522 µs ± 90.7 µs per loop (mean ± std. dev. of 7 runs, 3 loops each)
32.3 ms ± 1.45 ms per loop (mean ± std. dev. of 7 runs, 3 loops each)
***
mean
329 µs ± 53.4 µs per loop (mean ± std. dev. of 7 runs, 3 loops each)
31.8 ms ± 261 µs per loop (mean ± std. dev. of 7 runs, 3 loops each)
***
standard deviation
306 µs ± 46.4 µs per loop (mean ± std. dev. of 7 runs, 3 loops each)
239 ms ± 3.56 ms per loop (mean ± std. dev. of 7 runs, 3 loops each)
***
sorting
66.7 ms ± 2.45 ms per loop (mean ± std. dev. of 7 runs, 3 loops each)
520 ms ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 3 loops each)


Of course, that's just scratching the surface, but you can see how even a few lines of code here can make you code much faster.

If you are not familiar with numpy, we recommend you take some time to practice with it as it is somewhat ubiquitous in scientific python. Their [absolute beginner's guide](https://numpy.org/doc/stable/user/absolute_beginners.html) is a good (and actually fairly thorough) starting point.


remember, in the previous section we rewrote the `pairwise_distance` function in numpy:

In [336]:
def pairwise_distance(X):

    num_vectors = len(X)
    num_measurements = len(X[0])
    D = [[0]*num_vectors for x in range(num_vectors)]
    
    for i in range(num_vectors):
        for j in range(num_vectors):
            d = []
            for k in range(num_measurements):
                d.append( ( X[i][k] - X[j][k] )**2 )
            
            D[i][j] = sum(d) **0.5
    return(D)


def pairwise_distance_numpy(X):

    num_vectors = X.shape[0]
    num_measurements = X.shape[1] 
    D = np.empty((num_vectors, num_vectors), dtype=np.float64)
    
    for i in range(num_vectors):
        for j in range(num_vectors):
            d = np.square( np.subtract(X[i], X[j]) )
            D[i, j] = np.sqrt(np.sum(d))
    return(D)

You can play *spot the differences* between these 2 implementations

**Optional micro-exercise:** consider the following native python code, which computes the integral of $x^2-x$.

In [351]:
def integrate_f_native(a, b, N):
    s = 0
    dx = (b - a) / N
    for i in range(N):
        x = a + i * dx
        s += x ** 2 - x
    return s * dx

print( integrate_f_native(0,2,100) )
%timeit -n 3 -r 7 _=integrate_f_native(0,2,1000000)

0.6467999999999999
114 ms ± 2.41 ms per loop (mean ± std. dev. of 7 runs, 3 loops each)


Make it faster using numpy. **remember to make sure that you get (almost) the same results**

**hint:** `np.arange(start,stop,step)` is a function that creates array from `start` to `stop` by increments of `step`

In [41]:
np.arange(1,1.5,0.075)

array([1.   , 1.075, 1.15 , 1.225, 1.3  , 1.375, 1.45 ])

Uncomment the following to look at the solution:

In [377]:
# %load solutions/02_integrate_numpy.py

<br>
<br>

[back to the toc](#toc)


## 2. Numba <a id='6'></a>

**[Numba](https://numba.pydata.org/)** is a nice library which provide a number of optimization routines for python code, the most well know being **`@jit`** for **just-in-time** compilation

In [357]:
from numba import jit

In [358]:
# Unchanged code 
# the option nopython=True makes so that there will be an error if numba failed to convert to full C
@jit(nopython=True) 
def pairwise_distance_numba(X):

    num_vectors = X.shape[0]
    num_measurements = X.shape[1] 
    D = np.empty((num_vectors, num_vectors), dtype=np.float64)
    
    for i in range(num_vectors):
        for j in range(num_vectors):
            d = np.square( np.subtract(X[i], X[j]) )
            D[i, j] = np.sqrt(np.sum(d))
    return(D)

In [359]:
num_vector = 200
num_measures = 100

data = np.random.uniform(size=(num_vector,num_measures))
print(type(data[0][0]))
print(data.shape)

<class 'numpy.float64'>
(200, 100)


In [361]:
%time result = pairwise_distance_numba(data)

CPU times: user 22.4 ms, sys: 4 ms, total: 26.4 ms
Wall time: 25.4 ms


> the first time it is executed the function is compiled. Run the function again to get the execution time without compilation

In [362]:
%timeit -n 5 -r 7 result = pairwise_distance_numpy(data)
%timeit -n 5 -r 7 result = pairwise_distance_numba(data)

219 ms ± 3.18 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)
7.16 ms ± 600 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)


**Woosh!** that is quite a gain

In [363]:
# alternative syntax
import numba
pairwise_distance_numba = numba.jit(pairwise_distance_numpy)

Here it is pretty bluffing, but sometimes it can be a bit difficult to get this level of performance.

Most external libraries are missing from numba, and [not all of numpy's code has been ported as well](https://numba.pydata.org/numba-doc/dev/reference/numpysupported.html).

> Note : a lot of function in external libraries (such as the ones of sklearn) have already been optimized and compiled, so there would not necessarily be much to gain there anyway...

All-in-all, it depends quite a lot on the particulars of what you want to optimize : [here are some tips](https://numba.pydata.org/numba-doc/latest/user/performance-tips.html)


> there also exists ways to [compile numba code ahead of time](https://numba.pydata.org/numba-doc/dev/user/pycc.html)

Although it is usually a good idea to rely on `numpy` vectorized operations, `numba` copes very well with loops and vectorizes them when it can , and sometimes ends up even better for it:


In [364]:
@jit(nopython=True)
def pairwise_distance_numba2(X):

    num_vectors = X.shape[0]
    num_measurements = X.shape[1] 
    D = np.empty((num_vectors, num_vectors), dtype=np.float64)
    
    for i in range(num_vectors):
        for j in range(num_vectors):
            d = 0.
            for k in range(num_measurements):
                d += np.square( np.subtract(X[i][k], X[j][k])  )
            D[i, j] = np.sqrt(d)
    return(D)

_=pairwise_distance_numba2(data)

In [366]:
%timeit -n 5 -r 7 result = pairwise_distance_numba(data)
%timeit -n 5 -r 7 result = pairwise_distance_numba2(data)

7.64 ms ± 1.83 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)
3.79 ms ± 22 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)


<br>
<br>

[back to the toc](#toc)

## 3. Cython <a id='5'></a>

**[Cython](https://cython.org/)** provides way to transform a python code into C compiled code failry seamlessly.

By default, Cython retains Python flexibility by creating the ugliest of C-codes. This comes at the cost of a lot of efficiency, but already it is enough to speed your code some.

The "command-line" flavor of cython involves either calling `cython` or writing a little `setup.py` file for your code. It is a bit of work at the start but actually quite easy once you have done it a couple of time : see [here for examples](https://cython.readthedocs.io/en/latest/src/quickstart/build.html)

The jupyter way :

In [367]:
%load_ext cython

The cython extension is already loaded. To reload it, use:
  %reload_ext cython


In [368]:
## pure python 
def f_native(x):
    return x ** 2 - x


def integrate_f_native(a, b, N):
    s = 0
    dx = (b - a) / N
    for i in range(N):
        s += f_native(a + i * dx)
    return s * dx

In [369]:
%%cython
## cython, without changing a single thing

def f(x):
    return x ** 2 - x


def integrate_f(a, b, N):
    s = 0
    dx = (b - a) / N
    for i in range(N):
        s += f(a + i * dx)
    return s * dx

In [370]:
print("native")
%timeit -n 3 -r 5 result = integrate_f_native(0,1,1000000)
print("simple cython")
%timeit -n 3 -r 5 result = integrate_f(0,1,1000000)

native
155 ms ± 5.35 ms per loop (mean ± std. dev. of 5 runs, 3 loops each)
simple cython
105 ms ± 2.23 ms per loop (mean ± std. dev. of 5 runs, 3 loops each)


Ok, so a speedup of about a third, fairly nice for a single line change.

But, let's look how Cython performed with our code :

In [371]:
%%cython --annotate
## cython, without changing a single thing

def f(x):
    return x ** 2 - x


def integrate_f(a, b, N):
    s = 0
    dx = (b - a) / N
    for i in range(N):
        s += f(a + i * dx)
    return s * dx

We can give some hints to Cython, to help it compile the code better :

In [372]:
%%cython --annotate
## cython, typing 

def f_typed( double x ):
    return x ** 2 - x


def integrate_f_typed( double a, double b, int N):
    cdef int i
    cdef double s, dx
    s = 0
    dx = (b - a) / N
    for i in range(N):
        s += f_typed(a + i * dx)
    return s * dx

That's better, but there is still a lot of yellow. 
In particular when the two functions interact. 
Which is not ideal because they should both in C, their interaction should happen without any python element.


In [373]:
%%cython --annotate
## cython, more typing 

# this function is only called inside function wich are cythonized
# so we can tell cython to try to compile is as pure C
cdef double f_fullTyped( double x ):
    return x ** 2 - x


def integrate_f_fullTyped( double a, double b, int N):
    cdef int i
    cdef double s, dx
    s = 0
    dx = (b - a) / N
    for i in range(N):
        s += f_fullTyped(a + i * dx)
    return s * dx

In [374]:
print("native")
%timeit -r 5 -n 3 result = integrate_f_native(0,1,1000000)
print("cython - simple")
%timeit -r 5 -n 3 result = integrate_f(0,1,1000000)
print("cython - some typing")
%timeit -r 5 -n 3 result = integrate_f_typed(0,1,1000000)
print("cython - more typing")
%timeit -r 5 -n 3 result = integrate_f_fullTyped(0,1,1000000)

native
167 ms ± 21.1 ms per loop (mean ± std. dev. of 5 runs, 3 loops each)
cython - simple
104 ms ± 2.28 ms per loop (mean ± std. dev. of 5 runs, 3 loops each)
cython - some typing
31.1 ms ± 94.3 µs per loop (mean ± std. dev. of 5 runs, 3 loops each)
cython - more typing
971 µs ± 14.7 µs per loop (mean ± std. dev. of 5 runs, 3 loops each)


Woohoo! that's more like it.

Of course, there is more things we could do, like typing the return type of the functions and so on, as shown in this [quickstart tutorial](https://cython.readthedocs.io/en/latest/src/quickstart/cythonize.html) (which this example is grabbed from). 

<br>

These compiling tools usually won't work with external libraries, but a cool thing about Cython is that it works very well with numpy structures (Although the code is somewhat ugly, and they use a deprecated API, which they are currently working on changing...).

So le'ts see what we can get with our `pairwise_distance`:

In [397]:
%%cython --annotate
import numpy as np
cimport numpy as np
cimport cython
DTYPE = np.float64
ctypedef np.float64_t DTYPE_t

@cython.boundscheck(False) # turn off bounds-checking for entire function
@cython.wraparound(False)  # turn off negative index wrapping for entire function
def pairwise_distance_cython(double[:, ::1] X):
    
    cdef int num_vectors = X.shape[0]
    cdef int num_measurements = X.shape[1]
    cdef double d
    cdef double[:, ::1] D = np.empty((num_vectors, num_vectors), dtype=DTYPE)
    
    for i in range(num_vectors):
        for j in range(num_vectors):
            d=0
            for k in range(num_measurements):
                
                d += ( X[i][k] - X[j][k] )**2

            D[i, j] = d**0.5
    return(D)

In file included from /home/wandrille/Installed_software/anaconda3/envs/py38/lib/python3.8/site-packages/numpy/core/include/numpy/ndarraytypes.h:1969:0,
                 from /home/wandrille/Installed_software/anaconda3/envs/py38/lib/python3.8/site-packages/numpy/core/include/numpy/ndarrayobject.h:12,
                 from /home/wandrille/Installed_software/anaconda3/envs/py38/lib/python3.8/site-packages/numpy/core/include/numpy/arrayobject.h:4,
                 from /home/wandrille/.cache/ipython/cython/_cython_magic_6710633bfd4889588f06d4493fddfe9a.c:784:
  ^~~~~~~


In [398]:
print(data.shape)
print('numpy:')
%timeit -n 3 -r 5 D = pairwise_distance_numpy(data)
print('cython:')
%timeit -n 3 -r 5 result = pairwise_distance_cython(data)

(400, 100)
numpy:
878 ms ± 16.8 ms per loop (mean ± std. dev. of 5 runs, 3 loops each)
cython:
22.7 ms ± 154 µs per loop (mean ± std. dev. of 5 runs, 3 loops each)


Okay, so now we have really gotten down.

So cython is really great, although it does take some practice to get it to work the way you want. 
They do have a [nice tutorial](https://cython.readthedocs.io/en/latest/index.html) though.

> note : cython is also a great way to [interface python and C code](https://cython.readthedocs.io/en/stable/src/userguide/external_C_code.html).

> it is also fairly easy to do [profiling on cython code](https://cython.readthedocs.io/en/latest/src/tutorial/profiling_tutorial.html)

<br>

### Comparison between the different implementations

In [380]:
num_vector = 400
num_measures = 100

data = np.random.uniform(size=(num_vector,num_measures))
print(type(data[0][0]))
print(data.shape)

<class 'numpy.float64'>
(400, 100)


In [381]:
%timeit -n 1 -r 10 result = pairwise_distance_numpy(data)

904 ms ± 40.1 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)


In [401]:
%timeit -n 1 -r 10 result = pairwise_distance_cython(data)

26 ms ± 8.14 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)


In [383]:
%timeit -n 1 -r 10 result = pairwise_distance_numba2(data)

15.9 ms ± 605 µs per loop (mean ± std. dev. of 10 runs, 1 loop each)


> This is an example which tends to favors optimization by numba. In some other cases Cython may perform better.

# Exercise: 

try to optimize the following code:

In [500]:
def compute_sequence_similarity(seqA  ,seqB):
    """compute similarity between 2 sequence as the fraction of position where they have the same value"""

    l = len(seqA)
    similar = 0
    for i in range(l):
        if seqA[i] == seqB[i]:
            similar += 1
    return similar/l


def compute_sequence_similarity_Mat(Lseq):
    # compute similarity between all sequence pair
    sim = np.zeros( ( len(Lseq),len(Lseq) ) )
    for i,s1 in enumerate(Lseq):
        for j,s2 in enumerate(Lseq):
            sim[i,j] = compute_sequence_similarity( s1 , s2 )
    return sim

In [486]:
## generate some data to play with
Lseq = [ ''.join(np.random.choice(list("ATGC"), 500)) for x in range(100) ]

In [487]:
%timeit -n 3 -r 7 _=compute_sequence_similarity_Mat(Lseq)

274 ms ± 22.3 ms per loop (mean ± std. dev. of 7 runs, 3 loops each)


**Warning:** this exercise is not necessarily very easy. 

You will likely have to try different things and delve a bit in the libraries online documentations to get good results.

**numpy hint** : to transform string `s` to an array: `np.array(list(s))`


**cython hint** : 
 * **simple**: the typing of string is `str`. 
 * **complex**: we can use C stuff such as `char*`, but then you need to convert the a python `str` to unicode, using for instance something like:
 ```python 
c_compatible_string = python_string.encode('UTF-8')
 ```

Numba solution:

In [454]:
# %load -r -22 solutions/02_sequence_similarity_numba.py

In [None]:
# %load -r 23- solutions/02_sequence_similarity_numba.py

numpy solution:

In [None]:
# %load -r -25 solutions/02_sequence_similarity_numpy.py

In [461]:
# %load -r 26- solutions/02_sequence_similarity_numpy.py

Cython solution "simple":

In [538]:
# %load -r -35 solutions/02_sequence_similarity_cython1.py

In [537]:
# %load -r 36- solutions/02_sequence_similarity_cython1.py

Cython solution "complex":

In [None]:
# %load -r -38 solutions/02_sequence_similarity_cython2.py

In [None]:
# %load -r 39- solutions/02_sequence_similarity_cython2.py