# Notebook 2 - Optimizing Python 
----------------------------------------------


In [None]:
%load_ext autoreload
%autoreload 2

<br>

# Table of Content <a id='toc'></a>


1. **[Numpy](#4)**

2. **[Numba](#6)**
   
3. **[Cython](#5)**


## Introduction

Now that we have seen the tools to measure our code resource usage, we will review a couple of tricks that can help you speed-up your python code tremendously.

The firsts are basic:
 1. **Apply standard good sense**: does your code reads/write to the disk more than it need to?
    Do you spend a lot of time searching for items in lists instead of dictionaries?
 2. **Switch to numpy**: vectorized operations are great (as we have seen).
 


In [None]:
import numpy
import numba
import cython

print("All modules loaded successfully!")

> Note: if you are missing some of the above modules, you should install them.
>
>    * Installation with **pip**: `pip install --user numpy numba cython` 
>    * Installation with **conda**: `conda install -c conda-forge numpy numba cython`


<br>
<br>

## 1. Numpy <a id="4"></a>

If you have not done it already, a very good first step is to **use numpy structures and functions** wherever possible.  
Indeed, `numpy` implements efficient (it is all C++ under the hood) and vectorized operations, within a fairly easy to approach interface.  
It base structure is the **array**, which can be multi-dimensional, and can contains a single type of object (e.g. all floats).


In [None]:
import numpy as np

L = [1, 3, 45, 2, 3]
A = np.array(L)

print("list", L)
print("array", A)

<br>

There are many numpy array creation routines, some of which we have already seen:

In [None]:
# Create a 5x5 array of 0s.
np.zeros((5,5))

In [None]:
# 10 values randomly drawn from a standard normal distribution.
np.random.randn(10)

<br>

But the nicest is that you can **perform operations on whole arrays at once**, and fast:  
* Let's compare the speed of using numpy vs a regular python list.

In [None]:
A = np.random.randn(10 ** 6)
L = list(A) # for comparison

# Multiply all elements by 13.
%timeit -n 10 -r 3 A * 13
%timeit -n 10 -r 3 [x * 13 for x in L]

That is a speedup of ~200!

The same thing works if you want to do operation between arrays:

In [None]:
A1 = np.random.randint(low=1, high=6, size=3)  # 3 random numbers.
A2 = np.random.randint(low=1, high=6, size=3)  # 3 other random numbers.
print(f"{A1} + {A2} -> {A1+A2}")

<br>

Numpy also has a number of nice common functions:

In [None]:
print("sum")
%timeit -n 5 -r 3 A.sum()
%timeit -n 5 -r 3 sum(L)    # Compare with builtin sum.
print("***")

print("mean")
%timeit -n 5 -r 3 A.mean()
%timeit -n 5 -r 3 sum(L)/len(L) 
print("***")

print("standard deviation")
%timeit -n 5 -r 3 A.sum()

# We have to build a little function here.
def std(L):
    m = sum(L) / len(L) 
    s = 0
    for i in L:
        s+= (i - m)**2
    return (s / len(L))**0.5
%timeit -n 5 -r 3 std(L) 

print("***")
print("sorting")
%timeit -n 5 -r 3 np.sort(A)
%timeit -n 5 -r 3 sorted(L)

Of course, that's just scratching the surface, but you can see how even a few lines of code here can make you code much faster.

If you are not familiar with numpy, we recommend you take some time to practice with it as it is somewhat ubiquitous in scientific python. Their [absolute beginner's guide](https://numpy.org/doc/stable/user/absolute_beginners.html) is a good (and actually fairly thorough) starting point.

<br>

Remember, in the previous section we rewrote the `pairwise_distance` function in numpy:

In [None]:
def pairwise_distance(X):

    num_vectors = len(X)
    num_measurements = len(X[0])
    D = [[0]*num_vectors for x in range(num_vectors)]
    
    for i in range(num_vectors):
        for j in range(num_vectors):
            d = []
            for k in range(num_measurements):
                d.append( ( X[i][k] - X[j][k] )**2 )
            
            D[i][j] = sum(d) **0.5
    return(D)


def pairwise_distance_numpy(X):

    num_vectors = X.shape[0]
    num_measurements = X.shape[1] 
    D = np.empty((num_vectors, num_vectors), dtype=np.float64)
    
    for i in range(num_vectors):
        for j in range(num_vectors):
            d = np.square( np.subtract(X[i], X[j]) )
            D[i, j] = np.sqrt(np.sum(d))
    return(D)

You can play *spot the differences* between these 2 implementations.

<br>
<br>

<div class="alert alert-block alert-success">

## Exercise 2.1 - Numpy optimization

See the dedicated `exercises_course2.ipynb` notebook.

<div>

<br>
<br>
<br>

[Back to ToC](#toc)

## 2. Numba <a id='6'></a>

**[Numba](https://numba.pydata.org/)** is a nice library which provide a number of optimization routines for python code, the most well know being **`@jit`** for **just-in-time** compilation.

In [None]:
from numba import jit

In [None]:
# Unchanged code 
# The option nopython=True raises an error if numba failed to convert the code to full C.
@jit(nopython=True) 
def pairwise_distance_numba(X):

    num_vectors = X.shape[0]
    num_measurements = X.shape[1] 
    D = np.empty((num_vectors, num_vectors), dtype=np.float64)
    
    for i in range(num_vectors):
        for j in range(num_vectors):
            d = np.square( np.subtract(X[i], X[j]) )
            D[i, j] = np.sqrt(np.sum(d))
    return(D)

In [None]:
num_vector = 200
num_measures = 100

data = np.random.uniform(size=(num_vector,num_measures))
print(type(data[0][0]))
print(data.shape)

In [None]:
%time result = pairwise_distance_numba(data)

<div class="alert alert-block alert-warning">
    
**Important:** The first time the code is executed, Numba need to translate and compile the C code, which creates an overhead and slows it down. This is no longer needed on subsequent runs, to try running the function again to get the execution time without compilation.

<div>

In [None]:
%timeit -n 5 -r 7 result = pairwise_distance_numpy(data)
%timeit -n 5 -r 7 result = pairwise_distance_numba(data)

**Woosh!** that is quite a gain.

In [None]:
# Alternative syntax
import numba

pairwise_distance_numba = numba.jit(pairwise_distance_numpy , nopython=True)

Here it is pretty bluffing, but sometimes it can be a bit difficult to get this level of performance.

Most external libraries are missing from numba, and [not all of numpy's code has been ported as well](https://numba.pydata.org/numba-doc/dev/reference/numpysupported.html).

> Note: a lot of function in external libraries (such as the ones of sklearn) have already been optimized and compiled, so there would not necessarily be much to gain there anyway...

All-in-all, it depends quite a lot on the particulars of what you want to optimize : [here are some tips](https://numba.pydata.org/numba-doc/latest/user/performance-tips.html)


> there also exists ways to [compile numba code ahead of time](https://numba.pydata.org/numba-doc/dev/user/pycc.html)

Although it is usually a good idea to rely on `numpy` vectorized operations, `numba` copes very well with loops and vectorizes them when it can , and sometimes ends up even better for it:


In [None]:
@jit(nopython=True)
def pairwise_distance_numba2(X):

    num_vectors = X.shape[0]
    num_measurements = X.shape[1] 
    D = np.empty((num_vectors, num_vectors), dtype=np.float64)
    
    for i in range(num_vectors):
        for j in range(num_vectors):
            d = 0.
            for k in range(num_measurements):
                d += np.square( np.subtract(X[i][k], X[j][k])  )
            D[i, j] = np.sqrt(d)
    return(D)

_ = pairwise_distance_numba2(data)

In [None]:
%timeit -n 10 -r 3 result = pairwise_distance_numba(data)
%timeit -n 10 -r 3 result = pairwise_distance_numba2(data)

<br>
<br>

[Back to ToC](#toc)

## 3. Cython <a id='5'></a>

**[Cython](https://cython.org/)** provides a way to transform python code into C compiled code fairly seamlessly.

By default, Cython retains Python flexibility by creating the ugliest of C-codes. This comes at the cost of a lot of efficiency, but is already enough to speed-up your code some.

The "command-line" flavor of cython involves either calling `cython` or writing a little `setup.py` file for your code. It is a bit of work at the start but actually quite easy once you have done it a couple of time: see [here for examples](https://cython.readthedocs.io/en/latest/src/quickstart/build.html)

The jupyter way:

In [None]:
%load_ext cython

In [None]:
# Pure python 
def f_native(x):
    return x ** 2 - x

def integrate_f_native(a, b, N):
    s = 0
    dx = (b - a) / N
    for i in range(N):
        s += f_native(a + i * dx)
    return s * dx

In [None]:
%%cython
# cython, without changing a single thing

def f(x):
    return x ** 2 - x

def integrate_f(a, b, N):
    s = 0
    dx = (b - a) / N
    for i in range(N):
        s += f(a + i * dx)
    return s * dx

In [None]:
print("Native python:")
%timeit -n 10 -r 3 result = integrate_f_native(0, 1, 1000000)

print("Simple cython:")
%timeit -n 10 -r 3 result = integrate_f(0, 1, 1000000)

Ok, so a speedup of about a third, fairly nice for a single line change.

But, let's look how Cython performed with our code :

In [None]:
%%cython --annotate
# cython, without changing a single thing.

def f(x):
    return x ** 2 - x

def integrate_f(a, b, N):
    s = 0
    dx = (b - a) / N
    for i in range(N):
        s += f(a + i * dx)
    return s * dx

<br>

**We can give some hints to Cython**, to help it compile the code better :

In [None]:
%%cython --annotate
# cython, with manually added typing.

def f_typed(double x):
    return x ** 2 - x


def integrate_f_typed(double a, double b, int N):
    cdef int i
    cdef double s, dx
    s = 0
    dx = (b - a) / N
    for i in range(N):
        s += f_typed(a + i * dx)
    return s * dx

That's better, but there is still a lot of yellow. 
In particular when the two functions interact. 
Which is not ideal because they should both be in C, their interaction should happen without any python element.


In [None]:
%%cython --annotate
# cython, more typing.

# This function is only called inside functions which are cythonized.
# So we can tell cython to try to compile is as pure C.
cdef double f_full_typed(double x):
    return x ** 2 - x


def integrate_f_full_typed(double a, double b, int N):
    cdef int i
    cdef double s, dx
    s = 0
    dx = (b - a) / N
    for i in range(N):
        s += f_full_typed(a + i * dx)
    return s * dx

<br>

**Let's compare the speed** of our different implementations:

In [None]:
print("Native python:")
%timeit -n 10 -r 3 result = integrate_f_native(0, 1, 1_000_000)

print("Cython - simple:")
%timeit -n 10 -r 3 result = integrate_f(0, 1, 1_000_000)

print("Cython - some typing:")
%timeit -n 10 -r 3 result = integrate_f_typed(0, 1, 1_000_000)

print("Cython - more typing:")
%timeit -n 10 -r 3 result = integrate_f_full_typed(0, 1, 1_000_000)

Woohoo! that's more like it.

Of course, there is more things we could do, like typing the return type of the functions and so on, as shown in this [quick-start tutorial](https://cython.readthedocs.io/en/latest/src/quickstart/cythonize.html) (from which this example was taken). 

<br>

These compiling tools usually won't work with external libraries, but a cool thing about Cython is that it works very well with numpy structures (although the code is somewhat ugly, and they use a deprecated API, which they are currently working on changing...).

So le'ts see what we can get with our `pairwise_distance`:

In [None]:
%%cython --annotate
# distutils: define_macros=NPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION
import numpy as np
cimport numpy as np
cimport cython
DTYPE = np.float64
ctypedef np.float64_t DTYPE_t

@cython.boundscheck(False)  # Turn-off bounds-checking for entire function.
@cython.wraparound(False)   # Turn-off negative index wrapping for entire function.
def pairwise_distance_cython(double[:, ::1] X):
    
    cdef int num_vectors = X.shape[0]
    cdef int num_measurements = X.shape[1]
    cdef double d
    cdef double[:, ::1] D = np.empty((num_vectors, num_vectors), dtype=DTYPE)
    
    for i in range(num_vectors):
        for j in range(num_vectors):
            d=0
            for k in range(num_measurements):
                
                d += ( X[i][k] - X[j][k] )**2

            D[i, j] = d**0.5
    return(D)

In [None]:
print(data.shape)
print("Numpy:")
%timeit -n 10 -r 3 D = pairwise_distance_numpy(data)

print("Cython:")
%timeit -n 10 -r 3 result = pairwise_distance_cython(data)

Okay, so now we have really gotten a lot faster.

So cython is really great, although it does take some practice to get it to work the way you want. 
They do have a [nice tutorial](https://cython.readthedocs.io/en/latest/index.html) though.

> Notes:
> * `cython` also a great way to
>   [interface python and C code](https://cython.readthedocs.io/en/stable/src/userguide/external_C_code.html).
> * It is also fairly easy to do
>   [profiling on cython code](https://cython.readthedocs.io/en/latest/src/tutorial/profiling_tutorial.html).

<br>

### Comparison between the different implementations

In [None]:
# Create a test data matrix of size 400 x 100.
data = np.random.uniform(size=(400, 100))
print("Test matrix type:", type(data[0][0]))
print("Test matrix size:", data.shape)

# Benchmark the different implementations of the pairwise distance computing.
print("\nNative python:")
%time result = pairwise_distance(data)

print("\nNumpy:  ", end="")
%timeit -n 1 -r 10 result = pairwise_distance_numpy(data)

print("Cython: ", end="")
%timeit -n 1 -r 10 result = pairwise_distance_cython(data)

print("Numba:  ", end="")
%timeit -n 1 -r 10 result = pairwise_distance_numba2(data)

> This is an example which tends to favors optimization by numba. In some other cases Cython may perform better.

<br>
<br>

<div class="alert alert-block alert-success">

## Exercise 2.2 - Numba/Numpy/Cython optimization

See the dedicated `exercises_course2.ipynb` notebook.

<div>