![cython](https://upload.wikimedia.org/wikipedia/en/thumb/c/ce/Cython-logo.svg/1200px-Cython-logo.svg.png)
![numba](https://numba.pydata.org/_static/numba_blue_icon_rgb.png)

## Objective
Python is a fantastic language. It's quick to prototype, but __can__ be slow for computationally intensive tasks.
In this tutorial we will learn how to write performant Python code by using Cython and Numba.
> When we say Performant, we mean C performance.

### Overview
1. Learn how python is can be slow
2. See the performance gains of Cython
3. Learn how to write Cython code
4. Learn about Numba and how to use it
5. Use Cython and Numba for machine learning

# Python Speed of development vs execution speed.

![](https://slideplayer.com/slide/8720100/26/images/4/Programmer+Efficiency+&+Performance.jpg)




## Why does Python have a bad reputation for being slow?

Python is an interpreted and dynamically typed language.
Every line in python checks types and memory allocation, which is very slow.
All objects in python can be mutated and are stored in dictionaries so dictionary lookups are used for every variable, method/function call, etc...
This makes Python perfect for fast prototyping, and slow to execute.

Slow Python only runs pure `non-pythonic` Python code.
* Using loops instead of comprehension
* not using iterators
* etc
Naive Python developers make these mistakes often and it slows down their code.
Python has many optimized built in functions. Use them.

Performant Python uses libraries with compiled code to speed up execution.
Tools like Numpy and Tensorflow have dramatically increased the performance of Python.
It is so fast that Python is now the industry standard for machine learning.


## How are Numpy and Tensorflow fast?

NumPy isn't pure python.
It uses `Fortran` and `C` extensions for performance code.
If NumPy ran pure Python code it would be much slower.


Tensorflow is a tool for creating neural networks. It uses highly optimized C code to run efficiently on either a CPU or a GPU. It is incredibly difficult to write code that trains neural networks faster than Tensorflow. MxNet claims to be faster, but it's another Python package.

With the Python API it is relatively easy to implement deep networks in Tensorflow. 

## Where does that leave us when we need custom fast code?

Python natively supports linking to other languages.
You can read the [docs here](https://wiki.python.org/moin/IntegratingPythonWithOtherLanguages).
It is very difficult to implement.
Bridging the types is very gross code and is notoriously difficult to debug.


## Cython

Instead of writing `C` extensions, we can use `Cython`.

> Cython is a superset of the Python programming language, designed to give C-like performance with code that is written mostly in Python. Cython is a compiled language that generates CPython extension modules.

Cython translates code directly to C and then binary.
Cython also does many checks for us at the compile stage to help prevent possible issues that are easy to forget when writing `C` code.

Cython can import `C` code.
If there is a `C` library you want to use, just import it to your Cython file.
We won't review this in our tutorial, but you can read about it [here](https://cython.readthedocs.io/en/latest/src/tutorial/clibraries.html)

Both `ScikitLearn` and `spacy` use `Cython` for performant Python code.

> Note: jupyter compiles Cython code before running. If you aren't running in jupyter there is a compile stage before running.

### ScikitLearn and Cython

ScikitLearn recommends the following to write performant code.


 1. Profile the Python implementation to find the main bottleneck and isolate it in a dedicated module level function. This function will be reimplemented as a compiled extension module.
 2. If there exists a well maintained BSD or MIT C/C++ implementation of the same algorithm that is not too big, you can write a Cython wrapper for it and include a copy of the source code of the library in the scikit-learn source tree: this strategy is used for the classes svm.LinearSVC, svm.SVC and linear_model.LogisticRegression (wrappers for liblinear and libsvm).
 3. Otherwise, write an optimized version of your Python function using Cython directly. This strategy is used for the linear_model.ElasticNet and linear_model.SGDClassifier classes for instance.
 4. Move the Python version of the function in the tests and use it to check that the results of the compiled extension are consistent with the gold standard, easy to debug Python version.
 5. Once the code is optimized (not simple bottleneck spottable by profiling), check whether it is possible to have coarse grained parallelism that is amenable to multi-processing by using the joblib.Parallel class.

## Numba

Numba is a jit (just in time) compiler. 
Numba generates machine code with the LLVM compiler at runtime to have similar performance speed as `C` or `Fortran`

Much like `Cython` you can make `C` functions as callbacks to pass into functions from a `C` library with a callback parameter.

It isn't as widely used as Cython, but because it's a jit compiler, it can be easier to refactor code for Numba than for Cython.


## Example

We already learned that Numpy is way faster than Python for matrix math.
Let's see an example of writing dot product in Cython and compare the speed to Numpy.

In the next section [02_Cython_Language](02_Cython_Language.ipynb) we will review basics of Cython syntax.
This is just to show the performance gains from using Cython.

In [1]:
%load_ext Cython
array_len = 1000001

In [2]:
%%cython
import cython

@cython.boundscheck(False)
cpdef dot_product(long[:] data_array, long[:] weights_array):
    cdef long total = 0
    array_len = len(weights_array)
    for i in range(array_len):
        total += data_array[i] * weights_array[i]
    return total

In [3]:
import array

# Create an array (not a list) with the type long
data = array.array('l', list(range(array_len)))
weights = array.array('l', list(range(array_len)))

In [None]:
dot_product(data, weights)

In [None]:
%%timeit

dot_product(data, weights)

Let's compare the speed to numpy

In [None]:
import numpy as np

In [None]:
array = np.arange(array_len, dtype=np.int64)
weights_array = np.arange(array_len, dtype=np.int64)

In [None]:
%%timeit

numpy_dot_product = np.dot(weights_array, array)

## Conclusion
We can see that NumPy and Cython are comparable in speed for dot product.

Numpy is very fast for many matrix operations,
but like every other module, it only has specific functionality.

Cython on the other hand can do anything.
We just have to write out the algorithm.

