In [2]:
%run ../../DataFiles_and_Notebooks/00_AdvancedPythonConcepts/talktools.py

# Speeding up scientific Python code using Cython

### [UC Berkeley AY 250 'Python Computing for Data Science'](https://github.com/profjsb/python-seminar)

materials prepared by [Paul Ivanov](http://pirsquared.org/blog) (2013-11-21); Stefan van der Walt; JBloom  

## Motivation 

<img src="sketse/lang_speed.png">

Cython allows us to cross the gap


## [Cython](http://docs.cython.org/)


Cython is good for two things:

1. Wrapping legacy code
2. Making your python code go faster

This is good news because
   - we get to keep coding in Python (or, at least, a superset)
   - but with the speed advantage of C
   - You can’t have your cake and eat it. Or can you?
   - Conditions / loops approx. 2–8x speed increase, 30% overall; with
annotations: hundreds of times faster
 

# Let's take a step back...

Python is a C program (yes, there are other implementations, but CPython is the dominant one)

- you can write "C extensions" that can be imported as python modules

In [3]:
import math
math??

Deep introspection of `math` will not show us the source code, because it is actually implemented in C, compiled in a special way using the Python C API, and exposed to us (mere mortals) using a python interface.

In [4]:
import os
os.path??

`os.path` on the other hand, is written in pure Python, so we can see the sourcecode

In [5]:
!cython --version

Cython version 0.25.1


## What's Cython?

*Cython 
is a programming language that makes writing C extensions for the Python language as easy as Python itself.*

Cython is a superset of Python (i.e. all Python programs are valid Cython programs).

Cython allows you to:

   1. get convenient handles on C libraries, objects, and functions using in your Python code.
   2. "sprinkle in" type annotations into Python code to get speedups

### Use Cases 

 - Optimize execution of Python code (profile, if possible! – demo)
 - Wrap existing C and C++ code
 - Breaking out of the Global Interpreter Lock; openmp
 - Mixing C and Python, but without the pain of the Python C API

## Overview 

For this  introduction, we’ll take the following approach:

   1. Take a piece of pure Python code and benchmark (we’ll find that it is too slow)
   2. Run the code through Cython, compile and benchmark (we’ll find that it is somewhat faster)
   3. Annotate the types and benchmark (we’ll find that it is much faster)

Then we’ll look at how Cython allows us to
  - Work with NumPy arrays
  - Use multiple threads from Python
  - Wrap native C libraries

## Benchmark Python Code

We want to approximate the integral:
$$
\int_a^b f(x) dx
$$

<img src="sketse/LeftRiemann2.png" width="50%">

Or more a finer grid...

<img src="sketse/LeftRiemann.png" width="50%">


In [6]:
cd demos/integrate

/Users/jbloom/Classes/python-seminar/Lectures/11_Cython/demos/integrate


In [None]:
# %load integrate.py
from __future__ import division

def f(x):
    return x**4 - 3 * x

def integrate_f(a, b, N):
    """Rectangle integration of a function.

    Parameters
    ----------
    a, b : ints
        Interval over which to integrate.
    N : int
        Number of intervals to use in the discretisation.

    """
    s = 0
    dx = (b - a) / N
    for i in range(N):
        s += f(a + i * dx)
    return s * dx

In [7]:
import integrate

In [10]:
integrate.integrate_f(1,10,100000)

19846.812869729973

In [11]:
%timeit integrate.integrate_f(1,10,100000)

10 loops, best of 3: 53.1 ms per loop


Let's compile the code with Cython

```bash
cython filename.[py|pyx]
```

In [12]:
!cython integrate.py

In [13]:
!ls -lat |more

total 1328
drwxr-xr-x+ 24 jbloom  staff     816 Dec  2 14:36 .
-rw-r--r--+  1 jbloom  staff  126353 Dec  2 14:36 integrate.c
-rw-r--r--+  1 jbloom  staff     513 Nov 30 21:49 integrate_types.pyx
-rwxr-xr-x+  1 jbloom  staff   29096 Nov 30 21:48 integrate_types.cpython-35m-da rwin.so
-rw-r--r--+  1 jbloom  staff  103246 Nov 30 21:48 integrate_types.c
-rw-r--r--+  1 jbloom  staff     268 Nov 30 21:47 setup_types.py
-rw-r--r--+  1 jbloom  staff     274 Nov 30 21:46 setup_types.py~
-rw-r--r--+  1 jbloom  staff   39879 Nov 30 21:23 integrate.html
drwxr-xr-x+  2 jbloom  staff      68 Nov 30 21:00 .ipynb_checkpoints
drwxr-xr-x+  3 jbloom  staff     102 Nov 30 20:54 __pycache__
-rw-r--r--+  1 jbloom  staff  104313 Nov 30 20:53 integrate_compiled.c
-rw-r--r--+  1 jbloom  staff     406 Nov 30 20:53 integrate_compiled.pyx
-rw-r--r--+  1 jbloom  staff     274 Nov 30 20:53 setup.py
drwxr-xr-x+  9 jbloom  staff     306 Nov 29 18:19 ..
-rw-r--r--+  1 jbloom  staff     242 Nov 29 18:16 setup.py~
-rw-

What is happening behind the scenes? 

Cython translates Python to C, using the Python C API (let’s have a look)

In [14]:
!cat integrate.c

/* Generated by Cython 0.25.1 */

#define PY_SSIZE_T_CLEAN
#include "Python.h"
#ifndef Py_PYTHON_H
    #error Python headers needed to compile C extensions, please install development version of Python.
#elif PY_VERSION_HEX < 0x02060000 || (0x03000000 <= PY_VERSION_HEX && PY_VERSION_HEX < 0x03020000)
    #error Cython requires Python 2.6+ or Python 3.2+.
#else
#define CYTHON_ABI "0_25_1"
#include <stddef.h>
#ifndef offsetof
  #define offsetof(type, member) ( (size_t) & ((type*)0) -> member )
#endif
#if !defined(WIN32) && !defined(MS_WINDOWS)
  #ifndef __stdcall
    #define __stdcall
  #endif
  #ifndef __cdecl
    #define __cdecl
  #endif
  #ifndef __fastcall
    #define __fastcall
  #endif
#endif
#ifndef DL_IMPORT
  #define DL_IMPORT(t) t
#endif
#ifndef DL_EXPORT
  #define DL_EXPORT(t) t
#endif
#ifndef HAVE_LONG_LONG
  #if PY_VERSION_HEX >= 0x03030000 || (PY_MAJOR_VERSION == 2 && PY_VERSION_HEX >= 0x02070000)
    #define HAVE_LONG_LONG
  #endif
#endif

Three ways of making use of Cython 

**1. Compiling**

This C code can be compiled directly (e.g. using gcc) requiring the right library paths. We can also use a `setup.py` file to help us.

I copied `integrate.py` to `integrate_compiled.pyx` just to make a distinction between the compiled version and the pure python version.

In [16]:
%%writefile setup.py
from distutils.core import setup
from Cython.Build import cythonize
from distutils.extension import Extension

setup(
    ext_modules=cythonize(
       [Extension("integrate_compiled", ["integrate_compiled.pyx"])], 
        compiler_directives={'language_level': 3}
    )
)

Overwriting setup.py


In [17]:
!python setup.py build_ext --inplace

  "Cython.Distutils.old_build_ext does not properly handle dependencies "
running build_ext
building 'integrate_compiled' extension
/usr/bin/clang -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/jbloom/anaconda/envs/seminar/include -arch x86_64 -I/Users/jbloom/anaconda/envs/seminar/include/python3.5m -c integrate_compiled.c -o build/temp.macosx-10.6-x86_64-3.5/integrate_compiled.o
      [-Wunused-function][0m
static CYTHON_INLINE char* __Pyx_PyObject_AsString(PyObject* o) {
[0;1;32m                           ^
      '__Pyx_PyUnicode_FromString' [-Wunused-function][0m
static CYTHON_INLINE PyObject* __Pyx_PyUnicode_FromString(const char* c_str) {
[0;1;32m                               ^
      [-Wunused-function][0m
static CYTHON_INLINE int __Pyx_PyObject_IsTrue(PyObject* x) {
[0;1;32m                         ^
      [-Wunused-function][0m
static CYTHON_INLINE Py_ssize_t __Pyx_PyIndex_AsSsize_t(PyObject* b) {
[0;1;32m

In [18]:
!ls integrate_compiled*

integrate_compiled.c
[31mintegrate_compiled.cpython-35m-darwin.so[m[m
integrate_compiled.pyx


In [19]:
import integrate_compiled
integrate_compiled.integrate_f(1,10,10000)

19846.812869729973

In [20]:
%timeit integrate_compiled.integrate_f(1,10,10000)

100 loops, best of 3: 3.84 ms per loop


In [21]:
!rm integrate_compiled.*.so

In [22]:
cd demos/integrate/

[Errno 2] No such file or directory: 'demos/integrate/'
/Users/jbloom/Classes/python-seminar/Lectures/11_Cython/demos/integrate


### 2. pyximport

A little bit magical, it will grab and compile .pyx files the first time they are imported, if a compilation is necessary.

In [23]:
import pyximport
pyximport.install()

import integrate_compiled

In [24]:
integrate_compiled.integrate_f(1,10,10000)

19846.812869729973

Benchmark the new code

In [25]:
%timeit integrate_compiled.integrate_f(1,10,100000)

10 loops, best of 3: 37.4 ms per loop


In [26]:
import integrate

In [27]:
%timeit integrate.integrate_f(1,10,100000)

10 loops, best of 3: 52.9 ms per loop


**caveat emptor!**  C extensions **cannot** be reloaded in Python, see [this bug marked as WON'T FIX](http://bugs.python.org/issue1144263), so you have to either *restart the kernel* (i.e. quit python and start it again), or you could use the `%cython` magic...

### 3. %cython magic

The most magical of all

In [28]:
%load_ext Cython

In [29]:
%%cython

from __future__ import division

def f(x):
    return x**4 - 3 * x

def inline_integrate_f(a, b, N):
    """Rectangle integration of a function.

    Parameters
    ----------
    a, b : ints
        Interval over which to integrate.
    N : int
        Number of intervals to use in the discretisation.

    """
    s = 0
    dx = (b - a) / N
    for i in range(N):
        s += f(a + i * dx)
    return s * dx

In [30]:
inline_integrate_f(1,10,1000)

19806.452972998024

Slight speed increase (≈ 1.4×) probably not worth it.

- Can we help Cython to do even better?
  - Yes—by giving it some clues.
  - Cython has a basic type inferencing engine, but it is very
conservative for safety reasons.
- Why does type information allow such vast speed increases?


# Python is great, but slow at some things...

![Dinner with Guido van Rossum, 2009](http://pirsquared.org/bocanova/vga/vga_dinner_final.jpg)

Slow things worth knowing about:

1. for loops
2. function calls

## Making your code go faster

### "Premature optimization is the root of all evil" -- *Don Knuth*

0. Use version control and have a test suite
0. Profile code (don't just add type declarations everwhere)
0. Sprinkle in optimizations (refacoring code as necessary)

Let's tell Cython about the types...

In [31]:
%%writefile integrate_types.pyx
from __future__ import division

def f(double x):
    return x**4 - 3 * x

def types_integrate_f(double a, double b, int N):
    """Rectangle integration of a function.

    Parameters
    ----------
    a, b : ints
        Interval over which to integrate.
    N : int
        Number of intervals to use in the discretisation.

    """
    cdef:
        double s = 0
        double dx = (b - a) / N
        int i

    s = 0
    dx = (b - a) / N
    for i in range(N):
        s += f(a + i * dx)
    return s * dx

Overwriting integrate_types.pyx


In [32]:
%%cython

from __future__ import division

def f(double x):
    return x**4 - 3 * x

def types_integrate_f(double a, double b, int N):
    """Rectangle integration of a function.

    Parameters
    ----------
    a, b : ints
        Interval over which to integrate.
    N : int
        Number of intervals to use in the discretisation.

    """
    cdef:
        double s = 0
        double dx = (b - a) / N
        int i

    s = 0
    dx = (b - a) / N
    for i in range(N):
        s += f(a + i * dx)
    return s * dx

In [33]:
%timeit types_integrate_f(1,10,100000)

100 loops, best of 3: 9.93 ms per loop


In [34]:
%timeit integrate_compiled.integrate_f(1,10,100000)

10 loops, best of 3: 37.2 ms per loop


In [35]:
%timeit integrate.integrate_f(1,10,100000)

10 loops, best of 3: 54.6 ms per loop


There's even more bottlenecks:
<img src="sketse/code_flow_python_vs_C.png" width="70%">
We need to define `f(x)` as a C function.

In [None]:
# %load integrate_cy.pyx
# cython: cdivision=True

# ^^^ Could also use @cython.cdivision(True) decorator

cdef double f(double x):
    return x*x*x*x - 3 * x

def integrate_f(double a, double b, int N):
    cdef:
        double s = 0
        double dx = (b - a) / N
        size_t i

    for i in range(N):
        s += f(a + i * dx)

    return s * dx


In [36]:
%%cython
# cython: cdivision=True

# ^^^ Could also use @cython.cdivision(True) decorator

cdef double f(double x):
    return x*x*x*x - 3 * x

def cy_integrate_f(double a, double b, int N):
    cdef:
        double s = 0
        double dx = (b - a) / N
        size_t i

    for i in range(N):
        s += f(a + i * dx)

    return s * dx

In [37]:
%timeit cy_integrate_f(1,10,100000)

1000 loops, best of 3: 259 µs per loop


In [38]:
%timeit integrate.integrate_f(1,10,100000)

10 loops, best of 3: 52.9 ms per loop


In [39]:
52.9/0.259

204.24710424710423

With annotations, Cython gives us a nice path to find bottlenecks. Use the `-a`.

In [40]:
%load_ext Cython

The Cython extension is already loaded. To reload it, use:
  %reload_ext Cython


In [41]:
%%cython -a
def omg(c):
    wtf = 0
    for a in range(c):
        wtf += a
        
    return wtf

def lol(x):
    for i in range(x):
        omg(x*2)

The color allows us to see the density of the C code generated for a given line of Python. These lines will take longer to run, but we need not eliminate all of the unoptimized lines in order to get performance. We'll talk about profiling soon to help us find which lines are worth it.

In [42]:
%timeit lol(5000)

1 loop, best of 3: 2.6 s per loop


In [45]:
%%cython -a
cdef int omg(int c):
    cdef int wtf = 0
    cdef int a
    for a in range(c):
        wtf += a
        
    return wtf

cdef void lola(int x):
    cdef i
    for i in range(x):
        omg(x*2)

In [47]:
%timeit lola(5000)

10000 loops, best of 3: 108 µs per loop


### Exercise

1. Here's a python function that performs an elementwise computations:

   $y_i = x_i^3 - 4x_i + 2$
   
2. Write a Cython equivalent. 

In [None]:
%%cython
import numpy as np
cdef my_poly(double [:] x):
    out = np.zeros_like(x)
    L = x.shape[0]

    for i in range(L):
        out[i] = x[i] * x[i] * x[i] - 4 * x[i] + 2

    return np.asarray(out)

def other_poly(x):
    return my_poly(x)

In [None]:
import numpy as np
x = np.linspace(-10,10,100000)
%timeit my_poly(x)

In [None]:
%%cython -a
import numpy as np

def blah(float y):
    return y * y * y - 4 * y + 2
    
def my_poly(x):
    out = np.zeros_like(x)
    L = x.shape[0]

    for i in range(L,x):
        out[i] = blah(x[i])

    return np.asarray(out)

In [None]:
import numpy as np
x = np.linspace(-10,10,100000)
%timeit my_poly(x)

<!-- 
%load /home/pi/cur/python-seminar/Lectures/11_Cython/problems/fast_poly/fast_poly.pyx
-->

## Exercise

1. With a partner, have one of you log into Github, fork [this gist](http://git.io/Pq37pQ)

1. modify it together so that all functions perform some computation, but the whole thing doesn't take too long to run (1-5 second). 

   * You can use numpy in some of them, if you'd like, but be sure to include some for loops, as well. 
   * The template is there just to get you started, feel free to modify the parameters, or add new functions as you see fit.

1. Update your gist with this new code, and hand it off to another pair (who will give you theirs).

1. Fork and clone the other pair's program.

1. Now, without looking at the code, profile this code to identify the bottleneck. 

1. Put a comment next to the function that eats up the most time. 

1. Time permitting, numpy-fy and Cython-ize the offending function

1. Time permitting, change the parameters used in main to make a *different* function the bottleneck.



### Now you know enough about how to wrap C code

See also [Calling C functions](http://docs.cython.org/src/tutorial/external.html) and [Using C Libraries](http://docs.cython.org/src/tutorial/clibraries.html) in the Cython documentation.

### What about Fortran?

[Export the Fortran code to C](http://fortran90.org/src/best-practices.html#interfacing-with-c) and then [tell Cython to pretend it's calling C code](http://fortran90.org/src/best-practices.html#interfacing-with-python):

*Notice that we didn’t write any C code — we only told fortran to use the C calling convention when producing the ".o" files, and then we pretended in Cython, that the function is implemented in C, but in fact, it is linked in from Fortran directly*

# References:

[Cython Profiling Tutorial](http://docs.cython.org/src/tutorial/profiling_tutorial.html) : from Cython's official docs
