# Using Cython to get the speed of C in Python

In the earlier "Optimization" notebook we have demonstrated a number of techniques for optimizing python code, including using numpy to perform operations on arrays of input. We showed that with these techniques one can observe a significant speed up in the total run time of the program.

However, there are some instances in which numpy operations cannot be applied directly. Maybe there is no way to write what you are trying to do in terms of numpy operations. In these cases Cython can be used.

Cython code can be used to supplement python code. Cython code (like C or Fortran code) needs to be compiled before it can be run, but python's support for Cython will allow this to happen automatically. Cython code appears very similar to python code but a few different rules apply:

 * You will have to write out all for loops explictly - There is no numpy-like operations on an entire array that can be done
 * To acheive optimal performance you do need to think in terms of what the underlying C code, which this will be automatically translated into, will do. This is difficult without doing a whole other class dedicated to that language (we'll leave that until the 4th year!). However, important things to consider and the need to declare the types of variables (is this variable going to be used to store integers, or floating point numbers, or complex numbers, or ....) Also consider that creating a new storage array does take a little bit of time, so reuse arrays (reusing memory) when possible.

But let's begin at the start. How do we use Cython within a Jupyter notebook?

## Before we start

We need to enable Cython within our notebook. This is done with

In [2]:
%load_ext cython

## Example: Computing cos(x) on an array

In the optimization lecture we began with profiling a set of functions to compute an integral. Let's start with one of those functions here. Here we write a function to compute the cos(x) for a timeseries. We also provide a numpy optimized function for computing cos(x) as we showed last week.

In [3]:
import numpy

def generate_time_series(tmin, tmax, delta_t):
    """
    Generates a times series between tmin and tmax sampled at delta_t
    """
    tseries = numpy.arange(tmin, tmax, delta_t)
    # We shift tseries by delta_t / 2 to ensure that we are using the midpoint rule (see wikipedia page)
    tseries = tseries + delta_t / 2.
    return tseries


def compute_cosx(tseries):
    """
    Computes cos(t) for all values in tseries
    """
    cosx = numpy.zeros(len(tseries))
    for idx, tval in enumerate(tseries):
        cosx[idx] = numpy.cos(tseries[idx])
    return cosx

def compute_cosx_numpy(tseries):
    """
    Computes cos(t) for all values in tseries
    """
    return numpy.cos(tseries)

Remember that the numpy version is considerably faster than the non-vectorized version

In [4]:
tseries = generate_time_series(1., 1000., 1./100.)
%timeit compute_cosx(tseries)
%timeit compute_cosx_numpy(tseries)

31.7 ms ± 385 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
223 µs ± 722 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


Now let's try to do this with cython. As a first cut at this we can just use the Cython interpreter with no changes. In jupyter we begin a cell with `%%cython` to acheive this, and add the `-a` option to give some useful output. **NOTE** The cython cell does not have access to any imports made in any other cell, so you will need to import any modules in here again.

In [5]:
%%cython -a
import math, numpy
def compute_cosx_cython(tseries):
    """
    Computes cos(t) for all values in tseries
    """
    cosx = numpy.zeros(len(tseries))
    for idx, tval in enumerate(tseries):
        cosx[idx] = math.cos(tseries[idx])
    return cosx

def compute_cosx_numpy_cython(tseries):
    """
    Computes cos(t) for all values in tseries
    """
    return numpy.cos(tseries)

When this compiles it will give you a sense of how fast this will be. Lines that are dark yellow will not be particularly fast. You do *not* want time-critical lines to be yellow. However, it is nice to be able to write *unedited* python code in this way. Our slow unoptimized version of the code is *an order of magnitude* faster when called in this way ... and it took no real effort on our part to do that! Thought the numpy code is still much faster ... for now!

In [6]:
%timeit compute_cosx_cython(tseries)
%timeit compute_cosx_numpy_cython(tseries)

KeyboardInterrupt: 

How can we go about making this faster? Well let's focus on the first function with the for loop, the second version uses a numpy function and there's not much point trying to optimize that. There's a few things we can do to optimize this

* We use C's math library to call cos/sin. If we call math.cos or math.sin we're back in python code, and we must avoid that to be fast!
* We declare the type of all variables. This is done using `cdef` followed by the type of variable, followed by it's name. So `cdef int idx` says that idx is going to be an integer.
* Inputs to the function are similarly declared. Note the adding `[::1]` indicates a numpy array (or other array-like object in the standard configuration). So `double [::1]` can be interpreted as a numpy array of floats. (Python floats are 64-bit precision by default, which is called double in C).

Declaring variable types explicitly makes the function a little less flexible but allows the compiled code to be significantly faster as it knows more precisely ahead of time what it will be asked to do! Note that I also declare the length of the array, and the idx used in the for loop before starting the for loop.

In [None]:
%%cython -a
import numpy

from libc.math cimport cos # This imports c's sin function from the math library

def compute_cosx_cython(double [::1] timeseries):
    """
    Computes cos(t) for all values in tseries
    """
    cdef int n = timeseries.size # How many values in the timeseries
    cdef int idx
    cdef double[::1] cosx = numpy.zeros(n) # Create an array to store the cos(x) values
    for idx in range(n):
        cosx[idx] = cos(timeseries[idx])
    return cosx


In [None]:
%timeit compute_cosx_cython(tseries)

Finally, we can add a few options at the top of the function to turn off some python sanity checks, which are useful, but can slow code down. In this case these don't help much, but can be useful in other cases. Note that turning these off can cause your code to fail in weird ways with no reason for the failure (what's called a "segmentation fault"). If this happens remove these checks and see if you get a warning/error about things being wrong!

In [None]:
%%cython -a
import numpy
from cython import wraparound, boundscheck, cdivision

from libc.math cimport cos # This imports c's sin function from the math library

@boundscheck(False)
@wraparound(False)
@cdivision(True)
def compute_cosx_cython(double [::1] timeseries):
    """
    Computes cos(t) for all values in tseries
    """
    cdef int n = timeseries.size # How many values in the timeseries
    cdef int idx
    cdef double[::1] cosx = numpy.zeros(n) # Create an array to store the cos(x) values
    for idx in range(n):
        cosx[idx] = cos(timeseries[idx])
    return cosx


In [None]:
%timeit compute_cosx_cython(tseries)

This is now about as fast as the numpy function. Don't forget that numpy is itself compiled C-code, so it's often hard to beat that. Indeed in this case just using the numpy function would be the best choice. However, the point is that if there *wasn't* a numpy cos function, you would be able to acheive a function that's basically as fast using cython.

It is worth emphasizing though that making cython fast is also not trivial but the big things to change with respect to python code are illustrated here:

 * Declare variable types (arrays especially can be complicated here ... I did use numpy.zeros here to create an input array, and I recommend doing this to avoid memory management in C.).
 * Use built-in C functions for things like cos or sin or exp.

## Example 2

As we used in the optimization class. Here's an example code which integrates

$cos(x) \times \frac{1}{x} $

from $x = 1$ to $x=1000$.

We will do this by using the simple rectangular method for numerically integrating. https://en.wikipedia.org/wiki/Riemann_sum

In [None]:
# NOTE: We write this as a set of functions. Functions are better to isolate different parts of
#       the code and to be able to check each component individually. A `class` would also be a
#       good way of doing this.
import numpy
import math

def compute_cosx(tseries):
    """
    Computes cos(t) for all values in tseries
    """
    cosx = numpy.zeros(len(tseries))
    for idx, tval in enumerate(tseries):
        cosx[idx] = math.cos(tseries[idx])
    return cosx

def compute_invx(tseries):
    """
    Computes 1/x for all values in tseries
    """
    invx = numpy.zeros(len(tseries))
    for idx, tval in enumerate(tseries):
        invx[idx] = 1 / tseries[idx]
    return invx

def compute_seriesproduct(series1, series2):
    """
    Multiplies each element in series1 with the corresponding element in series2.
    This returns an array of the multiplied elements.
    """
    # Ensure the two arrays are the same length
    assert(len(series1)==len(series2))
    seriessum = numpy.zeros(len(series1))
    for idx in range(len(series1)):
        seriessum[idx] = series1[idx] * series2[idx]
    return seriessum

def compute_seriessum(series):
    """
    Computes the sum of all values in series
    """
    sumvals = 0
    for idx in range(len(series)):
        sumvals = sumvals + series[idx]
    return sumvals


class Integrator():
    def generate_integral(self):
        """
        Integral function goes here
        """
        cosx = compute_cosx(self.tseries)
        invx = compute_invx(self.tseries)
        prod = compute_seriesproduct(cosx, invx)
        summed_prod = compute_seriessum(prod)
        return summed_prod * self.delta_t


    def __init__(self, tmin, tmax, delta_t):
        """
        Initializes the class and timeseries
        """
        self.tmin = tmin
        self.tmax = tmax
        self.delta_t = delta_t
        tseries = numpy.arange(self.tmin, self.tmax, self.delta_t)
        # We shift tseries by delta_t / 2 to ensure that we are using the midpoint rule (see wikipedia page)
        tseries = tseries + self.delta_t / 2.
        self.tseries = tseries


def main_function():
    intgr = Integrator(1, 1000, 1./300.)
    return intgr.generate_integral()

print (main_function())

In [None]:
%timeit main_function()

From before we already know that `compute_cosx`, `compute_invx`, `compute_seriesproduct` and `compute_seriessum` are the slow parts of this function. Let's rewrite these in cython

In [None]:
%%cython -a
import math
import numpy

from libc.math cimport cos # This imports c's sin function from the math library
from cython import wraparound, boundscheck, cdivision

@boundscheck(False)
@wraparound(False)
@cdivision(True)
def compute_cosx_cython(double [::1] timeseries):
    """
    Computes cos(t) for all values in tseries
    """
    cdef int n = timeseries.size # How many values in the timeseries
    cdef int idx
    cdef double[::1] cosx = numpy.zeros(n) # Create an array to store the cos(x) values
    for idx in range(n):
        cosx[idx] = cos(timeseries[idx])
    return cosx

@boundscheck(False)
@wraparound(False)
@cdivision(True)
def compute_invx_cython(double [::1] timeseries):
    """
    Computes 1/x for all values in tseries
    """
    cdef int idx
    cdef int n = timeseries.size # How many values in the timeseries
    cdef double[::1] invx = numpy.zeros(n) # Create an array to store the 1/x values
    for idx in range(n):
        invx[idx] = 1. / timeseries[idx]
    return invx

@boundscheck(False)
@wraparound(False)
@cdivision(True)
def compute_seriesproduct_cython(double [::1] series1, double [::1] series2):
    """
    Multiplies each element in series1 with the corresponding element in series2.
    This returns an array of the multiplied elements.
    """
    cdef int idx
    cdef int n = series1.size # How many values in the timeseries
    cdef double[::1] seriesprod = numpy.zeros(n)
    for idx in range(n):
        seriesprod[idx] = series1[idx] * series2[idx]
    return seriesprod

@boundscheck(False)
@wraparound(False)
@cdivision(True)
def compute_seriessum_cython(double [::1] series):
    """
    Computes the sum of all values in series
    """
    cdef int idx
    cdef int n = series.size
    cdef double sumvals = 0.
    for idx in range(n):
        sumvals += series[idx]
    return sumvals



In [None]:
class Integrator():
    def generate_integral(self, cython=False):
        """
        Integral function goes here
        """
        if cython:
            cosx = compute_cosx_cython(self.tseries)
            invx = compute_invx_cython(self.tseries)
            prod = compute_seriesproduct_cython(cosx, invx)
            summed_prod = compute_seriessum_cython(prod)
        else:
            cosx = compute_cosx(self.tseries)
            invx = compute_invx(self.tseries)
            prod = compute_seriesproduct(cosx, invx)
            summed_prod = compute_seriessum(prod)
        return summed_prod * self.delta_t


    def __init__(self, tmin, tmax, delta_t):
        """
        Initializes the class and timeseries
        """
        self.tmin = tmin
        self.tmax = tmax
        self.delta_t = delta_t
        tseries = numpy.arange(self.tmin, self.tmax, self.delta_t)
        # We shift tseries by delta_t / 2 to ensure that we are using the midpoint rule (see wikipedia page)
        tseries = tseries + self.delta_t / 2.
        self.tseries = tseries


def main_function(cython=True):
    intgr = Integrator(1, 1000, 1./300.)
    return intgr.generate_integral(cython=cython)

print (main_function())

In [None]:
%timeit main_function(cython=True)
%prun -l 10 -q -T prun0 main_function(cython=True)
print(open('prun0', 'r').read())


We now have a code that is actually *faster* than our numpy optimized code from before. Nevertheless, given the added complexity of writing this, and the fact the speed differential will be less noticeable for larger arrays, in most cases the numpy code is more than good enough.

To remind again:

 * The use-case of Cython is primarily to optimize code for which there is no numpy optimized version.
 * Writing (fast) Cython code does require more effort (and more Googling) than writing python code. However, a moderate speed increase can sometimes be acheived without this.
 * Our Cython magic function is coverting our code written in Cython into C and then compiling it: It is possible to write the C-code directly, but that is more effort. There is the possibility to approach this from the other side and write pure C or C++ code, and then use cython to directly with the C code (this can even be possible with fortran). This can be used if you have a pre-existing C-code, or library, that you want to use in python.

## Exercise 1

As with the first optimization lecture, here are four functions written using python. Rewrite these functions using Cython. Compare the speed of the function to the first optimization's notebook where these functions were replaced with numpy calls.

### Exercise 1.1

Rewrite this function to compute the sin of an input timeseries

In [None]:
# EXERCISE 1.1 - Our sin series from before
import numpy, math

def compute_sin_tseries(timeseries):
    sin_tseries = numpy.zeros(len(tseries))
    for i in range(len(timeseries)):
        sin_tseries[i] = numpy.sin(timeseries[i])
    return sin_tseries

# Here is the optimized numpy solution for reference
def compute_sin_tseries_numpy(timeseries):
    return numpy.sin(timeseries)



In [None]:
%%cython -a
import numpy
from cython import wraparound, boundscheck
from libc.math cimport sin # This imports c's sin function from the math library

@boundscheck(False)
@wraparound(False)
def compute_sinx_cython(double [::1] timeseries):
    # You'll need to fill this in. The cos example is probably a good starting place!


In [None]:
# Compare time taken to run each of these functions here.
def generate_time_series(tmin, tmax, delta_t):
    """
    Generates a times series between tmin and tmax sampled at delta_t
    """
    tseries = numpy.arange(tmin, tmax, delta_t)
    # We shift tseries by delta_t / 2 to ensure that we are using the midpoint rule (see wikipedia page)
    tseries = tseries + delta_t / 2.
    return tseries

tseries = generate_time_series(1., 1000., 1./100.)
%timeit compute_sin_tseries(tseries)
%timeit compute_sin_tseries_numpy(tseries)
%timeit compute_sinx_cython(tseries)

### Exercise 1.2

Rewrite this function to compute the exponential of an input time series.

In [None]:
# EXERCISE 1.2
import math, numpy

def compute_exp_tseries(timeseries):
    exp_tseries = numpy.zeros(len(timeseries))
    for i in range(len(timeseries)):
        exp_tseries[i] = math.e ** timeseries[i]
    return exp_tseries

# Numpy optimized function from before
def compute_exp_tseries_numpy(timeseries):
    return numpy.exp(timeseries)

In [None]:
%%cython -a
# Cython goes here
import math
import numpy
from cython import wraparound, boundscheck
from libc.math cimport exp # This imports c's exp function from the math library

@boundscheck(False)
@wraparound(False)
def compute_expx_cython(double [::1] timeseries):
    # You need to complete this, as before the cos example is a good starting point.

In [None]:
# Compare time taken to run each of these functions here.
tseries = generate_time_series(1., 100., 1./10000.)
%timeit compute_exp_tseries(tseries)
%timeit compute_exp_tseries_numpy(tseries)
%timeit compute_expx_cython(tseries)

943 ms ± 9.26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
40.2 ms ± 75.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
41.4 ms ± 164 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


### Exercise 1.3

Rewrite this code to compute a Fourier transform.

In [None]:
# Exercise 1.3
import numpy as np

def compute_fourier_transform(data_time_domain):
    math_i = 1j # This is how to write i in python.
    N = len(data_time_domain) # How many points in the data
    k = np.arange(N)
    n = np.arange(N)
    data_frequency_domain = np.zeros(N,dtype=np.complex128)
    for i in n:
        for j in k:
            data_frequency_domain[j] += data_time_domain[i] * \
                (np.cos(2 * np.pi * j  * i / N) - math_i * np.sin(2 * np.pi * j * i / N))
    return data_frequency_domain

# Numpy function for reference
def compute_fourier_transform_numpy(data_time_domain):
    math_i = 1j # This is how to write i in python.
    N = len(data_time_domain) # How many points in the data
    k = np.arange(N)
    n = np.arange(N)
    data_frequency_domain = np.zeros(N,dtype=np.complex128)
    for i in n:
        data_frequency_domain += data_time_domain[i] * \
            (np.cos(2 * np.pi * k  * i / N) - math_i * np.sin(2 * np.pi * k * i / N))
    return data_frequency_domain



In [None]:
%%cython -a
# Cython goes here
import math
import numpy as np
from cython import wraparound, boundscheck
from libc.math cimport cos, sin, pi

@boundscheck(False)
@wraparound(False)
def compute_fourier_transform_cython(double[::1] data_time_domain):
    # N is the length of the array, use cdef to define it here as in previous examples

    cdef double complex [::1] data_frequency_domain = np.zeros(N, dtype=np.complex128) # Here's how to define our complex output array
    cdef double complex temp_value # I use a temporary value here to make it absolutely clear to cython that we must use complex values!
    # Here you'll need to add cdef statements for undefined variables (i, j) used in the for loop

    for i in range(N):
        for j in range(N):
            temp_value = data_time_domain[i] * cos(2 * pi * j * i / N)
            temp_value += # Add the imaginary part here
            data_frequency_domain[j] += temp_value
    return data_frequency_domain


In [None]:
data_time_domain = numpy.random.random(1000)
# This first one is *very* slow. Make a cup of tea while it runs!
%timeit compute_fourier_transform(data_time_domain)
%timeit compute_fourier_transform_numpy(data_time_domain)
%timeit compute_fourier_transform_cython(data_time_domain)

# Note that numpy's FFT will still be much quicker, as the algorithm itself, and not the implementation, is much more
# optimal. However, numpy's FFT would call to an underlying C library, which is *highly* optimized, and beating that
# would be almost impossible for cases with non-negligible length.

55.6 s ± 73.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
283 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
156 ms ± 121 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


### Exercise 1.4

Rewrite thie code to compute a cross correlation

In [None]:
# EXERCISE 1.4 - Again, you might have seen this before!

def compute_cross_correlation(signal, data):
    cross_correlation = []
    for i in range(len(data) - len(signal)):
        curr_cross_corr = 0
        for j in range(len(signal)):
            curr_cross_corr += signal[j] * data[i+j]
        cross_correlation.append(curr_cross_corr)
    return cross_correlation

# Numpy version from last week
def compute_cross_correlation_numpy(signal, data):
    cross_correlation = []
    for i in range(len(data) - len(signal)):
        curr_cross_corr = (signal * data[i:i+len(signal)]).sum()
        cross_correlation.append(curr_cross_corr)
    return cross_correlation


In [None]:
%%cython -a
# Cython goes here
import math
import numpy as np
from cython import wraparound, boundscheck

@boundscheck(False)
@wraparound(False)
def compute_cross_correlation_cython(double [::1] signal, double [::1] data):
    cdef int sigsize = signal.size
    cdef int datasize = data.size
    cdef int N = # How big is the output vector? Fill that in here.
    cdef double [::1] cross_correlation = np.empty(N, dtype=np.float64)

    # Fill in the gap here to compute cross_correlation
    # NOTE: You will probably need *two* for loops here! In python you avoid for loops, in cython you embrace them!

    return cross_correlation


In [None]:
signal = numpy.random.random(1024)
data = numpy.random.random(1024*10)
%timeit compute_cross_correlation(signal, data)
%timeit compute_cross_correlation_numpy(signal, data)
%timeit compute_cross_correlation_cython(signal, data)

10.6 s ± 185 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
241 ms ± 13.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
50.1 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


We notice that for functions where the function can be replaced with a single numpy call (np.cos or np.exp) the numpy code is just as fast as the cython code. However, for more involved examples such as the Fourier transform or cross_correlation the Cython code is noticeably faster. I saw a factor of 3-4 speed up for the cross-correlation code in Cython vs numpy.

## Exercise 2

Here we give two more examples of codes. Except here there is no python version. You must write a code to solve the problem, both in plain python, and in Cython. Compare the speed of both (and check that it works!)

## Exercise 2.1

Write a code to return the nth number in the Fibonacci sequence. This sequence goes:
```
0,1,1,2,3,5,8,13,21,34,55,89,144,...
```
**Warning!** While python can handle arbitrarily long integers, C cannot. Therefore the cython code will not be able to store the correct numbers in the C int type above about the 48th value. The `long int` type can be used to increase accuracy, but that will still return incorrect values above the 93rd value.

In [None]:
# Write python code here
def fib(n):
    """
    Return the nth value of the Fibonacci sequence
    """
    # Write the code

In [None]:
%%cython -a
def fib_cython(int n):
    """Return the nth value of the Fibonacci sequence."""
    # Write the code, use examples above, and Google, to help.


In [None]:
# Test and time the codes here: Note that the current cell output is the performance I acheived with my code
print(fib(1), fib(2), fib(10), fib(50))
print(fib_cython(1), fib_cython(2), fib_cython(10), fib_cython(50))

%timeit fib(48)
%timeit fib_cython(48)

1 1 55 12586269025
1 1 55 12586269025
8.07 µs ± 161 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
196 ns ± 1.18 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


## Exercise 2.2

Write a code to decide if the input number `n` is a prime or not. If it is a prime return True, if not return False.


In [None]:
# Write python code here
def is_prime(n):
    # Add python code here


In [None]:
%%cython -a
def is_prime_cython(int n):
    # Add Cython code here


In [None]:
# Test and time the codes here
# Current cell output is what I acheived.
print(is_prime(1), is_prime(2), is_prime(34), is_prime(1071), is_prime(123123123), is_prime(123123137))
print(is_prime_cython(1), is_prime_cython(2), is_prime_cython(34), is_prime_cython(1071), is_prime_cython(123123123), is_prime_cython(123123137))

%timeit is_prime(123123137)
%timeit is_prime_cython(123123137)


False True False False False True
False True False False False True
784 µs ± 42.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
22.3 µs ± 12.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


## Acknowledgements

* https://ipython-books.github.io
* https://cython.readthedocs.io
* https://stackoverflow.com/questions/15285534/isprime-function-for-python-language
