# Python for High Performance Computing
# <span style="font-family: Courier New, Courier, monospace;">Cython</span> extensions
<hr style="border: solid 4px green">
<br>
<center> <img src="images/arc_logo.png"; alt="Logo" style="float: center; width: 20%"></center>
<br>
## http://www.arc.ox.ac.uk
## support@arc.ox.ac.uk

## Overview
<hr style="border: solid 4px green">

### <span style="font-family: Courier New, Courier, monospace;">Cython</span>: you can have the cake and eat it... almost
* enjoy the benefits of fast executing C code without practically leaving Python
* requires a less Pythonic thinking and a more C-like thinking (familiarity with C pays)
* bonus feature: multithreaded parallelism
<br><br>

### <span style="font-family: Courier New, Courier, monospace;">Cython</span> is
* an open-source project (http://cython.org)
* a Python compiler (nearly)
* an extension to the Python language for
  * writing fast-executing extension modules
  * interfacing Python with C libraries
* best used to target performance critical parts of the code
<br><br>

### <span style="font-family: Courier New, Courier, monospace;">Cython</span> is stable & mature
* establishing itself as a pillar of the scientific Python ecosystem

## Example
<hr style="border: solid 4px green">

### Fibonacci sequence

In [2]:
import numpy

def fib (n):
    """Compute Fibonacci sequence"""
    s = numpy.zeros(n, dtype=numpy.float64)
    if n >= 1:
        s[0] = 1.0

    if n >= 2:
        s[1] = 1.0

    for i in range (2, n):
        s[i] = s[i-1] + s[i-2]

    return s

In [3]:
% timeit fib(1024)

1000 loops, best of 3: 485 µs per loop


In [4]:
% load_ext Cython
% load_ext cythonmagic



In [5]:
%%cython

import numpy
cimport numpy

# samre function as before but with data types
cpdef double[:] cfib (int n):
    """Compute Fibonacci sequence"""

    # declare types of output data
    cdef double[:] s
    # declare types of function local data
    cdef int i

    # the rest is the same
    s = numpy.zeros(n, dtype=numpy.float64)
    if n >= 1:
        s[0] = 1.0

    if n >= 2:
        s[1] = 1.0

    for i in range (2, n):
        s[i] = s[i-1] + s[i-2]

    return s

In [6]:
% timeit cfib(1024)

The slowest run took 318.13 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 4.96 µs per loop


###  A speedup of ~100 times!
* 3 extra lines of code
* 2 types added in function definition

## Overview
<hr style="border: solid 4px green">

### Pros
* 99% Python (versions 2 and 3 compatible)
* supports functionality both ways
  * running C extensions from Python
  * using `Cython` functions from C
* incremental development
  * standard Python is valid `Cython`
  * speed code up by adding C features
<br><br>

### Cons
* needs compilation
* CPython specific (does not work with other implementations, *e.g.* PyPy)

> *Documentation*:
> * http://cython.readthedocs.io/en/latest/
> * http://cython.readthedocs.io/en/latest/src/userguide/parallelism.html

## How Cython works
<hr style="border: solid 4px green">

### It writes C code so you don't have to
* Cython generates C code from Python-like code
* a C compiler compiles the C code generated
  * all major compilers supported on all major platforms
* most of Python syntax can be Cythonized
  * top-level classes and functions
  * control structures: loops, with, try-except/finally, ...
  * object operations, arithmetic, ...
<br><br>

### Help <span style="font-family: Courier New, Courier, monospace;">Cython</span> to get greater the performance than on its own
* the more specific you can be about variables and functions
  * the more the generated C code uses C types and libraries instead of the Python API and
  * the more the compiler can apply optimisations
* **aim** to get away from Python safety towards C optimised performance

## Cython workflow
<hr style="border: solid 4px green">

### Step 1: write <span style="font-family: Courier New, Courier, monospace;">Cython</span> code
* `.pyx` files: Python-like code
* `.pxd` files: Cython header files (optional)
<br><br>

### Step 2: <span style="font-family: Courier New, Courier, monospace;">Cython</span> translates this into C code and a compiler builds a shared library
* Option #1: use a translate-compile two step procedure, *e.g.*
  * `cython mycode.pyx`
  * `gcc -O2 -Wall -shared -fPIC $(pkg-config --cflags --libs python) -o myext.so myext.c`
* <span style="background-color:#ffcc00">Option #2: build and install using distutils</span>
  * `python setup.py build_ext --inplace`
  * `python setup.py install --prefix=./`
<br><br>

> *Note*: run-time build is an option
> * `pyximport` -- on-the-fly build & import (for experiments)
> * `cython.inline()` -- runtime compile code
<br><br>

### Step 3: Python imports the module and uses it
```python
import mycode
mycode.func ()
```

## Remove overheads: 4 steps to performance
<hr style="border: solid 4px green">

### Remove Python object overheads
* tell `Cython` the types of variables
```python
cdef int i, j, k
cdef float x, y[10]
cdef double *z
```
<br><br>

### Remove Python function call overheads
* tell `Cython`
  * how to turn functions into C or
  * how to use C functions directly
```python
# define Python function
def foo (int i, char *s):
# C function, not visible to Python code that imports the module
cdef int foo2 (int i, char *s):
# use function from C library directly
from libc.math cimport sin
cdef double x
s = sin(x)
```

### Multithreaded parallellism
* release the GIL
* use data parallelism constructs
  * replace `range` with `prange`
<br><br>

### Remove Python check overheads
* use *compiler directives*: tell `Cython` about any Python checks that can be skipped
```python
# True = avoid division checks (e.g. ZeroDivisionError)
@cython.cdivision(True)
# False = do not check that indexing operations raise IndexErrors
@cython.boundscheck(False)
# False = do not check for negative index handling (possibly causing segfaults or data corruption)
@cython.wraparound(False)
```

## Example
<hr style="border: solid 4px green">

### Task
* compute the $p$-norm of a vector
$$||x||_p=\left( \sum_{i=1}^{n} |x_i|^p \right)^{1/p}$$
* reference implementation is `numpy.linalg.norm`
<br><br>

### Steps
* write `cython` code
* build module
* test module
<br><br>

### Incremental Cython code development
* start with pure Python
* type the variables
* use external C functions
* remove the GIL and add multithreading
* add Cython compiler directives to eliminate checks

## Example (cont'd)
<hr style="border: solid 4px green">

### Start with a pure Python function

```python
import math

def p_norm (u, p):
    n = u.size
    s = 0.0
    for i in range (n):
        s += math.pow (math.fabs(u[i]), p)

    return math.pow (s, 1.0 / float(p))
```
<br><br>

### Notes
* Python code is legal `Cython` code

## Example (cont'd)
<hr style="border: solid 4px green">

### Add types

```python
import cython
cimport cython

cpdef double p_norm_types (double [:] u, int n, int p):
    cdef:
        int i
        double s

    s = 0.0
    for i in range (n):
        s += math.pow (math.fabs(u[i]), p)

    return math.pow (s, 1.0 / float (p))
```
<br><br>

### Notes
* the `cimport` statement imports the external C declarations from another module via a `.pxd` file from that module
* function arguments and internal function variables are given C types
```python
double [:] u, int n, int p
```
* these will be directly translated to like variables in the generated C code, avoiding the expensive use of Python objects
<br><br>

> *Remark*: recent developments allow us to use
```python
cpdef double[:] func (double[:])
```
instead of old
```python
cpdef numpy.ndarray[numpy.double_t, ndim=1] func (numpy.ndarray[numpy.double_t, ndim=1] u)
```
Higher dimensional arrays are equally easy, *e.g.* `double[:,:] u`

## Example (cont'd)
<hr style="border: solid 4px green">

### Add external C functions 

```python
from libc.math cimport pow, fabs

cpdef double p_norm_types_better (double [:] u, int n, int p):
    cdef:
        int i
        double s

    s = 0.0

    for i in range (n):
        s += pow (fabs(u[i]), p)

    return pow (s, 1.0 / float (p))
```
<br><br>

### Notes:
* the `Cython` module installation (and the associated `.pxd` files) make importing declarations easy (see remark below)
* the call to C function `pow` is repeated in the loop, so significant overheads are removed

<br><br>
> *Remark*: The use of the `pow` math function from the external C library
```python
from libc.math cimport pow
```
is equivalent to the longer explicit header declaration
```
cdef extern from "math.h" nogil:
    double pow (double x, double y)
```

## Example (cont'd)
<hr style="border: solid 4px green">

### Add multithreading and release the GIL

```python
from cython.parallel import prange, parallel
cimport openmp

cpdef double p_norm_openmp (double [:] u, int n, int p, int nt=2):
    cdef:
        int i
        double s

    s = 0.0

    openmp.omp_set_num_threads (nt)

    with nogil:
        for i in prange (n):
            s += pow (fabs (u[i]), p)

    return pow (s, 1.0 / float (p))
```
<br><br>

### Notes
* parallel region defined via `prange`
* the parallel region releases the GIL
* number of threads to use in parallel region set
  * via function argument
  * passed on to the OpenMP `omp_set_num_threads ()` function
* the reduction on variable `s` is inferred from context
  * `s +=` works
  * `s = s +` does not work

## OpenMP
<hr style="border: solid 4px green">

### OpenMP is
* an API for explicit shared-memory parallelism
* supported by all major compilers for C/C++/Fortran
<br><br>

### Multi-threaded parallelism
* program to tell the compiler when (and how) to multithread
* particularly targeting loops and particularly good for data parallelism
<br><br>

### Fork-join multithreading
<img src="./images/openmp_fork_join.gif"; style="float: center; width: 60%"; >
<br><br>

### Example
```c
# pragma omp parallel for \
  default (none)          \
  private (i)             \
  shared (n,p,u)          \
  reduction(+:s)
  for (i=0; i<n; i++) {
     s += pow (fabs (u[i]), p)
  }
```
<br><br>


### API components:
* compiler directives
* run-time library routines
* environment variables

## Example (cont'd)
<hr style="border: solid 4px green">

### Final touches: remove checks

```python
@cython.cdivision(True)
@cython.boundscheck(False)
@cython.wraparound(False)
cpdef double p_norm_openmp_better (double [:] u, int n, int p, int nt=2):

    cdef:
        int i
        double s

    s = 0.0

    openmp.omp_set_num_threads (nt)

    with nogil:
        for i in prange (n):
            s += pow (fabs (u[i]), p)

    return pow (s, 1.0 / float (p))
```
<br><br>

### Notes
* the Cython generator for C code is given additional hints to simplify code using compiler directives
  * `boundscheck=False` -- guarantee array bounds are respected
  * `wraparound=False` -- guarantee negative indices are not used
  * `cdivision=True` -- guarantee division is safe, avoid expensive checks (*e.g.* division by 0)

## Example (cont'd)
<hr style="border: solid 4px green">

### Build module

In [1]:
! python setup.py install --prefix=$PWD

running install
running build
running build_ext
cythoning ./src/matnorm.pyx to ./src/matnorm.c
building 'matnorm' extension
creating build
creating build/temp.linux-x86_64-2.7
creating build/temp.linux-x86_64-2.7/src
gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/usr/local/anaconda2/lib/python2.7/site-packages/numpy/core/include -I/usr/local/anaconda2/include/python2.7 -c ./src/matnorm.c -o build/temp.linux-x86_64-2.7/./src/matnorm.o -fopenmp
creating build/lib.linux-x86_64-2.7
gcc -pthread -shared -L/usr/local/anaconda2/lib -Wl,-rpath=/usr/local/anaconda2/lib,--no-as-needed build/temp.linux-x86_64-2.7/./src/matnorm.o -L/usr/local/anaconda2/lib -lpython2.7 -o build/lib.linux-x86_64-2.7/matnorm.so -fopenmp
running install_lib
creating /home/mihai/python-for-hpc/python-hpc/notebooks/lecture06-cython/lib
creating /home/mihai/python-for-hpc/python-hpc/notebooks/lecture06-cython/lib/python2.7
creating /home/mihai/python-for-hpc/python-hpc

## Example (cont'd)
<hr style="border: solid 4px green">

### Test
* first, test accuracy against `numpy.linalg.norm`
* then, test performance

In [2]:
! python test.py

norm is  6.17168473681
relative errors [%] [  7.19559130e-14   7.19559130e-14   2.87823652e-14   2.87823652e-14]

 performance ...
                                     linalg.norm:  1.595858
                       pure python code cythonized 8.873584
                                    adding C types 5.472727
                                adding C functions 1.504336
                   using types + OpenMP (1 thread) 1.976157
                  using types + OpenMP (2 threads) 1.013240
                  using types + OpenMP (4 threads) 0.593100
      using types + OpenMP + no checks (4 threads) 0.427184


## Three types of function declarations
<hr style="border: solid 4px green">

### <span style="font-family: Courier New, Courier, monospace;">def</span>
* Python, basically
  * called directly from Python
  * Python objects as arguments
  * returns a Python object
<br><br>

### <span style="font-family: Courier New, Courier, monospace;">cdef</span>: pure C functions
* C code effectively, all types must be declared
* `Cython` optimises aggressively
* *Pros*: fastest executing code
* *Cons*: declared functions not visible to code that imports the module
<br><br>

### <span style="font-family: Courier New, Courier, monospace;">cpdef</span>: both C and Python
* gets compiled to two functions
  * a `cdef` for C types (for fast execution)
  * a `def` for Python types (for compatibility)
* *Pros*: visible to code that imports the module
* *Cons*: version using Python objects can be as slow as `def` version

## Functions and type coercion
<hr style="border: solid 4px green">

### Argument type checks are automatic where variables are typed

In [9]:
% load_ext Cython
% load_ext cythonmagic

The Cython extension is already loaded. To reload it, use:
  %reload_ext Cython
The cythonmagic extension is already loaded. To reload it, use:
  %reload_ext cythonmagic


In [10]:
%%cython

def func (x):
    return x + 1

def func_int (int x):
    return x + 1

# low-level C function, callable from C
cdef cdef_func_int (int x):
    return x + 1

# low-level C function, callable from C + wrapper, callable from Python
cpdef cpdef_func_int (int x):
    return x + 1

In [11]:
# this works as intended
print func (1)

2


In [12]:
# this raises a type error from the "add" operation itself
print func ("abc")

TypeError: cannot concatenate 'str' and 'int' objects

In [13]:
# this function has a typed arg and the error is from an argument type check
print func_int ("abc")

TypeError: an integer is required

In [14]:
# low-level C function, callable from C, not visible from Python
print cdef_func_int ("abc")

NameError: name 'cdef_func_int' is not defined

In [15]:
# hybrid function, visible from Python, arg type check
print cpdef_func_int ("abc")

TypeError: an integer is required

## Functions and type coercion (cont'd)
<hr style="border: solid 4px green">

### Similar tests using the "norm" functions
* eliminating the Python specific tests (*e.g.* bounds check) doen not prevent the C type checks

In [16]:
import sys
sys.path.append("./lib/python2.7/site-packages")
import matnorm as mn
import numpy as np

M, N = 3, 4
u = np.random.rand (M*N)
v = np.arange(M*N)
w = u.reshape((M, N))

print "u is ", u.shape, u.dtype
print "v is ", v.shape, v.dtype
print "w is ", w.shape, w.dtype

u is  (12,) float64
v is  (12,) int64
w is  (3, 4) float64


In [17]:
# call function the normal way
r = mn.p_norm_openmp_better (u, M*N, 3)

In [18]:
# if input is wrong data type, an error is automatically raised
r = mn.p_norm_openmp_better (v, M*N, 3)

ValueError: Buffer dtype mismatch, expected 'double' but got 'long'

In [19]:
# if input is wrong data type, an error is automatically raised
r = mn.p_norm_openmp_better (w, M*N, 3)

ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

## Summary
<hr style="border: solid 4px green">

### Benefits of <span style="font-family: Courier New, Courier, monospace;">Cython</span>
* Python code is enhanced with
  * variable types
  * external libraries functions
  * multi-threading constructs
  * extra directives for the Cython compiler
* nevertheless, code remains mostly Python
<br><br>

### Enhanced Python code is cythonized and compiled
* code-to-code transformation: `Cython` source file (`.pyx` file) $\longrightarrow$ C source (`.c` file)
* source compilation: C source $\longrightarrow$ Python extension module (`.so` file)

## Summary (cont'd)
<hr style="border: solid 4px green">

### Best practices
* since pure Python is valid Cython, code development can be incremental
* work by methodically stripping away the Python safe but expensive objects
* `Cython` annotation helps
  * `cython -a mycode.pyx` generates `mycode.html`
  * the HTML file provides information about Python call outs
  * useful and *recommended* while developing code -- minimise the size of the *yellow regions*

## Summary (cont'd)
<hr style="border: solid 4px green">

### `Cython` performance achieved by
* compiling code
  * applying standard compiler optimisation (which the interpreter cannot)
  * *note*: pure Python cythonized cuts runtime by 20-50%
* pruning the original Python down to essentials
  * eliminate dynamic typing
  * eliminate checks (*e.g.* bound checks)
  * *note*: 300x speedup on pure Python cythonized
* overcoming the GIL and using OpenMP to multithread
  * *note*: very easy programming

<img src="../../images/reusematerial.png"; style="float: center; width: 90"; >
<br>
<br>