#### faster: fewer instructions to compiler
Comiplers
- **Cython**: the most commonly used for compiling to C, covering both numpy and normal Python code
- **Shed Skin**: An automatic Python-to-C coverter for non-numpy code
- **Numba**: A new compiler specilized for numby code
- **Pythran**: A new compiler for both numby and non-numby code
- **PyPy**: A stable JIT compiler for non-numpy code

Without numpy: Cython, ShedSkin, and PyPy
With numpy: Cython, Numba, and Pythran

JIT vs AOT(Ahead of Time) Compilers
- JIT: Numba, PyPy 
 - compile just the right parts of the code at the time of use.
 - **cold star problem**: code starts very slowly while it compiles
- AOT: Cython, Shed Skin, Pythran
 - using a static library: `numpy`, `scipy`, or `scikit-learn`

In [1]:
# Keeping the code generic makes it run more slowly.
# The `abs` function works differently depending on the underlying datatype. 
# - integer or float: simply returns in turning a negative value into positive
# - complex: taking the square root of the sum of the squared components

v = -1.0
print(type(v), abs(v))

v = 1-1j
print(type(v), abs(v))

<class 'float'> 1.0
<class 'complex'> 1.4142135623730951


In [None]:
# Reviewing the Julia function's CPU-bound code
def calculate_z_serial_purepython(maxiter, zs, cs):
    """Calculate output list using Julia update rule"""
    output = [0] * len(zs)
    for i in range(len(zs)):
        n = 0
        z = zs[i]
        c = cs[i]
        while n < maxiter and abs(z) < 2:
            z = z * z + c
            n += 1
        output[i] = n
    return output

#### Cython
- converts type-annotated Python into a compiled extension module
- a fork of Pyrex as an expanded version using libraries like `scipy`, `skikit-learn`, `lxml` and `zmq`
- used via a `setup.py` script to compile a module.

#### Compiling a Pure-Python v Using Cython
Writing a compiled extension module
- The calling Python code
- The function to be compiled in a new `.ptx` file
- A `setup.py` that contains the instructions for calling Cython to make extension module

For the Julia example
- *julia1.py*: build the input lists and call calculation function
- *cythonfn.pyx*: contains CPU-bound function
- *setup.py*: contains build instructions

#### Running setup.py to build a new compiled module
```
> python setup.py build_ext --inplace
running build_ext
skipping 'cythonfn.c' Cython extension (up-to-date)
building 'calculate' extension
creating build/temp.linux-x86_64-3.5
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/include/python3.5m -c cythonfn.c -o build/temp.linux-x86_64-3.5/cythonfn.o
x86_64-linux-gnu-gcc -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-Bsymbolic-functions -Wl,-z,relro -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 build/temp.linux-x86_64-3.5/cythonfn.o -o /home/luno/Workspace/high_performance_python/cython/lists/1/calculate.cpython-35m-x86_64-linux-gnu.so
```

#### Pure Python
```
python3.5 julia1.py
Length of x: 1000
Total elements: 1000000
Took 7.0345377922058105 seconds
Total sum of elements (for validation): 33219980
```

#### Cython 
```
python3.5 julia1.py
Length of x: 1000
Total elements: 1000000
Took 5.442601680755615 seconds
Total sum of elements (for validation): 33219980
```

#### Cython Annotations
```
cython -a cythonfn.pyx
```

In [2]:
%load_ext Cython

In [4]:
%%cython --annotate
def calculate_z(maxiter, zs, cs):
    """Calculate output list using Julia update rule"""
    output = [0] * len(zs)
    for i in range(len(zs)):
        n = 0
        z = zs[i]
        c = cs[i]
        while n < maxiter and abs(z) < 2:
            z = z * z + c
            n += 1
        output[i] = n
    return output

Each line can be expanded with a double-click to show the generated C code.
- **4 ~ 8**: the most calls back into the Python virtual machine (line 8 is called over 30 millions times in pure python)
- **9, 10, 11**: tight inner loop that would be responsible for the bulk of the execution time of this function.
- **6, 7**: much smaller effect on the final speed (called 1 million times in pure python), need to be replace the `list` objects with `numpy` arrays.
    
In summary:
- Inside tight inner loops
- Dereferencing `list`, `array`, or `np.array` items
- Performing mathmetical operations

#### Adding Some Type Annotations
it makes our compiled function run faster by doing more work in C and less via the Python virtual machine

ex)
- `int`
- `unsigned int`
- `double complex`

In [None]:
def calculate_z(int maxiter, zs, cs):
    """Calculate output list using Julia update rule"""
    cdef unsigned int i, n
    cdef double complex z, c
    output = [0] * len(zs)
    for i in range(len(zs)):
        n = 0
        z = zs[i]
        c = cs[i]
        while n < maxiter and abs(z) < 2:
            z = z * z + c
            n += 1
        output[i] = n
    return output

After compiling it takes 3.07 seconds to compute.
```
python3.5 julia1.py
Length of x: 1000
Total elements: 1000000
Took 3.0724616050720215 seconds
Total sum of elements (for validation): 33219980
```

As a simplyfied version:
$\sqrt{c.real^2 + c.imag^2} < \sqrt{4}$ to $c.real^2 + c.imag^2 < 4$

In [None]:
# Expanding the abs function using Cython
def calculate_z(int maxiter, zs, cs):
    """Calculate output list using Julia update rule"""
    cdef unsigned int i, n
    cdef double complex z, c
    output = [0] * len(zs)
    for i in range(len(zs)):
        n = 0
        z = zs[i]
        c = cs[i]
        while n < maxiter and (z.real * z.real + z.imag * z.imag) < 4:
            z = z * z + c
            n += 1
        output[i] = n
    return output

A dramatic effect - by reducing the number of Python calls in the inermost loop.
```
python3.5 julia1.py
Length of x: 1000
Total elements: 1000000
Took 0.15447497367858887 seconds
Total sum of elements (for validation): 33219980
```

In [10]:
%%cython --annotate
def calculate_z(int maxiter, zs, cs):
    """Calculate output list using Julia update rule"""
    cdef unsigned int i, n
    cdef double complex z, c
    output = [0] * len(zs)
    for i in range(len(zs)):
        n = 0
        z = zs[i]
        c = cs[i]
        while n < maxiter and (z.real * z.real + z.imag * z.imag) < 4:
            z = z * z + c
            n += 1
        output[i] = n
    return output

#### Shed Skin 
- An experimental Python-to-C++ compiler that works with Python 2.4-2.7
- Use type inference to inspect a Python program to annotate the types used for each variable (automatic introspection)
- The annodated code is transalated into C code.
- Get 0.3 seconds execution time using `--nobounds --nowraps` flags.
- Extra compile-time flags like `-ffast-math` or `-O3`

In [None]:
# Examining the annotated output from Shed Skin
def calculate_z(maxiter, zs, cs):        # maxiter: [int], zs: [list(complex)], cs: [list(complex)]
    """Calculate output list using Julia update rule"""
    output = [0] * len(zs)               # [list(int)]
    for i in range(len(zs)):             # [__iter(int)]
        n = 0                            # [int]
        z = zs[i]                        # [complex]
        c = cs[i]                        # [complex]
        while n < maxiter and abs(z) < 2: # [int]
            z = z * z + c                # [complex]
            n += 1                       # [int]
        output[i] = n                    # [int]
    return output                        # [list(int)]


if __name__ == "__main__":               # []
    # make a trivial example using the correct types to enable type inference
    # call the function so ShedSkin can analyze the types
    output = calculate_z(1, [0j], [0j])  # [list(int)]

#### Cython and numpy
- list(heterogeneous container): has an overhead for each dereference, as the object they reference can occur anywhere in memory.
- array(homogeneous container): stores primitive types in contiguous blocks of RAM, which enables faster addressing.

- `numpy` version without any Cython annoations takes about 71 seconds to run
- Because of the overhead of dereferencing individual elements in the `numpy` lists
- **`memoryview`**
 - allows the same low-level access to any object that implements the buffer interface, including `numpy` arrays and Python arrays.
 - can be shared with other C libraries without casting to another from from Python objects
 - can be used in any context (function parameters, module-level, cdef class attribute, etc) 
 - [Typed MemoryViews](http://cython.readthedocs.io/en/latest/src/userguide/memoryviews.html)

In [None]:
# Annotated numpy version of the Julia calculation function
# `double complex[:] zs` - a double-precision complex object using the buffer protocol
import numpy as np
cimport numpy as np

def calculate_z(int maxiter, double complex[:] zs, double complex[:] cs):
    """Calculate output list using Julia update rule"""
    cdef unsigned int i, n
    cdef double complex z, c
    # assining a 1D numpy `array` to it via `empty`, call to `empty` will allocate 
    # a block of memory but not initilized the memory with same values.
    cdef int[:] output = np.empty(len(zs), dtype=np.int32)
    for i in range(len(zs)):
        n = 0
        z = zs[i]
        c = cs[i]
        # expanded teh call to `abs` using the faster, more math version.
        while n < maxiter and (z.real * z.real + z.imag * z.imag) < 4:
            z = z * z + c
            n += 1
        output[i] = n
    return output

In [None]:
# Adding prange to enable parallization using OpenMP
#cython: boundscheck=False
from cython.parallel import prange
import numpy as np
cimport numpy as np

def calculate_z(int maxiter, double complex[:] zs, double complex[:] cs):
    """Calculate output list using Julia update rule"""
    cdef unsigned int i, length
    cdef double complex z, c
    cdef int[:] output = np.empty(len(zs), dtype=np.int32)
    length = len(zs)
    with nogil:
        for i in prange(length, schedule="guided"):
            z = zs[i]
            c = cs[i]
            output[i] = 0
            while output[i] < maxiter and (z.real * z.real + z.imag * z.imag) < 4:
                z = z * z + c
                output[i] += 1
    return output

#### Parallelizing with OpenMP on One Machine
- OpenMP (Open Multi-Processing) is a well defined cross-platform API that supports parallel execution and memory sharing for C, C++ and Fortran.
- With Cython, OpenMP can be added by using the `prange (parallel range)` operator and adding the `-fopenmp` compiler directive to `setup.py`.
- **nogil**: GIL is disabled; inside this block we use `prange` to enable an OpenMP parallel for loop to independently calculate each `i`. When disabling the GIL we must not operate on regular Python objects. (ex. lists)
- **schedule**: `dynamic` and `guide` options make the work chunks get distributed evenly

In [None]:
#cython: boundscheck=False
from cython.parallel import prange
import numpy as np
cimport numpy as np

def calculate_z(int maxiter, double complex[:] zs, double complex[:] cs):
    """Calculate output list using Julia update rule"""
    cdef unsigned int i, length
    cdef double complex z, c
    cdef int[:] output = np.empty(len(zs), dtype=np.int32)
    length = len(zs)
    # GIL is disabled
    with nogil:
        for i in prange(length, schedule="guided"):
            z = zs[i]
            c = cs[i]
            output[i] = 0
            while output[i] < maxiter and (z.real * z.real + z.imag * z.imag) < 4:
                z = z * z + c
                output[i] += 1
    return output

In [None]:
# Adding the OpenMP compiler and linker flags to setup.py for Cython

from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext

# for notes on compiler flags e.g. using
# export CFLAGS=-O2
# so gcc has -O2 passed (even though it doesn't make the code faster!)
# http://docs.python.org/install/index.html

setup(
    cmdclass={'build_ext': build_ext},
    ext_modules=[Extension("calculate", ["cython_np.pyx"], extra_compile_args=[
                           '-fopenmp'], extra_link_args=['-fopenmp'])]
)

### Numba (Skipped)
### Pythran (Skipped)
### PyPy
- an alternative implementation of the Python language that include JIT compiler.
- With CPython it takes 11 sec, with PyPy it takes 0.3 sec.
- Not supported `numpy`, (NumPyPy, https://bitbucket.org/pypy/numpy)
- C extension libraries probably won't work in a useful way.
- PyPy advice: try to remove any C extension code if possible.
- can use a lot of RAM, it may use more RAM than CPython.
- PyPy3.3 is considered alpha/beta software. All binaries are thus called “alpha”. It is known to be sometimes much slower than PyPy 2
- http://speed.pypy.org/
- https://bitbucket.org/pypy/compatibility/wiki/Home

#### tools
- [jitviewer](https://bitbucket.org/pypy/jitviewer)
- [vmprof](http://vmprof.com)


#### Garbage Collection Difference
- CPython uses reference counting, PyPy uses a modified mark and sweep.
- https://pypy.readthedocs.io/en/latest/cpython_differences.html

```
pypy julia1.py
Length of x: 1000
Total elements: 1000000
Took 0.645111083984 seconds
Total sum of elements (for validation): 33219980
```

#### Others
- https://github.com/dropbox/pyston
- [Cython vs Pyston vs PyPy](https://lincolnloop.com/blog/speed-comparison-cpython-pypy-pyston/)

### Foreign Function Interfaces
Sample C code for solving the 2D diffusion problem

```c
void evolve(double in[][512], double out[][512], double D, double dt) {
    int i, j;
    double laplacian;
    for (i=1; i<511; i++) {
        for (j=1; j<511; j++) {
            laplacian = in[i+1][j] + in[i-1][j] + in[i][j+1] + in[i][j-1] - 4 * in[i][j];
            out[i][j] = in[i][j] + D * dt * laplacian;
        }
    }
}
```

Create a .so file
```bash
gcc -O3 -std=gnu99 -c diffusion.c
gcc -shared -o diffusion.so diffusion.o
```

#### ctypes
- the most basic forgein function interface in CPython

#### cffi
- simplify many of the standard operations
- simply write the C code that define structure of the library, then cffi will do all the work
- write JIT compiled C code using the `verify` function.

#### f2py
- a dead simple way of importing fortran code into python

#### CPython Module
- the same way that CPython is developed
- takes care of all of the interactions between our code and the implementation of CPython

In [None]:
# ctypes 2D diffusion code
import ctypes

grid_shape = (512, 512)
_difussion = ctypes.CDLL("./difussion.so")

# Create references to the C types that we will need to simplify future code
TYPE_INT = ctypes.c_int
TYPE_DOUBLE = ctypes.c_double
TYPE_DOUBLE_SS = ctypes.POINTER(ctypes.POINTER(ctypes.c_double))

# Initialize the signature the evolve function to:
# void evolve(int, int, double**, double**, double, double)
_diffusion.evolve.argtypes = [
    TYPE_INT,
    TYPE_INT,
    TYPE_DOUBLE_SS,
    TYPE_DOUBLE_SS,
    TYPE_DOUBLE,
    TYPE_DOUBLE,
]
_diffusion.evolve.restype = None

def evolve(grid, out, dt, D=1.0):
    # First we convert the Python types into the relevant C types
    cX = TYPE_INT(grid_shape[0])
    cY = TYPE_INT(grid_shape[1])
    cdt = TYPE_DOUBLE(dt)
    cD = TYPE_DOUBLE(D)
    pointer_grid = grid.ctypes.data_as(TYPE_DOUBLE_SS)
    pointer_out = out.ctypes.data_as(TYPE_DOUBLE_SS)
    
    # Now we can call the function
    _diffusion.evolve(cX, cY, pointer_grid, pointer_out, cD, cdt)

In [None]:
# cffi 2D diffusion code
ffi = FFI()

ffi.cdef(r''' 
    void evolve(
        int Nx, int Ny, 
        double **in, double **out, 
        double D, double dt); 
''')
# cast nonnative Python objects
lib = ffi.dlopen("./diffusion.so")


def evolve(grid, dt, out, D=1.0):
    X, Y = grid_shape
    pointer_grid = ffi.cast('double**', grid.ctypes.data)
    pointer_out = ffi.cast('double**', out.ctypes.data)
    lib.evolve(X, Y, pointer_grid, pointer_out, D, dt)

Got errors during compiling: "called object is not a function or function pointer"
```
python setup.py build_ext --inplace
running build_ext
building 'cdiffusion' extension
C compiler: x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -fPIC

compile options: '-I/usr/local/lib/python2.7/dist-packages/numpy/core/include -I/usr/include/python2.7 -c'
extra options: '-O3 -std=c99 -Wall -p -pg'
Warning: Can't read registry to find the necessary compiler setting
Make sure that Python modules _winreg, win32api or win32con are installed.
x86_64-linux-gnu-gcc: cdiffusion/python_interface.c
In file included from /usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/ndarraytypes.h:1777:0,
                 from /usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/ndarrayobject.h:18,
                 from /usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/arrayobject.h:4,
                 from cdiffusion/python_interface.c:2:
/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:15:2: warning: #warning "Using deprecated NumPy API, disable it by " "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
 #warning "Using deprecated NumPy API, disable it by " \
  ^
In file included from /usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/ndarrayobject.h:27:0,
                 from /usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/arrayobject.h:4,
                 from cdiffusion/python_interface.c:2:
cdiffusion/python_interface.c: In function ‘py_evolve’:
/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/__multiarray_api.h:636:22: error: called object is not a function or function pointer
 #define PyArray_Type (*(PyTypeObject *)PyArray_API[2])
                      ^
cdiffusion/python_interface.c:40:28: note: in expansion of macro ‘PyArray_Type’
  if (PyArray_TYPE(data) != PyArray_Type(next_grid))
                            ^
cdiffusion/python_interface.c: At top level:
cdiffusion/python_interface.c:83:18: warning: initialization from incompatible pointer type [-Wincompatible-pointer-types]
    {"evolve",    py_evolve,  METH_VARARGS,   cdiffusion_evolve_docstring},
                  ^
cdiffusion/python_interface.c:83:18: note: (near initialization for ‘module_methods[0].ml_meth’)
In file included from /usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/ndarraytypes.h:1777:0,
                 from /usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/ndarrayobject.h:18,
                 from /usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/arrayobject.h:4,
                 from cdiffusion/python_interface.c:2:
/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:15:2: warning: #warning "Using deprecated NumPy API, disable it by " "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
 #warning "Using deprecated NumPy API, disable it by " \
  ^
In file included from /usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/ndarrayobject.h:27:0,
                 from /usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/arrayobject.h:4,
                 from cdiffusion/python_interface.c:2:
cdiffusion/python_interface.c: In function ‘py_evolve’:
/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/__multiarray_api.h:636:22: error: called object is not a function or function pointer
 #define PyArray_Type (*(PyTypeObject *)PyArray_API[2])
                      ^
cdiffusion/python_interface.c:40:28: note: in expansion of macro ‘PyArray_Type’
  if (PyArray_TYPE(data) != PyArray_Type(next_grid))
                            ^
cdiffusion/python_interface.c: At top level:
cdiffusion/python_interface.c:83:18: warning: initialization from incompatible pointer type [-Wincompatible-pointer-types]
    {"evolve",    py_evolve,  METH_VARARGS,   cdiffusion_evolve_docstring},
                  ^
cdiffusion/python_interface.c:83:18: note: (near initialization for ‘module_methods[0].ml_meth’)
error: Command "x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -fPIC -I/usr/local/lib/python2.7/dist-packages/numpy/core/include -I/usr/include/python2.7 -c cdiffusion/python_interface.c -o build/temp.linux-x86_64-2.7/cdiffusion/python_interface.o -O3 -std=c99 -Wall -p -pg" failed with exit status 1

```