#### faster: fewer instructions to compiler
Comiplers
- **Cython**: the most commonly used for compiling to C, covering both numpy and normal Python code
- **Shed Skin**: An automatic Python-to-C coverter for non-numpy code
- **Numba**: A new compiler specilized for numby code
- **Pythran**: A new compiler for both numby and non-numby code
- **PyPy**: A stable JIT compiler for non-numpy code

Without numpy: Cython, ShedSkin, and PyPy
With numpy: Cython, Numba, and Pythran

JIT vs AOT(Ahead of Time) Compilers
- JIT: Numba, PyPy 
 - compile just the right parts of the code at the time of use.
 - **cold star problem**: code starts very slowly while it compiles
- AOT: Cython, Shed Skin, Pythran
 - using a static library: `numpy`, `scipy`, or `scikit-learn`

In [1]:
# Keeping the code generic makes it run more slowly.
# The `abs` function works differently depending on the underlying datatype. 
# - integer or float: simply returns in turning a negative value into positive
# - complex: taking the square root of the sum of the squared components

v = -1.0
print(type(v), abs(v))

v = 1-1j
print(type(v), abs(v))

<class 'float'> 1.0
<class 'complex'> 1.4142135623730951


In [None]:
# Reviewing the Julia function's CPU-bound code
def calculate_z_serial_purepython(maxiter, zs, cs):
    """Calculate output list using Julia update rule"""
    output = [0] * len(zs)
    for i in range(len(zs)):
        n = 0
        z = zs[i]
        c = cs[i]
        while n < maxiter and abs(z) < 2:
            z = z * z + c
            n += 1
        output[i] = n
    return output

#### Cython
- converts type-annotated Python into a compiled extension module
- a fork of Pyrex as an expanded version using libraries like `scipy`, `skikit-learn`, `lxml` and `zmq`
- used via a `setup.py` script to compile a module.

#### Compiling a Pure-Python v Using Cython
Writing a compiled extension module
- The calling Python code
- The function to be compiled in a new `.ptx` file
- A `setup.py` that contains the instructions for calling Cython to make extension module

For the Julia example
- *julia1.py*: build the input lists and call calculation function
- *cythonfn.pyx*: contains CPU-bound function
- *setup.py*: contains build instructions

#### Running setup.py to build a new compiled module
```
> python setup.py build_ext --inplace
running build_ext
skipping 'cythonfn.c' Cython extension (up-to-date)
building 'calculate' extension
creating build/temp.linux-x86_64-3.5
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/include/python3.5m -c cythonfn.c -o build/temp.linux-x86_64-3.5/cythonfn.o
x86_64-linux-gnu-gcc -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-Bsymbolic-functions -Wl,-z,relro -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 build/temp.linux-x86_64-3.5/cythonfn.o -o /home/luno/Workspace/high_performance_python/cython/lists/1/calculate.cpython-35m-x86_64-linux-gnu.so
```

#### Pure Python
```
python3.5 julia1.py
Length of x: 1000
Total elements: 1000000
Took 7.0345377922058105 seconds
Total sum of elements (for validation): 33219980
```

#### Cython 
```
python3.5 julia1.py
Length of x: 1000
Total elements: 1000000
Took 5.442601680755615 seconds
Total sum of elements (for validation): 33219980
```

#### Cython Annotations
```
cython -a cythonfn.pyx
```

In [2]:
%load_ext Cython

In [4]:
%%cython --annotate
def calculate_z(maxiter, zs, cs):
    """Calculate output list using Julia update rule"""
    output = [0] * len(zs)
    for i in range(len(zs)):
        n = 0
        z = zs[i]
        c = cs[i]
        while n < maxiter and abs(z) < 2:
            z = z * z + c
            n += 1
        output[i] = n
    return output

Each line can be expanded with a double-click to show the generated C code.
- **4 ~ 8**: the most calls back into the Python virtual machine (line 8 is called over 30 millions times in pure python)
- **9, 10, 11**: tight inner loop that would be responsible for the bulk of the execution time of this function.
- **6, 7**: much smaller effect on the final speed (called 1 million times in pure python), need to be replace the `list` objects with `numpy` arrays.
    
In summary:
- Inside tight inner loops
- Dereferencing `list`, `array`, or `np.array` items
- Performing mathmetical operations

#### Adding Some Type Annotations
it makes our compiled function run faster by doing more work in C and less via the Python virtual machine

ex)
- `int`
- `unsigned int`
- `double complex`

In [None]:
def calculate_z(int maxiter, zs, cs):
    """Calculate output list using Julia update rule"""
    cdef unsigned int i, n
    cdef double complex z, c
    output = [0] * len(zs)
    for i in range(len(zs)):
        n = 0
        z = zs[i]
        c = cs[i]
        while n < maxiter and abs(z) < 2:
            z = z * z + c
            n += 1
        output[i] = n
    return output

After compiling it takes 3.07 seconds to compute.
```
python3.5 julia1.py
Length of x: 1000
Total elements: 1000000
Took 3.0724616050720215 seconds
Total sum of elements (for validation): 33219980
```

As a simplyfied version:
$\sqrt{c.real^2 + c.imag^2} < \sqrt{4}$ to $c.real^2 + c.imag^2 < 4$

In [None]:
# Expanding the abs function using Cython
def calculate_z(int maxiter, zs, cs):
    """Calculate output list using Julia update rule"""
    cdef unsigned int i, n
    cdef double complex z, c
    output = [0] * len(zs)
    for i in range(len(zs)):
        n = 0
        z = zs[i]
        c = cs[i]
        while n < maxiter and (z.real * z.real + z.imag * z.imag) < 4:
            z = z * z + c
            n += 1
        output[i] = n
    return output

A dramatic effect - by reducing the number of Python calls in the inermost loop.
```
python3.5 julia1.py
Length of x: 1000
Total elements: 1000000
Took 0.15447497367858887 seconds
Total sum of elements (for validation): 33219980
```

In [10]:
%%cython --annotate
def calculate_z(int maxiter, zs, cs):
    """Calculate output list using Julia update rule"""
    cdef unsigned int i, n
    cdef double complex z, c
    output = [0] * len(zs)
    for i in range(len(zs)):
        n = 0
        z = zs[i]
        c = cs[i]
        while n < maxiter and (z.real * z.real + z.imag * z.imag) < 4:
            z = z * z + c
            n += 1
        output[i] = n
    return output

#### Shed Skin 
- An experimental Python-to-C++ compiler that works with Python 2.4-2.7
- Use type inference to inspect a Python program to annotate the types used for each variable (automatic introspection)
- The annodated code is transalated into C code.
- Get 0.3 seconds execution time using `--nobounds --nowraps` flags.
- Extra compile-time flags like `-ffast-math` or `-O3`

In [None]:
# Examining the annotated output from Shed Skin
def calculate_z(maxiter, zs, cs):        # maxiter: [int], zs: [list(complex)], cs: [list(complex)]
    """Calculate output list using Julia update rule"""
    output = [0] * len(zs)               # [list(int)]
    for i in range(len(zs)):             # [__iter(int)]
        n = 0                            # [int]
        z = zs[i]                        # [complex]
        c = cs[i]                        # [complex]
        while n < maxiter and abs(z) < 2: # [int]
            z = z * z + c                # [complex]
            n += 1                       # [int]
        output[i] = n                    # [int]
    return output                        # [list(int)]


if __name__ == "__main__":               # []
    # make a trivial example using the correct types to enable type inference
    # call the function so ShedSkin can analyze the types
    output = calculate_z(1, [0j], [0j])  # [list(int)]

#### Cython and numpy
- `numpy` version without any Cython annoations takes about 71 seconds to run
- Because of the overhead of dereferencing individual elements in the `numpy` lists
- `memoryview`
 - allows the same low-level access to any object that implements the buffer interface, including `numpy` arrays and Python arrays.
 - can be shared with other C libraries without casting to another from from Python objects

In [None]:
# Annotated numpy version of the Julia calculation function
# `double complex[:] zs` - a double-precision complex object using the buffer protocol
import numpy as np
cimport numpy as np

def calculate_z(int maxiter, double complex[:] zs, double complex[:] cs):
    """Calculate output list using Julia update rule"""
    cdef unsigned int i, n
    cdef double complex z, c
    # assining a 1D numpy `array` to it via `empty`, call to `empty` will allocate 
    # a block of memory but not initilized the memory with same values.
    cdef int[:] output = np.empty(len(zs), dtype=np.int32)
    for i in range(len(zs)):
        n = 0
        z = zs[i]
        c = cs[i]
        # expanded teh call to `abs` using the faster, more math version.
        while n < maxiter and (z.real * z.real + z.imag * z.imag) < 4:
            z = z * z + c
            n += 1
        output[i] = n
    return output