# Fast Python

## Contents

- How fast is Python?
- Standard tricks
- *Numba*
- *Cython*

Next talk:
- Deeper dive into *Cython* | *Numba*

## Pure python factorial

In [1]:
def factorial(n):
    if n == 0:
        return 1
    return n * factorial(n - 1)

In [2]:
%timeit factorial(100)

The slowest run took 4.02 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 13.3 µs per loop


## Trivial function call

In [3]:
def identity(n):
    return n

In [4]:
%timeit identity(100)

The slowest run took 24.25 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 63.1 ns per loop


In C, such function call would take about *3 ns*.

## Numba

- JIT compiler module
- requires llvm 4
- almost as fast as Cython
- can't compile everything

In [5]:
import numba

@numba.jit
def factorial(n):
    if n == 0:
        return 1
    return n * factorial(n - 1)

In [6]:
%timeit factorial(100)

The slowest run took 325948.27 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 572 ns per loop


## Inter-function calls in Numba

In [7]:
import numba

@numba.jit(nopython = True)
def sum_of_factorials(amount, n):
    sum = 0
    for _ in range(amount):
        sum += factorial(n)
    return sum

In [8]:
%timeit sum_of_factorials(100, 0)

The slowest run took 175374.77 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 459 ns per loop


### JIT-ing functions
- `numba.jit('int32(int32)', nopython = True, nogil = True, cache = True, debug = True)`
- `numba.cuda.jit()`
- type anotations


## Unsupported features in Numba

- try block
- with block
- comprehensions
- coroutines
- explicit \*\*kwargs
- global variables are fixed during compilation

In [35]:
import numba

@numba.jit
def try_except():
    try:
        pass
    except:
        pass

In [36]:
%timeit try_except()

AssertionError: Failed at object (analyzing bytecode)
SETUP_EXCEPT(arg=4, lineno=5)

# C-extensions in numba

In [11]:
import numba
import numpy as np

@numba.jit(nopython = True)
def np_sum():
    a = np.arange(100)
    return np.sum(a)

@numba.jit(nopython = True)
def np_array_init():
    return np.asarray([1, 2])

In [12]:
%timeit np_sum()

The slowest run took 45878.30 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 5.63 µs per loop


In [13]:
np_array_init()

UntypedAttributeError: Failed at nopython (nopython frontend)
Unknown attribute 'asarray' of type Module(<module 'numpy' from '/home/skodajan/fast-python/env/lib/python3.5/site-packages/numpy/__init__.py'>)
File "<ipython-input-11-5336e01eb75b>", line 11
[1] During: typing of get attribute at <ipython-input-11-5336e01eb75b> (11)

## Numba & classes

In [14]:
import numpy as np
import numba
from numba import int32, float32

spec = [
    ('value', int32),
    ('array', float32[:]),
]

@numba.jitclass(spec)
class Bag(object):
    def __init__(self, value):
        self.value = value
        self.array = np.zeros(value, dtype=np.float32)

    @property
    def size(self):
        return self.array.size

    def increment(self, val):
        for i in range(self.size):
            self.array[i] = val
        return self.array

## LRU cache factorial

In [15]:
def fibonacci(n):
    if n < 2:
        return n
    return fibonacci(n - 1) * fibonacci(n - 2)

In [16]:
%timeit -n 1 -r 1 fibonacci(30)

1 loop, best of 1: 279 ms per loop


In [17]:
from functools import lru_cache

@lru_cache(maxsize=32000)
def fibonacci(n):
    if n < 2:
        return n
    return fibonacci(n - 1) * fibonacci(n - 2)

In [32]:
%timeit -n 1 -r 1 fibonacci(30)

1 loop, best of 1: 18.8 µs per loop


## Cython factorial

In [19]:
%load_ext Cython

In [20]:
%%cython -a

def factorial(n):
    if n == 0:
        return 1
    return n * factorial(n - 1)

In [21]:
%timeit factorial(100)

100000 loops, best of 3: 7 µs per loop


In [22]:
%%cython -a

cimport cython
from libc.stdint cimport int32_t

@cython.boundscheck(False)
cpdef int factorial(int n):
    if n == 0:
        return 1
    return n * factorial(n - 1)

In [23]:
%timeit factorial(100)

10000000 loops, best of 3: 113 ns per loop


## Empty cython function for reference

In [31]:
%%cython -a

cpdef int linear(int n):
    return n

cpdef int sum_of_10_linears(int n):
    cdef int sum
    for i in range(10):
        sum += linear(n)
    return sum

In [25]:
%timeit linear(100)

The slowest run took 37.71 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 33.4 ns per loop


In [30]:
%timeit sum_of_10_linears(100)

The slowest run took 29.15 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 42.9 ns per loop


## Cython syntax

- `import`/`cimport`
- `with (no)gil:`
- `cdef type identifier`
- `def`/`cdef`/`cpdef`
- using 3rd-party code

## How to create a Cython module

- pyx -- python code
- pxd -- *header* file for accessing C code
- setup.py

In [None]:
from distutils.core import setup
from Cython.Build import cythonize

setup(
    name = "My hello app",
    ext_modules = cythonize('hello.pyx'),
)

## List vs Dict

Which container with three ints 1--3 is faster?
- insertion (at the beginning in case of list)
- access
- iteration over the whole container

In [None]:
l = [1, 2, 3]
d = {1: None, 2: None, 3: None}

In [None]:
%timeit l.insert(0, 0)

In [None]:
%timeit d[0] = None

In [None]:
%timeit l[0]

In [None]:
%timeit d[0]

In [None]:
%timeit [x for x in l]

In [None]:
%timeit [x for x in d]

## PyPy

- alternative Python interpreter 
- 3.5 in beta state
- GIL-less branch in alpha state
- supports C extensions, although they are not binary compatable

In [None]:
%%pypy 

def factorial(n):
    if n == 0:
        return 1
    return n * factorial(n - 1)

1000000 loops, best of 1: 4.731 µs per loop

## C(++)

 - *pybind11* for C++
 - *ctypes* for C
 - Python C API (`#include <python.h>`) for calling Python from C
 
### PyBind11 binding code example

```
    py::class_<OrderSuggestion> order_suggestion(m, "OrderSuggestion");
    order_suggestion\
        .def(py::init<const std::string &, int, unsigned int>())
        .def("__repr__", &OrderSuggestion::__repr__)
        .def_readwrite("side", &OrderSuggestion::side)
        .def_readwrite("price_level", &OrderSuggestion::price_level)
        .def_readwrite("qty", &OrderSuggestion::qty);
```

# Comparison

- Cython 
 - very fast
 - 'python' code, python libraries
 - build required
 - separate modules
 - more different from python than it might seem
 - most mature and time-proven
 
- Numba
 - easy to use just where neccessary
 - easy to debug (just disable JIT)
 - llvm dependency
 - some caveats and limitations

# Comparison

 - Pure C(++)
     - fastest
     - two separate worlds leads two many duplicities (build, ci, lint, test, debug...)
     - requires another new skillset
     - slow to write 
     - typical C issues (memory leaks, crashes, security holes)
     - *pybind*'s template magic is hard to debug

# Q&A

### Sources
- http://github.com/leftys/fast-python

### References
- http://cython.readthedocs.io/en/latest/index.html
- http://numba.pydata.org/
- http://apt.llvm.org/
- http://jakevdp.github.io/blog/2013/06/15/numba-vs-cython-take-2/
- https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/