# Lunch Time Python

## Lunch 6: numba

<img style="width: 600px; float: right;" src="https://numba.pydata.org/_static/numba-blue-horizontal-rgb.svg">

[numba](https://numba.pydata.org/) is a just-in-time (JIT) compiler for Python. With a few simple annotations, array-oriented and math-heavy Python code can be just-in-time optimized to performance similar as C, C++ and Fortran, without having to switch languages or Python interpreters.

*Press `Spacebar` to go to the next slide (or `?` to see all navigation shortcuts)*

[Lunch Time Python](https://ssciwr.github.io/lunch-time-python/), [Scientific Software Center](https://ssc.iwr.uni-heidelberg.de), [Heidelberg University](https://www.uni-heidelberg.de/)

# Motivation

- Many reasons to use Python, but performance not one of them
- What to do when a Python function is too slow?
- Ideally, find a library (e.g. numpy) with an equivalent function
- Otherwise:
    - use PyPy instead of CPython (if all your libraries are available)
    - write a fortan function and compile with f2py or fortranmagic
    - write a C function and compile with Cython
    - write a C++ function and compile using pybind11 or ipybind
    - magically make your slow Python function faster (numba)

# numba installation

- Conda: `conda install numba`
- Pip: `python -m pip install numba`

# Vector reduction example

Toy example: implement a vector reduction operation:

r(x,y) = $ \sum_i \cos(x_i) \sin(y_i) $

Some random vectors to benchmark our functions:

In [None]:
import numpy as np

x = np.random.uniform(low=-1, high=1, size=5000000)
y = np.random.uniform(low=-1, high=1, size=5000000)

# Python

In [None]:
import math


def r_python(x_vec, y_vec):
    s = 0
    for x, y in zip(x_vec, y_vec):
        s += math.cos(x) * math.sin(y)
    return s

In [None]:
r_python(x, y)

In [None]:
%timeit r_python(x,y)

# numpy

In [None]:
def r_numpy(x_vec, y_vec):
    return np.dot(np.cos(x_vec), np.sin(y_vec))

In [None]:
r_numpy(x, y)

In [None]:
%timeit r_numpy(x,y)

# Cython

In [None]:
# pip install cython
%load_ext cython

In [None]:
%%cython

import math

def r_cython(x_vec, y_vec):
    s = 0
    for x,y in zip(x_vec, y_vec):
        s += math.cos(x) * math.sin(y)
    return s

In [None]:
r_cython(x, y)

In [None]:
%timeit r_cython(x,y)

In [None]:
%%cython

import math
# use C math functions
from libc.math cimport sin, cos

# use C types instead of Python types
def r_cython(double[:] x_vec, double[:] y_vec):
    cdef double s = 0
    cdef int i
    for i in range(len(x_vec)):
        s += cos(x_vec[i])*sin(y_vec[i])
    return s

In [None]:
r_cython(x, y)

In [None]:
%timeit r_cython(x,y)

# Fortran

In [None]:
if "google.colab" in str(get_ipython()):
    !pip install fortran-magic -qqq
%load_ext fortranmagic

In [None]:
%%fortran

subroutine r_fortran(x_vec, y_vec, res)
    real, intent(in) :: x_vec(:), y_vec(:)
    real, intent(out) :: res
    integer :: i, n
    n = size(x_vec)
    res = 0
    do i=1,n
        res = res + cos(x_vec(i))*sin(y_vec(i))
    enddo
endsubroutine r_fortran

In [None]:
r_fortran(x, y)

In [None]:
%timeit r_fortran(x,y)

# C++ / pybind11

In [None]:
if "google.colab" in str(get_ipython()):
    !pip install git+https://github.com/aldanor/ipybind.git -qqq
%load_ext ipybind

In [None]:
%%pybind11

#include <pybind11/numpy.h>
#include <math.h>
PYBIND11_PLUGIN(example) {
    py::module m("example");
    m.def("r_pybind", [](const py::array_t<double>& x, const py::array_t<double>& y) {
        double sum{0};
        auto rx{x.unchecked<1>()};
        auto ry{y.unchecked<1>()};
        for (py::ssize_t i = 0; i < rx.shape(0); i++){
            sum += std::cos(rx[i])*std::sin(ry[i]);
        }
        return sum;
    });
    return m.ptr();
}

In [None]:
r_pybind(x, y)

In [None]:
%timeit r_pybind(x, y)

# numba

In [None]:
from numba import jit


@jit
def r_numba(x_vec, y_vec):
    s = 0
    for x, y in zip(x_vec, y_vec):
        s += math.cos(x) * math.sin(y)
    return s

In [None]:
r_numba(x, y)

In [None]:
# pure python with numba JIT
%timeit r_numba(x,y)

## Numba compilation

Two compilation modes

- `nopython` mode (default)
  - Fast because it doesn't access the Python C API
  - Needs to be able to infer the native (C) types of all values
- `object` mode (fallback)
  - Slow because it uses Python objects and the Python C API
  - Only used if `nopython` mode is not possible
  - To raise an error instead of falling back to this, set `nopython=True` or use `@njit`

## Numba function signatures

You can optionally explicitly specify the function signature. Use cases:

- you want the function to be compiled when it is defined rather than when it is first called
- you need fine-grained control over types (e.g. if you want 32-bit floats)

In [None]:
from numba import float32


@jit(float32(float32, float32))
def sum(a, b):
    return a + b

In [None]:
sum(1, 0.99999999)

## Numba options

- `nopython=True` disable Object mode fallback
- `nogil=True` release the Python Global Interpreter Lock (GIL)
- `cache=True` cache the compiled funtions on disk
- `parallel=True` enable automatic parallelization

# Parallelization

- set `parallel=True` option to enable
- use `prange` to explicitly parallelize a loop over a `range`

In [None]:
from numba import jit, prange


@jit(parallel=True)
def r_numba(x_vec, y_vec):
    s = 0
    for i in prange(len(x_vec)):
        s += math.cos(x[i]) * math.sin(y[i])
    return s

In [None]:
r_numba(x, y)

In [None]:
%timeit r_numba(x,y)

# NumPy universal functions

- a numpy `ufunc` is a function that operates on scalars
- can create one using `@numba.vectorize` and use it like built-in numpy ufuncs

In [None]:
from numba import vectorize, float64


@vectorize([float64(float64, float64)], target="parallel")
def r(x, y):
    return np.cos(x) * np.sin(y)

In [None]:
r(2, 3)

In [None]:
r(x, y)

In [None]:
np.sum(r(x, y))

In [None]:
%timeit np.sum(r(x,y))

## Advanced features

- Ahead of Time (AoT) compilation
  - the compiled module only depends on NumPy
- Flexible specializations
  - `@generated_jit` decorator for compile-time logic, e.g. type specializations
- Stencil
  - `@stencil` decorator for creating a stencil to apply to an array
- C callbacks
  - `@cfunc` decorator to generate a C-callback (e.g. to pass to scipy.integrate)
- CUDA support
  - compile CUDA kernels to run on a GPU
- see [numba.readthedocs.io](https://numba.readthedocs.io/) for more