# Numba

## Contents

  - [Overview](#Overview)  
  - [Compiling Functions](#Compiling-Functions)   
  - [Vectorization](#Vectorization)
  - [Parallelization](#Parallelization)

## Overview

In our lecture on NumPy we learned one method, called **vectorization**, to improve speed and efficiency in numerical work. Vectorization involves sending array processing operations in batch to efficient low level code.  

Unfortunately, vectorization has several weaknesses:
- One is that it is highly memory intensive when working with large amounts of data.
- Another is that not all algorithms can be vectorized. 

A Python library called [Numba](http://numba.pydata.org/) solves many of these problems. It does so through something called **just in time (JIT) compilation**. 
The key idea is to compile functions to native machine code instructions on the fly. When it succeeds, the compiled code is extreamely fast. 
It can also do other tricks such as facilitate *multithreading* (a form of parallelization well suited to numerical work).


## Compiling Functions

As stated above, Numba's primary use is compiling function to fast native machine code during runtime.

Let's start with some imports:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import time

### An Example

Let’s consider a problem that are difficult to vectorize: 
generating the trajectory of a difference equation given an initial condition.
Let’s take the difference equation to be the quadratic map: $ x_{t+1} = 4 x_t (1 - x_t) $.

Here’s the plot of a typical trajectory, starting from $ x_0 = 0.1 $, with $ t $ on the x-axis.

In [None]:
def qm(x0, n):
    x = np.empty(n+1)
    x[0] = x0
    for t in range(n):
        x[t+1] = 4 * x[t] * (1 - x[t])
    return x

x = qm(0.1, 250)
fig, ax = plt.subplots()
ax.plot(x, 'b-', lw=2, alpha=0.8)
ax.set_xlabel('$t$', fontsize=12)
ax.set_ylabel('$x_{t}$', fontsize=12)
plt.show()

To speed the function `qm` up using Numba, the first step is:

In [None]:
from numba import jit

qm_numba = jit(qm)  # qm_numba is now a 'compiled' version of qm

The function `qm_numba` is a version of `qm` that is "targeted' for JIT-compilation. 

Let’s time and compare identical function calls across these two versions:

In [None]:
start = time.time() # record start time

### orginal qm function
qm(0.1, int(10**7))

end = time.time() # record end time

elapsed_time = end - start # test timing
print(elapsed_time)  

Let's try qm_numba. 

In [None]:
start = time.time() # record start time

### first call of jit version of qm function
qm_numba(0.1, int(10**7))

end = time.time() # record end time

elapsed_time = end - start # test timing
print(elapsed_time)  

This is already a massive speed gain. 

In fact, the next time and all subsequent times it runs even faster as the function has been compiled and is in memory.

In [None]:
start = time.time() # record start time

### subsequent call of jit version of qm function
qm_numba(0.1, int(10**7))

end = time.time() # record end time

elapsed_time = end - start # test timing
print(elapsed_time)  

### Decorator Notation

To target a function for JIT compilation we can just put `@jit` before the function.

In [None]:
@jit
def qm(x0, n):
    x = np.empty(n+1)
    x[0] = x0
    for t in range(n):
        x[t+1] = 4 * x[t] * (1 - x[t])
    return x

This is equivalent to `qm = jit(qm)`. 

The following now uses the jitted version:

In [None]:
qm(0.1, 10)

### How and When it Works

Numba attempts to generate fast machine code using the infrastructure provided by the [LLVM Project](http://llvm.org/). 
It does this by inferring type information on the fly.
This is easier for simple Python objects (simple scalar data types, such as floats, integers, etc.). Numba also plays well with NumPy arrays, which it treats as typed memory regions. 

In an ideal setting, Numba can infer all necessary type information.
This allows it to generate native machine code, without having to call the Python runtime environment.
In such a setting, Numba will be on par with machine code from low level languages.

When Numba cannot infer all type information, some Python objects are given generic `object` status, and some code is generated using the Python runtime.
When this happens, Numba provides only minor speed gains or none at all.

We generally prefer to force an error when this occurs, so we know effective compilation is failing. This is done by using either `@jit(nopython=True)` or, equivalently, `@njit` instead of `@jit`.

In [None]:
from numba import njit

@njit
def qm(x0, n):
    x = np.empty(n+1)
    x[0] = x0
    for t in range(n):
        x[t+1] = 4 * x[t] * (1 - x[t])
    return x

Moreover, for larger routines, or for routines using external libraries, inferring type information can easily fail. Hence, it is prudent when using Numba to focus on speeding up small, time-critical snippets of code. This will give better performance than blanketing your Python programs with `@jit` statements.

## Vectorization

As mentioned before, many functions provided by NumPy are ufuncs that can be vectorized with a faster speed.

For example, let's return to the maximization problem discussed before.

In [None]:
def f(x, y):
    return np.cos(x**2 + y**2) / (1 + x**2 + y**2)

grid = np.linspace(-3, 3, 5000)
x, y = np.meshgrid(grid, grid)

Using NumPy `np.max` function, which is vectorized, the running time is much faster than the pure Python loops. 

In [None]:
start = time.time() # record start time

### vectorized NumPy version
np.max(f(x, y))

end = time.time() # record end time

elapsed_time = end - start # test timing
print(elapsed_time)  

Numba can also be used to create custom ufuncs with the [@vectorize](http://numba.pydata.org/numba-doc/dev/user/vectorize.html) decorator. 

In [None]:
from numba import vectorize

@vectorize
def f_vec(x, y):
    return np.cos(x**2 + y**2) / (1 + x**2 + y**2)

np.max(f_vec(x, y))  # Run once to compile

start = time.time() # record start time

### f_vec vectorized through Numba
np.max(f_vec(x, y))

end = time.time() # record end time

elapsed_time = end - start # test timing
print(elapsed_time) 

Both Numba and NumPy use efficient machine code that’s specialized to these floating point operations. However, the code NumPy uses is, in some ways, less efficient.

For example, when NumPy computes `np.cos(x**2 + y**2)` it first creates the
intermediate arrays `x**2` and `y**2`, then it creates the array `np.cos(x**2 + y**2)`.

In the `@vectorize` version using Numba, the entire operator is reduced to a
single vectorized process and none of these intermediate arrays are created. 

## Parallelization

In addition to vectorization, NumPy cleverly implements *implicit multithreading* in a lot of its compiled code. 

Multithreading is one type of parallelization that can speed up code execution especially for handling large amount of data and CPU intensive simulations and other calculations. 

Numba can also gain further speed improvements using Numba’s automatic parallelization
feature by specifying `target='parallel'`. 

In this case, we need to specify the types of our inputs and outputs.As mentioned before, many functions provided by NumPy are ufuncs that can be vectorized with a faster speed.

For example, for the same maximization problem,

In [None]:
@vectorize('float64(float64, float64)', target='parallel')
def f_vec(x, y):
    return np.cos(x**2 + y**2) / (1 + x**2 + y**2)

np.max(f_vec(x, y))  # Run once to compile

start = time.time() # record start time

### f_vec vectorized through Numba with parallelization
np.max(f_vec(x, y))

end = time.time() # record end time

elapsed_time = end - start # test timing
print(elapsed_time) 