# Numba

Numba is an open-source Just-In-Time compiler that does exactly that. It enables Python developers to translate a subset of Python and NumPy code directly into machine code by using the LLVM compiler in the backend. In addition to that, Numba offers a wide range of choices for parallelizing Python code for CPUs and GPUs with trivial code changes. There are a lot of ways to approach compiling Python; the approach Numba takes is to compile individual functions or a collection of functions just in time as you need them.

To read about it more, please refer [this](https://analyticsindiamag.com/make-python-code-faster-with-numba/) article.

# Using Numba to make Python & NumPy code faster

Numba can be installed from PyPI as:

In [None]:
!python -m pip install pip --upgrade --user -q

In [None]:
!python -m pip install numba numpy --user -q

In [None]:
import IPython
IPython.Application.instance().kernel.do_shutdown(True)

Numba uses decorators to convert Python functions into functions that compile themselves. The most common Numba decorator is @jit. Let’s create an example function and see @jit in action.

In [None]:
import numba
from numba import jit
@jit(nopython=True)
def example_function(n): 
    trace = 0.0
    for i in range(n.shape[0]):  
        trace += np.tanh(n[i, i]) 
    return n + trace   

The nopython=True option tells Numba to fully compile the function to remove the Python interpreter calls completely. If it is not used, exceptions are raised, indicating places in the function that need to be refactored to achieve better-than-Python performance. Using nopython=True is strongly recommended. 

We’ll be using the %timeit magic function to measure execution time because it runs the function multiple times to get a more accurate estimate of short functions. Our function has not been compiled yet; to do that, we need to call it:

In [None]:
import numpy as np
n = np.arange(10000).reshape(100, 100)
%timeit example_function(n) 

The function was compiled, executed and cached. Now when it is called again, the previously generated machine code is executed directly without any need for compilation. 

In [None]:
%timeit example_function(n)

When benchmarking Numba-compiled functions, it is important to time them without including the compilation step since the compilation of a given function only happens once. Let’s compare to the uncompiled function. Numba-compiled functions have a .py_func attribute that can be used to access the original uncompiled Python function.
 

In [None]:
%timeit example_function.py_func(n)

The original Python function is more than 20 times slower than the Numba-compiled version. However, our example function used explicit loops, which are very fast in Numba and not so much in Python. Our function is really simple so we can try optimizing it by rewriting it using only NumPy expressions:

In [None]:
def numpy_example(n):
    return n + np.tanh(np.diagonal(n)).sum()
%timeit numpy_example(n) 

The refactored NumPy version is roughly 10 times faster than the Python version but still slower than the Numba-compiled version. 

## Multithreading with Numba

Operations on NumPy array expressions are often broadcasted independently over the input elements and have a significant amount of implied parallelism. Numba’s ParallelAccelerator optimization identifies this parallelism and automatically distributes it over several threads. To enable the parallelization pass, all we need to do is use the parallel=True option.

In [None]:
SQRT_2PI = np.sqrt(2 * np.pi)
@jit(nopython=True, parallel=True)
def gaussians(x, means, widths):
    n = means.shape[0]
    result = np.exp( -0.5 * ((x - means) / widths)**2 ) / widths
    return result / SQRT_2PI / n

Let’s call the function once to compile it:

In [None]:
means = np.random.uniform(-1, 1, size=1000000)
widths = np.random.uniform(0.1, 0.3, size=1000000)
gaussians(0.4, means, widths)

Now we can accurately compare the effect of threading and compiling with the normal Python version:

In [None]:
gaussians_nothread = jit(nopython=True)(gaussians.py_func)
%timeit gaussians(0.4, means, widths)  # numba-compiled and threading
%timeit gaussians_nothread(0.4, means, widths) # no threading
%timeit gaussians.py_func(0.4, means, widths) # normal python 

There are situations suited for multithreading where there’s no array expression but rather a loop where each iteration is independent of the other. In these cases, we can use prange() in a for loop to indicate to ParallelAccelerator that this loop can be executed in parallel:

In [None]:
import random
# Serial version
@jit(nopython=True)
def monte_carlo_pi_serial(nsamples):
    acc = 0
    for i in range(nsamples):
        x = random.random()
        y = random.random()
        if (x**2 + y**2) < 1.0:
            acc += 1
    return 4.0 * acc / nsamples

# Parallel version
@jit(nopython=True, parallel=True)
def monte_carlo_pi_parallel(nsamples):
    acc = 0
    # Only change is here
    for i in numba.prange(nsamples):
        x = random.random()
        y = random.random()
        if (x**2 + y**2) < 1.0:
            acc += 1
    return 4.0 * acc / nsamples

%time monte_carlo_pi_serial(int(4e8))
%time monte_carlo_pi_parallel(int(4e8)) 

One thing to note here is that prange() automatically handles the reduction variable acc in a thread-safe way.  Additionally, Numba automatically initializes the random number generator in each thread independently.

Alternatively, you can also use modules like concurrent.futures or Dask to run functions in multiple threads. For these use-cases, ParallelAccelerator isn’t helpful; we only want to obtain the Numba-compiled function to run concurrently in different threads. For accomplishing this, we need the Numba function to release the Global Interpreter Lock (GIL) during execution. This can be done using the nogil=True option. 