# Background
* Numba was initially developed to optimize the inefficient use-cases of [Numpy](https://numpy.org).
* Numpy uses multi-dimensional array (ndarray) object to store data.
* Python operators on ndarrays will trigger operations that are implemented in C and this is very efficient.
* Before Numba, NumPy users had to write Python C extensions to implement any custom computation in an efficient way. 


* Numba is a **function-at-a-time** Just-in-Time **(JIT)** compiler for CPython.
* Numba lets users annotate a **compute-intensive** Python function for compilation without rewriting the code in a low-level language like C.

## How does Numba work

* The programmer adds a Numba [decorator](https://www.datacamp.com/tutorial/decorators-python) to the function. 
* The decorator replaces the original Python function with a special object that just-in-time compiles the function when it is called the first time.

In [None]:
import numba
from numba import jit, int32, prange, vectorize, float64, cuda
import numpy as np
import math


In [None]:
@jit
def f(x, y):
    return x + y

In [None]:
print(f(2, 3)) # This generates one compiled code

In [None]:
# we can see what LLVM is doing with the function
for k, v in f.inspect_llvm().items():
    print("--------Type------\n",k)
    print("-----------LLVM IR string-----\n",v)

In [None]:
print(f(2.5, 3.9)) # This generates another compiled code

* We can tell numba to generated code only for one set of arguments

In [None]:
@jit(int32(int32, int32))
def f_1(x, y):
    return x + y

In [None]:
print(f_1(2, 3))

In [None]:
print(f_1('2', '3')) # generates error

* one compiled function can call another compiled function

In [None]:
@jit
def square(x):
    return x ** 2

@jit
def hypot(x, y):
    return math.sqrt(square(x) + square(y))

In [None]:
print(hypot(4, 3))

# Performance

In [None]:
def without_numba(a): 
    trace = 0.0
    for i in range(a.shape[0]):   
        trace += np.tanh(a[i, i]) 
    return a + trace            



In [None]:
@jit(nopython=True) # Set "nopython" mode for best performance, equivalent to @njit
def with_numba(a): # Function is compiled to machine code when called the first time
    trace = 0.0
    for i in range(a.shape[0]):   # Numba likes loops
        trace += np.tanh(a[i, i]) # Numba likes NumPy functions
    return a + trace              # Numba likes NumPy broadcasting



In [None]:
x = np.arange(100).reshape(10, 10)

In [None]:
%%time
print(without_numba(x))

In [None]:
%%time
print(with_numba(x))

In [None]:
x = np.arange(1000000).reshape(1000, 1000)

In [None]:
%%time
print(without_numba(x))

In [None]:
%%time
print(with_numba(x))


* The peformance difference increases as the problem size increase.

# Compilation options

## nopython

* nopython mode 
    * compilation without using Python C API (faster)
    * Entirely bypass the Python interpretter. 
* object mode 
    * compilation using Python C API (slower).
* If nopython mode fail numba will automatically fallback to object mode.
* so its a good practice to compile everything in nopython mode  

In [None]:
# nopython=True will force numba not to fallback to the object mode. 
# This will force an error if type inference does not work

@jit(nopython=True)
def f(x, y):
    return x + y

## nogil

In [None]:
# A numba compiled a code operates only on the native types.
# So it is not necessary to hold the GIL

@jit(nopython=True, nogil=True)
def f(x, y):
    return x + y

# beware: This can cause syncronization issues

## cache

In [None]:
# The chances are you call the same function again and again with the same argument type
# So you can cached the compiled code. 

@jit(nopython=True, cache=True)
def f(x, y):
    return x + y

## Automatic parallelization

In [None]:
#automatic parallelization

@jit(nopython=True, parallel=True)
def f(x, y):
    return x + y

In [None]:
@jit(nopython=True)
def reduction_without_parallel(n):
    shp = (13, 17)
    result1 = 2 * np.ones(shp, dtype=np.float64)
    tmp = 2 * np.ones_like(result1)

    for i in prange(n):
        result1 *= tmp

    return result1

In [None]:
%%time 
reduction_without_parallel(10)

In [None]:
@jit(nopython=True, parallel=True)
def reduction_with_parallel(n):
    shp = (13, 17)
    result1 = 2 * np.ones(shp, dtype=np.float64)
    tmp = 2 * np.ones_like(result1)

    for i in prange(n):
        result1 *= tmp

    return result1

In [None]:
%%time 
reduction_with_parallel(10)

In [None]:
%%time 
reduction_without_parallel(10000000)

In [None]:
%%time 
reduction_with_parallel(10000000)

## ufunc
* In NumPy universal function (ufunc) is a function that operates on ndarrays in an element-by-element fashion

In [None]:
x = [1, 2, 3, 4]
y = [4, 5, 6, 7]
z = np.add(x, y)

print(z)

* Creating a ufunc that operates on a ndarray of a particular type is not straight forward

In [None]:
def sinacosb(a, b):
    return math.sin(a) * math.cos(b)
    

In [None]:
n = 10000000
a = np.ones(n, dtype=np.dtype('f8'))
b = 2*a

In [None]:
sinacosb(a, b) # generates error

In [None]:
#Numba makes this process easy 

@vectorize([float64(float64, float64)]) 
def sinacosb_vect(a, b):
    return math.sin(a) * math.cos(b)
     

In [None]:
sinacosb_vect(a, b)