# Numba

[Numba](https://numba.pydata.org/numba-doc/dev/user/overview.html) is a compiler for Python array and numerical functions that gives you the power to speed up your applications with high performance functions written directly in Python.

Numba generates optimized machine code from pure Python code using the LLVM compiler infrastructure. With a few simple annotations, array-oriented and math-heavy Python code can be just-in-time optimized to performance similar as C, C++ and Fortran, without having to switch languages or Python interpreters.

Numba’s main features are:

* on-the-fly code generation (at import time or runtime, at the user’s preference)

* native code generation for the CPU (default) and GPU hardware

* integration with the Python scientific software stack (thanks to Numpy)

## Compiling Python code with `@jit`

### Lazy compilation
The recommended way to use the `@jit` decorator is to let Numba decide when and how to optimize:

In [1]:
from numba import jit

@jit
def sum(x, y):
    return x + y

%timeit sum(2,2)

  def sum(x, y):


The slowest run took 13.75 times longer than the fastest. This could mean that an intermediate result is being cached.
1.16 µs ± 1.77 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Compilation options

There are a number of keyword-only arguments can be passed to the `@jit` decorator.

**nopython**

Numba has two compilation modes: **nopython** mode and **object** mode. The former produces much faster code, but has limitations that can force Numba to fall back to the latter. To prevent Numba from falling back, and instead raise an error, pass nopython=True.

In [2]:
@jit(nopython=True)
def sum(x, y):
    return x + y

%timeit sum(2,2)

267 ns ± 1.45 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)


### Eager compilation

You can also tell Numba the function signature you are expecting. The function f() would now look like:

In [3]:
@jit('int8(int8,int8)',nopython=True)
def sum(x, y):
    return x + y

%timeit sum(2,2)

282 ns ± 0.713 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)


In this mode, compilation will be deferred until the first function execution. Numba will infer the argument types at call time, and generate optimized code based on this information. Numba will also be able to compile separate specializations depending on the input types. For example, calling the `f()` function above with integer or complex numbers will generate different code paths:

In this case, the corresponding specialization will be compiled by the `@jit` decorator, and no other specialization will be allowed. This is useful if you want fine-grained control over types chosen by the compiler (for example, to use single-precision floats).

### Signature specifications

Explicit `@jit` signatures can use a number of types. Here are some common ones:

- `void` is the return type of functions returning nothing (which actually return None when called from Python)
- `intp` and `uintp` are pointer-sized integers (signed and unsigned, respectively)
- `intc` and `uintc` are equivalent to C int and unsigned int integer types
- `int8`, `uint8`, `int16`, `uint16`, `int32`, `uint32`, `int64`, `uint64` are fixed-width integers of the corresponding bit width (signed and unsigned)
- `float32` and `float64` are single- and double-precision floating-point numbers, respectively
- `complex64` and `complex128` are single- and double-precision complex numbers, respectively
- array types can be specified by indexing any numeric type, e.g. `float32[:]` for a one-dimensional single-precision array or `int8[:,:]` for a two-dimensional array of 8-bit integers.

#### nogil

Whenever Numba optimizes Python code to native code that only works on native types and variables (rather than Python objects), it is not necessary anymore to hold Python’s global interpreter lock (GIL). Numba will release the GIL when entering such a compiled function if you passed `nogil=True`.

 This will not be possible if the function is compiled in `object` mode.

In [6]:
@jit(nogil=True,nopython=True)
def sum(x, y):
    return x + y

%timeit sum(2,2)

327 ns ± 3.77 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)


#### cache
To avoid compilation times each time you invoke a Python program, you can instruct Numba to write the result of function compilation into a file-based cache. This is done by passing cache=True:

In [7]:
@jit(cache=True,nopython=True)
def sum(x, y):
    return x + y

%timeit sum(2,2)

266 ns ± 2.83 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)


#### parallel

Enables automatic parallelization (and related optimizations) for those operations in the function known to have parallel semantics. For a list of supported operations, see Automatic parallelization with `@jit`. This feature is enabled by passing `parallel=True` and must be used in conjunction with `nopython=True`:

In [8]:
@jit(nopython=True, parallel=True)
def sum(x, y):
    return x + y


### Matrix addition
![matrix addition](https://media.geeksforgeeks.org/wp-content/uploads/20230608165718/Matrix-Addition.png)
*Image from geeksforgeeks.org*

In [9]:
# Initialize the input matricies
A = [[1, 2], [3, 4]]
B = [[4, 5], [6, 7]]
 
# Initialize the result matrix
C = [[0, 0], [0, 0]]

# just loop through each dimension 
for i in range(len(A)):
    for j in range(len(A[0])):
        C[i][j] = A[i][j] + B[i][j]
    
print(C)

[[5, 7], [9, 11]]


In [45]:
%%writefile ben.py
import random
import time
import numpy as np
import numba
from numba import jit


def matrix_addition(A,B,C):
    
    # just loop through each dimension 
    for i in range(len(A)):
        for j in range(len(A[0])):
            C[i][j] = A[i][j] + B[i][j]

    return(C)

# Parallel version
@jit(nopython=True, parallel=True)
def matrix_addition_parallel(A,B,C):
    
    for i in numba.prange(len(A)): # loop over rows in parallel
        for j in range(len(A[0])):
            C[i][j] = A[i][j] + B[i][j]

    return(C)
        

number_cols = 2000 
number_rows = 2000 

A = np.random.rand(number_rows,number_cols)
B = np.random.rand(number_rows,number_cols)
C = np.zeros((np.shape(A)[0],np.shape(A)[1]))

start = time.time()
result_serial = matrix_addition(A,B,C)
end = time.time()
print("Time serial:", end-start)

C = np.zeros((np.shape(A)[0],np.shape(A)[1]))

start = time.time()
result_parallel = matrix_addition_parallel(A,B,C)
end = time.time()
print("Time parallel:", end-start)


print("Print just checking: ",np.mean(result_serial - result_parallel))

Overwriting ben.py


In [46]:
!python ben.py

Time serial: 3.9077062606811523
Time parallel: 0.7738394737243652
Print just checking:  0.0
[0m

## The explicit Matrix mulitplication example now with Numba!

In [67]:
from numba import jit
from random import random
import numpy as np
import numba
import time

def explicit_matmul(A,B,C):
    #A[m][n]
    #B[n][p]
    #C[m][p]    
    for i in range(np.shape(A)[0]): #(i=1...m) Rows in A
        for j in range(np.shape(B)[1]): # (j=1...p) Columns in B
            for k in range(np.shape(A)[1]): # (k=1...n) Columns in A
                C[i][j] += A[i][k] * B[k][j]
    return(C)


AX=AY=BX=BY=200

A = np.random.rand(AX,AY)
B = np.random.rand(BX,BY)  
C = np.zeros((AX,AY))



start = time.perf_counter()
C = explicit_matmul(A,B,C)
end = time.perf_counter()
print("Serial: ",end-start)

Serial:  9.799237744882703


## Examples for lightweight profiling your code

 -  **%timeit** A very usefull magic function (especially for this course!)
 -  **time** (module) This module provides various time-related functions.
 -  **cProfile** (module) This module is recommended for most users; it’s a C extension with reasonable overhead that makes it suitable for profiling long-running programs. Based on lsprof, contributed by Brett Rosen and Ted Czotter.

In [None]:
import time 

start = time.perf_counter_ns()
explicit_matmul(A,B)
end = time.perf_counter_ns()

print("Time of function execution is " +str(round(end-start)) + " ns")

In [None]:
import cProfile

cProfile.run('explicit_matmul(A,B)') #By default the run method prints to the std out


In [None]:
cProfile.run('explicit_matmul(A,B)',"my_perf_file.out") #By default the run method prints to the std out

In [None]:
import pstats
from pstats import SortKey

p = pstats.Stats('my_perf_file.out')  #read in the profile data

#you can sort by the internal time
p.sort_stats('time')
p.print_stats()

#you can sort by the number of calls
p.sort_stats('calls')
p.print_stats()

#you can reverse the order
p.reverse_order()
p.print_stats()


In [None]:
import cProfile

def do_profile(func):
    def profiled_func(*args, **kwargs):
        profile = cProfile.Profile()
        try:
            profile.enable()
            result = func(*args, **kwargs)
            profile.disable()
            return result
        finally:
            profile.print_stats()
    return profiled_func

In [None]:
# Simple Matrix multiplication algorithm
@do_profile
def numpy_matmul(A,B):
    npA = np.array(A)
    npB = np.array(B)
    C = np.matmul(A,B)
    return C

@do_profile
def explicit_matmul(A,B):
    C = [[0 for x in range(len(A))] for y in range(len(B[0]))]
    for i in range(len(A)):
        for j in range(len(B[0])):
            for k in range(len(B)):
                C[i][j] += A[i][k] * B[k][j]
    return C

#Set matrix dimension
AX=AY=BX=BY=100

#Define Matrix A
A = [[random() for x in range(AX)] for y in range(AY)]

#Define Matrix B
B = [[random() for x in range(BX)] for y in range(BY)]

res = numpy_matmul(A,B)

res = explicit_matmul(A,B)