<a href="https://colab.research.google.com/github/keuperj/DataEngineering22/blob/main/week_9/Numba_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Numba Demo
* API:  https://numba.pydata.org/numba-doc/latest/index.html


### Implementing a simple function and getting the runtime

In [1]:
import random
def monte_carlo_pi(nsamples):
    acc = 0
    for i in range(nsamples):
        x = random.random()
        y = random.random()
        if (x ** 2 + y ** 2) < 1.0:
            acc += 1
    return 4.0 * acc / nsamples

In [2]:
%%timeit
monte_carlo_pi(10000)

100 loops, best of 5: 4.27 ms per loop


### Now the same thing with NUMBA compilation

In [3]:
from numba import jit
import random

@jit(nopython=True)
def monte_carlo_pi(nsamples):
    acc = 0
    for i in range(nsamples):
        x = random.random()
        y = random.random()
        if (x ** 2 + y ** 2) < 1.0:
            acc += 1
    return 4.0 * acc / nsamples

#NOTE calling the function once before timing so that the compilation time is not included in our comparison
monte_carlo_pi(10000)

3.1336

In [4]:
%%timeit
monte_carlo_pi(10000)

1000 loops, best of 5: 244 µs per loop


### Now with Multi-Threading

In [5]:
#need extre threading lib
!pip install tbb

Collecting tbb
  Downloading tbb-2021.6.0-py2.py3-none-manylinux1_x86_64.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 7.8 MB/s 
[?25hInstalling collected packages: tbb
Successfully installed tbb-2021.6.0


In [6]:
import numba as nb
@jit(nopython=True, parallel=True)
def monte_carlo_pi_parallel(nsamples):
    acc = 0
    for i in nb.prange(nsamples):
        x = random.random()
        y = random.random()
        if (x ** 2 + y ** 2) < 1.0:
            acc += 1
    return 4.0 * acc / nsamples

#NOTE calling the function once before timing so that the compilation time is not included in our comparison
monte_carlo_pi_parallel(10000)



3.16

In [7]:
%%timeit
monte_carlo_pi(10000)

10000 loops, best of 5: 120 µs per loop


not always faster -> overhead of parallelization :-(

## NUMBA GPU-Example 

In [8]:
from numba import cuda, guvectorize, vectorize, void, int32, float64, uint32
import math
import numpy as np
np.random.seed(1)

In [9]:
# CUDA Kernle
@cuda.jit
def axpy(r, a, x, y):
    i = cuda.grid(1)
    if i < len(r):
        r[i] = a * x[i] + y[i]

In [10]:
def create_and_add_vectors(N):
    # Create input data and transfer to GPU
    x = np.random.random(N)
    y = np.random.random(N)
    d_x = cuda.to_device(x)
    d_y = cuda.to_device(y)
    d_r = cuda.device_array_like(d_x)
    a = 4.5

    # Compute grid dimensions
    
    # An arbitrary reasonable choice of block size
    block_dim = 256
    # Enough blocks to cover the input
    grid_dim = math.ceil(len(d_x) / block_dim)

    # Launch the kernel
    axpy[grid_dim, block_dim](d_r, a, d_x, d_y)
    
    # Return the result
    return d_r.copy_to_host()

In [11]:
create_and_add_vectors(32)

array([2.83448855, 3.77462551, 0.6923918 , 1.67601221, 1.34690244,
       1.25014935, 0.85645923, 2.30516759, 2.77431472, 3.17284096,
       2.16681931, 3.87276708, 1.02326113, 4.39942199, 1.03183967,
       3.31071794, 2.16564695, 2.6441328 , 0.65110818, 1.57029223,
       3.81497868, 4.62272375, 1.90198196, 3.16881432, 4.51786879,
       4.17245856, 0.97200449, 0.87550488, 0.86657132, 4.36569725,
       1.13696091, 2.30916358])