<a href="https://colab.research.google.com/github/JacobDowns/CSCI-491-591/blob/main/lecture8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Usar Numba en GPU en Colab
!apt-get install nvidia-cuda-toolkit
!pip3 install numba

import os
os.environ['NUMBAPRO_LIBDEVICE'] = "/usr/lib/nvidia-cuda-toolkit/libdevice"
os.environ['NUMBAPRO_NVVM'] = "/usr/lib/x86_64-linux-gnu/libnvvm.so"

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  fonts-dejavu-core fonts-dejavu-extra libaccinj64-11.5 libatk-wrapper-java
  libatk-wrapper-java-jni libbabeltrace1 libcub-dev libcublas11 libcublaslt11
  libcudart11.0 libcufft10 libcufftw10 libcuinj64-11.5 libcupti-dev
  libcupti-doc libcupti11.5 libcurand10 libcusolver11 libcusolvermg11
  libcusparse11 libdebuginfod-common libdebuginfod1 libegl-dev libgail-common
  libgail18 libgl-dev libgl1-mesa-dev libgles-dev libgles1 libglvnd-core-dev
  libglvnd-dev libglx-dev libgtk2.0-0 libgtk2.0-bin libgtk2.0-common libipt2
  libnppc11 libnppial11 libnppicc11 libnppidei11 libnppif11 libnppig11
  libnppim11 libnppist11 libnppisu11 libnppitc11 libnpps11 libnvblas11
  libnvidia-compute-495 libnvidia-compute-510 libnvidia-compute-535
  libnvidia-ml-dev libnvjpeg11 libnvrtc-builtins11.5 libnvrtc11.2
  libnvtoolsext1 libnvvm4 libopengl-dev librsvg2-c

# Numba-Cuda
* Numba has its own spin off project called `numba-cuda` that allows you to write CUDA kernels in Python in a similar manner to its regular JIT compiled functions
* As demonstrated in the example kernels, the kernels you write will be executed by multiple (often hundreds or thousdands) of threads at once
* The hierarchical model of grids, blocks, and threads is very much at work here
* Numba exposes three types of GPU memory
  * Global device memory (the largest and slowest)
  * On chip shared memory
  * Local memory

## Kernel Declaration
* As we previously described, a kernel is a function that will be executed on the GPU
* Kernels cannot explictly return a value, so any result data needs to be written to an array passed to the kernel
* As in the CUDA examples shown previously, you must specify the number of blocks and the threads per block when launching a kernel
* Once a kernel is compiled, it can be called with different block / thread sizes without being recompiled
* The basic syntax for defining a kernel in Numba is
```python
@cuda.jit
def increment_by_one(an_array):
    """
    Increment all array elements by one.
    """
    # Do some stuff
  ```
while launching the kernel looks like this:
```python
threadsperblock = 32
blockspergrid = (an_array.size + (threadsperblock - 1)) // threadsperblock
increment_by_one[blockspergrid, threadsperblock](an_array)
```
* Note that kernels run asynchronously: launches queue their execution on the device and then return immediately. You can use `cuda.syncrhonize()` to wait for all previous kernel launches to finish executing
* However, the Numba docs note that passing an array that resides in host memory will implicitly cause a copy back to the host, which will be synchronous
* In this case, the kernel launch will not return until the data is copied back, and therefore appears to execute synchronously
* Clearly, there's some impoortant considerations here about how to determine the number of blocks and threads per block. We'll come back to that later

## Multi-dimensional Blocks / Grids
* For dealing with multi-dimensional arrays, CUDA lets you cpecify multi-dimensional blocks and grids
* Thus `blockspergrid` and `threadsperblock` may be tuples of 1-33 integers
* This has no affect on the efficiency or behavior of the code, but can help index the data in a more natural way

## Thread Positioning
* As in the CUDA example, each thread needs to know what chunk of the input data its responsible for
* To do this, Numba CUDA kernels have access to `cuda.threadIdx.x`, `cuda.blockIdx.x`, and `cuda.blockDim.x`
```python
@cuda.jit
def increment_by_one(an_array):
    # Thread id in a 1D block
    tx = cuda.threadIdx.x
    # Block id in a 1D grid
    ty = cuda.blockIdx.x
    # Block width, i.e. number of threads per block
    bw = cuda.blockDim.x
    # Compute flattened index inside the array
    pos = tx + ty * bw
    if pos < an_array.size:  # Check array boundaries
        an_array[pos] += 1
```
> Note: Unless you are sure the block size and grid size is a divisor of your array size, you must check boundaries as shown above.

* Since this type of indexing is very common, CUDA provides some niceties to calculate this position
* For example, `pos` can be calcualated as follows using `cuda.grid`
```python
@cuda.jit
def increment_by_one(an_array):
    pos = cuda.grid(1)
    if pos < an_array.size:
        an_array[pos] += 1
```
For a 2d array using a grid of threads you could do defined make the block two dimensional:
```python
threadsperblock = (16, 16)
blockspergrid_x = math.ceil(an_array.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(an_array.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)
increment_a_2D_array[blockspergrid, threadsperblock](an_array)
```
Then you can get the threads position in x,y position in the array with:
```python
@cuda.jit
def increment_a_2D_array(an_array):
    x, y = cuda.grid(2)
    if x < an_array.shape[0] and y < an_array.shape[1]:
       an_array[x, y] += 1
```

## Example: Vector Addition
* Let's see a simple example for performing vector addition


In [2]:
from numba import cuda

@cuda.jit
def f(a, b, c):
    # like threadIdx.x + (blockIdx.x * blockDim.x)
    tid = cuda.grid(1)
    size = len(c)

    if tid < size:
        c[tid] = a[tid] + b[tid]

In [3]:
import numpy as np
N = 100000
a = cuda.to_device(np.random.random(N))
b = cuda.to_device(np.random.random(N))
c = cuda.device_array_like(a)

In [4]:
# Enough threads per block for several warps per block
nthreads = 256
# Enough blocks to cover the entire vector depending on its length
nblocks = (len(a) // nthreads) + 1
f[nblocks, nthreads](a, b, c)
print(c.copy_to_host())



ERROR:numba.cuda.cudadrv.driver:Call to cuLinkAddData results in CUDA_ERROR_UNSUPPORTED_PTX_VERSION


LinkerError: [222] Call to cuLinkAddData results in CUDA_ERROR_UNSUPPORTED_PTX_VERSION
ptxas application ptx input, line 9; fatal   : Unsupported .version 8.5; current version is '8.4'

After the installation is complete, please restart the Colab runtime by going to "Runtime" -> "Restart runtime" in the menu. Then, you can run the code again.