# Numba for the GPU
## 1- Universal functions (ufuncs)/gufuncs
"ufuncs" operate in an __elementwise__ fashion. Hence, they are suitable for parallelisation.

Numba will figure out the broadcast rules for a defined scalar function of all the inputs.

In [30]:
from numba import vectorize

# to use CUDA, an explicit type signature has to be defined
# USE float32 when possible for faster runtime
@vectorize(['int64(int64, int64)'], target='cuda')
def add_func(x, y):
    return x + y

print('a+b:\n', add_func(1, 2))
print('a+b:\n', add_func(1.3, 2))
print('a+b:\n', add_func(np.ones(3), 2))

# Ufuncs that use special functions (`exp`, `sin`, `cos`, etc) on large data sets run especially well on the GPU.

# Device functions may be called only from a GPU one
# CUDA compiler inlines device functions
from numba import cuda
# * `np` math functions won't work on the device, you need to use their `math` counterparts instead.
import math

@cuda.jit(device=True)
def device_exp(a):
    return math.exp(a)

@vectorize(['float32(float32, float32)'], target='cuda')
def function_to_be_compiled(a,b):
    return device_exp(a) + device_exp(b)

### GPU memory
# It is good to allocate device memory once and refilling it with host data in runtime
from numba import cuda
n = 100000
x = np.arange(n).astype(np.float32)
y = 2 * x
x_device = cuda.to_device(x)
y_device = cuda.to_device(y)

print(x_device)
print(x_device.shape)
print(x_device.dtype)

%timeit add_ufunc(x, y)  # performance with host arrays
%timeit add_ufunc(x_device, y_device)

out_device = cuda.device_array(shape=(n,), dtype=np.float32)  # does not initialize the contents, like np.empty()
%timeit add_ufunc(x_device, y_device, out=out_device)
out_host = out_device.copy_to_host()
print(out_host[:10])

CudaSupportError: Error at driver init: 
[100] Call to cuInit results in CUDA_ERROR_NO_DEVICE:

Ufuncs broadcast a scalar function over array inputs but what if you want to broadcast a lower dimensional array function over a higher dimensional array? This is called a generalized ufunc ("gufunc"), and it opens up a whole new frontier for applying ufuncs.

Generalized ufuncs are a little more tricky because they need a signature (not to be confused with the Numba type signature) that shows the index ordering when dealing with multiple inputs. Fully explaining "gufunc" signatures is beyond the scope of this tutorial, but you can learn more from:

Let's write our own normalization function. This will take an array input and compute the L2 norm along the last dimension. Generalized ufuncs take their output array as the last argument, rather than returning a value. If the output is a scalar, then we will still receive an array that is one dimension less than the array input. For example, computing the row sums of an array will return a 1 dimensional array for 2D array input, or 2D array for 3D array input.

In [None]:
from numba import guvectorize
import math

@guvectorize(['(float32[:], float32[:])'], # have to include the output array in the type signature
             '(i)->()',                 # map a 1D array to a scalar output
             target='cuda')
def l2_norm(vec, out):
    acc = 0.0
    for value in vec:
        acc += value**2
    out[0] = math.sqrt(acc)

angles = np.random.uniform(-np.pi, np.pi, 10)
coords = np.stack([np.cos(angles), np.sin(angles)], axis=1)
print(coords)

l2_norm(coords)

## 2- CUDA Python kernels

Custom CUDA Kernels in Python with Numba
Multidimensional Grids and Shared Memory for CUDA Python with Numba

# pyCUDA