# Numba for the GPU

- [CUDA Best Practices Guide](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html)

- [pyCUDA](https://documen.tician.de/pycuda/) which requires writing C code in python is not tested here

__Note__: examples below may not show valuable speedups but their goal is to introduce Numba's syntax

## Universal and generalize functions ufuncs/gufuncs
"ufuncs" operate in an __elementwise__ fashion. Hence, they are suitable for parallelisation.
Numba finds the broadcast rules for a defined scalar function of all the inputs.

- Ufuncs that involve heavy math operations on large data sets may be suitable for the GPU.
- `np` math functions won't work on the device but their `math` counterparts do.
- Use `float32` when possible for faster runtime

In [None]:
import numpy as np
from numba import cuda, vectorize, guvectorize
import math

In [None]:
# an explicit type signature has to be defined
@vectorize(['int64(int64, int64)'], target='cuda')
def add_func(x, y):
    return x + y

print('a+b:\n', add_func(1, 2))
print('a+b:\n', add_func(2.3, 2)) # implicit cast into int64

## Device functions

- May be called only from a GPU one.
- CUDA compiler inlines device functions.

In [None]:
@cuda.jit(device=True) # NO explicit type signature
def device_exp(a):
    return math.exp(a)

@vectorize(['float32(float32, float32)'], target='cuda')
def function_to_be_compiled(a,b):
    return device_exp(a) + device_exp(b)

function_to_be_compiled(1,2)

## GPU memory
It is good to allocate device memory once and refill it with host data in runtime

In [None]:
@vectorize(['float32(float32, float32)'], target='cuda')
def add_func(x, y):
    return x + y

n = 1000000
x = np.arange(n).astype(np.float32)
y = 2 * x

x_device = cuda.to_device(x)
y_device = cuda.to_device(y)

%timeit add_func(x, y)
%timeit add_func(x_device, y_device) # the output is still a numba.cuda.cudadrv.devicearray.DeviceNDArray

out_device = cuda.device_array(shape=(n,), dtype=np.float32)  # does not initialize the contents, like np.empty()
%timeit add_func(x_device, y_device, out=out_device)
%timeit out_host = out_device.copy_to_host()
%timeit x_device = cuda.to_device(x)

## Generalized functions `gufuncs`

- Generalized ufuncs (ufuncs that need to broadcast one of it's inputs) need a signature that shows the index ordering when dealing with multiple inputs.
- The last argument of a `gufuncs` is their output array.

In [None]:
# have to include the output array in the type signature
# '(n),()->(n)' maps a 1D array and a scalar to 1D array
@guvectorize(['(float32[:],float32, float32[:])'], '(n),()->(n)', target='cuda')
def cuda_add(x,y, out):
    for i in range(x.shape[0]):
        out[i] = x[i] + y

cuda_add(np.ones(3),1.0)
# cuda_add(np.ones(3),1) # TypeError: no matching signature



