# Numba for the GPU
[CUDA Best Practices Guide](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html)

Note: pyCUDA which requires writing C code in python is not tested here

## Universal and generalise functions ufuncs/gufuncs
"ufuncs" operate in an __elementwise__ fashion. Hence, they are suitable for parallelisation.
Numba finds the broadcast rules for a defined scalar function of all the inputs.

Ufuncs that use functions (`exp`, `sin`, `cos`, etc) on large data sets run well on the GPU. `np` math functions won't work on the device but their `math` counterparts do.

In [1]:
import numpy as np
from numba import vectorize
from numba import cuda
from numba import guvectorize
import math

In [2]:
# an explicit type signature has to be defined, use float32 when possible for faster runtime
@vectorize(['int64(int64, int64)'], target='cuda')
def add_func(x, y):
    return x + y

print('a+b:\n', add_func(1, 2))
print('a+b:\n', add_func(2.3, 2))

a+b:
 [3]
a+b:
 [4]


## Device functions
 They may be called only from a GPU one.

 CUDA compiler inlines device functions.

In [3]:
@cuda.jit(device=True)
def device_exp(a):
    return math.exp(a)

@vectorize(['float32(float32, float32)'], target='cuda')
def function_to_be_compiled(a,b):
    return device_exp(a) + device_exp(b)

function_to_be_compiled(1,2)

array([10.107338], dtype=float32)

## GPU memory
It is good to allocate device memory once and refilling it with host data in runtime

In [5]:
@vectorize(['float32(float32, float32)'], target='cuda')
def add_func(x, y):
    return x + y

n = 1000000
x = np.arange(n).astype(np.float32)
y = 2 * x
%timeit x_device = cuda.to_device(x)
x_device = cuda.to_device(x)
y_device = cuda.to_device(y)

%timeit add_func(x, y)
%timeit add_func(x_device, y_device) # the output is still a numba.cuda.cudadrv.devicearray.DeviceNDArray

out_device = cuda.device_array(shape=(n,), dtype=np.float32)  # does not initialize the contents, like np.empty()
%timeit add_func(x_device, y_device, out=out_device)
%timeit out_host = out_device.copy_to_host()

1.07 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
3.8 ms ± 144 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.59 ms ± 22.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.65 ms ± 42.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
725 µs ± 41.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


## Generalised functions `gufuncs`

Generalized ufuncs (that need to broadcast one of it's inputs) need a signature that shows the index ordering when dealing with multiple inputs. The last argument of a `gufuncs` is their output array.

In [15]:
# have to include the output array in the type signature, '(n),()->(n)' maps a 1D array and a scalar to 1D output
@guvectorize(['(float32[:],float32, float32[:])'], '(n),()->(n)', target='cuda')
def cuda_add(x,y, out):
    for i in range(x.shape[0]):
        out[i] = x[i] + y

cuda_add(np.ones(3),1.0)

array([2., 2., 2.], dtype=float32)