In [None]:
# Module needed: numba, math, numpy, cuda, vectorize

# CUDA Ufuncs

Numba’s vectorize allows Python functions taking scalar input arguments to be used as NumPy ufuncs. Using the `vectorize()` decorator, Numba can compile a pure Python function into a ufunc that operates over NumPy arrays and executes on the GPU.

Using vectorize(), you write your function as operating over input scalars, rather than arrays. Numba will generate the surrounding loop (or kernel) allowing efficient iteration over the actual inputs.

In [None]:
import numba
from numba import vectorize,cuda
import numpy as np
import math

### Law of Cosines

For a triangle with sides $a$, $b$, and $c$ the law of cosines dictates that

$$
\frac{a^2+b^2-c^2}{2ab}=\cos C
$$

### Numba Ufunc Kernel

Below, we define the GPU-accelerated eager, or decoration-time, compilation vectorized function by providing signatures to the decorator and specifying `target='cuda'`.  GPU-targeted Ufuncs require signatures. 

As an exercise, complete the missing lines of code.  You will have to specify the signature(s) and computation.  

In [None]:
@vectorize(# --- FILL THIS IN ---,
           target='cuda')
def compute_angle(a, b, c):
    
    # --- FILL THIS IN ---
    

### Prepare Data

In [None]:
N = int(5e8)
dtype = np.float32

# prepare the input
a = np.array(np.random.sample(N)+3, dtype=dtype)

b = np.array(np.random.sample(N)+4, dtype=dtype)

c = np.array(np.random.sample(N)+5, dtype=dtype)



### Call GPU Ufunc
Calling a GPU Ufunc is as straight forward as calling a numpy function.  Under the hood, the CUDA launch configuration is handled automatically.

In [None]:
%%timeit -n2 -r5 -o
C_GPU = compute_angle(a,b,c)

In [None]:
# store the timing result
GPU_TIMING = _

### Numpy Version
Notice the numpy version of the same calculation looks nearly identical to the Ufunc definition.  

In [None]:
%%timeit -n1 -r1 -o
# CPU version
C_CPU = np.arccos(( a**2 + b**2 - c**2 ) / ( 2.0 * a * b ))

In [None]:
# store the timing result
CPU_TIMING = _

### Computing Speedup Factor

In [None]:
print('Speedup factor: ', CPU_TIMING.average / GPU_TIMING.average, 'X')

### Checking Results

In [None]:
# recompute (workaround for timeit bug)
C_GPU = compute_angle(a, b, c)
C_CPU = np.arccos(( a**2 + b**2 - c**2 ) / ( 2.0 * a * b ))

tol=1e-5
if np.array(np.abs(C_CPU-C_GPU)<tol).sum()==N:
    print('results agree')