## Homework 10: GPUs

## Due Date: April 26, 2023, 11:59pm

#### Firstname Lastname: Ching-Tsung(Deron) Tsai

#### E-mail: ct2840@nyu.edu

#### Enter your solutions and submit this notebook

---

**Problem 1 (100p)**


Write two programs which will be able to run in parallel on a GPU, one using Numba/CUDA (50p), one using PyOpenCL (50p).


Each program will:

- draw two random vectors $\vec u$ and $\vec v$ from $[0,1]^N$ where $N = 10^7$;


- calculate and output similarity between $\vec u$ and $\vec v$.




The similarity between two vectors $\vec u$ and $\vec v$ is defined here as a `cosine` value of the angle between them $\measuredangle \left( \vec u, \vec v \right)$. That is, the program returns: 

$$\cos \left( \measuredangle \left( \vec u, \vec v \right) \right).$$


Note that the output is a real value and must belong to $[-1, 1]$.

### CUDA

In [1]:
!export NUMBA_ENABLE_CUDASIM=1

In [2]:
%env NUMBA_ENABLE_CUDASIM=1

env: NUMBA_ENABLE_CUDASIM=1


In [3]:
from numba import cuda
print(cuda.gpus)
cuda.select_device(0)

<Managed Device 0>


In [4]:
import numpy as np
from numba import cuda, float32
import math

@cuda.jit
def cos_similarity(u, v, s, norm_u, norm_v):
    """
    u, v: input arrays
    s, norm_u, norm_v: output arrays that save results
    """
    i = cuda.grid(1)
    if i < u.shape[0]:       # avoid out of scope error
        s[i] = u[i] * v[i]
        norm_u[i] = u[i] * u[i]
        norm_v[i] = v[i] * v[i]
N = int(1e7)

# create 2 random inputs:
np.random.seed(777)        # make outcome reproducible
u = cuda.to_device(np.random.rand(N).astype(np.float32))
v = cuda.to_device(np.random.rand(N).astype(np.float32))

# to save outputs:
s = cuda.device_array(N, dtype=np.float32)
norm_u = cuda.device_array(N, dtype=np.float32)
norm_v = cuda.device_array(N, dtype=np.float32)


threads_per_block = 32
blocks_per_grid = math.ceil(N / threads_per_block)

# calculation:
cos_similarity[blocks_per_grid, threads_per_block](u, v, s, norm_u, norm_v)

# calculate the cosine similarity
dot_product = s.sum()
norm_u_sum = norm_u.sum()
norm_v_sum = norm_v.sum()
similarity = dot_product / (np.sqrt(norm_u_sum) * np.sqrt(norm_v_sum))
print(similarity)

0.75015926


### PyOpenCL

In [10]:
import pyopencl as cl  
import pyopencl.array as pycl_array  
import numpy as np  

context = cl.create_some_context()  
queue = cl.CommandQueue(context) 

N = int(1e7)
# create 2 random inputs
np.random.seed(777)       # make output reproducible
u = pycl_array.to_device(queue, np.random.rand(N).astype(np.float32))
v = pycl_array.to_device(queue, np.random.rand(N).astype(np.float32))

# to save outputs:
s = pycl_array.empty_like(u) 
norm_u = pycl_array.empty_like(u)
norm_v = pycl_array.empty_like(u)

program = cl.Program(context, """
__kernel void cos_similarity(__global const float *u, __global const float *v, __global float *s, __global float *norm_u,__global float *norm_v )
{
  int i = get_global_id(0);
  s[i] = u[i]*v[i];
  norm_u[i] = u[i]*u[i];
  norm_v[i] = v[i]*v[i];
}""").build()

# calculation:
program.cos_similarity(queue, u.shape, None, u.data, v.data, s.data, norm_u.data, norm_v.data)

dot_product = np.sum(s.get())
norm_u_sum = np.sum(norm_u.get())
norm_v_sum = np.sum(norm_v.get())
similarity = dot_product / (np.sqrt(norm_u_sum) * np.sqrt(norm_v_sum))
print(similarity)

0.75015926
