## Homework 10: GPUs

## Due Date: April 26, 2023, 11:59pm

#### Firstname Lastname: Seonhye Yang

#### E-mail: sy3420@nyu.edu

#### Enter your solutions and submit this notebook

---

**Problem 1 (100p)**


Write two programs which will be able to run in parallel on a GPU, one using Numba/CUDA (50p), one using PyOpenCL (50p).


Each program will:

- draw two random vectors $\vec u$ and $\vec v$ from $[0,1]^N$ where $N = 10^7$;


- calculate and output similarity between $\vec u$ and $\vec v$.




The similarity between two vectors $\vec u$ and $\vec v$ is defined here as a `cosine` value of the angle between them $\measuredangle \left( \vec u, \vec v \right)$. That is, the program returns: 

$$\cos \left( \measuredangle \left( \vec u, \vec v \right) \right).$$


Note that the output is a real value and must belong to $[-1, 1]$.

**Problem 1: Numba/CUDA**

In [6]:
!export NUMBA_ENABLE_CUDASIM=1

In [7]:
%env NUMBA_ENABLE_CUDASIM=1

env: NUMBA_ENABLE_CUDASIM=1


In [124]:
from numba import cuda

In [148]:
from numba import jit
import numpy as np

N = 10**7
u = np.random.rand(N)
v = np.random.rand(N)

@jit
def cosine_similarity(u, v):
    uv = 0.0
    vu = 0.0
    uu = 0.0
    vv = 0.0
    for i in range(N):
        uv += u[i]*v[i]
        vu += v[i]*u[i]
        uu += u[i]*u[i]
        vv += v[i]*v[i]
    cos = 0.0
    if (vv!=0 and uu!=0):
        cos = uv*(1/((uu*vv)**0.5))
    else:
        return None
    return cos

In [149]:
cosine_similarity(u, v)

0.750266367175593

**Problem 1: PyOpenCL**

In [134]:
!pip install pyopencl


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [152]:
import numpy as np
import pyopencl as cl

N = 10**7

u = np.random.rand(N).astype(np.float32)
v = np.random.rand(N).astype(np.float32)

# ctx = cl.create_some_context()
platform = cl.get_platforms()[0]  # Select the first platform [0]
device = platform.get_devices()[1]  # Select the first device on this platform [0]
ctx = cl.Context([device])  # Create a context with your device
queue = cl.CommandQueue(ctx)

mf = cl.mem_flags
u_0 = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=u)
v_0 = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=v)

# prg = cl.Program(ctx, """
# __kernel void sum(
#     __global const float *a_g, __global const float *b_g, __global float *res_g)
# {
#   int gid = get_global_id(0);
#   res_g[gid] = a_g[gid] + b_g[gid];
# }
# """).build()


uu_1 = cl.Buffer(ctx, mf.WRITE_ONLY, u.nbytes)
vv_1 = cl.Buffer(ctx, mf.WRITE_ONLY, v.nbytes)
uv_1 = cl.Buffer(ctx, mf.WRITE_ONLY, (u*v).nbytes)
vu_1 = cl.Buffer(ctx, mf.WRITE_ONLY, (v*u).nbytes)


prg = cl.Program(ctx, '''
__kernel void sum(__global float *u_0, 
__global float *v_0, 
__global float *vv_1,
__global float *uu_1, 
__global float *uv_1,
__global float *vu_1){
int i = get_global_id(0); 
uu_1[i] += u_0[i]*u_0[i]; 
vv_1[i] += v_0[i]*v_0[i];
uv_1[i] += u_0[i]*v_0[i];
vu_1[i] += v_0[i]*u_0[i];
}''').build()


# prg.sum(queue, a_np.shape, None, a_g, b_g, res_g)

prg.sum(queue, v.shape, None, u_0, v_0, uu_1, vv_1, uv_1, vu_1)

uu = np.zeros_like(u)
vv = np.zeros_like(v)
uv = np.zeros_like(u*v)
vu = np.zeros_like(v*u)

cl.enqueue_copy(queue, uu, uu_1)
cl.enqueue_copy(queue, uv, uv_1)
cl.enqueue_copy(queue, vu, vu_1)
cl.enqueue_copy(queue, vv, vv_1)
# res_g = cl.Buffer(ctx, mf.WRITE_ONLY, a_np.nbytes)
# prg.sum(queue, a_np.shape, None, a_g, b_g, res_g)

# res_np = np.empty_like(a_np)
# cl.enqueue_copy(queue, res_np, res_g)


cosine= np.sum(uv)/(np.sum(uu) * np.sum(vv))**0.5

print(f"cosine:{cosine}")

cosine:0.7500136151735816
