
# CUDA Programming Assignment (Using Numba on GPU)

**Instructions:**
- You will implement GPU kernels using Numba for:
  - Vector Addition
  - Dot Product
  - ReLU Activation
- Compare the performance and correctness against CPU implementations.

Note: This assignment cannot be reliably executed on Google Colab due to compatibility issues between the Colab environment's Python 3.11, Numba, and CUDA toolkit versions.

For successful execution, it's recommended to run this assignment on:

- A local machine with a compatible NVIDIA GPU and environment

OR

- A Kaggle notebook with GPU enabled (https://www.kaggle.com/code)


In [1]:
import numpy as np
from numba import cuda, float32
import math
import time
print("CUDA available?", cuda.is_available())
print("GPUs detected:", cuda.gpus)

CUDA available? False
GPUs detected: <Managed Device 0>


In [2]:
## Function for elementwise comparison between 2 arrays
def compare(a, b, rtol=1e-5, atol=1e-8):
    return np.allclose(a, b, rtol=rtol, atol=atol)

## Function to check if the relative error (difference) between 2 values is within a defined threshold
def within_relative_error(cpu_val, gpu_val, threshold=0.0002):
    if cpu_val == 0:
        return abs(gpu_val) < threshold
    relative_error = abs(cpu_val - gpu_val) / abs(cpu_val)
    return relative_error <= threshold

In [3]:
## Function to compute dot product of 2 vectors using CPU
def dot_product_cpu(A, B):
    assert len(A) == len(B)
    result = 0.0
    for i in range(len(A)):
        result += A[i] * B[i]
    return result

## Function to elementwise addition between 2 vectors using CPU
def vector_add_cpu(A, B):
    assert len(A) == len(B)
    result = [0.0] * len(A)
    for i in range(len(A)):
        result[i] = A[i] + B[i]
    return result

## Function to apply ReLU activation on a vector using CPU
def relu_activation_cpu(x):
    return [val if val > 0 else 0 for val in x]

In [4]:
## Number of datapoints
N = 1_000_000

## Randomly initializing the 2 vectors
A = np.random.rand(N).astype(np.float32)
B = np.random.rand(N).astype(np.float32)

## Number of threads per block
threads = 256
## Number of required blocks
blocks = math.ceil(N / threads)

In [5]:
## Storing the data in the gpu for processing
d_A = cuda.to_device(A)
d_B = cuda.to_device(B)
d_C = cuda.device_array_like(A)

## Part 1: Vector Addition (GPU)

In [6]:
# TODO

## Write a kernel function to perform vector addition between A and B

#defining the GPU kernal
@cuda.jit
def vector_add_gpu(A, B, C):
    i = cuda.grid(1)
    if i < A.size:
        C[i] = A[i] + B[i]

In [7]:
import os

## NOTE: Run this cell twice — GPU kernel launch is slow on first run due to compilation.
start_cpu = time.time()
cpu_result = vector_add_cpu(A, B)
cpu_time = (time.time() - start_cpu) * 1000

start_gpu = time.time()
vector_add_gpu[blocks, threads](d_A, d_B, d_C)
cuda.synchronize()
gpu_time = (time.time() - start_gpu) * 1000
## Call the kernel function here to perform vector addition and generate the result
print(f"Vector Add - CPU Time: {cpu_time:.3f} ms")
print(f"Vector Add - GPU Time: {gpu_time:.3f} ms")
# TODO
## Call the 'compare' function to check if the cpu and gpu results are equal
#bringing the result back to the host and verifying
gpu_result = d_C.copy_to_host()
print(f"Match: {compare(cpu_result, gpu_result)}")

NvvmSupportError: libNVVM cannot be found. Do `conda install cudatoolkit`:
Could not find module 'nvvm.dll' (or one of its dependencies). Try using the full path with constructor syntax.

Example output:

Vector Add - CPU Time: 239.820 ms

Vector Add - GPU Time: 0.369 ms

Match: True

## Part 2: Dot Product (GPU)

In [None]:
# TODO

## Write a kernel function to perform dot product between A and B

In [None]:
## NOTE: Run this cell twice — GPU kernel launch is slow on first run due to compilation.

start_cpu = time.time()
dot_cpu = dot_product_cpu(A, B)
end_cpu = time.time()
cpu_time = (end_cpu - start_cpu) * 1000

start_gpu = time.time()

# TODO
## Call the kernel function here to perform dot product and generate the result

cuda.synchronize()
end_gpu = time.time()
gpu_time = (end_gpu - start_gpu) * 1000

print(f"Dot Product - CPU Time: {cpu_time:.3f} ms")
print(f"Dot Product - GPU Time: {gpu_time:.3f} ms")

# TODO
## Call the 'within_relative_error' function to check if the cpu and gpu results are within the relative error

Example output:

Dot Product - CPU Time: 241.741 ms

Dot Product - GPU Time: 0.244 ms

Match: True

## Part 3: ReLU Activation (GPU)

In [None]:
# TODO

## Write a kernel function to perform ReLU activation on A

In [None]:
## NOTE: Run this cell twice — GPU kernel launch is slow on first run due to compilation.

start_cpu = time.time()
relu_cpu = relu_activation_cpu(A)
cpu_time = (time.time() - start_cpu) * 1000

start_gpu = time.time()

# TODO
## Call the kernel function here to perform ReLU activation and generate the result

cuda.synchronize()
gpu_time = (time.time() - start_gpu) * 1000

print(f"ReLU Activation - CPU Time: {cpu_time:.3f} ms")
print(f"ReLU Activation - GPU Time: {gpu_time:.3f} ms")

# TODO
## Call the 'compare' function to check if the cpy and gpu results are equal

Example output:

ReLU Activation - CPU Time: 116.852 ms

ReLU Activation - GPU Time: 0.196 ms

Match: True


## Submission Instructions

- Make sure **all outputs are printed clearly**.
- Submit your completed `.ipynb` file on ELMS / Canvas.