# Lab - Parallel Computing Part 2 - GPU Kernel Timing

## E6692 Spring 2022

Complete **Part 1: Define GPU Kernel Functions for a Deep Learning Model** before starting this part.

In [None]:
import numpy as np
import torch
import torch.nn.functional as F
import time

from utils.context import Context, GPUKernels
from utils.plot_execution_times import plot_execution_times, INPUT_SIZES

# define GPU
device = torch.device('cuda')

# define block size
BLOCK_SIZE = 32

# define timing iterations
iterations = 10

# define kernel path
kernel_path = './kernels.cu'

%load_ext autoreload
%autoreload 2

%matplotlib inline

## Part 2: GPU Kernel Timing

Now that we have implemented CUDA versions of deep learning network layers and verified the results against PyTorch's versions, we will compare the execution times of our CUDA implementation against PyTorch CPU and PyTorch GPU execution times. PyTorch simplifies almost all of the GPU memory allocation and context management. Data is sent to the GPU with the **.to(device)** method, where device is the **torch.device()** object defined in the first cell that corresponds to the Jetson Nano's GPU. The following cell gives an example of how to time operations and execute PyTorch operations on CPU vs GPU with the **relu()** function. 

### ReLU Timing

In [None]:
context = Context(BLOCK_SIZE)
source_module = context.getSourceModule(kernel_path)
cuda_functions = GPUKernels(context, source_module)

In [None]:
# relu time profiling example

# define timing lists
time_pytorch_cpu = []
time_pytorch_gpu = []
time_cuda = []

for input_size in INPUT_SIZES:
    
    time_pytorch_cpu_total = 0
    time_pytorch_gpu_total = 0
    time_cuda_total = 0
    
    # average execution times for more stable and accurate results
    for _ in range(iterations):
    
        # define input array
        input_array = np.random.randint(-10, high=10, size=(input_size, input_size))

        # define PyTorch input array
        input_array_pytorch_cpu = torch.from_numpy(input_array)

        # profile CPU PyTorch
        pytorch_cpu_start = time.time()
        output_array_pytorch_cpu = F.relu(input_array_pytorch_cpu)
        pytorch_cpu_end = time.time()

        # add pytorch CPU time to list
        time_pytorch_cpu_total += (pytorch_cpu_end - pytorch_cpu_start) * 1000

        # profile GPU PyTorch (including memory transfer)
        pytorch_gpu_start = time.time()
        input_array_pytorch_gpu = input_array_pytorch_cpu.to(device)
        output_array_pytorch_gpu = F.relu(input_array_pytorch_gpu)
        output_array_pytorch_gpu = output_array_pytorch_gpu.cpu()
        pytorch_gpu_end = time.time()

        # add pytorch GPU time to list
        time_pytorch_gpu_total += (pytorch_gpu_end - pytorch_gpu_start) * 1000

        # profile CUDA (including memory transfer)
        cuda_start = time.time()
        cuda_output = cuda_functions.relu(input_array)
        cuda_end = time.time()

        # add CUDA time to list
        time_cuda_total += (cuda_end - cuda_start) * 1000
        
    time_pytorch_cpu.append(time_pytorch_cpu_total / iterations)
    time_pytorch_gpu.append(time_pytorch_gpu_total / iterations)
    time_cuda.append(time_cuda_total / iterations)

# plot times
title = "ReLU Execution Time by Method"
plot_execution_times(time_pytorch_cpu, time_pytorch_gpu, time_cuda, title)


### Discuss the results of the relu() time profile:

#### 1. Are the GPU implementations faster than PyTorch CPU implementation? Why or why not?

TODO: Your answer here

#### 2. How does our CUDA implementation or ReLU compare to PyTorch's GPU implementation?

TODO: Your answer here

#### 3. Based on 1, is it worth computing ReLU activation by itself on GPU? If not, why is this function still needed for GPU deep learning implementations?

TODO: Your answer here

### Timing of 2D Convolution, 2D Max-Pooling, and Fully Connected layers

TODO: Using the relu time profile as an example, generate time profile plots using **plot_execution_times()** for **conv2d()**, **MaxPool2d()**, and **linear()**. Each CUDA implementation function should be compared to the corresponding CPU and GPU implementations in PyTorch. 

In [None]:
# TODO: conv2d() time profile and plot. You may define any mask shape.



### Discuss the results of the conv2d() time profile.

TODO: Your discussion here


In [None]:
# TODO: MaxPool2d() time profile and plot. You may use any pooling kernel size.

      

### Discuss the results of the MaxPool2d() time profile.

TODO: Your discussion here

In [None]:
# TODO: linear() time profile and plot. Define weight and bias shapes corresponding 
#       to the input size


### Discuss the results of the linear() time profile.

TODO: Your discussion here