$$
\newcommand{\mat}[1]{\boldsymbol {#1}}
\newcommand{\mattr}[1]{\boldsymbol {#1}^\top}
\newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}}
\newcommand{\vec}[1]{\boldsymbol {#1}}
\newcommand{\vectr}[1]{\boldsymbol {#1}^\top}
\newcommand{\rvar}[1]{\mathrm {#1}}
\newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}}
\newcommand{\diag}{\mathop{\mathrm {diag}}}
\newcommand{\set}[1]{\mathbb {#1}}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\bb}[1]{\boldsymbol{#1}}
\newcommand{\Tr}[0]{^\top}
\newcommand{\softmax}[1]{\mathrm{softmax}\left({#1}\right)}
$$

# CS236781: Deep Learning
# Tutorial 10: CUDA Kernels

## Introduction

In this tutorial, we will cover:

- TODO
- The CUDA programming model
- Implementing CUDA kernels with `numba`

In [158]:
# Setup
%matplotlib inline
import os
import sys
import math
import time
import tqdm
import torch
import matplotlib.pyplot as plt

In [159]:
plt.rcParams['font.size'] = 20
data_dir = os.path.expanduser('~/.pytorch-datasets')
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

## The CUDA programming model

CUDA is a parallel programming model and software environment that leverages the computational resources of NVIDIA GPU's for general-purpose numeric computation.

It provides compilers, programming-language extensions, optimized software libraries and developer tools.

- CUDA defines a programming model and a memory model
- CUDA programs run 1000's of threds on on 100's of physical cores
- Defines extensions to C language to write GPU code (But here we'll use Python :)
- Allows heterogeneous computation:
    - CPU runs sequential operations and invokes GPU
    - GPU runs massively-parallel work
    - Both can run concurrently

**Device**: The GPU  
**Host**: The machine controlling the GPU

| Heterogeneous computing | Host-device communication |
| --| --|
| <center><img src="img/hetero.png" width="300" /></center> | <center><img src="img/host_device.png" width="700" /></center>|


### CUDA Kernels

- A **Kernel** is a function that is *called from host* and *executes on device*
- Generally, one kernel executes at a time on the entire device
    - Actually, kernels can be queued into "streams"
    - Kernels from different streams can overlap
- A Kernel runs with using many concurrent threads
- Each thread executes the *same code*

<center><img src="img/kernel.png" width="300"/></center>

### Kernel "Geometry"

- Kernel launches as a 1d or 2d-**grid** of **thread blocks**
- Each **block** contains multiple threads arranged in a 1d, 2d or 3d configuration
- Threads within a block can synchronize (barrier) and share memory
- Each thread has a **unique id** that is mostly used for
    - Selecting in/out data (computing memory access locations)
    - Control-flow decisions
    
<center><img src="img/kernel_geom.png" width="500"/></center>

Note that multi-dimensional grids and blocks are just for the convenience of the programmer.
- Helps implement algorithms for 2d and 3d data
- Nothing actually changes in the hardware execution

How is a kernel implemented and launched?

- The CUDA C-extensions allow the programmer to define which code is compiled for CPU or GPU.
- A special syntax (`<<< >>>`) allows the definition of kernel geometry when launching it.

```c
__global__ void MyKernel() {}      // call from host, execute on GPU
__device__ float MyDeviceFunc() {} // call from GPU, execute on GPU
__host__ int HostFunc() {}         // call from host, execute on host

dim3 dimGrid(100, 50);  // 5000 thread blocks in the grid, in a 2D layout
dim3 dimBlock(4, 8, 8); // 256 threads per block, in a 3D layout
MyKernel <<< dimGrid, dimBlock >>> (...); // Launch kernel
```

### GPU threads

- Practically zero creation and switching overhead
- Can launch kernels with thousands of threads, many more than physical cores (**"oversubscribed"**)
    - When a thread is blocked due to memory latency, it's instantly swapped out with another waiting thread
    - Instant thread switching hides memory latency
- Even very simple kernels can generate performance benefit with massive parallelization
- Scheduled together in "warps": groups of (usually 32) threads performing the same instruction (SIMT)

### Determining Thread IDs

The CUDA runtime provides **special variables** for determining the geometry of the currently executing kernel:
- `gridDim`: Dimensions of the grid, in blocks. Can be 1d or 2d.
- `blockDim`: Dimension of the block, in threads. Can be 1d, 2d, or 3d.

The CUDA runtime provides **special variables** for calculating the unique thread id:
- `blockIdx`: Index of current block, within the grid. Can be 1d or 2d.
- `threadIdx`: Index of current thread, within the block. Can be 1d, 2d, or 3d.

**Example**: How can we use the above variables to obtain the unique thread id?

A unique thread id for a 1d kernel geometry can be obtained with `blockIdx.x * blockDim.x + threadIdx.x`.

<center><img src="img/thread_id_1d.png" width="800"></center>

### Key idea of CUDA

- Write a single-threaded program with the **thread id** as a parameter.
- Use thread id to select a subset of data to process.
- Launch many threads, so that together they cover the entire dataset.
- Code automatically to all available physical processors. 

### Scalability

- A Kernel transparently scale to device with a different number of physical processors
- A thread block is executed within a one "streaming multiprocessor" (SM)
    - Each SM has many thread processors AKA CUDA cores

<center><img src="img/sm.png" width="300"></center>

- Hardware schedules thread blocks on any available multiprocessor
- Source code defining kernel "geometry" stays the same regardless of hardware

For example, the same Kernel configuration can be launched on devices with a different number of multiprocessors:
<center><img src="img/block_scheduling.png" width="1000"></center>

### Memory Hierarchy

Different types of memory are available to device threads.

The most important ones are:

- Registers
    - Per-thread access
    - On chip $\rightarrow$ extremely fast
    - Persisted until thread terminates

- Thread-local memory
    - Stores per-thread local variables that cannot fit in the register memory
    - Located in DRAM $\rightarrow$ extremely slow
    - Persisted until thread terminates

- Shared memory
    - Shared between threads in the same thread block
    - Used for collaboration between threads in the same block
    - On chip $\rightarrow$ very fast
    - Persisted until end of block

- Global memory
    - Can be access by any thread in any thread block
    - Used to copy to/from host
    - Located in DRAM $\rightarrow$ extremely slow
    - Persisted for the life of the application

|Thread-local|Shared|Global|
|-|-|-|
|<img src="img/mem_local.png" width="330">| <img src="img/mem_shared.png" width="273">|<img src="img/mem_global.png" width="500"> |

## Rules of thumb

How many blocks?
- Should occupy every SM $\rightarrow$ At least one block per SM
- Should have something to run on SM if current block is waiting (e.g. sync) $\rightarrow$ At least two blocks per SM
- Should scale with same code if we upgrade hardware $\rightarrow$ Many blocks per SM!

How many threads?
- Many threads $\rightarrow$ hides global memory latency
- Too many threads $\rightarrow$ exhaust registers and shared memory
- Multiple of warp size
- Typical selection: 64 to 256 per block

## Implementing CUDA Kernels with `numba`

### What is `numba`?

Numba is a **just-in-time** (JIT) **function compiler**, focused on **numerical python**.
It can be used to accelerate python code by generating efficient, **type-specialized** machine code.

Numba supports all major OSes and a wide range of hardware (Intel x86/64, NVIDIA CUDA, ARM).
It's developed and actively maintained by Anaconda Inc., and considered production ready.


Let's explain the terms we used above:

**Just-in-time**: Functions are compiled the first time they're called.  The compiler therefore knows the argument types.

Bonus: This also allows Numba to be used interactively in a Jupyter notebook :)

**Function compiler**:  Numba compiles Python functions, not entire applications.

Numba does not replace the Python interpreter, it effectively transforms a function into a usually faster function. 

**Numerical python**: Numba supports only a subset of the python language. It works well with numerical types such as `int`, `float`, and `complex`, functions from the `math` and `cmath` modules and with `numpy` arrays.

**Type-specialized**: Numba speeds up your function by generating a specialized implementation for the specific data types you are using.

<center><img src="img/numba_flowchart.png" width="800"/></center>

### First steps with `numba` on the CPU

In [189]:
import numpy as np
import numba

Let's implement a "Hello World" style example: A trivial function that increments an array by 1.

In [190]:
@numba.jit(nopython=True)
def inc_cpu(a):
    for i in range(len(a)):
        a[i] += 1

We use the `numba.jit` decorator to wrap our code in a `numba` object that will JIT and cache it when called.

In [193]:
# The inc_cpu variable no longer points to a regular python function, but a callable wrapper.
inc_cpu

CPUDispatcher(<function inc_cpu at 0x7f05922c6e60>)

What's the `nopython` option?
- If `nopython=True`, `numba` will try to compile the entire function so that it can be run completely without the Python interpreter. This is usually what you want.
- Otherwise, `numba` will try to compile the entire function, but if there are unsupported operations or types it will try to only extract loops and compile them as separate functions.

First, let's create a million-element array and see how fast the python interpreter is using `%timeit`.

In [197]:
a = np.zeros((10**6,), dtype=np.float32)
a

array([0., 0., 0., ..., 0., 0., 0.], dtype=float32)

In [198]:
# Run as regular python code (interpreted)
%timeit inc_cpu.py_func(a)
a

2.96 s ± 10.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


array([8., 8., 8., ..., 8., 8., 8.], dtype=float32)

We had to call the function with `.py_func` to get the original function (before wrapping with the jitter).

Now let's call it though the wrapper to time the compiled version:

In [199]:
# Run as jit-compiled machine code
%timeit inc_cpu(a)
a

220 µs ± 148 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)


array([8119., 8119., 8119., ..., 8119., 8119., 8119.], dtype=float32)

That's about 5 orders of magnitude faster! Not bad for just adding a decorator function...

In [207]:
# Run using numpy add(), this is like a + 1 but without allocating output array
%timeit np.add(a, 1, out=a)

240 µs ± 82.3 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)


Nice, we get results similar to `numpy`'s optimized C code.

Important note about benchmarking:

The first time we called `inc_cpu` we paid a overhead price for the compilation.
However, the `%timeit` magic returns the best result from multiple runs, so our results do not show this overhead.

### First steps with `numba` on the GPU

In [209]:
from numba import cuda

cuda.detect()

Found 1 CUDA devices
id 0    b'GeForce RTX 2080 Ti'                              [SUPPORTED]
                      compute capability: 7.5
                           pci device id: 0
                              pci bus id: 15
Summary:
	1/1 devices are supported


True

Let's rewrite our "Hello World" example as a CUDA kernel.

In [210]:
@cuda.jit
def inc_gpu(a):
    idx = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x
    
    # Notice:
    # 1. No loop
    # 2. We assume more threads than array elements
    if idx < a.shape[0]:
        a[idx] += 1

Now lets invoke this kernel with a specific geometry containing more threads than array elements (over 1M threads!)

In [169]:
blocksize = 256
gridsize = math.ceil(a.shape[0] / blocksize)

# Copy data to GPU memory
d_a = cuda.to_device(a)

# Run as a kernel on GPU
# Note that we much synchronize to benchmark properly
%timeit inc_gpu[gridsize, blocksize](d_a); cuda.synchronize()

# Copying data back from device will also synchronize, i.e. wait for kernel to complete
a = d_a.copy_to_host()
a

182 µs ± 4.47 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


array([97341., 97341., 97341., ..., 97341., 97341., 97341.], dtype=float32)

In [142]:
@cuda.jit
def elementwise_mult_kernel(a, b, out):
    threads_per_block = cuda.blockDim.x
    thread_idx_in_block = cuda.threadIdx.x
    block_idx = cuda.blockIdx.x
    
    thread_idx_unique = thread_idx_in_block + block_idx * threads_per_block
    
    if thread_idx_unique >= len(a):
        return
    
    i = thread_idx_unique 
    out[i] = a[i] * b[i]

In [143]:
a = np.ones((10**6,), dtype=np.float32) * 2
b = np.ones((10**6,), dtype=np.float32) * 3

In [69]:
out = np.zeros_like(a)
elementwise_mult_kernel[1000, 1000](a, b, out)

out

array([6., 6., 6., ..., 6., 6., 6.], dtype=float32)

In [70]:
out = np.zeros_like(a)
elementwise_mult_kernel[10, 10](a, b, out)

out

array([6., 6., 6., ..., 0., 0., 0.], dtype=float32)

In [71]:
out[95:105]

array([6., 6., 6., 6., 6., 0., 0., 0., 0., 0.], dtype=float32)

Make sure we cover all data, regardless of kernel dimension:

In [72]:
@cuda.jit
def elementwise_mult_kernel_2(a, b, out):
    threads_per_block = cuda.blockDim.x
    num_blocks = cuda.gridDim.x
    
    thread_idx_in_block = cuda.threadIdx.x
    block_idx = cuda.blockIdx.x
    
    thread_idx_unique = thread_idx_in_block + block_idx * threads_per_block
    
    start = thread_idx_unique 
    end = len(a)
    stride = threads_per_block * num_blocks # jump over all threads, in case we have more data than threads
    
    for i in range(start, end, stride):
        out[i] = a[i] * b[i]

In [73]:
out = np.zeros_like(a)
elementwise_mult_kernel_2[10, 10](a, b, out)

out

array([6., 6., 6., ..., 6., 6., 6.], dtype=float32)

In [81]:
@cuda.jit
def matmul_kernel(a, b, out):
    i, j = cuda.grid(2)
    imax, jmax = cuda.gridsize(2)
    
    for k in range(b.shape[0]):
        out[i, j] += a[i,k] * b[k,j]

In [82]:
a = np.ones((2, 5), dtype=np.float32) * 2
b = np.ones((5, 2), dtype=np.float32) * 3

In [89]:
out = np.zeros((2,2), dtype=np.float32)
matmul_kernel[1, (10, 10)](a, b, out)

out

array([[30., 30.],
       [30., 30.]], dtype=float32)

**Thread cooperation**

Sometimes it's necessary for threads to cooperate, not everything can be parallel.
- Unlimited cooperation among thousands of threads is not scalable performance wise
- Solution: Only threads within same **block** can share memory
- Balance between scalability and cooperation