# What CPU and GPU am I using?

Before we start, lets check what processor and GPU we will be using. Performance can vary a lot depending on which model we are using. Google Collab does not allow us to choose the model, but it is free.

In [None]:
!echo "CPU:"
!cat /proc/cpuinfo | grep name
!echo "GPU:"
!nvidia-smi

CPU:
model name	: Intel(R) Xeon(R) CPU @ 2.30GHz
model name	: Intel(R) Xeon(R) CPU @ 2.30GHz
GPU:
Fri Jan 28 09:32:03 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.46       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P8    27W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                      

# Vector Addition

We start by loading a few packages and we define some helper functions to generate the three vectors a, b and c, to compute the checksum of the result and to time the calculation.

The standard Python implementation of the vector addition is provided for reference.

In [None]:
import time
import math
import numpy as np
from numpy.random import seed
from numpy.random import rand
from numba import jit,njit,prange,cuda, types, float32
import matplotlib.pyplot as plt

%matplotlib inline 

# Randomize between -10, 10
def randomize_array(size):
    return 10.0 * 2.0 * (rand(size) - 0.5)

def init(size):
    seed(2)
    a = np.array(randomize_array(size), dtype=np.float32)
    b = np.array(randomize_array(size), dtype=np.float32)
    c = np.zeros(size, dtype=np.float32)
    return a, b, c

@njit(parallel = True)
def check(c):
    size = len(c)
    sum = 0.0
    for i in prange(size):
        sum += c[i]
    return sum

def time_and_check(vec_op, size):
    a, b, c = init(size)

    start = time.time()
    vec_op(a, b, c)
    end = time.time()

    print('Size: ', size, ' elapsed time: ',end-start, ' checksum = ', check(c))

# Python implementation
def vec_add_interpreted(a, b, c):
    for i in range(len(a)):
        c[i] = a[i] + b[i]

The addition of two vectors is very straightforward. The complexity is linear with the size of the input. Therefore here we use a large vector size to increase the execution time. Note that we use a power of two, as this will help us a bit with the CUDA implementation at first.

In [None]:
size = 2**22

print("Interpreted Python:")
time_and_check(vec_add_interpreted, size)
time_and_check(vec_add_interpreted, size)

Interpreted Python:




Size:  4194304  elapsed time:  1.9601237773895264  checksum =  -204.1574936332181
Size:  4194304  elapsed time:  1.9934566020965576  checksum =  -204.1574936332181


# CUDA implementation

This is a very simple operation, therefore, you will have to implement the CUDA kernel function by yourself. 

## Exercise 1: The CUDA Hello World

In [None]:
# Exercise 1
def vec_add_numba_cuda(a, b, c):
    # get thread position 'i' in the grid
    # do computation at grid position 'i'

# call the function

blocksize = # 3. block size = number of threads per block dimension
gridsize = # 4. grid size = number of blocks per grid dimension

# Check!
size = 2**20
time_and_check(vec_add_interpreted, size)
time_and_check(vec_add_numba_cuda[gridsize, blocksize], size)

### Solution

In [None]:
@cuda.jit
def vec_add_numba_cuda(a, b, c):
    i = cuda.grid(1)
    c[i] = a[i] + b[i]

In [None]:
size = 2**22

blocksize = 32
gridsize = int(size/blocksize)
time_and_check(vec_add_numba_cuda[gridsize, blocksize], size)
time_and_check(vec_add_numba_cuda[gridsize, blocksize], size)

Size:  4194304  elapsed time:  0.5735862255096436  checksum =  -204.1574936332181
Size:  4194304  elapsed time:  0.01816868782043457  checksum =  -204.1574936332181


## Exercise 2: Foolproof

Adapt the previous code to handle sizes which are not a power of 2. Hint: you need to change both the kernel and the gridsize.


In [None]:
def vec_add_numba_cuda(a, b, c):
    # check that thread position 'i' is valid

blocksize =
gridsize = 

size = 12345678
time_and_check(vec_add_interpreted, size)
time_and_check(vec_add_numba_cuda[gridsize, blocksize], size)

IndentationError: ignored

### Solution

In [None]:
@cuda.jit
def vec_add_numba_cuda(a, b, c):
    i = cuda.grid(1)

    if i < a.shape[0]:
        c[i] = a[i] + b[i]

In [None]:
size = 12345678

blocksize = 32
gridsize = int((size+blocksize)/blocksize)

time_and_check(vec_add_interpreted, size)
time_and_check(vec_add_numba_cuda[gridsize, blocksize], size)
time_and_check(vec_add_numba_cuda[gridsize, blocksize], size)

Size:  12345678  elapsed time:  5.597100496292114  checksum =  9072.72968535521
Size:  12345678  elapsed time:  0.2529134750366211  checksum =  9072.72968535521
Size:  12345678  elapsed time:  0.08398103713989258  checksum =  9072.72968535521


# Exercise 3: Memory Management

By default, if we let Numba take care of the data transfers, Numba will copy all three arrays to and from the device everytime. 

This would be a good time to do some profiling using nvprof:
```
==8912== Profiling application: python vec_add.py
==8912== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   61.98%  79.743ms         6  13.290ms  5.6272ms  28.232ms  [CUDA memcpy DtoH]
                   36.66%  47.159ms         6  7.8599ms  5.3962ms  12.592ms  [CUDA memcpy HtoD]
                    1.36%  1.7535ms         2  876.77us  876.74us  876.80us  cudapy::__main__::dot_numba_cuda_kernel$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>)
      API calls:   39.09%  83.395ms         6  13.899ms  5.7020ms  29.060ms  cuMemcpyDtoH
                   38.00%  81.068ms         1  81.068ms  81.068ms  81.068ms  cuDevicePrimaryCtxRetain
                   22.22%  47.419ms         6  7.9031ms  5.3943ms  12.724ms  cuMemcpyHtoD
```

**61.98**% of the total time is spent in the **[CUDA memcpy DtoH]** function, and **[CUDA memcpy HtoD]** function. **In total, 98.6% of the total execution time on the GPU is lost in data transfers...** Yes, only 1.36% of time is calculations.

But in fact, we don't need to copy all three arrays everytime. We need to copy array a and b **to** the device (c will be set on the device), and we need to copy array c **from** the device to get the results.

Below are some examples of how to control data transfers manually:

```
# Create device array d_a from array a and copy it to the device
d_a = cuda.to_device(a)

# Alternatively, create device array d_c from array c but DON'T copy it
d_c = cuda.to_device(c, copy=False)

# Copy the content of device array d_c to host array c
d_c.copy_to_host(c)
``` 

Then when calling the kernel function, use the freshly created device arrays rather than the host arrays:
```
vec_add_numba_cuda[gridsize, blocksize](d_a, d_b, d_c)
```

In [None]:
def vec_add_numba_cuda_no_copy(a, b, c):
    # your turn
        
    
size = 2**24

blocksize = 
gridsize = 

print("Cuda:")
time_and_check(vec_add_numba_cuda[gridsize, blocksize], size)
time_and_check(vec_add_numba_cuda[gridsize, blocksize], size)
print("Cuda no copy:")
time_and_check(vec_add_numba_cuda_no_copy, size)
time_and_check(vec_add_numba_cuda_no_copy, size)


If you did it right, it should be significantly faster!

### Solution

In [None]:
def vec_add_numba_cuda_no_copy(a, b, c):
    d_a = cuda.to_device(a)
    d_b = cuda.to_device(b)
    d_c = cuda.to_device(c, copy=False)

    blocksize = 32
    gridsize = int(size / blocksize)

    vec_add_numba_cuda[gridsize, blocksize](d_a, d_b, d_c)

    d_c.copy_to_host(c)

In [None]:
size = 2**22
print("Cuda:")
blocksize = 32
gridsize = int(size / blocksize)
time_and_check(vec_add_numba_cuda[gridsize, blocksize], size)
time_and_check(vec_add_numba_cuda[gridsize, blocksize], size)
print("Cuda no copy:")
time_and_check(vec_add_numba_cuda_no_copy, size)
time_and_check(vec_add_numba_cuda_no_copy, size)

Cuda:
Size:  134217728  elapsed time:  0.6065824031829834  checksum =  6869.8527526276885
Size:  134217728  elapsed time:  0.5396871566772461  checksum =  6869.8527526276885
Cuda no copy:
Size:  134217728  elapsed time:  0.1997087001800537  checksum =  6869.8527526276885
Size:  134217728  elapsed time:  0.382235050201416  checksum =  6869.8527526276885


```
==9108== Profiling application: python vec_add.py
==9108== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   67.61%  49.289ms         2  24.644ms  24.631ms  24.658ms  [CUDA memcpy DtoH]
                   29.98%  21.857ms         4  5.4642ms  5.4225ms  5.5229ms  [CUDA memcpy HtoD]
                    2.41%  1.7536ms         2  876.79us  876.16us  877.41us  cudapy::__main__::dot_numba_cuda_kernel$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>)
      API calls:   50.14%  76.012ms         1  76.012ms  76.012ms  76.012ms  cuDevicePrimaryCtxRetain
                   34.52%  52.337ms         2  26.168ms  26.138ms  26.199ms  cuMemcpyDtoH
                   14.41%  21.845ms         4  5.4613ms  5.4113ms  5.4877ms  cuMemcpyHtoD
```

We have doubled the time spent on calculations! The whole execution is a factor 2x faster!

# Exercise 4: Tiling

Cuda **streams** allow concurrency of execution on a single device within a given context. Queued work items in the same stream execute sequentially, but work items in different streams may execute concurrently. Most operations involving a CUDA device can be performed asynchronously using streams, including data transfers and kernel execution.

In this example, computation time is too small to get any benefit, however we can try overlapping data transfers.

This is done with a technique called **Tiling**, where we split the work into smaller work items (also called chunks).

Here the code is given as an example, and you will not see any benefit in google colab. On a newer GPU, this will be 20 to 40% faster.

In [None]:
def vec_add_numba_cuda_tiling(a, b, c):
    nChunks = 8
    chunkSize = int(size / nChunks)

    for i in range(nChunks):
        stream = cuda.stream()
        
        begin = i * chunkSize
        end = begin + chunkSize

        d_a = cuda.to_device(a[begin:end], stream=stream)
        d_b = cuda.to_device(b[begin:end], stream=stream)
        d_c = cuda.to_device(c[begin:end], stream=stream, copy=False)

        blocksize = 32
        gridsize = int(size / blocksize)

        #print("ChunkSize = ", chunkSize, " tpb = ", TPB, " gridsize = ", gridsize)
        vec_add_numba_cuda[gridsize, blocksize, stream](d_a, d_b, d_c)

        d_c.copy_to_host(c[begin:end], stream=stream)
        
    
size = 2**24
print("Cuda:")
time_and_check(vec_add_numba_cuda[gridsize, blocksize], size)
time_and_check(vec_add_numba_cuda[gridsize, blocksize], size)
print("Cuda no copy:")
time_and_check(vec_add_numba_cuda_no_copy, size)
time_and_check(vec_add_numba_cuda_no_copy, size)
print("Cuda tiling:")
time_and_check(vec_add_numba_cuda_tiling, size)
time_and_check(vec_add_numba_cuda_tiling, size)

Cuda:
Size:  16777216  elapsed time:  0.05959486961364746  checksum =  43100.967578660464
Size:  16777216  elapsed time:  0.057906150817871094  checksum =  43100.967578660464
Cuda no copy:
Size:  16777216  elapsed time:  0.029003381729125977  checksum =  43100.967578660464
Size:  16777216  elapsed time:  0.12186980247497559  checksum =  43100.967578660464
Cuda tiling:
Size:  16777216  elapsed time:  0.08003592491149902  checksum =  43100.967578660464
Size:  16777216  elapsed time:  0.06699371337890625  checksum =  43100.967578660464
