# What CPU and GPU am I using?

Before we start, lets check what processor and GPU we will be using. Performance can vary a lot depending on which model we are using. Google Collab does not allow us to choose the model, but it is free.

In [1]:
!echo "CPU:"
!cat /proc/cpuinfo | grep name
!echo "GPU:"
!nvidia-smi

CPU:
model name	: 13th Gen Intel(R) Core(TM) i7-13700K
model name	: 13th Gen Intel(R) Core(TM) i7-13700K
model name	: 13th Gen Intel(R) Core(TM) i7-13700K
model name	: 13th Gen Intel(R) Core(TM) i7-13700K
model name	: 13th Gen Intel(R) Core(TM) i7-13700K
model name	: 13th Gen Intel(R) Core(TM) i7-13700K
model name	: 13th Gen Intel(R) Core(TM) i7-13700K
model name	: 13th Gen Intel(R) Core(TM) i7-13700K
model name	: 13th Gen Intel(R) Core(TM) i7-13700K
model name	: 13th Gen Intel(R) Core(TM) i7-13700K
model name	: 13th Gen Intel(R) Core(TM) i7-13700K
model name	: 13th Gen Intel(R) Core(TM) i7-13700K
model name	: 13th Gen Intel(R) Core(TM) i7-13700K
model name	: 13th Gen Intel(R) Core(TM) i7-13700K
model name	: 13th Gen Intel(R) Core(TM) i7-13700K
model name	: 13th Gen Intel(R) Core(TM) i7-13700K
model name	: 13th Gen Intel(R) Core(TM) i7-13700K
model name	: 13th Gen Intel(R) Core(TM) i7-13700K
model name	: 13th Gen Intel(R) Core(TM) i7-13700K
model name	: 13th Gen Intel(R) Core(TM) i7-13

# Vector Addition Hello World

In [8]:
import numpy as np
from numba import cuda

# The CUDA kernel
# Goal: add 5 to a vector of zeros
# in:  0 0 0 0 0 0...
# out: 5 5 5 5 5 5...
@cuda.jit
def vec_add_constant(x, n, c):
    # thread position in the grid
    # shortcut for: i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
    i = cuda.grid(1)

    # Make sure we stop computing at n. 
    # If n is not a multiple of threads_per_blocks, 
    # it is likely that we have an extra block with empty values, some threads will have to wait
    if i < n:
        x[i] += c
        
n = 4096 # vector size
c = 5.0 # constant to be added

h_x = np.zeros(n, dtype=np.float32) # host array filled with zeros
print(h_x)

# copy h_x to a device array named d_x on the GPU
d_x = cuda.to_device(h_x)

threads_per_block = 32 # between 32 and 1024

# compute the number of blocks necessary to solve the problem: 1024 / 256 = 4
blocks = (n + threads_per_block - 1) // threads_per_block

# call the CUDA kernel on the GPU
vec_add_constant[blocks, threads_per_block](d_x, n, c)

# copy d_x back to h_x (from GPU to CPU)
h_x = d_x.copy_to_host()

print(h_x)

[0. 0. 0. ... 0. 0. 0.]
[5. 5. 5. ... 5. 5. 5.]


## Exercise 1: Vector Addition

Now it's your turn to implement the CUDA kernel! 

The goal is to add two vectors.

In [12]:
# CUDA kernel to add two vectors
# in:
# a:  0 0 0 0 0
# b:  5 5 5 5 5
# out:
# c:  5 5 5 5 5
@cuda.jit
def vec_add(a, b, c, n):
    i = cuda.grid(1)

    # TODO: finish the kernel

n = 4096 # vector size
c = 5.0 # constant to be added

h_a = np.zeros(n, dtype=np.float32) # host array filled with zeros
h_b = np.full(n, 5.0, dtype=np.float32) # host array filled with 5s
h_c = np.zeros(n, dtype=np.float32) # host array filled with zeros

print(h_a)
print(h_b)

# copy h to d
d_a = cuda.to_device(h_a)
d_b = cuda.to_device(h_b)
d_c = cuda.to_device(h_c)

threads_per_block = 32 # between 32 and 1024

# compute the number of blocks necessary to solve the problem: 1024 / 256 = 4
blocks = (n + threads_per_block - 1) // threads_per_block

# call the CUDA kernel on the GPU
vec_add[blocks, threads_per_block](d_a, d_b, d_c, n)

# copy d to h
h_a = d_a.copy_to_host()
h_b = d_b.copy_to_host()
h_c = d_c.copy_to_host()

print(h_c)

[0. 0. 0. ... 0. 0. 0.]
[5. 5. 5. ... 5. 5. 5.]
[5. 5. 5. ... 5. 5. 5.]


## Exercise 2: Simple Memory Management

By default, if we let Numba take care of the data transfers, Numba will copy all three arrays to and from the device everytime. 

This would be a good time to do some profiling using nvprof:
```
==8912== Profiling application: python vec_add.py
==8912== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   61.98%  79.743ms         6  13.290ms  5.6272ms  28.232ms  [CUDA memcpy DtoH]
                   36.66%  47.159ms         6  7.8599ms  5.3962ms  12.592ms  [CUDA memcpy HtoD]
                    1.36%  1.7535ms         2  876.77us  876.74us  876.80us  cudapy::__main__::dot_numba_cuda_kernel$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>)
      API calls:   39.09%  83.395ms         6  13.899ms  5.7020ms  29.060ms  cuMemcpyDtoH
                   38.00%  81.068ms         1  81.068ms  81.068ms  81.068ms  cuDevicePrimaryCtxRetain
                   22.22%  47.419ms         6  7.9031ms  5.3943ms  12.724ms  cuMemcpyHtoD
```

**61.98**% of the total time is spent in the **[CUDA memcpy DtoH]** function, and **[CUDA memcpy HtoD]** function. **In total, 98.6% of the total execution time on the GPU is lost in data transfers...** Yes, only 1.36% of time is calculations.

But in fact, we don't need to copy all three arrays everytime. We need to copy array a and b **to** the device (c will be set on the device), and we need to copy array c **from** the device to get the results.

Below are some examples of how to control data transfers manually:

```
# Create device array d_a from array a and copy it to the device
d_a = cuda.to_device(a)

# Alternatively, create device array d_c from array c but DON'T copy it
d_c = cuda.to_device(c, copy=False)

# Copy the content of device array d_c to host array c
d_c.copy_to_host(c)
``` 

Then when calling the kernel function, use the freshly created device arrays rather than the host arrays:
```
vec_add[gridsize, blocksize](d_a, d_b, d_c)
```


If you did it right, it should be significantly faster!