# Intro to CuPy
Project Webpage: [https://cupy.dev/](https://cupy.dev/) \
User Guide: [https://docs.cupy.dev/en/stable/user_guide/index.html](https://docs.cupy.dev/en/stable/user_guide/index.html)

Query information about the GPUs available.

In [None]:
!nvidia-smi

## CuPy vs NumPy

CuPy(CUDA Python) has very similar syntax to NumPy(Numerical Python).

While NumPy arrays are stored on the CPU, CuPy arrays are stored on the GPU.

In [None]:
import numpy as np
import cupy as cp

size = 2048

# Initializes a random 2048x2048 matrix on the CPU
A_cpu = np.random.rand(size, size).astype(np.float64)

# Initializes a random 2048x2048 matrix on the GPU
A_gpu = cp.random.rand(size, size).astype(np.float64)

NumPy arrays can be changed into CuPy arrays by copying them from the CPU to the GPU, and vice versa. This conversion is not implicit, so you can't apply CuPy operations on NumPy arrays without copying them over first.

In [None]:
# Array is initialized on the CPU
B_cpu = np.random.randn(size, size)
print(f"B_cpu type: {type(B_cpu)}")

# Copy array from CPU(host) —> GPU(device)
B_gpu = cp.asarray(B_cpu)
print(f"B_gpu type: {type(B_gpu)}")

# Apply calculations on the GPU
B_gpu = cp.sin(B_gpu)

# Copy array from GPU(device) —> CPU(host)
B_cpu = cp.asnumpy(B_gpu)
print(f"B_cpu type: {type(B_cpu)}")

In [None]:
# Cannot do:
cp.sin(B_cpu)

### Let's plot a part of B_cpu and B_gpu

In [None]:
import matplotlib.pylab as plt

plt.plot(B_cpu[0,:100])

In [None]:
plt.plot(B_gpu[0,:100])

### let's compare the speed of operation

Since operations on CuPy arrays are done on the GPU, they can be much faster than NumPy operations on the CPU, especially for dense linear algebra on large matrices.

Note: `cp.cuda.Device().synchronize()` is used to ensure that the GPU operations are completed in order to time it accurately; it's not usually necessary.

In [None]:
# NumPy matrix multiplication
%timeit -n 5 C_cpu = np.matmul(A_cpu, B_cpu);

In [None]:
# CuPy matrix multiplication
%timeit -n 5 C_gpu = cp.matmul(A_gpu, B_gpu); cp.cuda.Device().synchronize()

GPU can be worse? Re-run the above cells again!

### Multiple GPUs
CuPy also lets us work with data on multiple GPUs. Similar to the host/device, data has to be copied from one GPU to the other.

In [None]:
# (Will only work if you have more than 1 GPU)

# Create array on GPU 1
with cp.cuda.Device(1):
    C_gpu1 = cp.zeros((size, size), dtype=cp.float)

# Copy array from GPU 1 —> GPU 0
with cp.cuda.Device(0): # not necessary, default device is 0
    C_gpu0 = cp.asarray(C_gpu1)

## Overhead

There are 2 types of overhead to keep in mind when using the GPU with CuPy: **kernel overhead** and **data movement overhead**.

### Kernel Overhead

The first time a function is called in CuPy, there is compliation overhead because CuPy uses Just-In-Time(JIT) compilation. The next time the function is called again it uses the cached code, so it's not as slow.

**The compiled codes are stored:** `~/.cupy/kernel_cache`

In [None]:
size = 256

for i in range(4):
    D_gpu = cp.random.rand(size,size).astype(np.float64)
    %time cp.linalg.eigh(D_gpu); cp.cuda.Device().synchronize()

There is also a CUDA kernel launch overhead of a couple microseconds every time a new GPU kernel is launched. This overhead amortized by larger problem sizes.

In [None]:
for size in [128, 256, 512, 1024]:
    print(f"\nArray size {size}x{size}")
    
    # NumPy
    print("- NumPy time")
    E_cpu = np.random.rand(size,size).astype(np.float64)
    %time np.linalg.eigh(E_cpu);

    # CuPy
    print("- CuPy time")
    E_gpu = cp.random.rand(size,size).astype(np.float64)
    cp.linalg.eigh(E_gpu); #isolate out JIT compilation overhead
    %time cp.linalg.eigh(E_gpu); cp.cuda.Device().synchronize()

The CUDA kernel launch overhead can also be reduced by merging multiple kernels together. We can see that by using the `@cupy.fuse` decorator, running the second fused kernel takes less time that the first kernel because it has no launch overhead

In [None]:
def double_multiply(x, y):
    return 2*x*y

@cp.fuse
def double_multiply_fused(x,y):
    return 2*x*y

In [None]:
size = 2**16
F1 = cp.random.rand(size)
F2 = cp.random.rand(size)

double_multiply(F1, F2) #isolate out JIT compilation overhead
%timeit -n 7 double_multiply(F1, F2); cp.cuda.Device().synchronize()

double_multiply_fused(F1, F2) #isolate out JIT compilation overhead
%timeit -n 7 double_multiply_fused(F1, F2); cp.cuda.Device().synchronize()

There is also an overhead associated when you run the very first CuPy function of a program, which is due to first creating the CUDA context by the CUDA driver.

### Data Movement Overhead

Transferring data between the CPU and the GPU is slower than processing the data on the GPU, so minimizing data movement in or out of the GPU is best for performance.

In [None]:
import time

In [None]:
# All data and operations on CPU

times = []
for i in range(10):
    start = time.perf_counter() #this function proviced high-resolution interval timing
    
    G_cpu = np.random.rand(size).astype(np.float64)
    H_cpu = np.random.rand(size).astype(np.float64)
    np.vdot(H_cpu, G_cpu);
    
    times.append(time.perf_counter() - start)

print(f"All CPU takes on average {np.mean(times[-9:])*1000} ms")

In [None]:
# All data and operations on GPU

times = []
for i in range(10):
    start = time.perf_counter()
    
    G_gpu = cp.random.rand(size).astype(np.float64)
    H_gpu = cp.random.rand(size).astype(np.float64)
    cp.vdot(H_gpu, G_gpu)
    cp.cuda.Device().synchronize()
    
    times.append(time.perf_counter() - start)

print(f"All GPU takes on average {np.mean(times[-9:])*1000} ms")

In [None]:
# Transfer data from CPU to GPU to operate on GPU

times = []
for i in range(10):
    start = time.perf_counter()
    
    G_gpu = cp.asarray(G_cpu)
    H_gpu = cp.asarray(H_cpu)
    cp.vdot(H_gpu, G_gpu)
    cp.cuda.Device().synchronize()

    times.append(time.perf_counter() - start)

print(f"CPU —> GPU takes on average {np.mean(times[-9:])*1000} ms")

## GPU Memory Management

Query the free and total memory with `nvidia-smi` shell commands or in Python using CuPy.

In [None]:
!nvidia-smi -i 0 --query-gpu=memory.free,memory.total --format=csv

In [None]:
print("(memory free, memory total) in bytes:")
print(cp.cuda.Device().mem_info)

If you try to allocate too much memory on the GPU, you get an `OutOfMemory` error.

In [None]:
size = 2**16
I_gpu = cp.zeros((size, size))
J_gpu = cp.zeros((size, size)) 

Clear all GPU memory.

In [None]:
cp.get_default_memory_pool().free_all_blocks()

In [None]:
!nvidia-smi -i 0 --query-gpu=memory.free,memory.total --format=csv

One way to resolve `OutOfMemory` errors is by using unified memory, where CUDA transfers data between the CPU and GPU on-demand (when page faults).

In [None]:
cp.cuda.set_allocator(cp.cuda.MemoryPool(cp.cuda.malloc_managed).malloc)

size = 2**16
I_gpu = cp.zeros((size, size))
J_gpu = cp.zeros((size, size))
# works when unified memory

Operations on these arrays can be slower due to the GPU moving pages in and out of its memory.

In [None]:
%%time
cp.multiply(I_gpu, J_gpu)
cp.cuda.Device().synchronize()

## Cleanup

In [None]:
# restart the kernel
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)