# Intro to Cuda
Small preface of how coding in CUDA works, with some basic components that'll build a strong foundation in CUDA before we dive into the matrix multiplication kernels.
<br><br>

***Read if you do not know C++***

<sub>Go learn C++. Seriously, though, CUDA is essentially C++ with GPU-specific extensions. Teaching you C++ here requires more MarkDown than this repository can hold, so do yourself a favour, and at least brush up on same basic syntax before you start digging in to make the most out of this guide.<sub>

<br><br>
## Execution Model - Introductory
Okay, I want you to pay real close attention to this part, as it VITAL to understanding how parallel computing works - and in turn, how we can optimize it.

### Threads
Threads are the smallest unit of execution - imagine them as mini workers inside a computer chip, each handling one small task at a time.

Processes are different from threads. They are independent programs in execution, and threads live inside processes (hence, smallest). Think of a process as a team, and the threads as the mini workers in that team.

Processes are independent because each of them has its own memory space, while threads in the same process share the same memory space. This is why one process crashing does not affect other processes, but one thread crashing can bring down the entire process.

#### Cores
Cores are the physical execution units in a processor that run threads.

 Each core can technically only execute one thread at a time but since threads are so small, that execution slice is extremely brief, and the CPU rapidly switches between them -- thousands of times per second. This time-slicing creates the illusion that many threads are running simaltaneously.

**Small interactive exercise for you:** Click off this guide and go check how many cores your CPU has.

Got it? Great. So, your CPU having X cores means technically only X threads are *truly* running at any given moment, and the rest are just taking turns so quickly that you don't notice. So your Chrome browser, VSCode, and Riot Client are all juggling threads -- pausing one task for a fraction of a millisecond to let another run, then switching back.

Also your mighty X cores? Sit down for this one. 
A state-of-the-art GPU (RTX 4090) has upwards of 16,000 CUDA cores. These CUDA cores don't time-slice threads like CPU cores, though -- they keep thousands of threads active truly at the same time. This is where the limitations of CPUs, and the domination of GPUs for massively parallel workloads come into play

### Blocks
Blocks are groups of threads that work together in a GPU. Threads in the same block have access to a shared memory region (SMEM) using fast on-chip shared memory -- this is crucual for optimization tricks we'll see later on.

If you are enjoying the workers analogy - a block can be thought of as a bigger department of mini workers, all of whom are in the same office, able to talk to each other and use the departments shared tools and resources. However, our poor mini workers are under a cruel corporation, and they cannot directly talk to workers in other departments ~ **Threads in different blocks cannot share memory**.

### Grids
Grids are collections of blocks launched for the kernel. More precisely, when each kernel is launched (invoked) it creates a single grid, which has a user-defined number of blocks.

A grid would be the entire building of the coorporation, with many different departments of mini workers, who are under strict command to not talk to each other




### Keep reading and you will be illuminated to how these all tie in together.

<br><br>
## Kernel Launch Syntax
If I haven't mentioned yet (I actually have not), a **kernel** is a function written in CUDA that runs on the GPU, executed by many threads in parallel

<sub> You might be thinking of "kernels" in a different context, especially because of how intertwined they are with processes and cores. Let me stop you right there. This is a very common misconception. You would be thinking of OS kernels, which is a core component of an OS, responsible for managing symtem resources. These have **NOTHING** to do with the CUDA kernels we are focusing on. </sub>

CUDA kernels run on the GPU, but are launched (similar to a function call) from CPU (host) code. When kernels are launched, a **grid** of **blocks** is created. We can pass in the number of **blocks**, and the number of **threads** per **block** (similar to function arguments).

*Time for some actual code, finally!*


In [None]:
dim3 gridDim(32, 32, 1);      
dim3 blockDim(32, 32, 1);     

Both of these lines initialize  objects of type `dim3` , a CUDA-specific struct that holds three unsigned integers: x, y and z representing the values for the size of each dimension.

Since each kernel launches a single grid, defining the dimensions of that grid (gridDim) is essentially deciding how many blocks to have. A grid with dimensions x,y,z will have x * y * z blocks, so 32 * 32 * 1 = 1024 blocks

Then defining the dimensions of that block (blockDim) means deciding the number of threads to have in each block. A block with dimensions x,y,z will have x * y * z threads, so 32 * 32 * 1 = 1024 threads that run in parallel and share memory.

Maybe, should have asked you to pull out your calculator too. We can see that the total number of threads would be 1024 * 1024 = 1048576.

So number of threads = gridDim.x * gridDim.y * gridDim.z * blockDim.x * blockDim.y * blockDim.z

##### Great image from Simon:

![](../../images/GEMM1/dimension-viz.png)

Just think of both grids and blocks as (singular) 3D planes.

Each unit square in a grid is a block.

Each unit square in each of these blocks, is a thread.

That pen and paper I asked you to grab earlier? Yeah, I was not joking. I would advise the reader to make a rough sketch of these models just so they have it with them, as they go along further.

<sub>Note: the dim3 objects can be named anything i.e. not constrained to just gridDim and blockDim, but these would be appropriate identifiers</sub>



In [None]:
sgemm_naive<<<gridDim, blockDim>>>(M, N, K, alpha, A, B, beta, C);  # C

This is a CUDA kernel launch, which, as you can see, looks similar to a function call - just with a few unfamiliar elements. 

The official syntax is:

    `kernel_name<<< gridDim, blockDim, sharedMem, stream >>>(regular function arguments…)`
The *parameters* passed within <<< >>> are actually execution configuration parameters, setting up the kernel grid that will be created, as well as (optionally) how much shared memory to allocate or which stream to use

To clarify, this happens on the **CPU (host)**, typically inside `int main()` (or another host function). The kernel then runs on the GPU, and 

For those of you familiar with web development, you can think of it like an API call from your frontend (CPU in this case) to an endpoint in your backend (GPU in this case). 

###### A CUDA kernel launch is, tehcnically speaking, an API call to the CUDA runtime library, which handles the scheduling of the kernel on the GPU. 

The frontend sends the request (launch configuration and kernel arguments) and backend processes the request (executes the kernel). This is where our analogy kind of breaks: the regular API flow would be for the backend (GPU) to send a response back the frontend (CPU). 

This *can* happen but is not the case by default, as the host must explicitly call a memory transfer function to transfer data back to the CPU. Else, the data just sits there in GPU memory, which is seperate from CPU memory



In [None]:
cudaMemcpy(destination_ptr, src_ptr, size, cudaMemcpyDeviceToHost);

This is the memory transfer function previously mentioned. It takes in the the destination pointer and the source pointer (both `void*`), the size i.e. number of bytes to copy, and kind i.e. the direction of copy, which can be:

- cudaMemcpyHostToDevice ~ Host(CPU) -> Device(GPU) 
- cudaMemcpyDeviceToHost ~ Device(GPU) -> Host(CPU)
- cudaMemcpyDeviceToDevice ~ Device(GPU) -> Device(GPU)
- cudaMemcpyHostToHost ~ Host(CPU) -> Host(CPU)



So, only after this transfer function call, will CPU have a copy of results of the kernel from the GPU.

In [None]:
 cudaDeviceSynchronize(); 

This function call makes it so the CPU waits for the GPU to finish. This is an optional step, but often used.

<sub>**Note:** This should actually be called before cudaMemcpy() because cudaMemcpy() already waits for the kernel to finish, so it can then transfer the data. It has just been included here to make the execution order explicit.<sub>

<br><br>
## Consolidated CUDA launch code

In [None]:
int main() {
    // assume we have set up pointers here for all the arguments below !
    dim3 gridDim(32, 32, 1);      
    dim3 blockDim(32, 32, 1);  

    sgemm_naive<<<gridDim, blockDim>>>(M, N, K, alpha, A, B, beta, C);  
    cudaDeviceSynchronize(); 
    cudaMemcpy(h_C, C, size, cudaMemcpyDeviceToHost);
    
    return 0;
}


So, to summarise using this example:
The host (CPU) function sets up pointers and execution configuration parameters, then uses those to launch the kernel, which executes on the GPU, and in this case, calls for the result of the kernel to be copied from GPU memory back to the CPU (into h_C here)

## CUDA Project Structue

Apologies for going on a bit long, but we’re now neatly wrapping up a tightly packed yet essential foundation in CUDA — enough to set you well on your way. Hopefully, the nerves you had approaching this guide 20 (I hope) minutes ago have eased up and you are setting into things. All that's left for this introduction is to fill in some final gaps, showing you how CUDA actually goes from this abstract concept to actually something running on your system.

##### The core file types:
- .cu: CUDA source files which actually contain the kernel definitions - allows for multiple kernels in one file
- .cuh: CUDA header file - holds kernel declarations, constants, macros, etc. Included in the source host code with #include
- .cpp/.cc: normal C++ source file - containing host-only code, but able to call kernels 

Interestingly, the .cu source files can also contain normal C++/C host code since it uses NCVV, which is a dual compiler capable of compiling both host C/C++ code and device CUDA in the same source file. Thus, we can even do the following in a .cu file:

In [None]:
// kernels 
__global__ void kernalName(...){...}

int main() {
    // ... main function
    // calls to the kernel
}

And this is fine for small demos (as you will see in our kernels directory)! But when we are looking at bigger/more serious GPU development projects, we should definetely follow the standard of seperation, with a quick example for each file shown below (told you C++ would be needed!):

In [None]:
// matrixMul.cu for device (kernel) code

#include "matrixMul.cuh"

__global__ void matrixMultiply(float *A, float *B, float *C, int M, int N, int K) {
    // compute logic here
}

In [None]:
// matrixMul.cuh for declaration of kernels

#pragma once

__global__ void matrixMultiply(float *A, float *B, float *C, int M, int N, int K);


In [None]:
// main.cpp/cc for the host-side code where kernels are invoked

#include "matrixMul.cuh"
#include <cuda_runtime.h>
#include <iostream>

int main() {
    ...
    matrixMultiply<<<blocks, threads>>>(d_A, d_B, d_C, M, N, K);
    ...
}
