<a href="https://colab.research.google.com/github/shubhamck/CoronaCV/blob/master/CUDA_Chapter_1_Scale.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Welcome to Cuda C++ Tutorials in Google Colab
This tutorial series will start from scratch and will assume no knowledge about CUDA.

## Chapter 1 : Running your first C++ Cuda program in Google Colab

### Setting up Colab with Nvidia GPU
As you all know Google colab is a very useful tool to prototype some cool Python or Machine Learning model using Python. But in the background a Google colab notebook can be assumed to be just another linux computer.

When you open a new Colab Notebook, Google gives a very minimal setup with a simple CPU with no GPU.

So first lets tell Google to give us a GPU:

* Click on `Runtime` in the above Toolbar
* Click on `Change Runtime` and Select `T4 GPU`
![image.png](https://i.postimg.cc/VsZnBTFv/colab-gpu.png)

So Now Lets test if Google gave us our GPU. Run `nvidia-smi`

In [1]:
!which nvcc
!nvidia-smi

/usr/local/cuda/bin/nvcc
Sat Aug 10 23:08:31 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   59C    P8              10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                           

> # ⛔ Alert
> Google only allows free GPU access for a limited time.
> I suggest `Change Runtime` back to `CPU` while coding or if you keep the `GPU` ON for a long time Google will ask you to pay to cotinue using it

### Writing a simple cuda kernel to multiply a number to all elements in an array on the GPU

The first CUDA kernel that everyone learns to write a simple scalar multiplcation operation on a large array

Lets look at the steps to do this:

1. Create an array on the CPU
2. Initialize the array with some `ints`
3. Create a pointer called `device pointer` which will point to the memory location on the GPU where the input array will be copied
4. Copy the input array to the GPU
5. Run the Kernel
6. Copy the the result array back to the CPU
7. Verify if the kernel performed the action as expected

In [8]:
%%writefile scale.cu

#include<stdio.h>
#include <iostream>

// Kernel which runs on every thread
__global__ void scale(int* in, int* out, int scale_factor)
{
    // Each thread knows its ID

    int id = threadIdx.x;

    // Using the thread ID we query the input array, multiply it by the scale factor and copy to result
    // to the output array using the same ID

    out[id] = in[id] * scale_factor;
}


int main(int argc,char **argv)
{
    // Define some some constants
    constexpr size_t ARRAY_SIZE = 64; // Size of input array
    constexpr size_t ARRAY_BYTES = ARRAY_SIZE * sizeof(int); // Size of input array in bytes
    constexpr int SCALE_FACTOR = 5; // scale factor

    // Initialize input array on CPU
    int h_in[ARRAY_SIZE];
    for (int i = 0; i < ARRAY_SIZE; ++i)
    {
      h_in[i] = i;
    }

    // Declare output array on CPU
    int h_out[ARRAY_SIZE];

    // Declare pointer on CPU which point to memory locations on GPU global memory
    int* d_in;
    int* d_out;

    // Allocate memory on the GPU
    cudaMalloc((void**) &d_in, ARRAY_BYTES);
    cudaMalloc((void**) &d_out, ARRAY_BYTES);

    // Copy input array from CPU to GPU
    cudaMemcpy(d_in, h_in, ARRAY_BYTES, cudaMemcpyHostToDevice);

    // Run the Kernel
    scale<<<1, ARRAY_SIZE>>>(d_in, d_out, SCALE_FACTOR);

    // Copy output array from GPU to CPU
    cudaMemcpy(h_out, d_out, ARRAY_BYTES, cudaMemcpyDeviceToHost);

    // Free the memory on the GPU
    cudaFree(d_in);
    cudaFree(d_out);

    // Flush
    cudaDeviceSynchronize();


    // Print output array on the CPU
    for (int i = 0; i < ARRAY_SIZE; ++i)
    {
      // printf("%d\n", h_out[i]);
      std::cout << i << " : " << h_out[i] << "\n";
    }
    return 0;
}

Overwriting scale.cu


Now lets compile the code.

In [9]:
!rm -rf /usr/local/cuda
!ln -s /usr/local/cuda-12.2 /usr/local/cuda
!nvcc -g -G scale.cu -o scale

This will generate `scale` binary in the current directly. You can run this binary and see the output

In [10]:
!./scale

0 : 0
1 : 5
2 : 10
3 : 15
4 : 20
5 : 25
6 : 30
7 : 35
8 : 40
9 : 45
10 : 50
11 : 55
12 : 60
13 : 65
14 : 70
15 : 75
16 : 80
17 : 85
18 : 90
19 : 95
20 : 100
21 : 105
22 : 110
23 : 115
24 : 120
25 : 125
26 : 130
27 : 135
28 : 140
29 : 145
30 : 150
31 : 155
32 : 160
33 : 165
34 : 170
35 : 175
36 : 180
37 : 185
38 : 190
39 : 195
40 : 200
41 : 205
42 : 210
43 : 215
44 : 220
45 : 225
46 : 230
47 : 235
48 : 240
49 : 245
50 : 250
51 : 255
52 : 260
53 : 265
54 : 270
55 : 275
56 : 280
57 : 285
58 : 290
59 : 295
60 : 300
61 : 305
62 : 310
63 : 315
