<a href="https://colab.research.google.com/github/shubhamck/CudaColab/blob/main/CUDA_Chapter_2_Image_Dilation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Welcome to Cuda C++ Tutorials in Google Colab
This tutorial series will start from scratch and will assume no knowledge about CUDA.

## Chapter 2 : Dilate an Image on the GPU using CUDA C++ and OpenCV

### Setting up Colab with Nvidia GPU
Follow [Chapter 1](https://github.com/shubhamck/CoronaCV/blob/master/CUDA_Chapter_1_Scale.ipynb) for setting up GPU for the notebook

So Now Lets test if Google gave us our GPU. Run `nvidia-smi`

In [1]:
!which nvcc
!nvidia-smi

/usr/local/cuda/bin/nvcc
Sun Aug 11 04:08:41 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   49C    P8              11W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                           

> # ⛔ Alert
> Google only allows free GPU access for a limited time.
> I suggest `Change Runtime` back to `CPU` while coding or if you keep the `GPU` ON for a long time Google will ask you to pay to cotinue using it

### Writing a simple cuda kernel to dilate an image using a user-defined dilation template

Lets look at the steps to do this:

1. Read the input image using `opencv` and store it on the CPU
2. Create a pointer called `device input pointer` which will point to the memory location on the GPU where the input image will be copied
3. Create a point called `device output pointer` which will point to the memory location on the GPU where the resulting image will be filled
4. Copy the input array to the GPU
5. Run the Kernel
6. Copy the the result image back to the CPU
7. Verify if the image got dilated

### OpenCV C++

OpenCV is preinstalled in ALL Google Colab notebooks. Lets try to use it in a simple program first and read an image and print its size

In [2]:
%%writefile opencv_test.cpp

#include<stdio.h>
#include <iostream>
#include "../include/opencv4/opencv2/opencv_modules.hpp"
#include "../include/opencv4/opencv2/opencv.hpp"
#include "../include/opencv4/opencv2/core/core.hpp"
#include "../include/opencv4/opencv2/highgui/highgui.hpp"
#include "../include/opencv4/opencv2/imgproc/imgproc.hpp"

int main()
{
  cv::Mat img = cv::imread("chess.png", CV_8UC1);
  std::cout << "Image size : " << img.size();
  std::cout << "rows : " << img.rows << ", cols : " << img.cols << "\n";
}

Writing opencv_test.cpp


Now Lets build the code.

In [3]:
!g++ opencv_test.cpp -o opencv_test -I/usr/include/opencv4 -lopencv_imgcodecs -lopencv_imgproc -lopencv_core

And Run it

In [4]:
!./opencv_test

Image size : [800 x 800]rows : 800, cols : 800


I used `chess.png` which looks something like this

![chess.png](https://i.postimg.cc/ncyyDkR5/chess.png)


### Image Dilation in Parallel

Lets look at the dilation from Parallelization Point of View

1. The Dilation Template or Kernel operates on each pixel of the input image
2. When the template is on a pixel, it finds the `max` of all the pixels falling under the template
3. This max value is then assigned to the corresponding pixel in the output image

![Dilation_img](https://i.postimg.cc/MTpWCmh9/dilation-shift.gif)
Image Source : https://python.plainenglish.io/image-dilation-explained-easily-e085c47fbac2

So parallelizing this operation is quite easy

1. Spawn a thread for each pixel in the input image
2. Each thread gets its pixel location `loc` from thread identifier
3. For the pixel the template max is calculated
4. This value is then written to the output image pixel location `loc`

How to spawn threads ?

1. We need total of `num_pixels = num_rows * num_cols` number of threads
2. If we just call the kernel same as Chapter 1 with `1 block` and `num_pixels threads` we might end up exhausting the number of available threads per block which would result in slowing down of the kernel as the GPU can only spawn a fixed number of threads at a time
3. To tackle this we can make use of multiple `blocks`
4. We can call `num_rows` number of `blocks` with each block containing `num_cols` number of `threads`
5. The syntax for that is `dilation <<< num_rows, num_cols >>>`

In [5]:
%%writefile dilate.cu

#include<stdio.h>
#include <iostream>
#include "../include/opencv4/opencv2/opencv_modules.hpp"
#include "../include/opencv4/opencv2/opencv.hpp"
#include "../include/opencv4/opencv2/core/core.hpp"
#include "../include/opencv4/opencv2/highgui/highgui.hpp"
#include "../include/opencv4/opencv2/imgproc/imgproc.hpp"

// helper function to get row major index
__device__ int row_major_location(int row_id, int col_id, int num_cols)
{
  return row_id * num_cols + col_id;
}

// Kernel which runs on every thread
__global__ void dilate(uchar* in_buffer, uchar* out_buffer, int dilation_kernel_size, int num_rows, int num_cols)
{
    // Each thread knows its block ID and block id is the row id
    int row_id = blockIdx.x;
    // Each thread knows its ID and thread id is the col id
    int col_id = threadIdx.x;

    // Get pixel location ( assuming in_buffer is row major flattened image )
    int pixel_location = row_major_location(row_id, col_id, num_cols);

    uchar max_value = 0U; // initialize the max value to zero;

    for (int i = row_id - dilation_kernel_size; i < row_id + dilation_kernel_size; ++i)
    {
      for (int j = col_id - dilation_kernel_size; j < col_id + dilation_kernel_size; ++j)
      {

        // check if location is valid i.e inside the image dimensions
        if (i >= 0 && j >= 0 && i < num_rows && j < num_cols)
        {
          uchar pixel_val = in_buffer[row_major_location(i, j, num_cols)];
          if (pixel_val > max_value)
          {
            max_value = pixel_val;
          }
        }
      }
    }

    // write the max value to the output buffer for the same pixel location
    out_buffer[pixel_location] = max_value;
}


int main(int argc,char **argv)
{
    constexpr int SQUARE_KERNEL_SIZE = 10; // 10x10 square Kernel

    auto input_img = cv::imread("chess.png", CV_8UC1);
    std::cout << "Size : " << input_img.size() << "\n";
    std::cout << "rows : " << input_img.rows << ", cols : " << input_img.cols << "\n";

    const auto num_rows = input_img.rows;
    const auto num_cols = input_img.cols;
    const auto num_pixels = num_rows * num_cols;
    const auto img_size_bytes = num_pixels * sizeof(uchar);

    // Declare output image on CPU
    cv::Mat output_img(num_rows, num_cols, CV_8UC1);

    // Declare pointer on CPU which pdvsdvdssdfoint to memory locations on GPU global memory
    uchar* d_input_buffer;
    uchar* d_output_buffer;

    // Allocate memory on the GPU
    cudaMalloc((void**) &d_input_buffer, img_size_bytes);
    cudaMalloc((void**) &d_output_buffer, img_size_bytes);

    // Copy input array from CPU to GPU
    cudaMemcpy(d_input_buffer, input_img.data, img_size_bytes, cudaMemcpyHostToDevice);

    // Run the Kernel
    dilate<<<num_rows, num_cols>>>(d_input_buffer, d_output_buffer, SQUARE_KERNEL_SIZE, num_rows, num_cols);

    // Copy output array from GPU to CPU
    cudaMemcpy(output_img.data, d_output_buffer, img_size_bytes, cudaMemcpyDeviceToHost);

    // Free the memory on the GPU
    cudaFree(d_input_buffer);
    cudaFree(d_output_buffer);

    // Flush
    cudaDeviceSynchronize();


    // write the output image
    cv::imwrite("chess_dilate_gpu.png", output_img);
    return 0;
}

Overwriting dilate.cu


Now lets compile the code.

In [7]:
!rm -rf /usr/local/cuda
!ln -s /usr/local/cuda-12.2 /usr/local/cuda
!nvcc -g -G dilate.cu -o dilate -I/usr/include/opencv4 -lopencv_imgcodecs -lopencv_imgproc -lopencv_core

This will generate `dilate` binary in the current directly. You can run this binary and see the output

In [8]:
!./dilate

Size : [800 x 800]
rows : 800, cols : 800


After the code runs successfully for the `10x10` dilation kernel you should get the following image
![chess_dilate_gpu.png](https://i.postimg.cc/MK1nRsZf/chess-dilate-gpu.png)
