# Parallel programming on the GPU

In this notebook, we'll look at some of the basic type of running accelerated programming on GPUs. We'll be using Google Colab and their GPU. 

Before we start, make sure you've chosen the GPU in Colab's runtime type. 

Let's first look at some information about the hardware: 

In [None]:
!nvidia-smi

and information about the software: 

In [None]:
!nvcc --version

Before we can run any script files, we need to connect to our Google Drive, where the files are located: 

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## "Hello World" from the GPU

Let's start with the simplest possible GPU script, we'll refactor the script file `01-hello-gpu.cu` to make it run on the GPU. 

In [None]:
!nvcc -arch=sm_70 -o hello-gpu 01-hello-gpu.cu -run

## Parallel kernels

Running multiple kernels in parallel on the GPU requires specifying the number of threads to run the kernel on with the syntax `kernelName<<<NUMBER_OF_BLOCKS, NUMBER_OFTHREADS_PER_BLOCK>>>()`. 

We'll refractor the script `02-first-parallel.cu` to make it run in parallel. 

In [None]:
!nvcc -arch=sm_70 -o first-parallel 02-first-parallel.cu -run

CUDA can access the id of each thread and each block. We'll refractor the script `02-thread-and-block-idx.cu` to show how that works. 

In [None]:
!nvcc -arch=sm_70 -o thread-and-block-idx 02-thread-and-block-idx.cu -run

## Accelerating FOR loops

Instead of run `for` loops in sequence, we can get the GPU to run each iteration in the `for` loop in parallel in its own thread. 

We'll refractor the code in `03-multi-threads-loop.cu` to see how to do that. 

In [None]:
!nvcc -arch=sm_70 -o multi-threads-loop 03-multi-threads-loop.cu -run

## Grid size mismatch

It is common that the size of the grid is larger than the size of the data set. In these cases we need to check thread number is smaller than the grid size. 

In the script `04-mismatched-config-loop.cu`, we'll look at how this is solved. 

In [None]:
!nvcc -arch=sm_70 -o mismatched-config-loop.cu 04-mismatched-config-loop.cu -run

## Data sets larger than the grid

Sometimes the size of the grid can be smaller than the size of a data set. In these cases, we need to get the kernel to loop through the operation in each grid stride. 

In the script `05-grid-stride-double.cu`, we'll look at how this is done. 

In [None]:
!nvcc -arch=sm_70 -o grid-stride-double 05-grid-stride-double.cu -run