
# 1. CUDA Programming Tutorial and Hands-on


Enter your name and student ID.

 * Name:
 * Student ID:



# 2. CUDA
* [CUDA](https://docs.nvidia.com/cuda/index.html) is an extension to C++ specific to NVIDIA GPUs
* It is the most basic, native programming model for NVIDIA GPUs

# 3. Compilers
* We use [NVIDIA CUDA Toolkit](https://developer.nvidia.com/cuda-toolkit) (`nvcc`) for CUDA compilers
* [LLVM ver. 18.1.8](https://llvm.org/) (`clang` and `clang++`) and NVIDA's C/C++ compilers (`nvc` and `nvc++`) we used for OpenMP also support CUDA, but they fail to compile some of our code, so we stick to more traditional `nvcc`

## 3-1. Set up NVIDIA CUDA and HPC SDK
Execute this before you use NVIDIA HPC SDK

In [None]:
export PATH=/opt/nvidia/hpc_sdk/Linux_x86_64/24.9/compilers/bin:$PATH
export PATH=/usr/local/cuda/bin:$PATH

* Check if it works
  * make sure the full path of nvcc is shown as `/usr/local/...`, not `/opt/nvidia/...`
* We do not recommend nvc/nvc++ for this exercise, but you might give them a try if you like

In [None]:
which nvcc
which nvc
which nvc++
nvcc --version
nvc --version

## 3-2. LLVM
* We do not recommend it for this exercise, but you might give them a try if you like
* Execute this before you use LLVM

In [None]:
export PATH=/home/share/llvm/bin:$PATH
export LD_LIBRARY_PATH=/home/share/llvm/lib:/home/share/llvm/lib/x86_64-unknown-linux-gnu:$LD_LIBRARY_PATH

* Check if it works (check if full paths of clang/clang++ are shown)

In [None]:
which clang
which clang++
clang --version

# 4. Check host and GPU
* First check if you are using the right host, tauleg000, <font color="red">not taulec</font>

In [None]:
hostname
hostname | grep tauleg || echo "Oh, you are not on the right host, access https://tauleg000.zapto.org/ instead"

* Check if GPU is alive by nvidia-smi
* Do `nvidia-smi --help` or see manual (`man nvidia-smi` on terminal) for more info

In [None]:
nvidia-smi


# 5. Compiling and running CUDA programs
## 5-1. With nvcc (NVIDIA HPC SDK CUDA compiler)
* Give a source file `.cu` extension or give `--x cu` option to indicate it is a CUDA source file

In [None]:
BEGIN SOLUTION
END SOLUTION
%%writefile cuda_hello.cu
#include <assert.h>
#include <stdio.h>

__global__ void cuda_thread_fun(int n) {
  int i        = blockDim.x * blockIdx.x + threadIdx.x;
  int nthreads = gridDim.x * blockDim.x;
  if (i < n) {
    printf("hello I am CUDA thread %d out of %d\n", i, nthreads);
  }
}

int main(int argc, char ** argv) {
  int n               = (argc > 1 ? atoi(argv[1]) : 100);
  int thread_block_sz = (argc > 2 ? atoi(argv[2]) : 64);
  int n_thread_blocks = (n + thread_block_sz - 1) / thread_block_sz;
  printf("%d threads/block * %d blocks\n", thread_block_sz, n_thread_blocks);

  // launch a kernel
  cuda_thread_fun<<<n_thread_blocks,thread_block_sz>>>(n);
  // wait for them to complete
  cudaDeviceSynchronize();
  return 0;
}

In [None]:
BEGIN SOLUTION
END SOLUTION
nvcc -o cuda_hello cuda_hello.cu

In [None]:
BEGIN SOLUTION
END SOLUTION
./cuda_hello

* You should see 100 lines of "hello I am CUDA thread ??? out of 128"

* Alternatively, you can have a source file with an ordinary C++ extension `.cc` (or `.cpp`) and give `-x cu`.
* It is useful when you want to have a single source file for OpenMP and CUDA programs

In [None]:
ln -sf cuda_hello.cu cuda_hello.cc
nvcc -o cuda_hello -x cu cuda_hello.cc

In [None]:
./cuda_hello

## 5-2. With nvc++ (NVIDIA HPC SDK C++ compiler)
* Just to demonstrate `nvc++` supports CUDA, too

In [None]:
BEGIN SOLUTION
END SOLUTION
nvc++ -Wall -o cuda_hello cuda_hello.cu

In [None]:
nvc++ -Wall -o cuda_hello -x cu cuda_hello.cc

## 5-3. With clang++ (LLVM)
* Just to demonstrate `clang++` supports CUDA, too

In [None]:
BEGIN SOLUTION
END SOLUTION
clang++ -Wall -o cuda_hello cuda_hello.cu -L/usr/local/cuda/lib64 -lcudart

In [None]:
ln -sf cuda_hello.cu cuda_hello.cc
clang++ -Wall -o cuda_hello -x cu cuda_hello.cc -L/usr/local/cuda/lib64 -lcudart

# 6. CUDA kernel
* The most basic concept of CUDA programming is a _CUDA kernel_
* Syntactically, a CUDA kernel is a `void` function with `__global__` keyword attached to it
```
__global__ void cuda_thread_fun(int n) { ... }
```
* A CUDA kernel describes what a _single_ CUDA thread does
* You launch a number of CUDA threads all executing the same kernel by 
```
kernel_func<<<num_of_blocks,num_of_threads_per_block>>>(...);
```
* We have already seen this in the above code
* See [2. Programming model](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#programming-model) section for reference

## 6-1. <font color="red">You'd better always check errors</font>
* It's not only CUDA programming in which you are strongly advised to check errors after each operation that could potentially go wrong
* Just like many C programming APIs (unlike Python scripting, for example), calling CUDA APIs and launching CUDA kernels silently return if something went wrong
* You could save a huge amount of time by checking errors 
  * every time you launch a CUDA kernel and
  * every time you call a CUDA API

* Here is the same piece of code with checking errors

In [None]:
BEGIN SOLUTION
END SOLUTION
%%writefile cuda_hello_chk.cu
#include <assert.h>
#include <stdio.h>

/*
  you'd better spend time on making sure you always check errors ...
*/

void check_api_error_(cudaError_t e,
                      const char * msg, const char * file, int line) {
  if (e) {
    fprintf(stderr, "%s:%d:error: %s %s\n",
            file, line, msg, cudaGetErrorString(e));
    exit(1);
  }
}

#define check_api_error(e) check_api_error_(e, #e, __FILE__, __LINE__)

void check_launch_error_(const char * msg, const char * file, int line) {
  cudaError_t e = cudaGetLastError();
  if (e) {
    fprintf(stderr, "%s:%d:error: %s %s\n",
            file, line, msg, cudaGetErrorString(e));
    exit(1);
  }
}

#define check_launch_error(exp) do { exp; check_launch_error_(#exp, __FILE__, __LINE__); } while (0)


__global__ void cuda_thread_fun(int n) {
  int i        = blockDim.x * blockIdx.x + threadIdx.x;
  int nthreads = gridDim.x * blockDim.x;
  if (i < n) {
    printf("hello I am CUDA thread %d out of %d\n", i, nthreads);
  }
}

int main(int argc, char ** argv) {
  int n               = (argc > 1 ? atoi(argv[1]) : 100);
  int thread_block_sz = (argc > 2 ? atoi(argv[2]) : 64);
  int n_thread_blocks = (n + thread_block_sz - 1) / thread_block_sz;
  printf("%d threads/block * %d blocks\n", thread_block_sz, n_thread_blocks);

  check_launch_error((cuda_thread_fun<<<n_thread_blocks,thread_block_sz>>>(n)));
  check_api_error(cudaDeviceSynchronize());
  return 0;
}

In [None]:
BEGIN SOLUTION
END SOLUTION
nvcc -o cuda_hello_chk cuda_hello_chk.cu
# nvc++ -Wall -o cuda_hello_chk cuda_hello_chk.cu
# clang++ -Wall -Wno-unknown-cuda-version -o cuda_hello_chk cuda_hello_chk.cu -L/usr/local/cuda/lib64 -lcudart

In [None]:
BEGIN SOLUTION
END SOLUTION
./cuda_hello_chk

* I factored out the error-checking code into a header file `"cuda_util.h"` and included it in the directory (check it from the left menu)
* The following code is a more concise version using the header file

In [None]:
BEGIN SOLUTION
END SOLUTION
%%writefile cuda_hello_hdr_chk.cu
#include <assert.h>
#include <stdio.h>

#include "cuda_util.h"

__global__ void cuda_thread_fun(int n) {
  int i        = blockDim.x * blockIdx.x + threadIdx.x;
  int nthreads = gridDim.x * blockDim.x;
  if (i < n) {
    printf("hello I am CUDA thread %d out of %d\n", i, nthreads);
  }
}

int main(int argc, char ** argv) {
  int n               = (argc > 1 ? atoi(argv[1]) : 100);
  int thread_block_sz = (argc > 2 ? atoi(argv[2]) : 64);
  int n_thread_blocks = (n + thread_block_sz - 1) / thread_block_sz;
  printf("%d threads/block * %d blocks\n", thread_block_sz, n_thread_blocks);

  check_launch_error((cuda_thread_fun<<<n_thread_blocks,thread_block_sz>>>(n)));
  check_api_error(cudaDeviceSynchronize());
  return 0;
}

In [None]:
BEGIN SOLUTION
END SOLUTION
nvcc -o cuda_hello_hdr_chk cuda_hello_hdr_chk.cu
# nvc++ -Wall -o cuda_hello_hdr_chk cuda_hello_hdr_chk.cu
# clang++ -Wall -Wno-unknown-cuda-version -o cuda_hello_hdr_chk cuda_hello_hdr_chk.cu -L/usr/local/cuda/lib64 -lcudart

In [None]:
BEGIN SOLUTION
END SOLUTION
./cuda_hello_hdr_chk

# 7. The number of CUDA threads launched
* You specify the number of threads launched by the two parameters in <<<...,...>>>, like
```
kernel_func<<<num_of_blocks,num_of_threads_per_block>>>(...);
```

* It will create (num_of_blocks * num_of_threads_per_block) threads in total.
* More precisely, it creates num_of_blocks _thread blocks_, each of which has num_of_threads_per_block threads.
* It is natural to wonder why you need to specify two parameters instead of just one parameter (the total number of threads) and how to choose num_of_threads_per_block.
* For now, just know that a thread block is the unit of scheduling
  * A GPU device fetches a single block at a time and dispatches it to a particular streaming multiprocessor (SM)
  * Remember that a single SM is like a CPU core; a single GPU device has a number of SMs just like a single CPU has a number of cores.

# <font color="green"> Problem 1 :  Change the number of threads per block</font>
Change the arguments of the following command line in various ways and see what happens

In [None]:
BEGIN SOLUTION
END SOLUTION
./cuda_hello_hdr_chk 10 3

_<font color="green">Answer for trivial work omitted</font>_

# 8. Thread ID
## 8-1. One-dimensional ID
* Just like OpenMP, CUDA provides a means for a thread to know its ID as well as the total number of threads launched together
* They are obtained from builtin variables
* Let's say you invoked a kernel with
```
f<<<12,34>>>(...);
```
you create 12 thread blocks having 34 threads each (408 threads in total).
 * `gridDim.x` gives the number of thread blocks ($= 12$)
 * `blockDim.x` gives the number of threads in a thread block ($= 34$)

* note: "grid" is the CUDA terminology to mean all the launched thread blocks (a CUDA thread $\in$ thread block $\in$ the entire grid)

 * `blockIdx.x` gives the block ID within the grid ($\in [0,12)$) 
 * `threadIdx.x` gives the thread ID within a thread block ($\in [0,34)$)
* If you want to get a single thread ID between 0 to 407 and the total number of threads, you get them by
```
int idx      = blockIdx.x * blockDim.x + threadIdx.x;
int nthreads = gridDim.x * blockDim.x;
```
* You have seen them in the above example.

* See [2.1 Kernels](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#kernels) for reference

## 8-2. Two- or three-dimensional ID
* Each of the above four variables can actually have up to three elements, allowing you to view blocks and threads within a block arranged in an one-, two- or three-dimensional space.  
* You specify them accordingly when you call a kernel, for which you use a variable of type `dim3` instead of an integer, to specify up to three numbers

* See [2.2 Thread Hierarchy](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#thread-hierarchy) for reference


In [None]:
BEGIN SOLUTION
END SOLUTION
%%writefile cuda_hello_2d.cu
#include <assert.h>
#include <stdio.h>

#include "cuda_util.h"

__global__ void cuda_thread_fun(int n) {
  int x          = blockDim.x * blockIdx.x + threadIdx.x;
  int y          = blockDim.y * blockIdx.y + threadIdx.y;
  int nthreads_x = gridDim.x * blockDim.x;
  int nthreads_y = gridDim.y * blockDim.y;
  int g          = x + nthreads_y * y;
  if (g < n) {
    printf("hello I am CUDA thread (%d,%d) of (%d,%d)\n",
           x, y, nthreads_x, nthreads_y);
  }
}

int isqrt(int n) {
  int i;
  for (i = 0; i * i < n; i++) ;
  return i;
}

int main(int argc, char ** argv) {
  int n                 = (argc > 1 ? atoi(argv[1]) : 40);
  int nx                = isqrt(n);
  int ny                = (n + nx - 1) / nx;
  int thread_block_sz_x = (argc > 2 ? atoi(argv[2]) : 2);
  int thread_block_sz_y = (argc > 3 ? atoi(argv[3]) : 3);
  int n_thread_blocks_x = (nx + thread_block_sz_x - 1) / thread_block_sz_x;
  int n_thread_blocks_y = (ny + thread_block_sz_y - 1) / thread_block_sz_y;
  printf("(%d * %d) threads/block * (%d * %d) blocks\n",
         thread_block_sz_x, thread_block_sz_y,
         n_thread_blocks_x, n_thread_blocks_y);

  dim3 nb(n_thread_blocks_x, n_thread_blocks_y);
  dim3 tpb(thread_block_sz_x, thread_block_sz_y);
  check_launch_error((cuda_thread_fun<<<nb,tpb>>>(n)));
  check_api_error(cudaDeviceSynchronize());
  return 0;
}

In [None]:
BEGIN SOLUTION
END SOLUTION
nvcc -o cuda_hello_2d cuda_hello_2d.cu
# nvc++ -Wall -o cuda_hello_2d cuda_hello_2d.cu
# clang++ -Wall -Wno-unknown-cuda-version -o cuda_hello_2d cuda_hello_2d.cu -L/usr/local/cuda/lib64 -lcudart

In [None]:
BEGIN SOLUTION
END SOLUTION
./cuda_hello_2d

# <font color="green"> Problem 2 :  Specify 2D thread blocks and grids</font>
* Change the arguments of the following command line in various ways and see what happens

In [None]:
BEGIN SOLUTION
END SOLUTION
./cuda_hello_2d 40 2 3

_<font color="green">Answer for trivial work omitted</font>_

# 9. Passing data between host (CPU) and device (GPU)
* GPU is a device separate from a host CPU
* As such, CPU and GPU do not share memory; you need to explicitly pass data by calling APIs (this is changing but practically remains true for a while)
* One simplest way to pass data from a host to device is arguments to a kernel function, but
  * it cannot be used for device -&gt; host (recall that kernel functions are always void)
  * it is limited to values passed by "call-by-value"; you cannot pass pointers along with values pointed to by them
* For anything other than passing arguments by call-by-values, you should use `cudaMalloc` and `cudaMemcpy`

* See [3.2.2. Device Memory](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory) for reference

## 9-1. cudaMalloc
```
void * p;
check_api_error(cudaMalloc(&p, size));
```

* allocates `size` bytes of memory on device and
* returns an address valid on the device (not valid on the host) to variable `p`

* remember that this function should be called on host; no functions are provided in CUDA API for CUDA threads to dynamically allocate memory along the way

## 9-2. cudaMemcpy
* host -&gt; device
```
check_api_error(cudaMemcpy(p_dev, p_host, size, cudaMemcpyHostToDevice));
```

* device -&gt; host
```
check_api_error(cudaMemcpy(p_host, p_dev, size, cudaMemcpyDeviceToHost));
```

* the first argument is always the destination
* p_dev should be an address on device (i.e., that has been allocated by `cudaMalloc`)

## 9-3. cudaFree
```
check_api_error(cudaFree(dev_p));
```

* frees memory allocated by cudaMalloc

* The following code demonstrates how to get some results back to host using `cudaMalloc` and `cudaMemcpy`
* Results show when each CUDA thread started executing

In [None]:
BEGIN SOLUTION
END SOLUTION
%%writefile cuda_memcpy.cu
#include <assert.h>
#include <stdio.h>

#include "cuda_util.h"

__global__ void cuda_thread_fun(long long * p, int n) {
  int i        = blockDim.x * blockIdx.x + threadIdx.x;
  p[i] = clock64();
}

int main(int argc, char ** argv) {
  int n               = (argc > 1 ? atoi(argv[1]) : 10);
  int thread_block_sz = (argc > 2 ? atoi(argv[2]) : 3);
  int n_thread_blocks = (n + thread_block_sz - 1) / thread_block_sz;

  long long * c = (long long *)malloc(sizeof(long long) * n);
  long long * c_dev;
  check_api_error(cudaMalloc(&c_dev, sizeof(long long) * n));
  check_launch_error((cuda_thread_fun<<<n_thread_blocks,thread_block_sz>>>(c_dev, n)));
  check_api_error(cudaDeviceSynchronize());
  check_api_error(cudaMemcpy(c, c_dev, sizeof(long long) * n, cudaMemcpyDeviceToHost));
  check_api_error(cudaFree(c_dev));
  for (int i = 0; i < n; i++) {
    printf("c[%d] = %lld\n", i, c[i]);
  }
  return 0;
}

In [None]:
BEGIN SOLUTION
END SOLUTION
nvcc -o cuda_memcpy cuda_memcpy.cu
# nvc++ -Wall -o cuda_memcpy cuda_memcpy.cu
# clang++ -Wall -Wno-unknown-cuda-version -o cuda_memcpy cuda_memcpy.cu -L/usr/local/cuda/lib64 -lcudart

In [None]:
BEGIN SOLUTION
END SOLUTION
./cuda_memcpy

# <font color="green"> Problem 3 :  Get data back from GPU to CPU</font>
* Change the arguments of the following command line in various ways and observe clock values printed

In [None]:
BEGIN SOLUTION
END SOLUTION
./cuda_memcpy 10 3

_<font color="green">Answer for trivial work omitted</font>_

* You observe some threads record exactly the same clock value
* What do you deduce from that?

BEGIN SOLUTION
END SOLUTION


_<font color="green">Example answer</font>_

* Threads having 32 consecuive thread IDs report exactly the same clock value
* It stems from the fact that those threads share an instruction pointer (execute the same instruction)

# 10. Unified Memory
* with unified memory you do not have to call `cudaMemcpy` to move data between host and GPU
* all you need to master is `cudaMallocManaged`, which you call in place of `cudaMalloc`
* you get a pointer that is valid both on CPU and GPU

# <font color="green"> Problem 4 :  Use Unified Memory</font>
* Change the following program so that it uses `cudaMallocManaged` instead of `cudaMalloc`.  Make appropriate changes to other parts (e.g., remove unnecessary `cudaMemcpy`) so that it behaves similar to the original one

In [None]:
BEGIN SOLUTION
END SOLUTION
%%writefile cuda_malloc_managed.cu
#include <assert.h>
#include <stdio.h>

#include "cuda_util.h"

__global__ void cuda_thread_fun(long long * p, int n) {
  int i        = blockDim.x * blockIdx.x + threadIdx.x;
  p[i] = i;
}

int main(int argc, char ** argv) {
  int n               = (argc > 1 ? atoi(argv[1]) : 10);
  int thread_block_sz = (argc > 2 ? atoi(argv[2]) : 3);
  int n_thread_blocks = (n + thread_block_sz - 1) / thread_block_sz;

  long long * c = (long long *)malloc(sizeof(long long) * n);
  long long * c_dev;
  check_api_error(cudaMalloc(&c_dev, sizeof(long long) * n));
  check_launch_error((cuda_thread_fun<<<n_thread_blocks,thread_block_sz>>>(c_dev, n)));
  check_api_error(cudaDeviceSynchronize());

  check_api_error(cudaMemcpy(c, c_dev, sizeof(long long) * n, cudaMemcpyDeviceToHost));
  check_api_error(cudaFree(c_dev));
  
  for (int i = 0; i < n; i++) {
    printf("c[%d] = %lld\n", i, c[i]);
  }

  free(c);
  return 0;
}

In [None]:
BEGIN SOLUTION
END SOLUTION
nvcc -o cuda_malloc_managed cuda_malloc_managed.cu
# nvc++ -Wall -o cuda_malloc_managed cuda_malloc_managed.cu
# clang++ -Wall -Wno-unknown-cuda-version -o cuda_malloc_managed cuda_malloc_managed.cu -L/usr/local/cuda/lib64 -lcudart

In [None]:
BEGIN SOLUTION
END SOLUTION
./cuda_malloc_managed 10 3

In [None]:
%%writefile cuda_malloc_managed.cu
#include <assert.h>
#include <stdio.h>

#include "cuda_util.h"

__global__ void cuda_thread_fun(long long * p, int n) {
  int i        = blockDim.x * blockIdx.x + threadIdx.x;
  p[i] = i;
}

int main(int argc, char ** argv) {
  int n               = (argc > 1 ? atoi(argv[1]) : 10);
  int thread_block_sz = (argc > 2 ? atoi(argv[2]) : 3);
  int n_thread_blocks = (n + thread_block_sz - 1) / thread_block_sz;

  long long * c;
  check_api_error(cudaMallocManaged(&c, sizeof(long long) * n));
  check_launch_error((cuda_thread_fun<<<n_thread_blocks,thread_block_sz>>>(c, n)));
  check_api_error(cudaDeviceSynchronize());

  
  for (int i = 0; i < n; i++) {
    printf("c[%d] = %lld\n", i, c[i]);
  }

  check_api_error(cudaFree(c));
  return 0;
}

# 11. CUDA device memory model
* memory blocks allocated by `cudaMalloc` are visiable to (shared by) all threads and called _global memory_
* they persist on device until you release them by cudaFree (or the process finishes), so they can be used not only to pass values between device and host, but also to pass values between different kernel calls (without moving values back and forth between host and device each time you call a kernel)

* See [Memory Hierarchy](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-hierarchy) for reference

# 12. Race condition and atomic operation 
* As threads launched in a single kernel call run concurrently, they are subject to the same race condition as OpenMP threads
* That is, if two threads access the same variable (or the same array element) and at least one of them is a write, there is a race and the program almost certainly has a bug

* In the following program, each thread increments a variable by one; it nevertheles does not print the number of threads launched and prints unpredictable results each time executed.

In [None]:
BEGIN SOLUTION
END SOLUTION
%%writefile cuda_race.cu
#include <assert.h>
#include <stdio.h>

#include "cuda_util.h"

__global__ void cuda_thread_fun(unsigned long long * p, int n) {
  int i        = blockDim.x * blockIdx.x + threadIdx.x;
  if (i < n) {
    *p = *p + 1;
  }
}

int main(int argc, char ** argv) {
  int n               = (argc > 1 ? atoi(argv[1]) : 1000);
  int thread_block_sz = (argc > 2 ? atoi(argv[2]) : 64);
  int n_thread_blocks = (n + thread_block_sz - 1) / thread_block_sz;

  unsigned long long c;
  unsigned long long * c_dev;
  check_api_error(cudaMalloc(&c_dev, sizeof(unsigned long long)));
  check_launch_error((cuda_thread_fun<<<n_thread_blocks,thread_block_sz>>>(c_dev, n)));
  check_api_error(cudaDeviceSynchronize());
  check_api_error(cudaMemcpy(&c, c_dev, sizeof(unsigned long long), cudaMemcpyDeviceToHost));
  check_api_error(cudaFree(c_dev));
  printf("c = %llu\n", c);
  return 0;
}

In [None]:
BEGIN SOLUTION
END SOLUTION
nvcc -o cuda_race cuda_race.cu
# nvc++ -Wall -o cuda_race cuda_race.cu
# clang++ -Wall -Wno-unknown-cuda-version -o cuda_race cuda_race.cu -L/usr/local/cuda/lib64 -lcudart

In [None]:
BEGIN SOLUTION
END SOLUTION
./cuda_race

# <font color="green"> Problem 5 :  Observe race condition</font>
Execute the above program many times and observe the results; try changing parameters.

In [None]:
BEGIN SOLUTION
END SOLUTION
./cuda_race 1000 64

_<font color="green">Answer for trivial work omitted</font>_

* OpenMP had three basic tools --- critical, atomic and reduction --- to resolve race conditions depending on the situation.
* Roughly, CUDA only has an analogue to atomic and does not have critical or reduction.

## 12-1. Atomic add
* CUDA has
```
atomicAdd(T* p, T x);
```
function for various types of T.  
It performs `*p = *p + x` _atomically_, meaning that it is guaranteed that `*p` is not updated between the point `*p` is read and the point `*p` is written.

* See [atomicAdd](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomicadd) for reference

# <font color="green"> Problem 6 :  Use `atomicAdd`</font>
* Change the following program to resolve the race condition using `atomicAdd` and make sure the result always matches the number of threads launched.

In [None]:
BEGIN SOLUTION
END SOLUTION
%%writefile cuda_race_atomic_add.cu
#include <assert.h>
#include <stdio.h>

#include "cuda_util.h"

__global__ void cuda_thread_fun(unsigned long long * p, int n) {
  int i        = blockDim.x * blockIdx.x + threadIdx.x;
  if (i < n) {
    *p = *p + 1;
  }
}

int main(int argc, char ** argv) {
  int n               = (argc > 1 ? atoi(argv[1]) : 1000);
  int thread_block_sz = (argc > 2 ? atoi(argv[2]) : 64);
  int n_thread_blocks = (n + thread_block_sz - 1) / thread_block_sz;

  unsigned long long c;
  unsigned long long * c_dev;
  check_api_error(cudaMalloc(&c_dev, sizeof(unsigned long long)));
  check_launch_error((cuda_thread_fun<<<n_thread_blocks,thread_block_sz>>>(c_dev, n)));
  check_api_error(cudaDeviceSynchronize());
  check_api_error(cudaMemcpy(&c, c_dev, sizeof(unsigned long long), cudaMemcpyDeviceToHost));
  check_api_error(cudaFree(c_dev));
  printf("c = %llu\n", c);
  return 0;
}

* To compile programs using `atomicAdd`, you need to give `--generate-code arch=compute_80,code=sm_80` to `nvcc`
* `--generate-code` specifies which GPU architectures/instruction set `nvcc` generates code for, so it might affect generated code in other ways including performance


In [None]:
BEGIN SOLUTION
END SOLUTION
nvcc --generate-code arch=compute_80,code=sm_80 -o cuda_race_atomic_add cuda_race_atomic_add.cu
# nvc++ -Wall -gpu=cc80 -o cuda_race_atomic_add cuda_race_atomic_add.cu
# clang++ -Wall -Wno-unknown-cuda-version --cuda-gpu-arch=sm_80 -o cuda_race_atomic_add cuda_race_atomic_add.cu -L/usr/local/cuda/lib64 -lcudart

In [None]:
BEGIN SOLUTION
END SOLUTION
./cuda_race_atomic_add 10000 64
./cuda_race_atomic_add 100000 64

In [None]:
%%writefile cuda_race_atomic_add_ans.cu
#include <assert.h>
#include <stdio.h>

#include "cuda_util.h"

__global__ void cuda_thread_fun(unsigned long long * p, int n) {
  int i        = blockDim.x * blockIdx.x + threadIdx.x;
  if (i < n) {
    atomicAdd(p, 1L);
  }
}

int main(int argc, char ** argv) {
  int n               = (argc > 1 ? atoi(argv[1]) : 1000);
  int thread_block_sz = (argc > 2 ? atoi(argv[2]) : 64);
  int n_thread_blocks = (n + thread_block_sz - 1) / thread_block_sz;

  unsigned long long c;
  unsigned long long * c_dev;
  check_api_error(cudaMalloc(&c_dev, sizeof(unsigned long long)));
  check_launch_error((cuda_thread_fun<<<n_thread_blocks,thread_block_sz>>>(c_dev, n)));
  check_api_error(cudaDeviceSynchronize());
  check_api_error(cudaMemcpy(&c, c_dev, sizeof(unsigned long long), cudaMemcpyDeviceToHost));
  check_api_error(cudaFree(c_dev));
  printf("c = %llu\n", c);
  return 0;
}

In [None]:
nvcc --generate-code arch=compute_80,code=sm_80 -o cuda_race_atomic_add_ans cuda_race_atomic_add_ans.cu
# nvc++ -Wall -gpu=cc80 -o cuda_race_atomic_add_ans cuda_race_atomic_add_ans.cu
# clang++ -Wall -Wno-unknown-cuda-version --cuda-gpu-arch=sm_80 -o cuda_race_atomic_add_ans cuda_race_atomic_add_ans.cu -L/usr/local/cuda/lib64 -lcudart

In [None]:
./cuda_race_atomic_add_ans 10000 64
./cuda_race_atomic_add_ans 100000 64

# 13. Barrier synchronization of threads
* Recent CUDA has the notion of cooperative groups, with which you can build a barrier synchronization between threads
* setup
```
#include <cooperative_groups.h>
namespace cg = cooperative_groups; // save typing
```
* create data representing a grouup
```
cg::grid_group g = cg::this_grid(); // all threads
```

* perform barrier synchronization when necessary (ensure no threads execute `<after>` until all threads finish `<before>`) 
```
  <before>
  g.sync();
  <after>
```

* You need to launch such kernels by
```
void * args[] = { a0, a1, ... };
cudaLaunchCooperativeKernel((void *)f, nb, bs, args);
```
instead of
```
f<<<nb,bs>>>(a0, a1, ...);
```

* See [Cooperative Groups](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cooperative-groups) for reference

# <font color="green"> Problem 7 :  Use barrier synchronization</font>
Change the following program `sum_array()` so that it correctly outputs the sum of the array by implementing reduction on barrier synchronization.

In [None]:
BEGIN SOLUTION
END SOLUTION
%%writefile cuda_sum.cu
#include <assert.h>
#include <stdio.h>

#include "cuda_util.h"

#include <cooperative_groups.h>

//using namespace cooperative_groups;
// Alternatively use an alias to avoid polluting the namespace with collective algorithms
namespace cg = cooperative_groups;

__global__ void sum_array(double * c, long n) {
  // should return c[0] + c[1] + ... + c[n-1] in c[0]
  // you can destroy other elements of the array
  cg::grid_group g = cg::this_grid();
  unsigned long long i = g.thread_rank();
}

int main(int argc, char ** argv) {
  long n                = (argc > 1 ? atoi(argv[1]) : 10000);
  int threads_per_block = (argc > 2 ? atoi(argv[2]) : 64);
  int n_thread_blocks = (n + threads_per_block - 1) / threads_per_block;

  double * c = (double *)malloc(sizeof(double) * n);
  for (long i = 0; i < n; i++) {
    c[i] = 1.0;
  }
  double * c_dev;
  check_api_error(cudaMalloc(&c_dev, sizeof(double) * n));
  check_api_error(cudaMemcpy(c_dev, c, sizeof(double) * n, cudaMemcpyHostToDevice));
  void * args[2] = { (void *)&c_dev, (void *)&n };
  check_api_error(cudaLaunchCooperativeKernel((void*)sum_array,
                                              n_thread_blocks,
                                              threads_per_block,
                                              args));
  check_api_error(cudaDeviceSynchronize());
  check_api_error(cudaMemcpy(c, c_dev, sizeof(double) * n, cudaMemcpyDeviceToHost));
  check_api_error(cudaFree(c_dev));
  printf("sum = %f\n", c[0]);
  assert(c[0] == n);
  return 0;
}

In [None]:
BEGIN SOLUTION
END SOLUTION
nvcc -o cuda_sum cuda_sum.cu
# nvc++ -Wall -o cuda_sum cuda_sum.cu
# clang++ -Wall -Wno-unknown-cuda-version -o cuda_sum cuda_sum.cu -L/usr/local/cuda/lib64 -lcudart

In [None]:
BEGIN SOLUTION
END SOLUTION
./cuda_sum

In [None]:
%%writefile cuda_sum_ans.cu
#include <assert.h>
#include <stdio.h>

#include "cuda_util.h"

#include <cooperative_groups.h>

//using namespace cooperative_groups;
// Alternatively use an alias to avoid polluting the namespace with collective algorithms
namespace cg = cooperative_groups;

__global__ void sum_array(double * c, long n) {
  // should return c[0] + c[1] + ... + c[n-1] in c[0]
  // you can destroy other elements of the array
  cg::grid_group g = cg::this_grid();
  unsigned long long i = g.thread_rank();
  unsigned long long h;
  for (int m = n; m > 1; m = h) {
    h = (m + 1) / 2;
    if (i + h < m) {
      c[i] += c[i + h];
    }
    g.sync();
  }
}

int main(int argc, char ** argv) {
  long n                = (argc > 1 ? atoi(argv[1]) : 10000);
  int threads_per_block = (argc > 2 ? atoi(argv[2]) : 64);
  int n_thread_blocks = (n + threads_per_block - 1) / threads_per_block;

  double * c = (double *)malloc(sizeof(double) * n);
  for (long i = 0; i < n; i++) {
    c[i] = 1.0;
  }
  double * c_dev;
  check_api_error(cudaMalloc(&c_dev, sizeof(double) * n));
  check_api_error(cudaMemcpy(c_dev, c, sizeof(double) * n, cudaMemcpyHostToDevice));
  void * args[2] = { (void *)&c_dev, (void *)&n };
  check_api_error(cudaLaunchCooperativeKernel((void*)sum_array,
                                              n_thread_blocks,
                                              threads_per_block,
                                              args));
  check_api_error(cudaDeviceSynchronize());
  check_api_error(cudaMemcpy(c, c_dev, sizeof(double) * n, cudaMemcpyDeviceToHost));
  check_api_error(cudaFree(c_dev));
  printf("sum = %f\n", c[0]);
  assert(c[0] == n);
  return 0;
}

In [None]:
nvcc -o cuda_sum_ans cuda_sum_ans.cu
# nvc++ -Wall -o cuda_sum_ans cuda_sum_ans.cu
# clang++ -Wall -Wno-unknown-cuda-version -o cuda_sum_ans cuda_sum_ans.cu -L/usr/local/cuda/lib64 -lcudart

In [None]:
./cuda_sum_ans

# <font color="green"> Problem 8 :  Putting them together: calculating an integral</font>
Write a CUDA program that calculates

$$ \int_0^1 \int_0^1 \sqrt{1 - x^2 - y^2}\,dx\,dy $$

* mathematical note: consider the integrand to be zero outside $1 - x^2 - y^2 \geq 0$

* Write a CUDA kernel that computes the integrand on a single point
* And launch it with as many threads as the number of points you compute the integrand at
* The result should be close to $\pi/6$ (1/8 of the volume of the unit ball)
* Play with the number of infinitesimal intervals for integration and the number of threads so that you can observe a speedup
* Measure the time not just for the entire computation, but the time of each step including cudaMalloc, cudaMemcpy to initialize variables on the device, kernel and cudaMemcpy to get the result back
* Try atomicAdd as well as reduction 
* Play with unified memory also

In [None]:
BEGIN SOLUTION
END SOLUTION
%%writefile cuda_integral.cu


In [None]:
BEGIN SOLUTION
END SOLUTION
nvcc -o cuda_integral cuda_integral.cu
# nvc++ -Wall -o cuda_integral cuda_integral.cu
# clang++ -Wall -Wno-unknown-cuda-version -o cuda_integral cuda_integral.cu -L/usr/local/cuda/lib64 -lcudart

In [None]:
BEGIN SOLUTION
END SOLUTION
./cuda_integral

In [None]:
%%writefile cuda_integral_ans.cu
#include <stdio.h>
#include <unistd.h>
#include <math.h>
#include <time.h>

#include "cuda_util.h"

double cur_time() {
  struct timespec tp[1];
  clock_gettime(CLOCK_REALTIME, tp);
  return tp->tv_sec + tp->tv_nsec * 1.0e-9;
}

__global__ void cuda_thread_fun(int n, double xa, double ya, double dx, double dy, double * sp) {
  int i          = blockDim.x * blockIdx.x + threadIdx.x;
  int j          = blockDim.y * blockIdx.y + threadIdx.y;
  if (i < n && j < n) {
    double x = xa + i * dx;
    double y = ya + j * dy;
    double z2 = 1 - x * x - y * y;
    if (z2 > 0) {
      atomicAdd(sp, sqrt(z2) * dx * dy);
    }
  }
}

int main(int argc, char ** argv) {
  double xa = 0.0;
  double xb = 1.0;
  double ya = 0.0;
  double yb = 1.0;
  int n = 10000;
  double dx = (xb - xa) / n;
  double dy = (yb - ya) / n;

  // thread configuration
  int nx                = n;
  int ny                = n;
  int thread_block_sz_x = (argc > 1 ? atoi(argv[1]) : 8);
  int thread_block_sz_y = thread_block_sz_x;
  int n_thread_blocks_x = (nx + thread_block_sz_x - 1) / thread_block_sz_x;
  int n_thread_blocks_y = (ny + thread_block_sz_y - 1) / thread_block_sz_y;

  double s = 0.0;
  double * s_dev;
  double t0 = cur_time();
  check_api_error(cudaMalloc(&s_dev, sizeof(double)));
  double t1 = cur_time();
  check_api_error(cudaMemcpy(s_dev, &s, sizeof(double), cudaMemcpyHostToDevice));
  double t2 = cur_time();
  
  dim3 nb(n_thread_blocks_x, n_thread_blocks_y);
  dim3 tpb(thread_block_sz_x, thread_block_sz_y);
  check_launch_error((cuda_thread_fun<<<nb,tpb>>>(n, xa, ya, dx, dy, s_dev)));
  check_api_error(cudaDeviceSynchronize());
  double t3 = cur_time();
  
  check_api_error(cudaMemcpy(&s, s_dev, sizeof(double), cudaMemcpyDeviceToHost));
  double t4 = cur_time();
  
  printf("ans = %.9f\n", s);
  printf(" cudaMalloc  : %f sec\n", t1 - t0);
  printf(" host -> dev : %f sec\n", t2 - t1);
  printf(" kernel      : %f sec\n", t3 - t2);
  printf(" host <- dev : %f sec\n", t4 - t3);
  printf("---------------------------\n");
  printf("total        : %f sec\n", t4 - t0);
  return 0;
}

In [None]:
nvcc -o cuda_integral_ans cuda_integral_ans.cu
# nvc++ -Wall -o cuda_integral_ans cuda_integral_ans.cu
# clang++ -Wall -Wno-unknown-cuda-version -o cuda_integral_ans cuda_integral_ans.cu -L/usr/local/cuda/lib64 -lcudart

In [None]:
./cuda_integral_ans