

# CS 380 - GPU and GPGPU Programming Lecture 14: GPU Compute APIs, Pt. 4

Markus Hadwiger, KAUST

## Reading Assignment #6 (until Oct 21)



### Read (required):

Programming Massively Parallel Processors book (4th edition),
 Chapter 5 (Memory architecture and data locality)

### Read (optional):

- Programming Massively Parallel Processors book (4th edition),
   Chapter 20 (An introduction to CUDA streams)
- Programming Massively Parallel Processors book (4th edition),
   Chapter 21 (CUDA dynamic parallelism)

### Semester Project (proposal until Oct 21!)



- Choosing your own topic encouraged! (we will also suggest some topics)
  - Pick something that you think is really cool!
  - Can be completely graphics or completely computation, or both combined
  - Can be built on CS 380 frameworks, Vulkan SDK, CUDA SDK, ...
- Write short (1-2 pages) project proposal by Oct 21 at the latest
  - Talk to us before you start writing!
     (content and complexity should fit the lecture)
- Submit semester project with report (deadline: Dec 5)
- Present semester project, event in final exams week: Dec 9 (tbd!)

### Semester Project Ideas (1)



#### Some ideas for topics

- Procedural shading with noise + marble etc. (GPU Gems 2, chapter 26)
- Procedural shading with noise + bump mapping (GPU Gems 2, chapter 26)
- Subdivision surfaces (GPU Gems 2, chapter 7)
- Ambient occlusion, screen space ambient occlusion
- Shadow mapping, hard shadows, soft shadows
- Deferred shading
- Particle system rendering + CUDA particle sort
- Advanced image filters: fast bilateral filtering, Gaussian kD trees
- Advanced image de-convolution (e.g., convex L1 optimization)
- PDE solvers (e.g., anisotropic diffusion filtering, 2D level set segmentation, 2D fluid flow)

### Semester Project Ideas (2)



#### Some ideas for topics

- Distance field computation (GPU Gems 3, chapter 34)
- Livewire ("intelligent scissors") segmentation in CUDA
- Linear systems solvers, matrix factorization (Cholesky, ...); with/without CUBLAS
- CUDA + matlab
- Fractals (Sierpinski, Koch, ...)
- Image compression
- Bilateral grid filtering for multichannel images
- Discrete wavelet transforms
- Fast histogram computations
- Terrain rendering from height map images; clipmaps or adaptive tesselation

# Code Example #2: Matrix Multiply

# Programming Model: Square Matrix Multiplication

- P = M \* N of size WIDTH x WIDTH
- Without tiling:
  - One thread handles one element of P
  - M and N are loaded WIDTH times from global memory



# Multiply Using One Thread Block

- One block of threads computes matrix P
  - Each thread computes one element of P
- Each thread
  - Loads a row of matrix M
  - Loads a column of matrix N
  - Perform one multiply and addition for each pair of M and N elements
  - Compute to off-chip memory access ratio close to 1:1 (not very high)
- Size of matrix limited by the number of threads allowed in a thread block



# Matrix Multiplication Device-Side Kernel Function (cont.)

```
for (int k = 0; k < M.width; ++k)
  float Melement = M.elements[ty * M.pitch + k];
  float Nelement = Nd.elements[k * N.pitch + tx];
  Pvalue += Melement * Nelement;
// Write the matrix to device memory;
// each thread writes one element
P.elements[ty * blockDim.x+ tx] = Pvalue;
                                                               ty
                                                      tx
```

### Handling Arbitrary Sized Square Matrices

 Have each 2D thread block to compute a (BLOCK\_WIDTH)<sup>2</sup> sub-matrix (tile) of the result matrix

Each has (BLOCK\_WIDTH)<sup>2</sup> threads

 Generate a 2D Grid of (WIDTH/BLOCK\_WIDTH)<sup>2</sup> blocks

You still need to put a loop around the kernel call for cases where WIDTH is greater than Max grid size!



### Multiply Using Several Blocks - Idea

- One thread per element of P
- Load sub-blocks of M and N into shared memory
- Each thread reads one element of M and one of N
- Reuse each sub-block for all threads, i.e. for all elements of P
- Outer loop on sub-blocks



Hendrik Lensch and Robert Strzodka

bsize-1

### Multiply Using Several Blocks - Idea

- One thread per element of P
- Load sub-blocks of M and N into shared memory
- Each thread reads one element of M and one of N
- Reuse each sub-block for all threads, i.e. for all elements of P
- Outer loop on sub-blocks



Hendrik Lensch and Robert Strzodka

bsize-1

### Example: Matrix Multiplication (1)



 Copy matrices to device; invoke kernel; copy result matrix back to host

```
// Matrix multiplication - Host code
// Matrix dimensions are assumed to be multiples of BLOCK SIZE
void MatMul(const Matrix A, const Matrix B, Matrix C)
    // Load A and B to device memory
    Matrix d A:
    d A.width = d A.stride = A.width; d A.height = A.height;
    size t size = A.width * A.height * sizeof(float);
    cudaMalloc((void**)&d A.elements, size);
    cudaMemcpy(d A.elements, A.elements, size,
               cudaMemcpyHostToDevice);
    Matrix d B;
    d B.width = d B.stride = B.width; d B.height = B.height;
    size = B.width * B.height * sizeof(float);
    cudaMalloc((void**)&d B.elements, size);
    cudaMemcpy(d B.elements, B.elements, size,
               cudaMemcpyHostToDevice);
```

### Example: Matrix Multiplication (2)



```
// Allocate C in device memory
Matrix d C;
d C.width = d C.stride = C.width; d C.height = C.height;
size = C.width * C.height * sizeof(float);
cudaMalloc((void**)&d C.elements, size);
// Invoke kernel
dim3 dimBlock (BLOCK SIZE, BLOCK SIZE);
dim3 dimGrid(B.width / dimBlock.x, A.height / dimBlock.y);
MatMulKernel<<<dimGrid, dimBlock>>>(d A, d B, d C);
// Read C from device memory
cudaMemcpy (C.elements, d C.elements, size,
           cudaMemcpyDeviceToHost);
// Free device memory
cudaFree (d A.elements);
cudaFree (d B.elements);
cudaFree (d C.elements);
```

## Example: Matrix Multiplication (3)



blockCol

- Multiply matrix block-wise
- Set BLOCK\_SIZE for efficient hardware use, e.g., to 16 on cc. 1.x or 16 or 32 on cc. 2.x +

- Maximize parallelism
  - Launch as many threads per block as block elements
  - Each thread fetches one element per block
  - Perform row \* column dot products in parallel



### Example: Matrix Multiplication (4)



```
global void MatrixMul( float *matA, float *matB, float *matC, int w )
     shared float blockA[ BLOCK SIZE ][ BLOCK SIZE ];
    shared float blockB[ BLOCK SIZE ][ BLOCK SIZE ];
   int bx = blockIdx.x; int tx = threadIdx.x;
   int by = blockIdx.y; int ty = threadIdx.y;
   int col = bx * BLOCK SIZE + tx;
   int row = by * BLOCK SIZE + ty;
   float out = 0.0f;
   for ( int m = 0; m < w / BLOCK SIZE; m++ ) {
      blockB[ ty ][ tx ] = matB[ col + ( m * BLOCK SIZE + ty ) * w ];
       syncthreads();
       for ( int k = 0; k < BLOCK SIZE; k++ ) {
          out += blockA[ ty ][ k ] * blockB[ k ][ tx ];
        syncthreads();
   }
```

Caveat: for brevity, this code assumes matrix sizes are a multiple of the block size (either because they really are, or because padding is used; otherwise guard code would need to be added)

matC[ row \* w + col ] = out;

### Running on a V100 (Volta) SM





```
#define THREADS PER BLK 128
 _global__ void convolve(int N, float* input,
                         float* output)
   __shared__ float support[THREADS_PER_BLK+2];
   int index = blockIdx.x * blockDim.x +
               threadIdx.x:
   support[threadIdx.x] = input[index];
   if (threadIdx.x < 2) {</pre>
      support[THREADS PER BLK+threadIdx.x]
        = input[index+THREADS PER BLK];
   syncthreads();
   float result = 0.0f; // thread-local
   for (int i=0; i<3; i++)
     result += support[threadIdx.x + i];
   output[index] = result / 3.f;
```

A convolve thread block is executed by 4 warps (4 warps x 32 threads/warp = 128 CUDA threads per block)

#### SM core operation each clock:

- Each sub-core selects one runnable warp (from the 16 warps in its partition)
- Each sub-core runs next instruction for the CUDA threads in the warp (this instruction may apply to all or a subset of the CUDA threads in a warp depending on divergence)

courtesy Kayvon Fatahalian

(sub-core == SM partition)

