Skip to content

void123-dev/100DaysOfGPUProgramming

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

154 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Day 0: Hello GPU World Kernel

Resource: Read Chapter 1 of Programming Massively Parallel Processors

What I learnt:

  • CUDA kernels are functions executed on the GPU by multiple threads in parallel
  • The 'threadIdx.x' variable gives each thread a unique ID within the block
  • 'cudaDeviceSynchronize()' ensures the GPU finishes before the program exits
  • Learnt about different GPU's like NVIDIA Geforce RTX 4090 and AMD Radeon RX 7900 XTX
  • Learnt about how kernels utilize GPU's
  • Learnt about the GPU design and how it emphasizes on throughtput
  • Learnt some error handling in CUDA
  • Learnt about Amdahl's Law

Challenges Faced:

  • Had to install the CUDA toolkit on my laptop to gain experience
  • Due to no personal access to a GPU, I had to compile my code in Google Colab
  • Using a nvccjupyter notebook didn't work so I had to manually compile the code
  • Mistook 'cudaGetErrorString' for 'cudaGetLastErrorString'

Performance Observations:

  • Kernel ran instantly, probably because it just prints messages

Thoughts:

  • What happens when I launch more blocks?

Day 1: Vector Addition Kernel

Resource: Read chapter 2.1-2.3 of PMPP

What I learnt:

  • Learnt about Data and Task Parallelism and how they differ
  • Learnt about how C pointers work and how we use them in writing CUDA kernels

Challenges Faced:

  • Understanding the maths in grid_stride loop were quite difficult, the core computation as well
  • Took some time to gain intuition on C pointers

Performance Observations:

  • Kernel ran instantly as well

Thoughts:

  • I wonder how matrix multiplications work

Day 2: Vector Addition Kernel with Timing

Resources: Read Chapter 2.4 of PMPP, Read chapter 1-2 of the Best Cuda Practices Guide

What I learnt:

  • Learnt about the APOD design cycle
  • Learnt about CUDA API functions for managing device global memory
  • Learnt that CUDA's malloc is a lot like C's
  • Leant how to structure my args in cudaMemcpy
  • Learnt how the pointer variable should be cast in cudaMalloc
  • Learnt aboutr Gustafson's Law
  • Solved a vector addition LeetGPU

Challenges Faced:

  • Struggled to understand the abstract math behind calculating the global idx and the grid-stride loop
  • Took some time solving a LeetGPU, had to figure out how to write a test program to follow the implementation details

Performance Observations:

  • Compared sequential and parallel addition of vectors on CPU and GPU respectively and achieved a 29x speedup in the computation, which means the sequential method of the CPU must have been really slow

Thoughts:

  • Wondering how many lines of code the largest kernels is

Day 3: Array Multiplication Kernel with Timing

Resources: Finished Chapter 2 of PMPP

What I Learnt:

  • Learnt about kernel functions and threading
  • Learnt about the built in variables like blockDim.x and threadIdx.x and how many dimensions they handle
  • Learnt the difference between SPMD and SIMD
  • Learnt about qualifier keywords

Challenges Faced:

  • NO challenges faced today, getting the hang of the cuda syntax

Performance Observations:

  • Tested GPU execution time and found that, the larger the number of threads the less the number of blocks and the faster the execution time and viceversa

Thoughts:

  • How to write 2D kernels

Day 4: Array Multiplication Kernel, GPU vs CPU

Resources: Read Chapter 3 of PMPP

What I Learnt:

  • Learnt about Multi-dimensional Grids
  • Learnt how to write 2D and 3D arrays

Challenges Faced:

  • Had a little trouble with some syntax errors but quickly fixed them

Performance Observations:

  • A GPU achieves a max speedup of 2.82x
  • The larger the block size, the faster the kernel and the more the speedup
  • CPU execution time was faster on the 64 block size

Thoughts:

  • Why do CPU's perform better with smaller threads?

Day 5: 2D Matrix Addition Kernel

Resources: Reread Chapter 3 of PMPP

What I learnt:

  • Learnt how to handle 2D thread and block indexing
  • Learnt how to implement strides
  • Created a mental model to help me gain intuition on the dimesions of a grid and block

Challenges Faced:

  • Took me a bit of time to understand the purpose of stride_x and stride_y

Thoughts:

  • I wonder how CPU's perform of 2D computations

Day 6: 2D Matrix Addition Kernel GPU vs CPU

Resources: Chapter 3.4 of PMPP

What I Learnt:

  • Read about matrix multiplication tiling

Challenges Faced:

  • Debugging some syntax errors in the code gave me hell, but getting better at debugging cuda code

Thoughts:

  • Same thought as yesterday, will find an answer

Day 7: 2D Tiled Matrix Multiplication Kernel with Shared Memory

Resources: Chapter 4 of PMPP

What I learnt:

  • Learnt about tiling and shared memory and how they are key optimization techniques

Challenges Faced:

  • It was challenging implementing the tiling and understanding the matrix computations

Thoughts:

  • I wonder how CPUs perform compared to GPU in matrix multiplications

Day 8: Optimized a previous Array Multiplication Kernel I wrote

  • Very busy today but i managed to write this kernel.

Performance Observations:

  • Removed the grid-stride loop and moved the cudaMemcpy out of the timing loop; doing this significantly improved the gpu execution time, from 0.199ms on 155 blocks to 0.098ms on 157 blocks

Day 9: Reduction Sum Kernel

Resources: Chapter 4 of PMPP

What I learnt:

  • Learnt about the architecture of a modern GPU and block scheduling

Challenges Faced:

  • Took a bit of time understanding the maths behind the partial sum

Performance Observations:

  • The faster the block size, the faster the execution time

Thoughts:

  • What happens if we dont do a partial sun?

Day 10: Metal Flash Attention Kernel

Resources: Tinygrad's Generic Metal Flash Attention Kernel

What I learnt:

  • Learnt how to write metal kernels, noted the similarities with cuda
  • Learnt how to write attention computation

Challenges Faced:

  • It was difficult translating the code from a cuda-minded perspective

Thoughts:

  • I wonder how a cuda flash attention kernel is written

Day 11: Reduction Sum Kernel, GPU vs CPU

  • wrote a reduction sum kernel and compared performance between gpu and cpu, faced a challenge when rounding off the gpu and cpu sum as they weren't matching but managed to fix it

Day 12: Dot Product Kernel

  • wrote a dot product kernel with shared memory, reduction strategy and an atomic operations
  • learnt how to profile my kernel with nvprof instead of manually testing for gpu execution time
  • noticed that cudaMalloc takes up most of the gpu's execution time

Day 13: LeetGPU Dot Product Kernel

  • solved a medium level leetgpu, implemented a CUDA program that computes the dot product of two vectors containing 32-bit floating point numbers

Day 14: Matrix Transpose Kernel

  • wrote a matrix transpose kernel with shared memory for tiling, memory coelescing and 2D thread indexing

Day 15: Hadamard Product Kernel

  • wrote a hadamard product kernel coalesced memory access, no shared memory as each element is accessed once and zero __synchthreads(); to avoid synchronization overhead

Day 16: LeetGPU Matrix Transpose Example 1

  • solved a leetgpu problem: Write a program that transposes a matrix of 32-bit floating point numbers on a GPU. The transpose of a matrix switches its rows and columns. Given a matrix A of dimensions rows x cols, the transpose A^T will have dimensions cols x rows. All matrices are stored in row-major format.

Day 17: LeetGPU Matrix Transpose Example 2

  • solved a leetgpu problem: Write a program that transposes a matrix of 32-bit floating point numbers on a GPU. The transpose of a matrix switches its rows and columns. Given a matrix A of dimensions rows x cols, the transpose A^T will have dimensions cols x rows. All matrices are stored in row-major format.

Day 18: GEMM Kernels

  • practised solving 6 leetgpu
  • wrote a 2 GEMM kernels, first test is on perfomance on different A & B matrix shapes and large m, n, k integers, second test is on performance with same A & B matrix shapes and small m, n, k integers

Day 19: LeetGPU FP16 Gemm Kernel

  • solved another medium level leetgpu problem by writing a FP16 gemm...thats 4/42 solved


Day 20: Optimized GEMM kernel

  • practised solving 3 leetgpu problems, wrote an optimized GEMM kernel by use tiled shared memory

Day 21: L2 Norm Kernel

  • wrote a L2 Norm vector kernel with shared memory fusing the normalization computation and division, hence reducing global memory access
  • will be practicing leetgpu for the rest of the day

Day 22: Vector Subtraction Kernel

  • wrote a vector subtraction kernel and tested it's performance on 1 million elements

Day 23: Metal Vector Addition Kernel

  • feeling unwell but managed to rewrite on of my cuda vector add kernel to a metal implemention.
  • learning how the syntax and attributes differ...metal seems much simpler

Day 24: Metal MatMul Kernel

  • turns out adding a grid-stride loop to my vectorAdd kernel increases perf from 19.64GFLOPS to 19.81GFLOPS, that's 0.866% faster
  • implemented a naive metal matmul kernel, noticing the differences between cuda and metal

Day 25: Sigmoid Activation Kernel

  • implemented a sigmoid activation kernel

Day 26: ReLU Activation Kernel

  • implemented a ReLU activation kernel
  • solved a new leetgpu problem(ReLU), thats 5/44 solved now
  • will be practicing leetgpu for the rest of the day

Day 27: LeetGPU's

  • practised solving 4 leetgpu's

Day 28: Max Reduction Kernel

  • wrote a single-block max reduction kernel with shared memory

Day 29: LeetGPU Matrix Multiplication Kernel

  • solved a leetgpu matrix multiplication problem, that's 6/47 now..will stay leetgpu-maxxing for the rest of the day

Day 30: LeetGPU Leaky ReLU Kernel

  • wrote a leaky relu kernel and solved the problem on leetgpu, thats 7/47 now..stayed leetgpu maxxing for the rest of the day

Day 31: LeetGPU Reverse Array Kernel

  • solved a reverse array leetgpu problem, thats 8/47 solved now...tried solving a reduction problem but getting issues with memory allocation, will figure it out

day 32: Single Block Softmax Kernel

  • learnt about a softmax kernel, started off simple by writing a single block softmax kernel avoid any custom atomic operations but used shared memory and parallel reduction and the log-sum-exp trick to avoid overflow
  • solved 2 new leepgpu problems, matrix copy and softmax

Day 33: LeetGPU Dot Product Optimization

  • optimized my dot product kernel on leetgpu by replacing the accumulation loop with an unrolling loop by a factor of 4 to enable the kernel process four mulitply-add operation at once, this reduced loop overhead hence becoming currently the fastest kernel on a H200, B200, H100 and second place on an A100-80GB

Day 34: LeetGPU Matrix Transpose Optimization

  • optimized my matrix transpose kernel on leetgpu by adding tile shared memory, achieving a 10.10x runtime speed up(from 25.79119 to 2.55282)

Day 35: LeetGPU Matrix Transpose Further Optimization

  • further optimized my matrix transpose kernel on leetgpu by using #pragma unroll to reduce branch overhead, restric pointer aliasing to prevent pointing to overlapping memory and experimented with different tile sizes.
  • on tile size 8 and tesla t4, my kernel got 1.0198x faster(2.55282ms to 2.50317ms)

Day 36: Dot Product LeetGPU Practise

  • practiced writing a dot product kernel on leetgpu

Day 37: Batched Matrix Multiplication Kernel LeetGPU

  • learnt about and solved a batched matrix multipliccation leetgpu problem, the fp16 GEMM kernel was kind of difficult to grasp but this BMM has helped me gain deeper intuition, that's 12/50 solved now.
  • will be practicing writing fp16 GEMM and BMM kernels for the rest of the day.

Day 38: Optimized Batched Matrix Multiplication Kernel LeetGPU

  • optimized my batched matrix multiplication kernel on leetgpu by adding shared memory, making my kernel 1.20x faster(3.53275s to 2.94001 ms)

Day 39: LeetGPU Practice

  • practiced solving 6 leetgpu problems

Day 40: Sigmoid Linear Unit Kernel LeetGPU

  • learnt about and solved a sigmoid linear unit kernel problem on leetgpu solving 13/51 problems now
  • i'll stay leetgpu-maxxing for the rest of the day

Day 41: Matrix Transpose Kernel practice

  • practiced writing a matrix transpose kernel

Day 42: 1D Convolution Kernel & Fast Dot Product Kernel

  • learnt about and solved the 1D convolution kernel problem on leetgpu, that's 15/51 now
  • also wrote a fast dot product kernel, first place on H100, H200 and B200, 2nd on the A100, still chasing a podium on the T4(not so far away), will find a solution soon!

Day 43: LeetGPU Practice

  • busy busy day, practiced a bunch of leetgpu problems

Day 44: Simple Pytorch Inference & Fast Reverse Array LeetGPU

  • solved the simple inference challenge on leetgpu and managed 1st place on the B200 and H200 and 2nd place on the T4, A100, H100
  • also here's a fast reverse array kernel i wrote(trust me)

Day 45: Fast Dot Product with Memory Allocation

  • been a while since i wrote a kernel and did manual memory allocation(so used to leetgpu doing it for me), so i rewrote my fast dot product kernel from leetgpu, added memory allocation, tested it on N = 1024 and profiled it

Day 46: Fast Reverse Array Kernel with Memory Allocation

  • added some memory allocation to my fast reverse array kernel from leetgpu and tested it on the playground on a GTX TITAN X, profiled it too

Day 47: 1D Convolution Kernel with Shared Memory Optimization

  • optimized my 1D convolution kernel on leetgpu by adding shared memory to reduce global access memory, 1.29x faster(from 6.75597 ms to 5.24543 ms)
  • 2nd fastets H200 1D convolution kernel on leetgpu

Day 48: 1D Convolution Kernel Optimization Attempt

  • attempted to optimize my 1D convolution kernel, on top of shared memory I tried to reduce the redundant output_size computation but didn't end up achieving a neglible speedup

Day 49: Refactored Fast Dot Product Kernel

  • refactored my fast dot product kernel, cleaner and slightly faster

Day 50: 2D Convolution Kernel

  • learnt about writing a 2D convolution kernel and solved the challenge on leetgpu
  • 2nd fastest 2D conv kernel on the B200 & H200, 3rd fastest on the h100
  • will be looking into optimizations to make it faster

Day 51: Further Optimization on Matrix Transpose Kernel

  • ran dozens of experiments and further optimized my H100 matrix transpose kernel on leetgpu by adding restrict pointers for float input & output, tweaked the tile size and added pragma unroll loops
  • achieved a 1.94x speedup in runtime(from 0.45757 to 0.23612 ms)

Day 52: Mojo Vector Add Kernel

  • learnt how to write a vector add kernel in mojo, a lot similar to cuda..it's like cuda but with python-like syntax
  • can you beat my H100 mojo vector add?

Day 53: Mojo Reverse Array Kernel

  • fastest T4 mojo reverse array kernel on leetgpu
  • getting the hang of mojo, it's not that hard compared to cuda cause of it's simple readable python-like syntax
  • i like the ceildiv function, easier to write compared to cuda's
  • still trying to optimized my 2D conv cuda kernel

Day 54: Mojo ReLU Kernel

  • busy day but managed to squeeze in some time and learnt to write a ReLU activation function kernel in Mojo

Day 55: Mojo Matrix Copy & CUDA FP16 Gemm

  • learnt how to write a matrix copy kernel in mojo
  • i began reading through sibhoems blog on how to optimize a cuda matmul for cuBLAS-like performance, found out my fp16 gemm kept getting a out of memory bounds because of the way i was doing my indexing(idk how i didn't spot it all this time, skill issue), so i rewrote it and slightly optimized it by adding an unroll loop and slashed out the redundant computation when passing the answer to C
  • think of this implementation as Kernel 1[very naive]

Day 56: SwiGLU & FP16 SGEMM with Shared Memory and Leaky ReLU in Mojo

  • learnt to write a swiglu kernel in cuda and a leaky relu kernel in mojo
  • read simon's blog[kernel 3 implementation] and ended up adding shared memory to my coalesced fp16 sgemm kernl, 31.7% faster now(from a runtime of 0.4431 to 0.3364ms)

Day 57: Count Array Kernel

  • learnt how to write a count array kernel, this time i didn't start with a naive kernel: i went straight for shared memory optimization
  • still practicing writing matmul kernel, preping my kernel 3 implemenation by adding 1D tiling..i'll probably do this on monday

Day 58: FP16 SGEMM & Dot Product Kernel Practice

  • practiced writing a cuda fp16 gemm and a dot product kernel

Day 59: 2D Convolution Kernel

  • was experimenting with my 2d convolution kernel and ended up making some tweaks: added an unroll loop for the convolution compution
  • currently 2nd fastest H200 2D convolution kernel, achieved a 1.35x speedup in runtime(frrom 0.81811 ms to 0.60366 ms)

Day 60: F16 SGEMM with 1D BlockTiling Optimization

  • read simon's kernel 4 implementation, understood it, then proceeded to add 1D BlockTiling to my fp16 gemm with shared memory and memory coealescing
  • now currently 2nd fastest H200 fp16 gemm kernel on leetgpu following a 6.15% speedup in runtime(0.3364 to 0.3168 ms)

Day 61: Batched Matrix Multiplication with 1D BlockTiling

  • as a result of yesterday's fp16 gemm optimization, i took a look at my batched mat mul and saw how slow it was
  • optimized it by adding memory coalescing, shared memory and 1D blocktiling for columns(i find this very handy)
  • 32.5% faster(0.23302 to 0.17579ms) on the H200

Day 62: Dot Product Kernel with Unsigned Ints

  • optimized my dot product kernel by using unsigned ints instead of ints, significant increase in performance
  • apparently unsigned int has faster execution due to fewer instructions, esp in unrolled loops

Day 63: FP16 GEMM Practice

  • busy day but i managed to practice writing a fp16 gemm, cleaned it: washed off some redundant computations

Day 64: Batched Mat Mul Optimization

  • slighlty tweaked my batched mat mul by adding restrict pointers and scraping off any redundant indexing
  • 7.06% faster(0.17673ms to 0.16426ms)

Day 65: Dot Product Optimization

  • was practicing writing a dot product kernel when i saw some flaws and turned them into perfection
  • 1.44x faster(44% faster) : 0.01812ms to 0.01256 ms
  • faster kernel on the a100, H100 and H200

Day 66: CuTe DSL Vector Add Kernel

  • started reading the nvidia cutlass documention
  • learnt how to write a CuTe DSL vector add kernel, currently fastest as well on all available GPU's on leetgpu

Day 67: FP16 GEMM & Dot Product Practice

  • practiced writing my very fast dot product kernel and a fp16 gemm kernel

Day 68: CuTe DSL ReLU Kernel and Cuda Reverse Array Optimization

  • learnt how to write a relu kernel in cute dsl
  • slightly tweaked my fast cuda reverse array kernel and making it 7.84x faster on the B200(0.13842 ms to 0.01765 ms), similar improvements seen on the T4, A100, H200 and H100

Day 69: FP16 GEMM with Tensor Cores via WMMA API

  • learnt how to write a FP16 GEMM with tensor cores via the WMMA API
  • saw a dramatic speedup: upto 4.91x faster on the H100(0.4431ms to 0.09027ms), similar on dramatic speed up on the H200, B200 and A100(fastest kernel on all 4 gpu's)

Day 70: FP16 WMMA correction

  • realized switching my layout for the b_frag to ::col_major wont work cause the leetgpu problem concerned specifically asks us to assume all matrices are row major
  • but i did find out that ::col_major is the correct mem layout for the b_frag because it aligns the data for tensor core operations and optimizes memory access patterns, hence improving performance

Day 71: WMMA GEMM A100 vs H100 Comparison

  • ran an experiment comparing my WMMA GEMM on the A100 vs H100
  • the A100 clocked a peak runtime of 0.13873 ms while the H100 sprinted to 0.09027 ms
  • that's a 1.54x speedup

Day 72: Optimized Batched Mat Mul

  • optimized my batched mat mul by padding shared memory, using restrict pointers for A, B, C and unsigned ints for unroll loops
  • achieved a 1.187x speedup in runtime.

About

100 Days Of GPU Programming.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Cuda 92.8%
  • Mojo 3.8%
  • Metal 2.3%
  • Python 1.1%