This is a personal fork of the GPU-Mode reference-kernels repository for writing and testing GPU kernels locally. The original repository contains reference implementations and competition problems for GPU kernel optimization challenges hosted by the GPU-Mode community.
Purpose of this fork:
- βοΈ Write and test custom CUDA/Triton kernels locally on WSL2
- π§ͺ Experiment with kernel optimizations before submission
- π Learn GPU programming through hands-on practice
- π Run benchmarks on local hardware (RTX 2050, etc.)
- Available Problems
- Getting Started
- Local Development & Testing
- Problem Structure
- Implementation Examples
- Example Workflows
Fundamental GPU programming problems based on the "Programming Massively Parallel Processors" (PMPP) curriculum. Perfect for learning CUDA optimization techniques.
Problems:
vectoradd_py
- Vector addition (great starting point!)vectorsum_py
- Vector summation/reductionmatmul_py
- Matrix multiplicationconv2d_py
- 2D convolutionhistogram_py
- Histogram computationprefixsum_py
- Parallel prefix sum (scan)grayscale_py
- Image grayscale conversionsort_py
- Parallel sorting
problems/amd/
- AMD kernel optimization challengesproblems/bioml/
- Bioinformatics and ML kernelsproblems/amd_distributed/
- Multi-GPU distributed kernels
- NVIDIA GPU (for CUDA kernels) with compute capability 7.0+ recommended
- Python 3.8+
- CUDA Toolkit 12.1+ (for compiling inline CUDA kernels)
- PyTorch with CUDA support
- Clone this repository:
git clone https://github.com/thefirehacker/reference-kernels.git
cd reference-kernels
- Create a Python virtual environment:
python3 -m venv .venv
source .venv/bin/activate
- Install PyTorch with CUDA support:
# For CUDA 12.1
pip install --index-url https://download.pytorch.org/whl/cu121 torch
# For CUDA 12.3
pip install --index-url https://download.pytorch.org/whl/cu123 torch
- Verify CUDA is working:
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}'); print(f'GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else None}')"
If you're running Windows with WSL2, you can test kernels locally on your NVIDIA GPU with full CUDA support!
π Complete WSL2 + CUDA + PyTorch Setup Guide
This comprehensive guide provides a known-good, reproducible path from a fresh WSL2 Ubuntu installation to running compiled CUDA kernels:
What's covered:
- β Installing CUDA Toolkit 12.1 on WSL2 Ubuntu 22.04
- β
Configuring environment variables (
CUDA_HOME
,PATH
,LD_LIBRARY_PATH
) - β Installing PyTorch with matching CUDA version
- β Setting up the vectoradd problem with inline CUDA
- β Running and benchmarking your kernels locally
- β Troubleshooting common issues (OOM, compilation errors, etc.)
Tested Hardware:
- NVIDIA RTX 2050 (4GB VRAM, SM 8.6)
- NVIDIA RTX 3060 and newer
- Most modern NVIDIA GPUs with CUDA support
Why WSL2?
- Native CUDA support via GPU passthrough (no virtualization overhead)
- Full
nvcc
compiler access for inline CUDA kernels - Linux development environment on Windows
- Seamless integration with Windows filesystem
Once you have CUDA and PyTorch set up, each problem includes a run_local.py
script for local testing:
cd problems/pmpp/vectoradd_py
python run_local.py
What run_local.py
does:
- Correctness validation - Checks your kernel output against the reference implementation
- Automatic sizing - Detects available VRAM and picks appropriate test sizes
- Warmup runs - Ensures GPU is ready and caches are warm
- Precision timing - Uses CUDA events for accurate millisecond measurements
- Statistical reporting - Reports mean, best, and worst case timings
Example output:
Detected dims=2, freeβ3.4 GB, using sizes=[4096, 8192, 12288, 16384] (s_maxβ16384)
size=4096: mean=0.3421 ms, best=0.3401 ms, worst=0.3502 ms
size=8192: mean=1.2134 ms, best=1.2089 ms, worst=1.2301 ms
size=12288: mean=2.7892 ms, best=2.7801 ms, worst=2.8123 ms
size=16384: mean=4.9123 ms, best=4.9034 ms, worst=4.9456 ms
To test exactly as the competition evaluator does:
cd problems/pmpp/vectoradd_py
# Create test cases
printf "size: 4096; seed: 42\nsize: 8192; seed: 43\n" > tests.txt
# Run correctness tests
exec 3>&1
POPCORN_FD=3 python ../eval.py test tests.txt
# Run benchmarks
POPCORN_FD=3 python ../eval.py benchmark tests.txt
Each problem follows a standardized structure for consistency and ease of use:
problems/pmpp/vectoradd_py/
βββ reference.py # Reference PyTorch implementation (ground truth)
βββ task.py # Type definitions for inputs/outputs
βββ task.yml # Problem specification and test configurations
βββ submission.py # YOUR kernel implementation (what you edit!)
βββ template.py # Empty template to start from
βββ utils.py # Helper functions (correctness checking, seeding)
βββ eval.py # Official competition evaluator
βββ run_local.py # Local benchmark script (auto-sizing, timing)
βββ solutions/ # Example reference solutions
βββ correct/
β βββ submission_pytorch.py # Pure PyTorch solution
β βββ submission_triton.py # Triton kernel solution
β βββ submission_cuda_inline.py # Inline CUDA solution
βββ incorrect/ # Examples of common mistakes (for learning)
reference.py
- Contains the reference PyTorch implementation
- Defines
generate_input()
for creating test cases - Defines
check_implementation()
for correctness validation - Your kernel must match this output (within tolerance)
task.py
- Defines type hints:
input_t
andoutput_t
- Documents expected tensor shapes, dtypes, and devices
- Ensures type safety across submissions
task.yml
- Specifies problem metadata (name, description, difficulty)
- Defines test case configurations
- Sets timeout limits and resource constraints
submission.py
- This is where you write your kernel!
- Must implement
custom_kernel(data: input_t) -> output_t
- Can use PyTorch, Triton, inline CUDA, or CuPy
- Will be evaluated for both correctness and performance
utils.py
set_seed()
- Ensures reproducible random inputsverbose_allclose()
- Detailed tensor comparison with error reportingmatch_reference()
- Convenience function for correctness checking
You can implement kernels using any of these approaches:
def custom_kernel(data: input_t) -> output_t:
A, B = data
return A + B
import triton
import triton.language as tl
@triton.jit
def add_kernel(a_ptr, b_ptr, c_ptr, N, BLOCK_SIZE: tl.constexpr):
pid = tl.program_id(0)
offset = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
mask = offset < N
a = tl.load(a_ptr + offset, mask=mask)
b = tl.load(b_ptr + offset, mask=mask)
c = a + b
tl.store(c_ptr + offset, c, mask=mask)
def custom_kernel(data: input_t) -> output_t:
A, B = data
C = torch.empty_like(A)
N = A.numel()
grid = lambda meta: (triton.cdiv(N, meta['BLOCK_SIZE']),)
add_kernel[grid](A, B, C, N, BLOCK_SIZE=1024)
return C
from torch.utils.cpp_extension import load_inline
cuda_source = """
__global__ void add_kernel(const float* A, const float* B, float* C, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N) C[idx] = A[idx] + B[idx];
}
torch::Tensor add_cuda(torch::Tensor A, torch::Tensor B) {
auto C = torch::empty_like(A);
int N = A.numel();
int threads = 256;
int blocks = (N + threads - 1) / threads;
add_kernel<<<blocks, threads>>>(
A.data_ptr<float>(),
B.data_ptr<float>(),
C.data_ptr<float>(),
N
);
return C;
}
"""
add_module = load_inline(
name='add_cuda',
cpp_sources='torch::Tensor add_cuda(torch::Tensor A, torch::Tensor B);',
cuda_sources=cuda_source,
functions=['add_cuda']
)
def custom_kernel(data: input_t) -> output_t:
return add_module.add_cuda(data[0], data[1])
When testing your kernel locally, ensure:
β
Correctness - Output matches reference implementation
β
Type safety - Follow input_t
and output_t
definitions
β
Device compatibility - Tensors must stay on CUDA device
β
Numerical precision - Match reference within tolerance (typically rtol=1e-5, atol=1e-8
)
β
Performance - Track improvements with run_local.py
benchmarks
# 1. Pick a problem
cd problems/pmpp/vectoradd_py
# 2. Start with the template
cp template.py submission.py
# 3. Implement your kernel (edit submission.py)
nano submission.py # or use your favorite editor
# 4. Test locally
python run_local.py
# 5. If it works, optimize and iterate!
cd problems/pmpp/vectoradd_py
# Try the PyTorch solution first
cp solutions/correct/submission_pytorch.py submission.py
python run_local.py
# Compare with Triton
cp solutions/correct/submission_triton.py submission.py
python run_local.py
# Try inline CUDA for maximum performance
cp solutions/correct/submission_cuda_inline.py submission.py
python run_local.py
# Test baseline
python run_local.py > baseline.txt
# Make optimization changes to submission.py
# ... edit code ...
# Compare performance
python run_local.py > optimized.txt
diff baseline.txt optimized.txt
- π WSL2 Setup Guide - Complete setup for Windows/WSL2
- π GPU-Mode Website - Original community and resources
- π¬ GPU-Mode Discord - Join the community
- π Original Repository - Upstream source
π‘ Start simple - Get a correct PyTorch implementation first, then optimize
π‘ Compare approaches - Try PyTorch, Triton, and CUDA versions to understand trade-offs
π‘ Profile before optimizing - Use nvprof
or nsys
to find actual bottlenecks
π‘ Check occupancy - Higher occupancy often (but not always) means better performance
π‘ Memory coalescing - Ensure adjacent threads access adjacent memory for bandwidth
π‘ Shared memory - Use it strategically to reduce expensive global memory accesses
π‘ Test on various sizes - Verify your kernel works across different input scales
π‘ Benchmark locally - Use run_local.py
to measure real performance on your GPU
π‘ Learn from solutions - Study the solutions/
folder for different implementation techniques
π‘ Iterate quickly - WSL2 setup enables fast compile-test-optimize cycles