A collection of exercises from LeetGPU (GitHub), featuring implementations in CUDA, PyTorch, and Triton.
Tested on WSL2 Ubuntu 24.04.
- CUDA Toolkit
- CMake 4
- Python 3.12+, GTest, NVBench
- uv
sudo apt-get install -y python3.12-dev libgtest-dev
uv syncmake build && make testmake build-release && make benchmake py-sync && make py-testmake clean- Follow (the Windows section for WSL2) in NVIDIA Developer Tools Solutions: Permission Issue with Performance Counters to grant access to the GPU performance counters to all users.
- Restart WSL in powershell by running
wsl --shutdown - Run
ncu(without sudo):
ncu \
--set=full \ # Most comprehensive profiling
-f \ # Force overwrite output files if they already exist
--kernel-name-base demangled \ # Use human-readable kernel names in output
--kernel-name 'regex:vector_add' \ # Only profile kernels matching the regex pattern "vector_add"
-o vector_add \ # Output results to files with "vector_add" prefix (creates .ncu-rep files)
./001_vector_addition_benchmark \ # The executable to profile. Here is a nvbench program. Flags for nvbench program can be found in https://github.com/NVIDIA/nvbench/blob/main/docs/cli_help.md
--profile \ # Run once only
--axis "N=67108864" # Run the benchmark with N=67108864This will generate vector_add.ncu-rep which can be opened in:
- Nsight Compute GUI (Windows): For interactive analysis with charts and recommendations
- Command line:
ncu -i vector_add.ncu-repfor text-based analysis