A collection of CUDA programming exercises developed based on the following sources:
- NVIDIA CUDA Samples
- NVIDIA CUDA C++ Programming Guide
- Programming Massively Parallel Processors, 4th Edition, by Wen-mei W. Hwu, David B. Kirk, and Izzat El Hajj, Published by Morgan Kaufmann
The command-line parsing implementation is incorporated directly from ArgParse without modifications to minimize external dependencies.
All examples can be built using the provided Makefile with the following command:
make <example_file_name>
For example, to build the matrix multiplication exercise:
make matrix_mul
All compiled binaries are output to the build
directory.
Each binary can be launched without arguments to use default parameters, or with the --help
flag to display all supported command-line arguments and options.
Performance profiling can be performed using NVIDIA Compute Profiler (NCU). Refer to the Makefile
for specific profiling configuration flags. To profile an example, use the following make command:
make <example_file_name>_profile
For example, to profile the matrix multiplication implementation:
make matrix_mul_profile
This exercise demonstrates multiple implementations of matrix multiplication with varying levels of optimization:
- Naive Implementation: A straightforward approach that reads data directly from global memory, with each thread computing one output element of the result matrix. This serves as a baseline for performance comparison.
- Tiled Implementation: An optimized version that improves upon the naive approach by loading matrix tiles into shared memory, significantly reducing global memory accesses and improving memory bandwidth utilization.
This exercise implements a 256-bin histogram computation using advanced GPU optimization techniques. The implementation features a private histogram per thread block with privatization through shared memory, atomic operations for concurrent updates, and thread coarsening to improve computational throughput. The GPU results are verified against a CPU serial implementation for correctness. The exercise also demonstrates the use of CUDA streams, events, and asynchronous memory operations for efficient host-device communication.
This exercise performs parallel reduction to calculate the sum of a large array of integers. The implementation demonstrates the use of CUDA Cooperative Groups, a flexible programming model for thread synchronization and coordination. Specifically, it showcases reduction operations over warp tile groups, leveraging hardware-level primitives for efficient intra-warp communication and computation.
This exercise implements an inclusive prefix scan (cumulative sum) operation using two distinct approaches. The first approach utilizes the optimized implementation from the NVIDIA CUB library. The second approach demonstrates a custom kernel implementation using CTA (Cooperative Thread Array) collective functions to scan individual warps using hardware-accelerated instructions. The custom implementation employs a hierarchical strategy: first scanning within warps, then reducing across warps within a thread block to scan the entire block, and finally reducing block sums to enable scanning of arbitrarily long arrays.
This exercise demonstrates parallel stream compaction, a fundamental operation that selects elements from an input array matching a predicate function and copies only those elements to a contiguous output array. The implementation compares three distinct approaches:
- STL Serial: A baseline single-threaded CPU implementation using standard library algorithms.
- STL Parallel: A multi-threaded CPU implementation leveraging C++ parallel execution policies (
std::execution::par
). - GPU Three-Pass Kernel: A GPU-accelerated implementation employing a three-stage pipeline: (1) predicate evaluation to generate indicator flags, (2) inclusive scan using the CUB library to compute output positions, and (3) parallel gather of selected elements to their final positions in the output array.