Skip to content

This repository contains a collection of CUDA programs demonstrating fundamental and advanced GPU computing techniques. Each module is designed to illustrate core CUDA programming concepts and optimization strategies, making this repository a practical resource for learning, experimentation, and showcasing GPU acceleration capabilities.

License

Notifications You must be signed in to change notification settings

ridash2005/CUDA_Kernels

Repository files navigation

CUDA Kernels Repository

CUDA Kernels Banner

Welcome to the CUDA Kernels repository! This project is a comprehensive collection of CUDA implementations ranging from fundamental concepts to advanced mathematical operations. It is designed for both beginners starting their CUDA journey and professionals looking for reference implementations.

📂 Project Structure

The repository is organized into modules of increasing complexity:

1. Fundamentals (modules/01_fundamentals)

  • 01_kernel_basics: Introduction to writing your first CUDA kernel.
  • 02_grid_block: Understanding the Grid-Block-Thread hierarchy.
  • 03_hardware: Querying GPU device properties and capabilities.

2. Memory Management (modules/02_memory_management)

  • 01_vector_ops: Standard vector operations (Add, Sub, Mul, Dot) demonstrating global memory usage.
  • 02_vector_dot: Dot product implementation using atomic operations.
  • 03_constant_memory: Optimization using Constant Memory for read-only data.
  • 04_unified_memory: Simplifies memory management using cudaMallocManaged and cudaMemPrefetchAsync.

3. Advanced Math (modules/03_advanced_math)

  • 01_matrix_vector_ops: High-performance Matrix-Vector multiplication (Standard, Banded, Symmetric, Triangular) and Rank-1/Rank-2 updates. Includes CPU verification.
  • 02_fft: Fast Fourier Transform implementations (Radix-2 and Stockham algorithms).

4. Optimizations (modules/04_optimizations)

  • 01_tiled_matmul: The "Holy Grail" of CUDA optimizations. Tiled Matrix-Matrix multiplication using Shared Memory.
  • 02_reduction: Highly optimized parallel reduction (Sum) using Warp Shuffle instructions.

5. Concurrency (modules/05_concurrency)

  • 01_streams: Demonstrates maximizing GPU throughput by overlapping Compute with Memory Transfers using CUDA Streams.

6. Advanced Algorithms (modules/06_advanced_algorithms)

  • 01_histogram: Optimized frequency counting using Privatized Shared Memory Atomics to reduce Global Memory contention.

7. The Ecosystem (modules/07_ecosystem)

  • 01_cublas: Industry-standard Matrix Multiplication using NVIDIA's hand-tuned cuBLAS library.
  • 02_thrust: High-level C++ template library ("STL for CUDA") for Sorting and Reducing without writing kernels.
  • 03_curand: Parallel Random Number Generation (Monte Carlo Pi Estimation).
  • 04_cusparse: Sparse Matrix-Vector multiplication using Compressed Sparse Row (CSR) format.
  • 05_cusolver: Dense Cholesky Decomposition ($A = L L^T$).
  • 06_nvtx: Profiling range markers for Nsight Systems.
  • 07_dynamic_parallelism: Child kernel launches from the GPU (CDP).

🚀 Getting Started

Prerequisites

  • NVIDIA GPU: Compute Capability 5.0 or higher recommended.
  • CUDA Toolkit: Version 10.0 or higher.
  • Compiler: nvcc (bundled with CUDA Toolkit).
  • Build Tool: make (or nmake on Windows, though headers are set up for typical make).

Building and Running

Each module contains a Makefile. To build a specific module, navigate to its directory and run make.

Example: Running the Matrix-Vector Operations Demo

cd modules/03_advanced_math/01_matrix_vector_ops
make
./mv_app.exe

📚 Documentation

For a detailed learning path, check out A_BEGINNERS_GUIDE.md.

For a guide on the broader ecosystem (cuBLAS, Thrust, TensorRT, etc.), read CUDA_ECOSYSTEM_GUIDE.md.

🛠️ Verification

All modules are self-contained and include verification mechanisms (comparing GPU results against CPU reference implementations) to ensure correctness.

To verify the entire repository at once, run the included PowerShell script:

./scripts/verify_all.ps1

🤝 Contributing

Contributions are welcome! Please ensure code is formatted and includes verification logic.

📄 License

MIT License

About

This repository contains a collection of CUDA programs demonstrating fundamental and advanced GPU computing techniques. Each module is designed to illustrate core CUDA programming concepts and optimization strategies, making this repository a practical resource for learning, experimentation, and showcasing GPU acceleration capabilities.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published