Some common CUDA kernel implementations (Not the fastest).
-
Updated
Jun 24, 2024 - Cuda
Some common CUDA kernel implementations (Not the fastest).
🎉CUDA 笔记 / 大模型手撕CUDA / C++笔记,更新随缘: flash_attn、sgemm、sgemv、warp reduce、block reduce、dot product、elementwise、softmax、layernorm、rmsnorm、hist etc.
The simplest but fast implementation of matrix multiplication in CUDA.
FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme
Fast SGEMM emulation on Tensor Cores
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
Use tensor core to calculate back-to-back HGEMM (half-precision general matrix multiplication) with MMA PTX instruction.
My attempt of making a GEMM kernel...
CUDA kernel functions
code for benchmarking GPU performance based on cublasSgemm and cublasHgemm
Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.
Add a description, image, and links to the gemm topic page so that developers can more easily learn about it.
To associate your repository with the gemm topic, visit your repo's landing page and select "manage topics."