Some common CUDA kernel implementations (Not the fastest).
-
Updated
Oct 27, 2024 - Cuda
Some common CUDA kernel implementations (Not the fastest).
Fast SGEMM emulation on Tensor Cores
My attempt of making a GEMM kernel...
FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme
CUDA kernel functions
Use tensor core to calculate back-to-back HGEMM (half-precision general matrix multiplication) with MMA PTX instruction.
The simplest but fast implementation of matrix multiplication in CUDA.
code for benchmarking GPU performance based on cublasSgemm and cublasHgemm
Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
🎉 Modern CUDA Learn Notes with PyTorch: CUDA Cores, Tensor Cores, fp32/tf32, fp16/bf16, fp8/int8, flash_attn, rope, sgemm, hgemm, sgemv, warp/block reduce, elementwise, softmax, layernorm, rmsnorm.
Add a description, image, and links to the gemm topic page so that developers can more easily learn about it.
To associate your repository with the gemm topic, visit your repo's landing page and select "manage topics."