🎉CUDA 笔记 / 大模型手撕CUDA / C++笔记,更新随缘: flash_attn、sgemm、sgemv、warp reduce、block reduce、dot product、elementwise、softmax、layernorm、rmsnorm、hist etc.
-
Updated
May 19, 2024 - Cuda
🎉CUDA 笔记 / 大模型手撕CUDA / C++笔记,更新随缘: flash_attn、sgemm、sgemv、warp reduce、block reduce、dot product、elementwise、softmax、layernorm、rmsnorm、hist etc.
Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme
code for benchmarking GPU performance based on cublasSgemm and cublasHgemm
The simplest but fast implementation of matrix multiplication in CUDA.
Use tensor core to calculate back-to-back HGEMM (half-precision general matrix multiplication) with MMA PTX instruction.
Fast SGEMM emulation on Tensor Cores
CUDA kernel functions
My attempt of making a GEMM kernel...
Add a description, image, and links to the gemm topic page so that developers can more easily learn about it.
To associate your repository with the gemm topic, visit your repo's landing page and select "manage topics."