diff --git a/README.md b/README.md index 457ba38b..167e53f1 100644 --- a/README.md +++ b/README.md @@ -21,11 +21,12 @@
- - + + +
-Currently, on NVIDIA L20, RTX 4090 and RTX 3090 Laptop, compared with cuBLAS's default Tensor Cores math algorithm `CUBLAS_GEMM_DEFAULT_TENSOR_OP`, the `HGEMM (WMMA/MMA)` implemented in this repo (`blue`🔵) can achieve `95%~99%` of its (`orange`🟠) performance. Please check [toy-hgemm library🔥🔥](./kernels/hgemm) for more details. +Currently, on NVIDIA L20, RTX 4090 and RTX 3080 Laptop, compared with cuBLAS's default Tensor Cores math algorithm `CUBLAS_GEMM_DEFAULT_TENSOR_OP`, the `HGEMM (WMMA/MMA)` implemented in this repo (`blue`🔵) can achieve `95%~99%` of its (`orange`🟠) performance. Please check [toy-hgemm library🔥🔥](./kernels/hgemm) for more details. |CUDA Cores|Sliced K(Loop over K)|Tile Block|Tile Thread| |:---:|:---:|:---:|:---:| @@ -34,56 +35,22 @@ Currently, on NVIDIA L20, RTX 4090 and RTX 3090 Laptop, compared with cuBLAS's d |✔️|✔️|✔️|✔️| |Copy Async|Tile MMA(More Threads)|Tile Warp(More Values)|Multi Stages| |✔️|✔️|✔️|✔️| -|Reg Double Buffers|Block Swizzle|Warp Swizzle|Collective Store(Warp Shfl)| +|Reg Double Buffers|Block Swizzle|Warp Swizzle|Collective Store(Warp Shuffle)| |✔️|✔️|✔️|✔️| |Row Major(NN)|Col Major(TN)|SGEMM TF32|SMEM Swizzle(CuTe)| |✔️|✔️|✔️|✔️| - - - - - +## ©️Citations🎉🎉 + +```BibTeX +@misc{CUDA-Learn-Notes@2024, + title={CUDA-Learn-Notes: A Modern CUDA Learn Notes with PyTorch for Beginners}, + url={https://github.com/DefTruth/CUDA-Learn-Notes}, + note={Open-source software available at https://github.com/DefTruth/CUDA-Learn-Notes}, + author={DefTruth etc}, + year={2024} +} +``` ## 📖 150+ CUDA Kernels 🔥🔥 (面试常考题目) ([©️back👆🏻](#contents)) **Workflow**: custom **CUDA** kernel impl -> **PyTorch** Python bindings -> Run tests. 👉TIPS: `*` = **Tensor Cores(WMMA/MMA)**, otherwise, CUDA Cores; `/` = not supported; `✔️` = supported; `❔` = in my plan. diff --git a/kernels/hgemm/README.md b/kernels/hgemm/README.md index 3816e2e2..fde995cf 100755 --- a/kernels/hgemm/README.md +++ b/kernels/hgemm/README.md @@ -166,7 +166,11 @@ python3 hgemm.py --cute-tn --mma --wmma-all --plot 在NVIDIA GeForce RTX 3080 Laptop上测试,使用mma4x4_warp4x4(16 WMMA m16n16k16 ops, warp tile 64x64)以及Thread block swizzle,大部分case能持平甚至超过cuBLAS,使用Windows WSL2 + RTX 3080 Laptop进行测试。 + + +![image](https://github.com/user-attachments/assets/9472e970-c083-4b31-9252-3eeecc761078) ```bash python3 hgemm.py --wmma-all --plot