From d86f8c723b71c7135f2e23c4090454423344fda8 Mon Sep 17 00:00:00 2001 From: DefTruth <31974251+DefTruth@users.noreply.github.com> Date: Sun, 24 Nov 2024 21:19:25 +0800 Subject: [PATCH 01/11] Update README.md --- README.md | 56 +++++++++++-------------------------------------------- 1 file changed, 11 insertions(+), 45 deletions(-) diff --git a/README.md b/README.md index 457ba38b..939538a4 100644 --- a/README.md +++ b/README.md @@ -39,51 +39,17 @@ Currently, on NVIDIA L20, RTX 4090 and RTX 3090 Laptop, compared with cuBLAS's d |Row Major(NN)|Col Major(TN)|SGEMM TF32|SMEM Swizzle(CuTe)| |✔️|✔️|✔️|✔️| - - - - - +## ©️Citations🎉🎉 + +```BibTeX +@misc{CUDA-Learn-Notes@2024, + title={CUDA-Learn-Notes: A Modern CUDA Learn Notes with PyTorch}, + url={https://github.com/DefTruth/CUDA-Learn-Notes}, + note={Open-source software available at https://github.com/DefTruth/CUDA-Learn-Notes}, + author={DefTruth etc}, + year={2024} +} +``` ## 📖 150+ CUDA Kernels 🔥🔥 (面试常考题目) ([©️back👆🏻](#contents)) **Workflow**: custom **CUDA** kernel impl -> **PyTorch** Python bindings -> Run tests. 👉TIPS: `*` = **Tensor Cores(WMMA/MMA)**, otherwise, CUDA Cores; `/` = not supported; `✔️` = supported; `❔` = in my plan. From 19337fcce136d5b4883c56ab5f26f5531749119b Mon Sep 17 00:00:00 2001 From: DefTruth <31974251+DefTruth@users.noreply.github.com> Date: Sun, 24 Nov 2024 21:29:05 +0800 Subject: [PATCH 02/11] Update README.md --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 939538a4..3ed556e9 100644 --- a/README.md +++ b/README.md @@ -34,9 +34,9 @@ Currently, on NVIDIA L20, RTX 4090 and RTX 3090 Laptop, compared with cuBLAS's d |✔️|✔️|✔️|✔️| |Copy Async|Tile MMA(More Threads)|Tile Warp(More Values)|Multi Stages| |✔️|✔️|✔️|✔️| -|Reg Double Buffers|Block Swizzle|Warp Swizzle|Collective Store(Warp Shfl)| +|Reg Double Buffers|Block Swizzle|Warp Swizzle|Collective Store(Warp Shuffle)| |✔️|✔️|✔️|✔️| -|Row Major(NN)|Col Major(TN)|SGEMM TF32|SMEM Swizzle(CuTe)| +|Row Major(NN)|Col Major(TN)|SGEMM TF32|SMEM Swizzle(CUTLASS CuTe)| |✔️|✔️|✔️|✔️| ## ©️Citations🎉🎉 From c6b7e88b996ae67b22c608b67f9c466ecaf85b54 Mon Sep 17 00:00:00 2001 From: DefTruth <31974251+DefTruth@users.noreply.github.com> Date: Sun, 24 Nov 2024 21:30:50 +0800 Subject: [PATCH 03/11] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 3ed556e9..ebba3538 100644 --- a/README.md +++ b/README.md @@ -25,7 +25,7 @@ -Currently, on NVIDIA L20, RTX 4090 and RTX 3090 Laptop, compared with cuBLAS's default Tensor Cores math algorithm `CUBLAS_GEMM_DEFAULT_TENSOR_OP`, the `HGEMM (WMMA/MMA)` implemented in this repo (`blue`🔵) can achieve `95%~99%` of its (`orange`🟠) performance. Please check [toy-hgemm library🔥🔥](./kernels/hgemm) for more details. +Currently, on NVIDIA L20, RTX 4090 and RTX 3080 Laptop, compared with cuBLAS's default Tensor Cores math algorithm `CUBLAS_GEMM_DEFAULT_TENSOR_OP`, the `HGEMM (WMMA/MMA)` implemented in this repo (`blue`🔵) can achieve `95%~99%` of its (`orange`🟠) performance. Please check [toy-hgemm library🔥🔥](./kernels/hgemm) for more details. |CUDA Cores|Sliced K(Loop over K)|Tile Block|Tile Thread| |:---:|:---:|:---:|:---:| From 578d57d2dfe2686ae7629e33af21ee03007fde92 Mon Sep 17 00:00:00 2001 From: DefTruth <31974251+DefTruth@users.noreply.github.com> Date: Sun, 24 Nov 2024 21:31:29 +0800 Subject: [PATCH 04/11] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index ebba3538..36b0951b 100644 --- a/README.md +++ b/README.md @@ -36,7 +36,7 @@ Currently, on NVIDIA L20, RTX 4090 and RTX 3080 Laptop, compared with cuBLAS's d |✔️|✔️|✔️|✔️| |Reg Double Buffers|Block Swizzle|Warp Swizzle|Collective Store(Warp Shuffle)| |✔️|✔️|✔️|✔️| -|Row Major(NN)|Col Major(TN)|SGEMM TF32|SMEM Swizzle(CUTLASS CuTe)| +|Row Major(NN)|Col Major(TN)|SGEMM TF32|SMEM Swizzle(CUTLASS/CuTe)| |✔️|✔️|✔️|✔️| ## ©️Citations🎉🎉 From c9a1a88b7fc7ed2321247f9c6c4ea5c1212bad5c Mon Sep 17 00:00:00 2001 From: DefTruth <31974251+DefTruth@users.noreply.github.com> Date: Sun, 24 Nov 2024 21:32:07 +0800 Subject: [PATCH 05/11] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 36b0951b..fe14b953 100644 --- a/README.md +++ b/README.md @@ -43,7 +43,7 @@ Currently, on NVIDIA L20, RTX 4090 and RTX 3080 Laptop, compared with cuBLAS's d ```BibTeX @misc{CUDA-Learn-Notes@2024, - title={CUDA-Learn-Notes: A Modern CUDA Learn Notes with PyTorch}, + title={CUDA-Learn-Notes: A Modern CUDA Learn Notes with PyTorch for Beginners}, url={https://github.com/DefTruth/CUDA-Learn-Notes}, note={Open-source software available at https://github.com/DefTruth/CUDA-Learn-Notes}, author={DefTruth etc}, From db90a1bc110f68bebbd479a45142bbe85b1d857c Mon Sep 17 00:00:00 2001 From: DefTruth <31974251+DefTruth@users.noreply.github.com> Date: Sun, 24 Nov 2024 21:34:44 +0800 Subject: [PATCH 06/11] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index fe14b953..c3fcad40 100644 --- a/README.md +++ b/README.md @@ -36,7 +36,7 @@ Currently, on NVIDIA L20, RTX 4090 and RTX 3080 Laptop, compared with cuBLAS's d |✔️|✔️|✔️|✔️| |Reg Double Buffers|Block Swizzle|Warp Swizzle|Collective Store(Warp Shuffle)| |✔️|✔️|✔️|✔️| -|Row Major(NN)|Col Major(TN)|SGEMM TF32|SMEM Swizzle(CUTLASS/CuTe)| +|Row Major(NN)|Col Major(TN)|SGEMM TF32|SMEM Swizzle(CuTe)| |✔️|✔️|✔️|✔️| ## ©️Citations🎉🎉 From abbefe1e45ad3f5175511e30da35247edb497047 Mon Sep 17 00:00:00 2001 From: DefTruth <31974251+DefTruth@users.noreply.github.com> Date: Sun, 24 Nov 2024 21:36:07 +0800 Subject: [PATCH 07/11] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index c3fcad40..457fbaa7 100644 --- a/README.md +++ b/README.md @@ -34,7 +34,7 @@ Currently, on NVIDIA L20, RTX 4090 and RTX 3080 Laptop, compared with cuBLAS's d |✔️|✔️|✔️|✔️| |Copy Async|Tile MMA(More Threads)|Tile Warp(More Values)|Multi Stages| |✔️|✔️|✔️|✔️| -|Reg Double Buffers|Block Swizzle|Warp Swizzle|Collective Store(Warp Shuffle)| +|Reg Double Buffers|Block Swizzle|Warp Swizzle|Collective Store(Warp Shfl)| |✔️|✔️|✔️|✔️| |Row Major(NN)|Col Major(TN)|SGEMM TF32|SMEM Swizzle(CuTe)| |✔️|✔️|✔️|✔️| From d56ede65491ae0b50829308bc05ab88b37682aa2 Mon Sep 17 00:00:00 2001 From: DefTruth <31974251+DefTruth@users.noreply.github.com> Date: Sun, 24 Nov 2024 21:39:37 +0800 Subject: [PATCH 08/11] Update README.md --- kernels/hgemm/README.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/kernels/hgemm/README.md b/kernels/hgemm/README.md index 3816e2e2..fde995cf 100755 --- a/kernels/hgemm/README.md +++ b/kernels/hgemm/README.md @@ -166,7 +166,11 @@ python3 hgemm.py --cute-tn --mma --wmma-all --plot 在NVIDIA GeForce RTX 3080 Laptop上测试,使用mma4x4_warp4x4(16 WMMA m16n16k16 ops, warp tile 64x64)以及Thread block swizzle,大部分case能持平甚至超过cuBLAS,使用Windows WSL2 + RTX 3080 Laptop进行测试。 + + +![image](https://github.com/user-attachments/assets/9472e970-c083-4b31-9252-3eeecc761078) ```bash python3 hgemm.py --wmma-all --plot From c046410700ec8ade7e8a0f3ddbf725fa3911fa26 Mon Sep 17 00:00:00 2001 From: DefTruth <31974251+DefTruth@users.noreply.github.com> Date: Sun, 24 Nov 2024 21:41:39 +0800 Subject: [PATCH 09/11] Update README.md --- README.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 457fbaa7..d20a33e2 100644 --- a/README.md +++ b/README.md @@ -21,8 +21,9 @@
- - + + +
Currently, on NVIDIA L20, RTX 4090 and RTX 3080 Laptop, compared with cuBLAS's default Tensor Cores math algorithm `CUBLAS_GEMM_DEFAULT_TENSOR_OP`, the `HGEMM (WMMA/MMA)` implemented in this repo (`blue`🔵) can achieve `95%~99%` of its (`orange`🟠) performance. Please check [toy-hgemm library🔥🔥](./kernels/hgemm) for more details. From 6587bca4e9bdf780f2a0b14bb7407439eb2a5eb3 Mon Sep 17 00:00:00 2001 From: DefTruth <31974251+DefTruth@users.noreply.github.com> Date: Sun, 24 Nov 2024 21:44:18 +0800 Subject: [PATCH 10/11] Update README.md --- README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index d20a33e2..f7b5d1d9 100644 --- a/README.md +++ b/README.md @@ -21,9 +21,9 @@
- - - + + +
Currently, on NVIDIA L20, RTX 4090 and RTX 3080 Laptop, compared with cuBLAS's default Tensor Cores math algorithm `CUBLAS_GEMM_DEFAULT_TENSOR_OP`, the `HGEMM (WMMA/MMA)` implemented in this repo (`blue`🔵) can achieve `95%~99%` of its (`orange`🟠) performance. Please check [toy-hgemm library🔥🔥](./kernels/hgemm) for more details. @@ -35,7 +35,7 @@ Currently, on NVIDIA L20, RTX 4090 and RTX 3080 Laptop, compared with cuBLAS's d |✔️|✔️|✔️|✔️| |Copy Async|Tile MMA(More Threads)|Tile Warp(More Values)|Multi Stages| |✔️|✔️|✔️|✔️| -|Reg Double Buffers|Block Swizzle|Warp Swizzle|Collective Store(Warp Shfl)| +|Reg Double Buffers|Block Swizzle|Warp Swizzle|Collective Store(Warp Shuffle)| |✔️|✔️|✔️|✔️| |Row Major(NN)|Col Major(TN)|SGEMM TF32|SMEM Swizzle(CuTe)| |✔️|✔️|✔️|✔️| From 9a74a901e8319d067b9464846e1cef4fb947eff3 Mon Sep 17 00:00:00 2001 From: DefTruth <31974251+DefTruth@users.noreply.github.com> Date: Sun, 24 Nov 2024 21:44:45 +0800 Subject: [PATCH 11/11] Update README.md --- README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index f7b5d1d9..167e53f1 100644 --- a/README.md +++ b/README.md @@ -21,9 +21,9 @@
- - - + + +
Currently, on NVIDIA L20, RTX 4090 and RTX 3080 Laptop, compared with cuBLAS's default Tensor Cores math algorithm `CUBLAS_GEMM_DEFAULT_TENSOR_OP`, the `HGEMM (WMMA/MMA)` implemented in this repo (`blue`🔵) can achieve `95%~99%` of its (`orange`🟠) performance. Please check [toy-hgemm library🔥🔥](./kernels/hgemm) for more details.