From d86f8c723b71c7135f2e23c4090454423344fda8 Mon Sep 17 00:00:00 2001
From: DefTruth <31974251+DefTruth@users.noreply.github.com>
Date: Sun, 24 Nov 2024 21:19:25 +0800
Subject: [PATCH 01/11] Update README.md
---
README.md | 56 +++++++++++--------------------------------------------
1 file changed, 11 insertions(+), 45 deletions(-)
diff --git a/README.md b/README.md
index 457ba38b..939538a4 100644
--- a/README.md
+++ b/README.md
@@ -39,51 +39,17 @@ Currently, on NVIDIA L20, RTX 4090 and RTX 3090 Laptop, compared with cuBLAS's d
|Row Major(NN)|Col Major(TN)|SGEMM TF32|SMEM Swizzle(CuTe)|
|✔️|✔️|✔️|✔️|
-
-
-
-
-
+## ©️Citations🎉🎉
+
+```BibTeX
+@misc{CUDA-Learn-Notes@2024,
+ title={CUDA-Learn-Notes: A Modern CUDA Learn Notes with PyTorch},
+ url={https://github.com/DefTruth/CUDA-Learn-Notes},
+ note={Open-source software available at https://github.com/DefTruth/CUDA-Learn-Notes},
+ author={DefTruth etc},
+ year={2024}
+}
+```
## 📖 150+ CUDA Kernels 🔥🔥 (面试常考题目) ([©️back👆🏻](#contents))
**Workflow**: custom **CUDA** kernel impl -> **PyTorch** Python bindings -> Run tests. 👉TIPS: `*` = **Tensor Cores(WMMA/MMA)**, otherwise, CUDA Cores; `/` = not supported; `✔️` = supported; `❔` = in my plan.
From 19337fcce136d5b4883c56ab5f26f5531749119b Mon Sep 17 00:00:00 2001
From: DefTruth <31974251+DefTruth@users.noreply.github.com>
Date: Sun, 24 Nov 2024 21:29:05 +0800
Subject: [PATCH 02/11] Update README.md
---
README.md | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/README.md b/README.md
index 939538a4..3ed556e9 100644
--- a/README.md
+++ b/README.md
@@ -34,9 +34,9 @@ Currently, on NVIDIA L20, RTX 4090 and RTX 3090 Laptop, compared with cuBLAS's d
|✔️|✔️|✔️|✔️|
|Copy Async|Tile MMA(More Threads)|Tile Warp(More Values)|Multi Stages|
|✔️|✔️|✔️|✔️|
-|Reg Double Buffers|Block Swizzle|Warp Swizzle|Collective Store(Warp Shfl)|
+|Reg Double Buffers|Block Swizzle|Warp Swizzle|Collective Store(Warp Shuffle)|
|✔️|✔️|✔️|✔️|
-|Row Major(NN)|Col Major(TN)|SGEMM TF32|SMEM Swizzle(CuTe)|
+|Row Major(NN)|Col Major(TN)|SGEMM TF32|SMEM Swizzle(CUTLASS CuTe)|
|✔️|✔️|✔️|✔️|
## ©️Citations🎉🎉
From c6b7e88b996ae67b22c608b67f9c466ecaf85b54 Mon Sep 17 00:00:00 2001
From: DefTruth <31974251+DefTruth@users.noreply.github.com>
Date: Sun, 24 Nov 2024 21:30:50 +0800
Subject: [PATCH 03/11] Update README.md
---
README.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/README.md b/README.md
index 3ed556e9..ebba3538 100644
--- a/README.md
+++ b/README.md
@@ -25,7 +25,7 @@
-Currently, on NVIDIA L20, RTX 4090 and RTX 3090 Laptop, compared with cuBLAS's default Tensor Cores math algorithm `CUBLAS_GEMM_DEFAULT_TENSOR_OP`, the `HGEMM (WMMA/MMA)` implemented in this repo (`blue`🔵) can achieve `95%~99%` of its (`orange`🟠) performance. Please check [toy-hgemm library🔥🔥](./kernels/hgemm) for more details.
+Currently, on NVIDIA L20, RTX 4090 and RTX 3080 Laptop, compared with cuBLAS's default Tensor Cores math algorithm `CUBLAS_GEMM_DEFAULT_TENSOR_OP`, the `HGEMM (WMMA/MMA)` implemented in this repo (`blue`🔵) can achieve `95%~99%` of its (`orange`🟠) performance. Please check [toy-hgemm library🔥🔥](./kernels/hgemm) for more details.
|CUDA Cores|Sliced K(Loop over K)|Tile Block|Tile Thread|
|:---:|:---:|:---:|:---:|
From 578d57d2dfe2686ae7629e33af21ee03007fde92 Mon Sep 17 00:00:00 2001
From: DefTruth <31974251+DefTruth@users.noreply.github.com>
Date: Sun, 24 Nov 2024 21:31:29 +0800
Subject: [PATCH 04/11] Update README.md
---
README.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/README.md b/README.md
index ebba3538..36b0951b 100644
--- a/README.md
+++ b/README.md
@@ -36,7 +36,7 @@ Currently, on NVIDIA L20, RTX 4090 and RTX 3080 Laptop, compared with cuBLAS's d
|✔️|✔️|✔️|✔️|
|Reg Double Buffers|Block Swizzle|Warp Swizzle|Collective Store(Warp Shuffle)|
|✔️|✔️|✔️|✔️|
-|Row Major(NN)|Col Major(TN)|SGEMM TF32|SMEM Swizzle(CUTLASS CuTe)|
+|Row Major(NN)|Col Major(TN)|SGEMM TF32|SMEM Swizzle(CUTLASS/CuTe)|
|✔️|✔️|✔️|✔️|
## ©️Citations🎉🎉
From c9a1a88b7fc7ed2321247f9c6c4ea5c1212bad5c Mon Sep 17 00:00:00 2001
From: DefTruth <31974251+DefTruth@users.noreply.github.com>
Date: Sun, 24 Nov 2024 21:32:07 +0800
Subject: [PATCH 05/11] Update README.md
---
README.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/README.md b/README.md
index 36b0951b..fe14b953 100644
--- a/README.md
+++ b/README.md
@@ -43,7 +43,7 @@ Currently, on NVIDIA L20, RTX 4090 and RTX 3080 Laptop, compared with cuBLAS's d
```BibTeX
@misc{CUDA-Learn-Notes@2024,
- title={CUDA-Learn-Notes: A Modern CUDA Learn Notes with PyTorch},
+ title={CUDA-Learn-Notes: A Modern CUDA Learn Notes with PyTorch for Beginners},
url={https://github.com/DefTruth/CUDA-Learn-Notes},
note={Open-source software available at https://github.com/DefTruth/CUDA-Learn-Notes},
author={DefTruth etc},
From db90a1bc110f68bebbd479a45142bbe85b1d857c Mon Sep 17 00:00:00 2001
From: DefTruth <31974251+DefTruth@users.noreply.github.com>
Date: Sun, 24 Nov 2024 21:34:44 +0800
Subject: [PATCH 06/11] Update README.md
---
README.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/README.md b/README.md
index fe14b953..c3fcad40 100644
--- a/README.md
+++ b/README.md
@@ -36,7 +36,7 @@ Currently, on NVIDIA L20, RTX 4090 and RTX 3080 Laptop, compared with cuBLAS's d
|✔️|✔️|✔️|✔️|
|Reg Double Buffers|Block Swizzle|Warp Swizzle|Collective Store(Warp Shuffle)|
|✔️|✔️|✔️|✔️|
-|Row Major(NN)|Col Major(TN)|SGEMM TF32|SMEM Swizzle(CUTLASS/CuTe)|
+|Row Major(NN)|Col Major(TN)|SGEMM TF32|SMEM Swizzle(CuTe)|
|✔️|✔️|✔️|✔️|
## ©️Citations🎉🎉
From abbefe1e45ad3f5175511e30da35247edb497047 Mon Sep 17 00:00:00 2001
From: DefTruth <31974251+DefTruth@users.noreply.github.com>
Date: Sun, 24 Nov 2024 21:36:07 +0800
Subject: [PATCH 07/11] Update README.md
---
README.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/README.md b/README.md
index c3fcad40..457fbaa7 100644
--- a/README.md
+++ b/README.md
@@ -34,7 +34,7 @@ Currently, on NVIDIA L20, RTX 4090 and RTX 3080 Laptop, compared with cuBLAS's d
|✔️|✔️|✔️|✔️|
|Copy Async|Tile MMA(More Threads)|Tile Warp(More Values)|Multi Stages|
|✔️|✔️|✔️|✔️|
-|Reg Double Buffers|Block Swizzle|Warp Swizzle|Collective Store(Warp Shuffle)|
+|Reg Double Buffers|Block Swizzle|Warp Swizzle|Collective Store(Warp Shfl)|
|✔️|✔️|✔️|✔️|
|Row Major(NN)|Col Major(TN)|SGEMM TF32|SMEM Swizzle(CuTe)|
|✔️|✔️|✔️|✔️|
From d56ede65491ae0b50829308bc05ab88b37682aa2 Mon Sep 17 00:00:00 2001
From: DefTruth <31974251+DefTruth@users.noreply.github.com>
Date: Sun, 24 Nov 2024 21:39:37 +0800
Subject: [PATCH 08/11] Update README.md
---
kernels/hgemm/README.md | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/kernels/hgemm/README.md b/kernels/hgemm/README.md
index 3816e2e2..fde995cf 100755
--- a/kernels/hgemm/README.md
+++ b/kernels/hgemm/README.md
@@ -166,7 +166,11 @@ python3 hgemm.py --cute-tn --mma --wmma-all --plot
在NVIDIA GeForce RTX 3080 Laptop上测试,使用mma4x4_warp4x4(16 WMMA m16n16k16 ops, warp tile 64x64)以及Thread block swizzle,大部分case能持平甚至超过cuBLAS,使用Windows WSL2 + RTX 3080 Laptop进行测试。
+
+
+
```bash
python3 hgemm.py --wmma-all --plot
From c046410700ec8ade7e8a0f3ddbf725fa3911fa26 Mon Sep 17 00:00:00 2001
From: DefTruth <31974251+DefTruth@users.noreply.github.com>
Date: Sun, 24 Nov 2024 21:41:39 +0800
Subject: [PATCH 09/11] Update README.md
---
README.md | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/README.md b/README.md
index 457fbaa7..d20a33e2 100644
--- a/README.md
+++ b/README.md
@@ -21,8 +21,9 @@