From a196643bfa3a2e516f54b36e49fb99901d6fb997 Mon Sep 17 00:00:00 2001
From: DefTruth <31974251+DefTruth@users.noreply.github.com>
Date: Wed, 15 Jan 2025 11:22:57 +0000
Subject: [PATCH 1/5] Automated submodule update

---
 .dev/.gitignore           |  2 +-
 .dev/update_submodules.sh |  4 ++--
 README.md                 | 20 +++++++++++++++++++-
 3 files changed, 22 insertions(+), 4 deletions(-)
diff --git a/.dev/.gitignore b/.dev/.gitignore
index ea25c19b..473d6afb 100644
--- a/.dev/.gitignore
+++ b/.dev/.gitignore
@@ -22,4 +22,4 @@ bin
 *.log
 *.txt
 *.tex
-__pycache__
\ No newline at end of file
+tmp*
\ No newline at end of file
diff --git a/.dev/update_submodules.sh b/.dev/update_submodules.sh
index e71ef574..fb2798c0 100755
--- a/.dev/update_submodules.sh
+++ b/.dev/update_submodules.sh
@@ -1,8 +1,8 @@
 # update submodules
 set -x
 git submodule init
-# git submodule update --remote # update all submodule
-git submodule update --remote ffpa-attn-mma # only update ffpa-attn-mma
+git submodule update --remote # update all submodule
+# git submodule update --remote ffpa-attn-mma # only update ffpa-attn-mma
 git add .
 git commit -m "Automated submodule update"
 set +x
\ No newline at end of file
diff --git a/README.md b/README.md
index 12066614..366b79b9 100644
--- a/README.md
+++ b/README.md
@@ -357,6 +357,8 @@ The kernels listed here will guide you through a step-by-step progression, rangi
 
 ### 📚 Hard+ ⭐️⭐️⭐️⭐️ & Hard++ ⭐️⭐️⭐️⭐️⭐️ ([©️back👆🏻](#cuda-kernel)) 
 
+- 📚 FlashAttention-2 MMA (MMA Acc F32/F16, swizzle, QKV smem share, fine-grained tiling, etc.🎉)
+
 <div id="cuda-kernel-hard-plus"></div>  
 
 |📖 CUDA Kernel| 📖 Elem DType| 📖 Acc DType| 📖 Docs | 📖 Level |
@@ -390,7 +392,23 @@ The kernels listed here will guide you through a step-by-step progression, rangi
 | ✔️ [flash_attn...tiling_qkv_swizzle{qk}{f32}*](./kernels/flash-attn/mma/swizzle/flash_attn_mma_tiling_qkv_swizzle_qk_F32F16F16F32.cu)|f16|f32|[link](./kernels/flash-attn)|⭐️⭐️⭐️⭐️|   
 | ✔️ [flash_attn...tiling_qkv_swizzle{qkv}{f32}*](./kernels/flash-attn/mma/swizzle/flash_attn_mma_tiling_qkv_swizzle_qkv_F32F16F16F32.cu)|f16|f32|[link](./kernels/flash-attn)|⭐️⭐️⭐️⭐️| 
 
-**rr**: means reduce registers usage (for `d>128`); **f32**: means MMA accumulate with FP32 dtype, otherwise, FP16. softmax Acc dtype is always be FP32 for high precision; **swizzle**: now, only support smem swizzle for MMA.
+NOTE: **rr**: means reduce registers usage (for `d>128`); **f32**: means MMA accumulate with FP32 dtype, otherwise, FP16. softmax Acc dtype is always be FP32 for high precision; **swizzle**: now, only support smem swizzle for MMA.
+
+- 📚 FFPA Attention MMA (**1.8x~3x**🎉faster vs SDPA EA, D > 256, FA2 not supported)
+
+|📖 CUDA Kernel| 📖 Elem DType| 📖 Acc DType| 📖 Docs | 📖 Level |
+|:---|:---|:---|:---|:---|   
+| ✔️ [ffpa_mma_stages_split_q_L1_F16F16F16](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F16_L1.cu)|f16|f16|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️⭐️| 
+| ✔️ [ffpa_mma_stages_split_q_L1_F16F16F32](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L1.cu)|f16|f32|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️⭐️| 
+| ✔️ [ffpa_mma_stages_split_q_L1_mixed_acc](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L1.cu)|f16|QK f32, PV f16|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️⭐️| 
+| ? [ffpa_mma_stages_split_q_L2_F16F16F16](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F16_L2.cu)|f16|f16|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️⭐️| 
+| ? [ffpa_mma_stages_split_q_L2_F16F16F32](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L2.cu)|f16|f32|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️⭐️| 
+| ? [ffpa_mma_stages_split_q_L2_mixed_acc](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L2.cu)|f16|QK f32, PV f16|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️⭐️| 
+| ? [ffpa_mma_stages_split_q_L3_F16F16F16](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F16_L3.cu)|f16|f16|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️⭐️| 
+| ? [ffpa_mma_stages_split_q_L3_F16F16F32](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L3.cu)|f16|f32|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️⭐️| 
+| ? [ffpa_mma_stages_split_q_L3_mixed_acc](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L3.cu)|f16|QK f32, PV f16|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️⭐️| 
+
+NOTE: 🤖[ffpa-attn-mma](https://github.com/DefTruth/ffpa-attn-mma): 📚FFPA - Yet another Faster Flash Prefill Attention with O(1)🎉SRAM complexity for headdim > 256, **1.8x~3x**🎉faster than SDPA EA: [📈L20 ~1.9x↑🎉](https://github.com/DefTruth/ffpa-attn-mma?tab=readme-ov-file#L1-bench-l20), [📈 A30 ~1.8x↑🎉](https://github.com/DefTruth/ffpa-attn-mma?tab=readme-ov-file#L1-bench-a30), [📈3080 ~2.9x↑🎉](https://github.com/DefTruth/ffpa-attn-mma?tab=readme-ov-file#L1-bench-3080), [📈4090 ~2.1x↑🎉](https://github.com/DefTruth/ffpa-attn-mma?tab=readme-ov-file#L1-bench-4090).  
 
 ## 📖 博客目录
 

From 82a99e1e04c329f54e60b7231d10983189c3791d Mon Sep 17 00:00:00 2001
From: DefTruth <31974251+DefTruth@users.noreply.github.com>
Date: Wed, 15 Jan 2025 11:27:22 +0000
Subject: [PATCH 2/5] add ffpa-attn-mma kernels to lists

---
 .github/.gitignore |  2 +-
 README.md          | 12 ++++++------
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/.github/.gitignore b/.github/.gitignore
index ea25c19b..473d6afb 100644
--- a/.github/.gitignore
+++ b/.github/.gitignore
@@ -22,4 +22,4 @@ bin
 *.log
 *.txt
 *.tex
-__pycache__
\ No newline at end of file
+tmp*
\ No newline at end of file
diff --git a/README.md b/README.md
index 366b79b9..5c3b6e30 100644
--- a/README.md
+++ b/README.md
@@ -401,12 +401,12 @@ NOTE: **rr**: means reduce registers usage (for `d>128`); **f32**: means MMA acc
 | ✔️ [ffpa_mma_stages_split_q_L1_F16F16F16](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F16_L1.cu)|f16|f16|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️⭐️| 
 | ✔️ [ffpa_mma_stages_split_q_L1_F16F16F32](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L1.cu)|f16|f32|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️⭐️| 
 | ✔️ [ffpa_mma_stages_split_q_L1_mixed_acc](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L1.cu)|f16|QK f32, PV f16|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️⭐️| 
-| ? [ffpa_mma_stages_split_q_L2_F16F16F16](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F16_L2.cu)|f16|f16|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️⭐️| 
-| ? [ffpa_mma_stages_split_q_L2_F16F16F32](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L2.cu)|f16|f32|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️⭐️| 
-| ? [ffpa_mma_stages_split_q_L2_mixed_acc](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L2.cu)|f16|QK f32, PV f16|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️⭐️| 
-| ? [ffpa_mma_stages_split_q_L3_F16F16F16](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F16_L3.cu)|f16|f16|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️⭐️| 
-| ? [ffpa_mma_stages_split_q_L3_F16F16F32](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L3.cu)|f16|f32|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️⭐️| 
-| ? [ffpa_mma_stages_split_q_L3_mixed_acc](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L3.cu)|f16|QK f32, PV f16|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️⭐️| 
+| ⚠️ [ffpa_mma_stages_split_q_L2_F16F16F16](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F16_L2.cu)|f16|f16|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️⭐️| 
+| ⚠️ [ffpa_mma_stages_split_q_L2_F16F16F32](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L2.cu)|f16|f32|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️⭐️| 
+| ⚠️ [ffpa_mma_stages_split_q_L2_mixed_acc](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L2.cu)|f16|QK f32, PV f16|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️⭐️| 
+| ⚠️ [ffpa_mma_stages_split_q_L3_F16F16F16](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F16_L3.cu)|f16|f16|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️⭐️| 
+| ⚠️ [ffpa_mma_stages_split_q_L3_F16F16F32](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L3.cu)|f16|f32|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️⭐️| 
+| ⚠️ [ffpa_mma_stages_split_q_L3_mixed_acc](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L3.cu)|f16|QK f32, PV f16|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️⭐️| 
 
 NOTE: 🤖[ffpa-attn-mma](https://github.com/DefTruth/ffpa-attn-mma): 📚FFPA - Yet another Faster Flash Prefill Attention with O(1)🎉SRAM complexity for headdim > 256, **1.8x~3x**🎉faster than SDPA EA: [📈L20 ~1.9x↑🎉](https://github.com/DefTruth/ffpa-attn-mma?tab=readme-ov-file#L1-bench-l20), [📈 A30 ~1.8x↑🎉](https://github.com/DefTruth/ffpa-attn-mma?tab=readme-ov-file#L1-bench-a30), [📈3080 ~2.9x↑🎉](https://github.com/DefTruth/ffpa-attn-mma?tab=readme-ov-file#L1-bench-3080), [📈4090 ~2.1x↑🎉](https://github.com/DefTruth/ffpa-attn-mma?tab=readme-ov-file#L1-bench-4090).  
 

From c9155b630953bf75382ea55e05af89170903250a Mon Sep 17 00:00:00 2001
From: DefTruth <31974251+DefTruth@users.noreply.github.com>
Date: Wed, 15 Jan 2025 11:28:27 +0000
Subject: [PATCH 3/5] add ffpa-attn-mma kernels to lists

---
 README.md | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/README.md b/README.md
index 5c3b6e30..dd789fc5 100644
--- a/README.md
+++ b/README.md
@@ -398,15 +398,15 @@ NOTE: **rr**: means reduce registers usage (for `d>128`); **f32**: means MMA acc
 
 |📖 CUDA Kernel| 📖 Elem DType| 📖 Acc DType| 📖 Docs | 📖 Level |
 |:---|:---|:---|:---|:---|   
-| ✔️ [ffpa_mma_stages_split_q_L1_F16F16F16](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F16_L1.cu)|f16|f16|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️⭐️| 
-| ✔️ [ffpa_mma_stages_split_q_L1_F16F16F32](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L1.cu)|f16|f32|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️⭐️| 
-| ✔️ [ffpa_mma_stages_split_q_L1_mixed_acc](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L1.cu)|f16|QK f32, PV f16|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️⭐️| 
-| ⚠️ [ffpa_mma_stages_split_q_L2_F16F16F16](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F16_L2.cu)|f16|f16|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️⭐️| 
-| ⚠️ [ffpa_mma_stages_split_q_L2_F16F16F32](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L2.cu)|f16|f32|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️⭐️| 
-| ⚠️ [ffpa_mma_stages_split_q_L2_mixed_acc](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L2.cu)|f16|QK f32, PV f16|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️⭐️| 
-| ⚠️ [ffpa_mma_stages_split_q_L3_F16F16F16](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F16_L3.cu)|f16|f16|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️⭐️| 
-| ⚠️ [ffpa_mma_stages_split_q_L3_F16F16F32](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L3.cu)|f16|f32|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️⭐️| 
-| ⚠️ [ffpa_mma_stages_split_q_L3_mixed_acc](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L3.cu)|f16|QK f32, PV f16|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️⭐️| 
+| ✔️ [ffpa_mma_stages_split_q_L1_F16F16F16](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F16_L1.cu)|f16|f16|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️| 
+| ✔️ [ffpa_mma_stages_split_q_L1_F16F16F32](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L1.cu)|f16|f32|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️| 
+| ✔️ [ffpa_mma_stages_split_q_L1_mixed_acc](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L1.cu)|f16|QK f32, PV f16|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️| 
+| ⚠️ [ffpa_mma_stages_split_q_L2_F16F16F16](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F16_L2.cu)|f16|f16|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️| 
+| ⚠️ [ffpa_mma_stages_split_q_L2_F16F16F32](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L2.cu)|f16|f32|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️| 
+| ⚠️ [ffpa_mma_stages_split_q_L2_mixed_acc](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L2.cu)|f16|QK f32, PV f16|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️| 
+| ⚠️ [ffpa_mma_stages_split_q_L3_F16F16F16](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F16_L3.cu)|f16|f16|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️| 
+| ⚠️ [ffpa_mma_stages_split_q_L3_F16F16F32](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L3.cu)|f16|f32|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️| 
+| ⚠️ [ffpa_mma_stages_split_q_L3_mixed_acc](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L3.cu)|f16|QK f32, PV f16|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️| 
 
 NOTE: 🤖[ffpa-attn-mma](https://github.com/DefTruth/ffpa-attn-mma): 📚FFPA - Yet another Faster Flash Prefill Attention with O(1)🎉SRAM complexity for headdim > 256, **1.8x~3x**🎉faster than SDPA EA: [📈L20 ~1.9x↑🎉](https://github.com/DefTruth/ffpa-attn-mma?tab=readme-ov-file#L1-bench-l20), [📈 A30 ~1.8x↑🎉](https://github.com/DefTruth/ffpa-attn-mma?tab=readme-ov-file#L1-bench-a30), [📈3080 ~2.9x↑🎉](https://github.com/DefTruth/ffpa-attn-mma?tab=readme-ov-file#L1-bench-3080), [📈4090 ~2.1x↑🎉](https://github.com/DefTruth/ffpa-attn-mma?tab=readme-ov-file#L1-bench-4090).  
 

From 672d39e74c299904c34edecb969134de16f2f8ed Mon Sep 17 00:00:00 2001
From: DefTruth <31974251+DefTruth@users.noreply.github.com>
Date: Wed, 15 Jan 2025 11:33:56 +0000
Subject: [PATCH 4/5] add ffpa-attn-mma kernels to lists

---
 README.md | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/README.md b/README.md
index dd789fc5..9019101c 100644
--- a/README.md
+++ b/README.md
@@ -13,7 +13,7 @@
  </div>   
 
 
-📚 **Modern CUDA Learn Notes with PyTorch** for Beginners: It includes **Tensor/CUDA Cores, TF32/F16/BF16/F8**, [📖150+ CUDA Kernels🔥🔥(Easy -> Hard++)](#cuda-kernel) with PyTorch bindings, [📖100+ LLM/VLM/CV/CUDA/CuTe🔥](#my-blogs-part-1) blogs, [📖toy-hgemm⚡️⚡️](./kernels/hgemm) which can achieve `98%~100%` performance of **cuBLAS**, and [📖flash-attention-mma⚡️⚡️](./kernels/flash-attn) using Tensor Cores with pure MMA PTX. Welcome to 🌟👆🏻star this repo to support me, many thanks ~ 🎉🎉
+📚 **Modern CUDA Learn Notes with PyTorch** for Beginners: It includes **Tensor/CUDA Cores, TF32/F16/BF16/F8**, [📖200+ CUDA Kernels🔥🔥(Easy -> Hard++)](#cuda-kernel) with PyTorch bindings, [📖100+ LLM/VLM/CV/CUDA/CuTe🔥](#my-blogs-part-1) blogs, [📖toy-hgemm⚡️⚡️](./kernels/hgemm) which can achieve `98%~100%` performance of **cuBLAS**, and [📖flash-attention-mma⚡️⚡️](./kernels/flash-attn) using Tensor Cores with pure MMA PTX. Welcome to 🌟👆🏻star this repo to support me, many thanks ~ 🎉🎉
 
 <div id="contents"></div>    
 
@@ -169,7 +169,7 @@ flash_attn_mma_stages_split_q_tiling_qkv_kernel(half* Q, half* K, half* V, half*
 }
 ```
 
-## 📖 150+ CUDA Kernels 🔥🔥 (Easy -> Hard++) ([©️back👆🏻](#contents))  
+## 📖 200+ CUDA Kernels 🔥🔥 (Easy -> Hard++) ([©️back👆🏻](#contents))  
 
 <div id="cuda-kernel"></div>    
 
@@ -392,7 +392,7 @@ The kernels listed here will guide you through a step-by-step progression, rangi
 | ✔️ [flash_attn...tiling_qkv_swizzle{qk}{f32}*](./kernels/flash-attn/mma/swizzle/flash_attn_mma_tiling_qkv_swizzle_qk_F32F16F16F32.cu)|f16|f32|[link](./kernels/flash-attn)|⭐️⭐️⭐️⭐️|   
 | ✔️ [flash_attn...tiling_qkv_swizzle{qkv}{f32}*](./kernels/flash-attn/mma/swizzle/flash_attn_mma_tiling_qkv_swizzle_qkv_F32F16F16F32.cu)|f16|f32|[link](./kernels/flash-attn)|⭐️⭐️⭐️⭐️| 
 
-NOTE: **rr**: means reduce registers usage (for `d>128`); **f32**: means MMA accumulate with FP32 dtype, otherwise, FP16. softmax Acc dtype is always be FP32 for high precision; **swizzle**: now, only support smem swizzle for MMA.
+💡NOTE: **rr**: means reduce registers usage (for `d>128`); **f32**: means MMA accumulate with FP32 dtype, otherwise, FP16. softmax Acc dtype is always be FP32 for high precision; **swizzle**: now, only support smem swizzle for MMA.
 
 - 📚 FFPA Attention MMA (**1.8x~3x**🎉faster vs SDPA EA, D > 256, FA2 not supported)
 
@@ -408,9 +408,9 @@ NOTE: **rr**: means reduce registers usage (for `d>128`); **f32**: means MMA acc
 | ⚠️ [ffpa_mma_stages_split_q_L3_F16F16F32](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L3.cu)|f16|f32|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️| 
 | ⚠️ [ffpa_mma_stages_split_q_L3_mixed_acc](https://github.com/DefTruth/ffpa-attn-mma/blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L3.cu)|f16|QK f32, PV f16|[link](https://github.com/DefTruth/ffpa-attn-mma)|⭐️⭐️⭐️⭐️| 
 
-NOTE: 🤖[ffpa-attn-mma](https://github.com/DefTruth/ffpa-attn-mma): 📚FFPA - Yet another Faster Flash Prefill Attention with O(1)🎉SRAM complexity for headdim > 256, **1.8x~3x**🎉faster than SDPA EA: [📈L20 ~1.9x↑🎉](https://github.com/DefTruth/ffpa-attn-mma?tab=readme-ov-file#L1-bench-l20), [📈 A30 ~1.8x↑🎉](https://github.com/DefTruth/ffpa-attn-mma?tab=readme-ov-file#L1-bench-a30), [📈3080 ~2.9x↑🎉](https://github.com/DefTruth/ffpa-attn-mma?tab=readme-ov-file#L1-bench-3080), [📈4090 ~2.1x↑🎉](https://github.com/DefTruth/ffpa-attn-mma?tab=readme-ov-file#L1-bench-4090).  
+💡NOTE: 🤖[ffpa-attn-mma](https://github.com/DefTruth/ffpa-attn-mma): 📚FFPA - Yet another Faster Flash Prefill Attention with O(1)🎉SRAM complexity for headdim > 256, **1.8x~3x**🎉faster than SDPA EA: [📈L20 ~1.9x↑🎉](https://github.com/DefTruth/ffpa-attn-mma?tab=readme-ov-file#L1-bench-l20), [📈 A30 ~1.8x↑🎉](https://github.com/DefTruth/ffpa-attn-mma?tab=readme-ov-file#L1-bench-a30), [📈3080 ~2.9x↑🎉](https://github.com/DefTruth/ffpa-attn-mma?tab=readme-ov-file#L1-bench-3080), [📈4090 ~2.1x↑🎉](https://github.com/DefTruth/ffpa-attn-mma?tab=readme-ov-file#L1-bench-4090).  
 
-## 📖 博客目录
+## 📖 100+ LLM/VLM/CV/CUDA/CuTe Tech Blogs
 
 <div id="my-blogs-part-1"></div>  
 

From 8d0e8f7c72be24d585e54fc9b1cbc46737f5dca1 Mon Sep 17 00:00:00 2001
From: DefTruth <31974251+DefTruth@users.noreply.github.com>
Date: Wed, 15 Jan 2025 16:20:03 +0000
Subject: [PATCH 5/5] Automated submodule update

---
 ffpa-attn-mma          | 2 +-
 third-party/.gitignore | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/ffpa-attn-mma b/ffpa-attn-mma
index 6b98ccc7..de69c453 160000
--- a/ffpa-attn-mma
+++ b/ffpa-attn-mma
@@ -1 +1 @@
-Subproject commit 6b98ccc7d3d39043beb18ef6768ec3e8d0262756
+Subproject commit de69c4531108aecb552b25371121e1d8a1b180e8
diff --git a/third-party/.gitignore b/third-party/.gitignore
index ea25c19b..473d6afb 100644
--- a/third-party/.gitignore
+++ b/third-party/.gitignore
@@ -22,4 +22,4 @@ bin
 *.log
 *.txt
 *.tex
-__pycache__
\ No newline at end of file
+tmp*
\ No newline at end of file