From 3dda091e9c5aac19cc0a17d53cae2a4e7d073fdb Mon Sep 17 00:00:00 2001
From: DefTruth <31974251+DefTruth@users.noreply.github.com>
Date: Fri, 10 Jan 2025 09:22:30 +0800
Subject: [PATCH 1/4] Update README.md

---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/README.md b/README.md
index ad9db14c..3b218791 100644
--- a/README.md
+++ b/README.md
@@ -18,7 +18,7 @@
 ## 📖 News 🔥🔥
 <div id="news"></div>  
 
-- [2025-01-08]: [📚Fully QKV Fine-grained Tiling](#mma-tiling-qkv) has been refactored into 🤖[cuffpa-py](https://github.com/DefTruth/cuffpa-py): 📚FFPA - Yet another Faster Flash Prefill Attention with O(1)🎉SRAM complexity for headdim > 256, ~1.5x🎉faster vs SDPA EA.
+- [2025-01-08]: [📚QKV Fine-grained Tiling](#mma-tiling-qkv) has been refactored into 🤖[cuffpa-py](https://github.com/DefTruth/cuffpa-py): 📚FFPA - Yet another Faster Flash Prefill Attention with O(1)🎉SRAM complexity for headdim > 256, 1.5x~2x🎉faster vs SDPA EA.
 - [2024-12-02]: HGEMM MMA kernels has been refactored into 🤖[cuhgemm-py](https://github.com/DefTruth/cuhgemm-py): ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, achieve peak⚡️ performance.
 
 <!--

From a8716e2c66e78c71e646b9ec72def4cc6997c692 Mon Sep 17 00:00:00 2001
From: DefTruth <31974251+DefTruth@users.noreply.github.com>
Date: Fri, 10 Jan 2025 09:23:00 +0800
Subject: [PATCH 2/4] Update README.md

---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 3b218791..5e8b3c29 100644
--- a/README.md
+++ b/README.md
@@ -18,7 +18,7 @@
 ## 📖 News 🔥🔥
 <div id="news"></div>  
 
-- [2025-01-08]: [📚QKV Fine-grained Tiling](#mma-tiling-qkv) has been refactored into 🤖[cuffpa-py](https://github.com/DefTruth/cuffpa-py): 📚FFPA - Yet another Faster Flash Prefill Attention with O(1)🎉SRAM complexity for headdim > 256, 1.5x~2x🎉faster vs SDPA EA.
+- [2025-01-08]: [📚QKV Fine-grained Tiling](#mma-tiling-qkv) has been refactored into 🤖[cuffpa-py](https://github.com/DefTruth/cuffpa-py): 📚FFPA - Yet another Faster Flash Prefill Attention with O(1)🎉SRAM complexity for headdim > 256, **1.5x~2x**🎉faster vs SDPA EA.
 - [2024-12-02]: HGEMM MMA kernels has been refactored into 🤖[cuhgemm-py](https://github.com/DefTruth/cuhgemm-py): ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, achieve peak⚡️ performance.
 
 <!--

From b8d9baa8f281ea7f7c44f9df403171e6b4f5ebb3 Mon Sep 17 00:00:00 2001
From: DefTruth <31974251+DefTruth@users.noreply.github.com>
Date: Fri, 10 Jan 2025 09:47:21 +0800
Subject: [PATCH 3/4] Update README.md

---
 README.md | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 5e8b3c29..d3d414d1 100644
--- a/README.md
+++ b/README.md
@@ -18,7 +18,8 @@
 ## 📖 News 🔥🔥
 <div id="news"></div>  
 
-- [2025-01-08]: [📚QKV Fine-grained Tiling](#mma-tiling-qkv) has been refactored into 🤖[cuffpa-py](https://github.com/DefTruth/cuffpa-py): 📚FFPA - Yet another Faster Flash Prefill Attention with O(1)🎉SRAM complexity for headdim > 256, **1.5x~2x**🎉faster vs SDPA EA.
+- [2025-01-08]: [📚QKV Fine-grained Tiling](#mma-tiling-qkv) has been refactored into 🤖[cuffpa-py](https://github.com/DefTruth/cuffpa-py): 📚FFPA - Yet another Faster Flash Prefill Attention with O(1)🎉SRAM complexity for headdim > 256, **1.5x~2x**🎉faster than SDPA EA: [📈L20 ~1.7x↑🎉](https://github.com/DefTruth/cuffpa-py?tab=readme-ov-file#L1-bench), [📈 A30 ~1.5x↑🎉](https://github.com/DefTruth/cuffpa-py?tab=readme-ov-file#L1-bench), [📈3080 ~2.5x↑🎉](https://github.com/DefTruth/cuffpa-py?tab=readme-ov-file#L1-bench), [📈4090 ~1.8x↑🎉](https://github.com/DefTruth/cuffpa-py?tab=readme-ov-file#L1-bench).  
+
 - [2024-12-02]: HGEMM MMA kernels has been refactored into 🤖[cuhgemm-py](https://github.com/DefTruth/cuhgemm-py): ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, achieve peak⚡️ performance.
 
 <!--

From 6a51680e0145111f6621947c6e5636f3f175a6de Mon Sep 17 00:00:00 2001
From: DefTruth <31974251+DefTruth@users.noreply.github.com>
Date: Fri, 10 Jan 2025 09:48:57 +0800
Subject: [PATCH 4/4] Update README.md

---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index d3d414d1..523e2524 100644
--- a/README.md
+++ b/README.md
@@ -45,7 +45,7 @@
 </div> 
 
 
-Currently, on NVIDIA L20, RTX 4090 and RTX 3080 Laptop, compared with cuBLAS's default Tensor Cores algorithm, the `HGEMM (WMMA/MMA/CuTe)` in this repo (`blue`🔵) can achieve `98%~100%` of its (`orange`🟠) performance. Please check [toy-hgemm library⚡️⚡️](./kernels/hgemm) or [hgemm-tensorcores-mma⚡️⚡️](https://github.com/DefTruth/hgemm-tensorcores-mma) repo for more details.
+Currently, on NVIDIA L20, RTX 4090 and RTX 3080 Laptop, compared with cuBLAS's default Tensor Cores algorithm, the `HGEMM (WMMA/MMA/CuTe)` in this repo (`blue`🔵) can achieve `98%~100%` of its (`orange`🟠) performance. Please check [toy-hgemm library⚡️⚡️](./kernels/hgemm) or [cuhgemm-py⚡️⚡️](https://github.com/DefTruth/cuhgemm-py) repo for more details.
 
 ![toy-hgemm-library](https://github.com/user-attachments/assets/962bda14-b494-4423-b8eb-775da9f5503d)