Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,8 @@
## 📖 News 🔥🔥
<div id="news"></div>

- [2025-01-08]: [📚Fully QKV Fine-grained Tiling](#mma-tiling-qkv) has been refactored into 🤖[cuffpa-py](https://github.com/DefTruth/cuffpa-py): 📚FFPA - Yet another Faster Flash Prefill Attention with O(1)🎉SRAM complexity for headdim > 256, ~1.5x🎉faster vs SDPA EA.
- [2025-01-08]: [📚QKV Fine-grained Tiling](#mma-tiling-qkv) has been refactored into 🤖[cuffpa-py](https://github.com/DefTruth/cuffpa-py): 📚FFPA - Yet another Faster Flash Prefill Attention with O(1)🎉SRAM complexity for headdim > 256, **1.5x~2x**🎉faster than SDPA EA: [📈L20 ~1.7x↑🎉](https://github.com/DefTruth/cuffpa-py?tab=readme-ov-file#L1-bench), [📈 A30 ~1.5x↑🎉](https://github.com/DefTruth/cuffpa-py?tab=readme-ov-file#L1-bench), [📈3080 ~2.5x↑🎉](https://github.com/DefTruth/cuffpa-py?tab=readme-ov-file#L1-bench), [📈4090 ~1.8x↑🎉](https://github.com/DefTruth/cuffpa-py?tab=readme-ov-file#L1-bench).

- [2024-12-02]: HGEMM MMA kernels has been refactored into 🤖[cuhgemm-py](https://github.com/DefTruth/cuhgemm-py): ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, achieve peak⚡️ performance.

<!--
Expand All @@ -44,7 +45,7 @@
</div>


Currently, on NVIDIA L20, RTX 4090 and RTX 3080 Laptop, compared with cuBLAS's default Tensor Cores algorithm, the `HGEMM (WMMA/MMA/CuTe)` in this repo (`blue`🔵) can achieve `98%~100%` of its (`orange`🟠) performance. Please check [toy-hgemm library⚡️⚡️](./kernels/hgemm) or [hgemm-tensorcores-mma⚡️⚡️](https://github.com/DefTruth/hgemm-tensorcores-mma) repo for more details.
Currently, on NVIDIA L20, RTX 4090 and RTX 3080 Laptop, compared with cuBLAS's default Tensor Cores algorithm, the `HGEMM (WMMA/MMA/CuTe)` in this repo (`blue`🔵) can achieve `98%~100%` of its (`orange`🟠) performance. Please check [toy-hgemm library⚡️⚡️](./kernels/hgemm) or [cuhgemm-py⚡️⚡️](https://github.com/DefTruth/cuhgemm-py) repo for more details.

![toy-hgemm-library](https://github.com/user-attachments/assets/962bda14-b494-4423-b8eb-775da9f5503d)

Expand Down