diff --git a/README.md b/README.md
index 457ba38b..167e53f1 100644
--- a/README.md
+++ b/README.md
@@ -21,11 +21,12 @@
-Currently, on NVIDIA L20, RTX 4090 and RTX 3090 Laptop, compared with cuBLAS's default Tensor Cores math algorithm `CUBLAS_GEMM_DEFAULT_TENSOR_OP`, the `HGEMM (WMMA/MMA)` implemented in this repo (`blue`🔵) can achieve `95%~99%` of its (`orange`🟠) performance. Please check [toy-hgemm library🔥🔥](./kernels/hgemm) for more details.
+Currently, on NVIDIA L20, RTX 4090 and RTX 3080 Laptop, compared with cuBLAS's default Tensor Cores math algorithm `CUBLAS_GEMM_DEFAULT_TENSOR_OP`, the `HGEMM (WMMA/MMA)` implemented in this repo (`blue`🔵) can achieve `95%~99%` of its (`orange`🟠) performance. Please check [toy-hgemm library🔥🔥](./kernels/hgemm) for more details.
|CUDA Cores|Sliced K(Loop over K)|Tile Block|Tile Thread|
|:---:|:---:|:---:|:---:|
@@ -34,56 +35,22 @@ Currently, on NVIDIA L20, RTX 4090 and RTX 3090 Laptop, compared with cuBLAS's d
|✔️|✔️|✔️|✔️|
|Copy Async|Tile MMA(More Threads)|Tile Warp(More Values)|Multi Stages|
|✔️|✔️|✔️|✔️|
-|Reg Double Buffers|Block Swizzle|Warp Swizzle|Collective Store(Warp Shfl)|
+|Reg Double Buffers|Block Swizzle|Warp Swizzle|Collective Store(Warp Shuffle)|
|✔️|✔️|✔️|✔️|
|Row Major(NN)|Col Major(TN)|SGEMM TF32|SMEM Swizzle(CuTe)|
|✔️|✔️|✔️|✔️|
-
-
-
-
-
+## ©️Citations🎉🎉
+
+```BibTeX
+@misc{CUDA-Learn-Notes@2024,
+ title={CUDA-Learn-Notes: A Modern CUDA Learn Notes with PyTorch for Beginners},
+ url={https://github.com/DefTruth/CUDA-Learn-Notes},
+ note={Open-source software available at https://github.com/DefTruth/CUDA-Learn-Notes},
+ author={DefTruth etc},
+ year={2024}
+}
+```
## 📖 150+ CUDA Kernels 🔥🔥 (面试常考题目) ([©️back👆🏻](#contents))
**Workflow**: custom **CUDA** kernel impl -> **PyTorch** Python bindings -> Run tests. 👉TIPS: `*` = **Tensor Cores(WMMA/MMA)**, otherwise, CUDA Cores; `/` = not supported; `✔️` = supported; `❔` = in my plan.
diff --git a/kernels/hgemm/README.md b/kernels/hgemm/README.md
index 3816e2e2..fde995cf 100755
--- a/kernels/hgemm/README.md
+++ b/kernels/hgemm/README.md
@@ -166,7 +166,11 @@ python3 hgemm.py --cute-tn --mma --wmma-all --plot
在NVIDIA GeForce RTX 3080 Laptop上测试,使用mma4x4_warp4x4(16 WMMA m16n16k16 ops, warp tile 64x64)以及Thread block swizzle,大部分case能持平甚至超过cuBLAS,使用Windows WSL2 + RTX 3080 Laptop进行测试。
+
+
+
```bash
python3 hgemm.py --wmma-all --plot