From d3a86b551d3f1f5f4ed961fac80175a4115549f4 Mon Sep 17 00:00:00 2001 From: DefTruth <31974251+DefTruth@users.noreply.github.com> Date: Fri, 25 Oct 2024 14:13:30 +0800 Subject: [PATCH 1/3] Update hgemm_mma_stage.cu --- hgemm/hgemm_mma_stage.cu | 1 + 1 file changed, 1 insertion(+) diff --git a/hgemm/hgemm_mma_stage.cu b/hgemm/hgemm_mma_stage.cu index 89bd60e2..8fb5e1ef 100644 --- a/hgemm/hgemm_mma_stage.cu +++ b/hgemm/hgemm_mma_stage.cu @@ -1015,6 +1015,7 @@ hgemm_mma_m16n8k16_mma2x4_warp4x4x2_stages_dsmem_kernel( } } + // collective store with reg reuse & warp shuffle for (int i = 0; i < WARP_TILE_M; ++i) { // reuse RA[2][4][4] reg here, this may boost 0.3~0.5 TFLOPS up. // may not put 'if' in N loop, it will crash the 'pragma unroll' hint ? From d47b2278248a5a4549940d41f528a802b630bc98 Mon Sep 17 00:00:00 2001 From: DefTruth <31974251+DefTruth@users.noreply.github.com> Date: Fri, 25 Oct 2024 14:15:22 +0800 Subject: [PATCH 2/3] Update README.md --- hgemm/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hgemm/README.md b/hgemm/README.md index 8f381423..85605e8d 100755 --- a/hgemm/README.md +++ b/hgemm/README.md @@ -33,7 +33,7 @@ - NVIDIA L20 -目前最优的实现,在L20上(理论Tensor Cores FP16算力为 119.5 TFLOPS),使用WMMA API能达到cuBLAS大概95%~98%左右的性能(105-113 TFLOPS vs 105-115 TFLOPS),使用MMA API能达到115 TFLOPS,部分case会超越cuBLAS。已知问题为bank conflicts没有完全消除,目前通过padding的方式缓解bank conflicts会导致shared memory浪费,也会影响SM occupancy。并且尚未手工实现smem swizzle(受限于WMMA API的灵活性以及row major的layout),后续将会尝试通过MMA PTX和col major的layout实现smem swizzle,[点击查看性能数据](#NV-L20)。 +目前最优的实现,在L20上(理论Tensor Cores FP16算力为 119.5 TFLOPS),使用WMMA API能达到cuBLAS大概95%~98%左右的性能(105-113 TFLOPS vs 105-115 TFLOPS),使用MMA API能达到115 TFLOPS,部分case会超越cuBLAS。已知问题为bank conflicts没有完全消除,目前通过padding的方式缓解bank conflicts会导致shared memory浪费,也会影响SM occupancy。并且尚未手工实现smem swizzle/permute(受限于WMMA API的灵活性以及row major的layout),后续将会尝试通过MMA PTX和col major的layout实现smem swizzle/permute,[点击查看性能数据](#NV-L20)。 - NVIDIA GeForce RTX 3080 Laptop From e9f01a1ba5fd7305b83c7f68160d1cc308362263 Mon Sep 17 00:00:00 2001 From: DefTruth <31974251+DefTruth@users.noreply.github.com> Date: Fri, 25 Oct 2024 16:50:24 +0800 Subject: [PATCH 3/3] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 5363c748..36fd0f3a 100644 --- a/README.md +++ b/README.md @@ -159,7 +159,7 @@ | ✔️ [hgemv_k16_f16](./hgemv/hgemv.cu)|f16|f16|[link](./hgemv/)|⭐️⭐️⭐️| | ✔️ [flash_attn_1_fwd_f32](./flash-attn/flash_attn.cu)|f32|f32|[link](./flash-attn)|⭐️⭐️⭐️| | ✔️ [flash_attn_2_fwd_f16_m16n8k16*](./flash-attn/flash_attn_mma.cu)|f16|f16|[link](./flash-attn)|⭐️⭐️⭐️| -| ✔️ [hard_nms cpp only](./nms/nms.cc)|f32|/|/|⭐️| +| ✔️ [nms_kernel](./nms/nms.cu)|f32|/|[link](./nms)|⭐️⭐️| | ✔️ [notes v1(deprecated)](./notes-v1.cu)|f32|f32|/|⭐️| 👉TIPS: * means using **Tensor Cores(MMA/WMMA)**, otherwise, using CUDA Cores by default.