[cuda] Add new gamma beta backwards kernel #147773

ahmadsharif1 · 2025-02-24T22:21:17Z

Context:
Prior to this PR we had 3 non-ROCM CUDA kernels to handle GammaBeta backwards pass:

For small M
32x32 faster kernel for shapes that were divisible by 32 for both M and N
All other cases

This approach had several weaknesses:

For non-32x32 case, the performance was slow because we were not using warp shuffles there
For small M we were not doing coalesced loads so performance was poor in that case (though the total runtime is quite small in those cases so perhaps it doesn't matter much)
For large M and small N, we were only using few SMs in the GPU because we were only exploiting parallelism in the N dimension, not in the M dimension
We had to maintain 3 different kernels.

This PR:

Adds a single templatized kernel that can technically replace all 3 kernels and get equal or faster performance. The only reason I left out the simple kernel is because USE_ROCM case was using that and I couldn't test my kernel with USE_ROCM
Depending on template parameters, this kernel can either fully reduce the grad values or partially reduce them. In the partial reduction case, a second kernel is needed to fully reduce them.
For the large M and small N case, we can launch the partial reduction kernel followed by a .sum() to do the full reduction. The advantage is the partial reduction can fully utilize all SMs on the GPU as we parallelize across the M dimension. This can lead to pretty dramatic performance gains -- for instance, I saw 10x+ performance improvement for M=7e6 and N=32 (which was from a real model).

Full performance results are shown below on my H100:

This reverts commit bb59712.

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

pytorch-bot · 2025-02-24T22:21:22Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/147773

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 15 New Failures

As of commit a634482 with merge base ffa19b9 ():

NEW FAILURES - The following jobs have failed:

Lint / lintrunner-noclang / linux-job (gh)
>>> Lint for test/test_nn.py:
pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 1, 3, linux.g4dn.12xlarge.nvidia.gpu) (gh)
distributed/fsdp/test_fsdp_clip_grad_norm.py::TestClipGradNormCUDA::test_ddp_parity_cuda
pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 3, 3, linux.g4dn.12xlarge.nvidia.gpu) (gh)
distributed/_composable/fsdp/test_fully_shard_clip_grad_norm_.py::TestClipGradNormWorldSize2::test_clip_grad_norm_1d
pull / linux-focal-cuda12.4-py3.10-gcc9 / test (default, 1, 5, linux.4xlarge.nvidia.gpu) (gh)
test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda
pull / linux-focal-cuda12.4-py3.10-gcc9 / test (default, 3, 5, linux.4xlarge.nvidia.gpu) (gh)
test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_LayerNorm_cuda_float32
pull / linux-focal-cuda12.4-py3.10-gcc9 / test (default, 5, 5, linux.4xlarge.nvidia.gpu) (gh)
test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_TransformerDecoderLayer_cuda_float64
pull / linux-focal-cuda12.4-py3.10-gcc9-sm89 / test (default, 3, 5, linux.g6.4xlarge.experimental.nvidia.gpu) (gh)
test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_LayerNorm_cuda_float32
pull / linux-focal-py3.13-clang10 / test (crossref, 1, 2, linux.2xlarge) (gh)
test_nn.py::TestNN::test_layer_norm_backwards_eps
pull / linux-focal-py3.13-clang10 / test (default, 1, 5, linux.4xlarge) (gh)
test_nn.py::TestNN::test_layer_norm_backwards_eps
pull / linux-focal-py3.13-clang10 / test (dynamo_wrapped, 2, 3, linux.2xlarge) (gh)
test_nn.py::TestNN::test_layer_norm_backwards_eps
pull / linux-focal-py3.9-clang10 / test (crossref, 1, 2, linux.2xlarge) (gh)
test_nn.py::TestNN::test_layer_norm_backwards_eps
pull / linux-focal-py3.9-clang10 / test (default, 1, 5, linux.4xlarge) (gh)
test_nn.py::TestNN::test_layer_norm_backwards_eps
pull / linux-focal-py3.9-clang10 / test (dynamo_wrapped, 1, 3, linux.2xlarge) (gh)
test_nn.py::TestNN::test_layer_norm_backwards_eps
pull / linux-jammy-py3.10-clang15-asan / test (default, 1, 6, linux.4xlarge) (gh)
test_nn.py::TestNN::test_layer_norm_backwards_eps
pull / linux-jammy-py3.9-gcc11 / test (default, 1, 5, linux.2xlarge) (gh)
test_nn.py::TestNN::test_layer_norm_backwards_eps

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2025-04-26T02:52:46Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

ahmadsharif1 added 5 commits February 24, 2025 12:45

Ad[cuda] ded a correctness test for layernorm backwards

bb59712

Revert "Ad[cuda] ded a correctness test for layernorm backwards"

cb5730f

This reverts commit bb59712.

v<Replace this line with a title. Use 1 line only, 67

d06cb58

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

.

aeefa80

New gamma beta backwards kernel

b3a5ca1

pytorch-bot bot added the release notes: nn release notes category label Feb 24, 2025

.

a634482

github-actions bot added the Stale label Apr 26, 2025

github-actions bot closed this May 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[cuda] Add new gamma beta backwards kernel #147773

[cuda] Add new gamma beta backwards kernel #147773

ahmadsharif1 commented Feb 24, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Feb 24, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Apr 26, 2025

Uh oh!

Uh oh!

[cuda] Add new gamma beta backwards kernel #147773

[cuda] Add new gamma beta backwards kernel #147773

Conversation

ahmadsharif1 commented Feb 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Feb 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/147773

❌ 15 New Failures

Uh oh!

github-actions bot commented Apr 26, 2025

Uh oh!

Uh oh!

ahmadsharif1 commented Feb 24, 2025 •

edited

Loading

pytorch-bot bot commented Feb 24, 2025 •

edited

Loading