[ET-VK] Fused RMSNorm operator to fix fp16 overflow by SS-JIA · Pull Request #18772 · pytorch/executorch

SS-JIA · 2026-04-08T14:08:39Z

Stack from ghstack (oldest at bottom):

Fused RMSNorm operator that performs squaring, mean, rsqrt, and
weight scaling in a single shader dispatch. All accumulation is done
in fp32 regardless of input dtype, preventing fp16 overflow when
residual stream values exceed sqrt(65504) ≈ 256.

The Python reference impl (rms_norm_impl) must preserve the input
dtype — PyTorch type promotion would otherwise produce fp32 output
from fp16 inputs, and the FusePatternsPass re-trace would propagate
that incorrect dtype through the graph.

Authored by Claude.

Differential Revision: D99841211

Fused RMSNorm operator that performs squaring, mean, rsqrt, and weight scaling in a single shader dispatch. All accumulation is done in fp32 regardless of input dtype, preventing fp16 overflow when residual stream values exceed sqrt(65504) ≈ 256. The Python reference impl (`rms_norm_impl`) must preserve the input dtype — PyTorch type promotion would otherwise produce fp32 output from fp16 inputs, and the FusePatternsPass re-trace would propagate that incorrect dtype through the graph. Authored by Claude. Differential Revision: [D99841211](https://our.internmc.facebook.com/intern/diff/D99841211/) [ghstack-poisoned]

pytorch-bot · 2026-04-08T14:08:43Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18772

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 9864405 with merge base 4afd7f9 ():

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / unittest / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / unittest-editable / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Fused RMSNorm operator that performs squaring, mean, rsqrt, and weight scaling in a single shader dispatch. All accumulation is done in fp32 regardless of input dtype, preventing fp16 overflow when residual stream values exceed sqrt(65504) ≈ 256. The Python reference impl (`rms_norm_impl`) must preserve the input dtype — PyTorch type promotion would otherwise produce fp32 output from fp16 inputs, and the FusePatternsPass re-trace would propagate that incorrect dtype through the graph. Authored by Claude. Differential Revision: [D99841211](https://our.internmc.facebook.com/intern/diff/D99841211/) ghstack-source-id: 364237333 Pull Request resolved: #18772

github-actions · 2026-04-08T14:10:10Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Fused RMSNorm operator that performs squaring, mean, rsqrt, and weight scaling in a single shader dispatch. All accumulation is done in fp32 regardless of input dtype, preventing fp16 overflow when residual stream values exceed sqrt(65504) ≈ 256. The Python reference impl (`rms_norm_impl`) must preserve the input dtype — PyTorch type promotion would otherwise produce fp32 output from fp16 inputs, and the FusePatternsPass re-trace would propagate that incorrect dtype through the graph. Authored by Claude. Differential Revision: [D99841211](https://our.internmc.facebook.com/intern/diff/D99841211/) [ghstack-poisoned]

Pull Request resolved: #18772 Fused RMSNorm operator that performs squaring, mean, rsqrt, and weight scaling in a single shader dispatch. All accumulation is done in fp32 regardless of input dtype, preventing fp16 overflow when residual stream values exceed sqrt(65504) ≈ 256. The Python reference impl (`rms_norm_impl`) must preserve the input dtype — PyTorch type promotion would otherwise produce fp32 output from fp16 inputs, and the FusePatternsPass re-trace would propagate that incorrect dtype through the graph. Authored by Claude. ghstack-source-id: 364280899 @exported-using-ghexport Differential Revision: [D99841211](https://our.internmc.facebook.com/intern/diff/D99841211/)

Fused RMSNorm operator that performs squaring, mean, rsqrt, and weight scaling in a single shader dispatch. All accumulation is done in fp32 regardless of input dtype, preventing fp16 overflow when residual stream values exceed sqrt(65504) ≈ 256. The Python reference impl (`rms_norm_impl`) must preserve the input dtype — PyTorch type promotion would otherwise produce fp32 output from fp16 inputs, and the FusePatternsPass re-trace would propagate that incorrect dtype through the graph. Authored by Claude. Differential Revision: [D99841211](https://our.internmc.facebook.com/intern/diff/D99841211/) [ghstack-poisoned]

Pull Request resolved: #18772 Fused RMSNorm operator that performs squaring, mean, rsqrt, and weight scaling in a single shader dispatch. All accumulation is done in fp32 regardless of input dtype, preventing fp16 overflow when residual stream values exceed sqrt(65504) ≈ 256. The Python reference impl (`rms_norm_impl`) must preserve the input dtype — PyTorch type promotion would otherwise produce fp32 output from fp16 inputs, and the FusePatternsPass re-trace would propagate that incorrect dtype through the graph. Authored by Claude. ghstack-source-id: 364514329 @exported-using-ghexport Differential Revision: [D99841211](https://our.internmc.facebook.com/intern/diff/D99841211/)

Pull Request resolved: pytorch#18772 Fused RMSNorm operator that performs squaring, mean, rsqrt, and weight scaling in a single shader dispatch. All accumulation is done in fp32 regardless of input dtype, preventing fp16 overflow when residual stream values exceed sqrt(65504) ≈ 256. The Python reference impl (`rms_norm_impl`) must preserve the input dtype — PyTorch type promotion would otherwise produce fp32 output from fp16 inputs, and the FusePatternsPass re-trace would propagate that incorrect dtype through the graph. Authored by Claude. ghstack-source-id: 364514329 @exported-using-ghexport Differential Revision: [D99841211](https://our.internmc.facebook.com/intern/diff/D99841211/)

This was referenced Apr 8, 2026

[ET-VK] Fix force_fp16 texture bias being silently rejected for CONTIGUOUS_ANY ops #18770

Merged

[ET-VK] Deduplicate transition clone nodes in TagMemoryMetaPass #18771

Merged

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 8, 2026

meta-codesync Bot added fb-exported meta-exported labels Apr 8, 2026

manuelcandales approved these changes Apr 8, 2026

View reviewed changes

meta-codesync Bot merged commit 76a5a52 into gh/SS-JIA/518/base Apr 9, 2026
159 of 164 checks passed

meta-codesync Bot deleted the gh/SS-JIA/518/head branch April 9, 2026 01:42

meta-codesync Bot had a problem deploying to cherry-pick-bot April 9, 2026 01:42 Failure

SS-JIA mentioned this pull request Apr 9, 2026

[ET-VK] Fused RMSNorm operator to fix fp16 overflow #18786

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ET-VK] Fused RMSNorm operator to fix fp16 overflow#18772

[ET-VK] Fused RMSNorm operator to fix fp16 overflow#18772
meta-codesync[bot] merged 3 commits intogh/SS-JIA/518/basefrom
gh/SS-JIA/518/head

SS-JIA commented Apr 8, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Apr 8, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SS-JIA commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18772

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

github-actions Bot commented Apr 8, 2026

This PR needs a release notes: label

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SS-JIA commented Apr 8, 2026 •

edited

Loading

pytorch-bot Bot commented Apr 8, 2026 •

edited

Loading

This PR needs a `release notes:` label