[inductor] generate fused rms/layer norm bwd #165370

shunting314 · 2025-10-13T21:48:23Z

Stack from ghstack (oldest at bottom):

RMS/Layer norm backward would generated 2 kind of reductions:

the reduction computing dx which reduce across the hidden dimension (in the context of transformer)
the reduction computing dw/db which reduce across the BxT (batch size , sequence length) dimension.

These 2 set of reductions have common input buffers but inductor can not fuse them because of different loop orders.

There are multiple sources of custom kernels that implement fused version of such kernel (Liger-Kernel, quack, Paul Zhang's internal post). This PR enable Inductor to generate such kernels automatically.

The generated kernel is very similar to https://github.com/linkedin/Liger-Kernel/blob/33924d20b6fa5dfc9391e4630ae82cc810114a30/src/liger_kernel/ops/rms_norm.py#L114 .

To make the implementation simple and performing, we enable such fusion only if the inner reduction (computing dx) is a persistent reduction. This should be true for representative inputs. Persistent reduction is critical for the perf here to make sure a loaded tensor does not need to be reload.

To make sure the inner reduction (computing dx) and outer reductions (computing dw/db) being fusible, the PR does the following:

convert the outer reductions to pointwise by replacing 'reduction' & 'store_reduction' node with a new type of node 'parital_accumulate'. The new node will collect the reduction type, buffer name, input of reduction etc, which is essential for proper codegening.
do loop reordering (rely on the earlier loop ordering after fusion work) to reorder the loops of the converted pointwise so it can be fused with the inner reduction
there can be epilogues that need to be added in the end. E.g. the outer reduction may be followed by a division for mean , or followed by a down cast if dw/db is in low precision (fp16/bf16).

Some early benchmarking on H100 shows about 2X speedup for both RMSNorm and LayerNorm backward for shape (1152 * 500, 384 ) used in some internal model. Note that, I manually disable split reduction in this benchmarking since otherwise the fusion will be skipped right now. The next PR will make the mix-order-reduction compose better with split reduction

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

[ghstack-poisoned]

pytorch-bot · 2025-10-13T21:48:27Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/165370

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

All Docker build failed due to Ubuntu archive outage

⏳ No Failures, 1 Pending

As of commit 1ac94b7 with merge base 4e6afa8 ():
💚 Looks good so far! There are no failures yet. 💚

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

⏳ trunk / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, lf.linux.2xlarge, unstable) (gh) (#166072)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: c1adef9 Pull Request resolved: #165370