Use scalar fast path in optimized layer_norm for small tensors (#18636) by xiaodong705 · Pull Request #18636 · pytorch/executorch

xiaodong705 · 2026-04-01T14:33:18Z

Summary:

The optimized native_layer_norm kernel uses SIMD vectorization
(RowwiseMoments + vec::map3) which has significant setup overhead that
exceeds the benefit for small normalized dimensions. For EMG/CTRL-R
trackpad models with N=26-144 features, this causes a latency
regression vs portable kernels (D98628176).

Add a scalar fast path for N < 256 that bypasses SIMD vectorization,
using simple scalar loops for mean/variance computation and element-wise
normalization. This matches the portable kernel's approach for small
tensors while preserving the SIMD path for large tensors (N >= 256).

The threshold of 256 was determined by analyzing per-op layer_norm
dimensions across both trackpad models:

Old model (HUGC72): 20x N=72 [scalar], 10x N=144 [scalar], 2x N=512 [SIMD]
New model (O79HDB): 8x N=32-64 [scalar], 8x N=128 [scalar], 16x N=256 [SIMD], 5x N=512 [SIMD]

With threshold=128, the N=144 calls in the old model still hit the slow
SIMD path, causing ~1ms overhead. Raising to 256 captures these.

Portable vs old optimized regression (x86_64 devserver):

Old model (trackpad-HUGC72.pte):
| Threads | Portable (median) | Old Optimized (median) | Regression |
| 1 | 4,845 us | 8,818 us | 1.8x |
| 2 | 4,958 us | 8,982 us | 1.8x |
| 4 | 5,627 us | 9,564 us | 1.7x |

New model (trackpad-O79HDB-FP32.pte):
| Threads | Portable (median) | Old Optimized (median) | Regression |
| 1 | 5,617 us | 7,695 us | 1.4x |
| 2 | 5,818 us | 7,931 us | 1.4x |
| 4 | 7,399 us | 9,424 us | 1.3x |

Full thread sweep (threshold=256):

Old model (trackpad-HUGC72.pte):
| Threads | Portable (median) | Old Optimized (median) | New Optimized (median) | Speedup |
| 1 | 4,826 us | 8,818 us | 3,929 us | 2.24x |
| 2 | 5,021 us | 8,982 us | 4,041 us | 2.22x |
| 4 | 5,657 us | 9,564 us | 4,712 us | 2.03x |

New model (trackpad-O79HDB-FP32.pte):
| Threads | Portable (median) | Old Optimized (median) | New Optimized (median) | Speedup |
| 1 | 5,545 us | 7,695 us | 5,162 us | 1.49x |
| 2 | 5,744 us | 7,931 us | 5,235 us | 1.52x |
| 4 | 7,175 us | 9,424 us | 6,750 us | 1.40x |

Optimized kernels now outperform portable on both models. The old model
sees the largest improvement because 10 layer_norm calls at N=144 moved
from the expensive SIMD path to the fast scalar path.

Suggested by Kimish Patel in the PyTorch Edge Q&A discussion.

Differential Revision: D98795281

pytorch-bot · 2026-04-01T14:33:23Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18636

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Cancelled Job, 2 Unrelated Failures

As of commit 9e0d052 with merge base 4e2ae9c ():

CANCELLED JOB - The following job was cancelled. Please retry:

pull / unittest-editable / windows / windows-job (gh)

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

pull / test-eval_llama-wikitext-linux / linux-job (gh) (detected as infra flaky with no log or failing log classifier)
pull / unittest / windows / windows-job (gh) (detected as infra flaky with no log or failing log classifier)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-codesync · 2026-04-01T14:33:26Z

@xiaodong705 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D98795281.

github-actions · 2026-04-01T14:34:09Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

…ch#18636) Summary: The optimized `native_layer_norm` kernel uses SIMD vectorization (RowwiseMoments + vec::map3) which has significant setup overhead that exceeds the benefit for small normalized dimensions. For EMG/CTRL-R trackpad models with N=26-144 features, this causes a latency regression vs portable kernels (D98628176). Add a scalar fast path for N < 256 that bypasses SIMD vectorization, using simple scalar loops for mean/variance computation and element-wise normalization. This matches the portable kernel's approach for small tensors while preserving the SIMD path for large tensors (N >= 256). The threshold of 256 was determined by analyzing per-op layer_norm dimensions across both trackpad models: - Old model (HUGC72): 20x N=72 [scalar], 10x N=144 [scalar], 2x N=512 [SIMD] - New model (O79HDB): 8x N=32-64 [scalar], 8x N=128 [scalar], 16x N=256 [SIMD], 5x N=512 [SIMD] With threshold=128, the N=144 calls in the old model still hit the slow SIMD path, causing ~1ms overhead. Raising to 256 captures these. Portable vs old optimized regression (x86_64 devserver): Old model (trackpad-HUGC72.pte): | Threads | Portable (median) | Old Optimized (median) | Regression | | 1 | 4,845 us | 8,818 us | 1.8x | | 2 | 4,958 us | 8,982 us | 1.8x | | 4 | 5,627 us | 9,564 us | 1.7x | New model (trackpad-O79HDB-FP32.pte): | Threads | Portable (median) | Old Optimized (median) | Regression | | 1 | 5,617 us | 7,695 us | 1.4x | | 2 | 5,818 us | 7,931 us | 1.4x | | 4 | 7,399 us | 9,424 us | 1.3x | **Full thread sweep (threshold=256):** Old model (trackpad-HUGC72.pte): | Threads | Portable (median) | Old Optimized (median) | New Optimized (median) | Speedup | | 1 | 4,826 us | 8,818 us | 3,929 us | 2.24x | | 2 | 5,021 us | 8,982 us | 4,041 us | 2.22x | | 4 | 5,657 us | 9,564 us | 4,712 us | 2.03x | New model (trackpad-O79HDB-FP32.pte): | Threads | Portable (median) | Old Optimized (median) | New Optimized (median) | Speedup | | 1 | 5,545 us | 7,695 us | 5,162 us | 1.49x | | 2 | 5,744 us | 7,931 us | 5,235 us | 1.52x | | 4 | 7,175 us | 9,424 us | 6,750 us | 1.40x | Optimized kernels now outperform portable on both models. The old model sees the largest improvement because 10 layer_norm calls at N=144 moved from the expensive SIMD path to the fast scalar path. Suggested by Kimish Patel in the PyTorch Edge Q&A discussion. Differential Revision: D98795281

…ch#18636) Summary: The optimized `native_layer_norm` kernel uses SIMD vectorization (RowwiseMoments + vec::map3) which has significant setup overhead that exceeds the benefit for small normalized dimensions. For EMG/CTRL-R trackpad models with N=26-144 features, this causes a latency regression vs portable kernels (D98628176). Add a scalar fast path for N < 256 that bypasses SIMD vectorization, using simple scalar loops for mean/variance computation and element-wise normalization. This matches the portable kernel's approach for small tensors while preserving the SIMD path for large tensors (N >= 256). The threshold of 256 was determined by analyzing per-op layer_norm dimensions across both trackpad models: - Old model (HUGC72): 20x N=72 [scalar], 10x N=144 [scalar], 2x N=512 [SIMD] - New model (O79HDB): 8x N=32-64 [scalar], 8x N=128 [scalar], 16x N=256 [SIMD], 5x N=512 [SIMD] With threshold=128, the N=144 calls in the old model still hit the slow SIMD path, causing ~1ms overhead. Raising to 256 captures these. Portable vs old optimized regression (x86_64 devserver): Old model (trackpad-HUGC72.pte): | Threads | Portable (median) | Old Optimized (median) | Regression | | 1 | 4,845 us | 8,818 us | 1.8x | | 2 | 4,958 us | 8,982 us | 1.8x | | 4 | 5,627 us | 9,564 us | 1.7x | New model (trackpad-O79HDB-FP32.pte): | Threads | Portable (median) | Old Optimized (median) | Regression | | 1 | 5,617 us | 7,695 us | 1.4x | | 2 | 5,818 us | 7,931 us | 1.4x | | 4 | 7,399 us | 9,424 us | 1.3x | **Full thread sweep (threshold=256):** Old model (trackpad-HUGC72.pte): | Threads | Portable (median) | Old Optimized (median) | New Optimized (median) | Speedup | | 1 | 4,826 us | 8,818 us | 3,929 us | 2.24x | | 2 | 5,021 us | 8,982 us | 4,041 us | 2.22x | | 4 | 5,657 us | 9,564 us | 4,712 us | 2.03x | New model (trackpad-O79HDB-FP32.pte): | Threads | Portable (median) | Old Optimized (median) | New Optimized (median) | Speedup | | 1 | 5,545 us | 7,695 us | 5,162 us | 1.49x | | 2 | 5,744 us | 7,931 us | 5,235 us | 1.52x | | 4 | 7,175 us | 9,424 us | 6,750 us | 1.40x | Optimized kernels now outperform portable on both models. The old model sees the largest improvement because 10 layer_norm calls at N=144 moved from the expensive SIMD path to the fast scalar path. Suggested by Kimish Patel in the [PyTorch Edge Q&A discussion](https://fb.workplace.com/groups/pytorch.edge.users/permalink/2015326169337666/). Differential Revision: D98795281

…ch#18636) Differential Revision: D98795281 Pull Request resolved: pytorch#18636

xiaodong705 requested a review from manuelcandales as a code owner April 1, 2026 14:33

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 1, 2026

meta-codesync Bot added fb-exported meta-exported labels Apr 1, 2026

meta-codesync Bot changed the title ~~Use scalar fast path in optimized layer_norm for small tensors~~ Use scalar fast path in optimized layer_norm for small tensors (#18636) Apr 1, 2026

xiaodong705 force-pushed the export-D98795281 branch 2 times, most recently from dcea32a to 47cd2f6 Compare April 1, 2026 16:09

xiaodong705 force-pushed the export-D98795281 branch from 47cd2f6 to 9e0d052 Compare April 1, 2026 16:31

manuelcandales approved these changes Apr 2, 2026

View reviewed changes

meta-codesync Bot merged commit fc6855d into pytorch:main Apr 2, 2026
159 of 163 checks passed

jpiat pushed a commit to jpiat/executorch that referenced this pull request Apr 14, 2026

Use scalar fast path in optimized layer_norm for small tensors (pytor…

10301c5

…ch#18636) Differential Revision: D98795281 Pull Request resolved: pytorch#18636

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use scalar fast path in optimized layer_norm for small tensors (#18636)#18636

Use scalar fast path in optimized layer_norm for small tensors (#18636)#18636
meta-codesync[bot] merged 1 commit intopytorch:mainfrom
xiaodong705:export-D98795281

xiaodong705 commented Apr 1, 2026 •

edited by meta-codesync Bot

Loading

Uh oh!

pytorch-bot Bot commented Apr 1, 2026 •

edited

Loading

Uh oh!

meta-codesync Bot commented Apr 1, 2026

Uh oh!

github-actions Bot commented Apr 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

xiaodong705 commented Apr 1, 2026 • edited by meta-codesync Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18636

❌ 1 Cancelled Job, 2 Unrelated Failures

Uh oh!

meta-codesync Bot commented Apr 1, 2026

Uh oh!

github-actions Bot commented Apr 1, 2026

This PR needs a release notes: label

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xiaodong705 commented Apr 1, 2026 •

edited by meta-codesync Bot

Loading

pytorch-bot Bot commented Apr 1, 2026 •

edited

Loading

This PR needs a `release notes:` label