Skip to content

Use scalar fast path in optimized layer_norm for small tensors (#18636)#18636

Merged
meta-codesync[bot] merged 1 commit intopytorch:mainfrom
xiaodong705:export-D98795281
Apr 2, 2026
Merged

Use scalar fast path in optimized layer_norm for small tensors (#18636)#18636
meta-codesync[bot] merged 1 commit intopytorch:mainfrom
xiaodong705:export-D98795281

Conversation

@xiaodong705
Copy link
Copy Markdown
Contributor

@xiaodong705 xiaodong705 commented Apr 1, 2026

Summary:

The optimized native_layer_norm kernel uses SIMD vectorization
(RowwiseMoments + vec::map3) which has significant setup overhead that
exceeds the benefit for small normalized dimensions. For EMG/CTRL-R
trackpad models with N=26-144 features, this causes a latency
regression vs portable kernels (D98628176).

Add a scalar fast path for N < 256 that bypasses SIMD vectorization,
using simple scalar loops for mean/variance computation and element-wise
normalization. This matches the portable kernel's approach for small
tensors while preserving the SIMD path for large tensors (N >= 256).

The threshold of 256 was determined by analyzing per-op layer_norm
dimensions across both trackpad models:

  • Old model (HUGC72): 20x N=72 [scalar], 10x N=144 [scalar], 2x N=512 [SIMD]
  • New model (O79HDB): 8x N=32-64 [scalar], 8x N=128 [scalar], 16x N=256 [SIMD], 5x N=512 [SIMD]

With threshold=128, the N=144 calls in the old model still hit the slow
SIMD path, causing ~1ms overhead. Raising to 256 captures these.

Portable vs old optimized regression (x86_64 devserver):

Old model (trackpad-HUGC72.pte):
| Threads | Portable (median) | Old Optimized (median) | Regression |
| 1 | 4,845 us | 8,818 us | 1.8x |
| 2 | 4,958 us | 8,982 us | 1.8x |
| 4 | 5,627 us | 9,564 us | 1.7x |

New model (trackpad-O79HDB-FP32.pte):
| Threads | Portable (median) | Old Optimized (median) | Regression |
| 1 | 5,617 us | 7,695 us | 1.4x |
| 2 | 5,818 us | 7,931 us | 1.4x |
| 4 | 7,399 us | 9,424 us | 1.3x |

Full thread sweep (threshold=256):

Old model (trackpad-HUGC72.pte):
| Threads | Portable (median) | Old Optimized (median) | New Optimized (median) | Speedup |
| 1 | 4,826 us | 8,818 us | 3,929 us | 2.24x |
| 2 | 5,021 us | 8,982 us | 4,041 us | 2.22x |
| 4 | 5,657 us | 9,564 us | 4,712 us | 2.03x |

New model (trackpad-O79HDB-FP32.pte):
| Threads | Portable (median) | Old Optimized (median) | New Optimized (median) | Speedup |
| 1 | 5,545 us | 7,695 us | 5,162 us | 1.49x |
| 2 | 5,744 us | 7,931 us | 5,235 us | 1.52x |
| 4 | 7,175 us | 9,424 us | 6,750 us | 1.40x |

Optimized kernels now outperform portable on both models. The old model
sees the largest improvement because 10 layer_norm calls at N=144 moved
from the expensive SIMD path to the fast scalar path.

Suggested by Kimish Patel in the PyTorch Edge Q&A discussion.

Differential Revision: D98795281

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Apr 1, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18636

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Cancelled Job, 2 Unrelated Failures

As of commit 9e0d052 with merge base 4e2ae9c (image):

CANCELLED JOB - The following job was cancelled. Please retry:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 1, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented Apr 1, 2026

@xiaodong705 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D98795281.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 1, 2026

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@meta-codesync meta-codesync Bot changed the title Use scalar fast path in optimized layer_norm for small tensors Use scalar fast path in optimized layer_norm for small tensors (#18636) Apr 1, 2026
xiaodong705 added a commit to xiaodong705/executorch that referenced this pull request Apr 1, 2026
…ch#18636)

Summary:

The optimized `native_layer_norm` kernel uses SIMD vectorization
(RowwiseMoments + vec::map3) which has significant setup overhead that
exceeds the benefit for small normalized dimensions. For EMG/CTRL-R
trackpad models with N=26-144 features, this causes a latency
regression vs portable kernels (D98628176).

Add a scalar fast path for N < 256 that bypasses SIMD vectorization,
using simple scalar loops for mean/variance computation and element-wise
normalization. This matches the portable kernel's approach for small
tensors while preserving the SIMD path for large tensors (N >= 256).

The threshold of 256 was determined by analyzing per-op layer_norm
dimensions across both trackpad models:
- Old model (HUGC72): 20x N=72 [scalar], 10x N=144 [scalar], 2x N=512 [SIMD]
- New model (O79HDB): 8x N=32-64 [scalar], 8x N=128 [scalar], 16x N=256 [SIMD], 5x N=512 [SIMD]

With threshold=128, the N=144 calls in the old model still hit the slow
SIMD path, causing ~1ms overhead. Raising to 256 captures these.

Portable vs old optimized regression (x86_64 devserver):

Old model (trackpad-HUGC72.pte):
| Threads | Portable (median) | Old Optimized (median) | Regression |
| 1       |          4,845 us |              8,818 us  |      1.8x  |
| 2       |          4,958 us |              8,982 us  |      1.8x  |
| 4       |          5,627 us |              9,564 us  |      1.7x  |

New model (trackpad-O79HDB-FP32.pte):
| Threads | Portable (median) | Old Optimized (median) | Regression |
| 1       |          5,617 us |              7,695 us  |      1.4x  |
| 2       |          5,818 us |              7,931 us  |      1.4x  |
| 4       |          7,399 us |              9,424 us  |      1.3x  |

**Full thread sweep (threshold=256):**

Old model (trackpad-HUGC72.pte):
| Threads | Portable (median) | Old Optimized (median) | New Optimized (median) | Speedup |
| 1       |          4,826 us |              8,818 us  |             3,929 us   |   2.24x |
| 2       |          5,021 us |              8,982 us  |             4,041 us   |   2.22x |
| 4       |          5,657 us |              9,564 us  |             4,712 us   |   2.03x |

New model (trackpad-O79HDB-FP32.pte):
| Threads | Portable (median) | Old Optimized (median) | New Optimized (median) | Speedup |
| 1       |          5,545 us |              7,695 us  |             5,162 us   |   1.49x |
| 2       |          5,744 us |              7,931 us  |             5,235 us   |   1.52x |
| 4       |          7,175 us |              9,424 us  |             6,750 us   |   1.40x |

Optimized kernels now outperform portable on both models. The old model
sees the largest improvement because 10 layer_norm calls at N=144 moved
from the expensive SIMD path to the fast scalar path.

Suggested by Kimish Patel in the PyTorch Edge Q&A discussion.

Differential Revision: D98795281
@xiaodong705 xiaodong705 force-pushed the export-D98795281 branch 2 times, most recently from dcea32a to 47cd2f6 Compare April 1, 2026 16:09
xiaodong705 added a commit to xiaodong705/executorch that referenced this pull request Apr 1, 2026
…ch#18636)

Summary:

The optimized `native_layer_norm` kernel uses SIMD vectorization
(RowwiseMoments + vec::map3) which has significant setup overhead that
exceeds the benefit for small normalized dimensions. For EMG/CTRL-R
trackpad models with N=26-144 features, this causes a latency
regression vs portable kernels (D98628176).

Add a scalar fast path for N < 256 that bypasses SIMD vectorization,
using simple scalar loops for mean/variance computation and element-wise
normalization. This matches the portable kernel's approach for small
tensors while preserving the SIMD path for large tensors (N >= 256).

The threshold of 256 was determined by analyzing per-op layer_norm
dimensions across both trackpad models:
- Old model (HUGC72): 20x N=72 [scalar], 10x N=144 [scalar], 2x N=512 [SIMD]
- New model (O79HDB): 8x N=32-64 [scalar], 8x N=128 [scalar], 16x N=256 [SIMD], 5x N=512 [SIMD]

With threshold=128, the N=144 calls in the old model still hit the slow
SIMD path, causing ~1ms overhead. Raising to 256 captures these.

Portable vs old optimized regression (x86_64 devserver):

Old model (trackpad-HUGC72.pte):
| Threads | Portable (median) | Old Optimized (median) | Regression |
| 1       |          4,845 us |              8,818 us  |      1.8x  |
| 2       |          4,958 us |              8,982 us  |      1.8x  |
| 4       |          5,627 us |              9,564 us  |      1.7x  |

New model (trackpad-O79HDB-FP32.pte):
| Threads | Portable (median) | Old Optimized (median) | Regression |
| 1       |          5,617 us |              7,695 us  |      1.4x  |
| 2       |          5,818 us |              7,931 us  |      1.4x  |
| 4       |          7,399 us |              9,424 us  |      1.3x  |

**Full thread sweep (threshold=256):**

Old model (trackpad-HUGC72.pte):
| Threads | Portable (median) | Old Optimized (median) | New Optimized (median) | Speedup |
| 1       |          4,826 us |              8,818 us  |             3,929 us   |   2.24x |
| 2       |          5,021 us |              8,982 us  |             4,041 us   |   2.22x |
| 4       |          5,657 us |              9,564 us  |             4,712 us   |   2.03x |

New model (trackpad-O79HDB-FP32.pte):
| Threads | Portable (median) | Old Optimized (median) | New Optimized (median) | Speedup |
| 1       |          5,545 us |              7,695 us  |             5,162 us   |   1.49x |
| 2       |          5,744 us |              7,931 us  |             5,235 us   |   1.52x |
| 4       |          7,175 us |              9,424 us  |             6,750 us   |   1.40x |

Optimized kernels now outperform portable on both models. The old model
sees the largest improvement because 10 layer_norm calls at N=144 moved
from the expensive SIMD path to the fast scalar path.

Suggested by Kimish Patel in the [PyTorch Edge Q&A discussion](https://fb.workplace.com/groups/pytorch.edge.users/permalink/2015326169337666/).

Differential Revision: D98795281
…ch#18636)

Summary:

The optimized `native_layer_norm` kernel uses SIMD vectorization
(RowwiseMoments + vec::map3) which has significant setup overhead that
exceeds the benefit for small normalized dimensions. For EMG/CTRL-R
trackpad models with N=26-144 features, this causes a latency
regression vs portable kernels (D98628176).

Add a scalar fast path for N < 256 that bypasses SIMD vectorization,
using simple scalar loops for mean/variance computation and element-wise
normalization. This matches the portable kernel's approach for small
tensors while preserving the SIMD path for large tensors (N >= 256).

The threshold of 256 was determined by analyzing per-op layer_norm
dimensions across both trackpad models:
- Old model (HUGC72): 20x N=72 [scalar], 10x N=144 [scalar], 2x N=512 [SIMD]
- New model (O79HDB): 8x N=32-64 [scalar], 8x N=128 [scalar], 16x N=256 [SIMD], 5x N=512 [SIMD]

With threshold=128, the N=144 calls in the old model still hit the slow
SIMD path, causing ~1ms overhead. Raising to 256 captures these.

Portable vs old optimized regression (x86_64 devserver):

Old model (trackpad-HUGC72.pte):
| Threads | Portable (median) | Old Optimized (median) | Regression |
| 1       |          4,845 us |              8,818 us  |      1.8x  |
| 2       |          4,958 us |              8,982 us  |      1.8x  |
| 4       |          5,627 us |              9,564 us  |      1.7x  |

New model (trackpad-O79HDB-FP32.pte):
| Threads | Portable (median) | Old Optimized (median) | Regression |
| 1       |          5,617 us |              7,695 us  |      1.4x  |
| 2       |          5,818 us |              7,931 us  |      1.4x  |
| 4       |          7,399 us |              9,424 us  |      1.3x  |

**Full thread sweep (threshold=256):**

Old model (trackpad-HUGC72.pte):
| Threads | Portable (median) | Old Optimized (median) | New Optimized (median) | Speedup |
| 1       |          4,826 us |              8,818 us  |             3,929 us   |   2.24x |
| 2       |          5,021 us |              8,982 us  |             4,041 us   |   2.22x |
| 4       |          5,657 us |              9,564 us  |             4,712 us   |   2.03x |

New model (trackpad-O79HDB-FP32.pte):
| Threads | Portable (median) | Old Optimized (median) | New Optimized (median) | Speedup |
| 1       |          5,545 us |              7,695 us  |             5,162 us   |   1.49x |
| 2       |          5,744 us |              7,931 us  |             5,235 us   |   1.52x |
| 4       |          7,175 us |              9,424 us  |             6,750 us   |   1.40x |

Optimized kernels now outperform portable on both models. The old model
sees the largest improvement because 10 layer_norm calls at N=144 moved
from the expensive SIMD path to the fast scalar path.

Suggested by Kimish Patel in the [PyTorch Edge Q&A discussion](https://fb.workplace.com/groups/pytorch.edge.users/permalink/2015326169337666/).

Differential Revision: D98795281
@meta-codesync meta-codesync Bot merged commit fc6855d into pytorch:main Apr 2, 2026
159 of 163 checks passed
jpiat pushed a commit to jpiat/executorch that referenced this pull request Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants