Use scalar fast path in optimized layer_norm for small tensors (#18636)#18636
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18636
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 Cancelled Job, 2 Unrelated FailuresAs of commit 9e0d052 with merge base 4e2ae9c ( CANCELLED JOB - The following job was cancelled. Please retry:
FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@xiaodong705 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D98795281. |
This PR needs a
|
…ch#18636) Summary: The optimized `native_layer_norm` kernel uses SIMD vectorization (RowwiseMoments + vec::map3) which has significant setup overhead that exceeds the benefit for small normalized dimensions. For EMG/CTRL-R trackpad models with N=26-144 features, this causes a latency regression vs portable kernels (D98628176). Add a scalar fast path for N < 256 that bypasses SIMD vectorization, using simple scalar loops for mean/variance computation and element-wise normalization. This matches the portable kernel's approach for small tensors while preserving the SIMD path for large tensors (N >= 256). The threshold of 256 was determined by analyzing per-op layer_norm dimensions across both trackpad models: - Old model (HUGC72): 20x N=72 [scalar], 10x N=144 [scalar], 2x N=512 [SIMD] - New model (O79HDB): 8x N=32-64 [scalar], 8x N=128 [scalar], 16x N=256 [SIMD], 5x N=512 [SIMD] With threshold=128, the N=144 calls in the old model still hit the slow SIMD path, causing ~1ms overhead. Raising to 256 captures these. Portable vs old optimized regression (x86_64 devserver): Old model (trackpad-HUGC72.pte): | Threads | Portable (median) | Old Optimized (median) | Regression | | 1 | 4,845 us | 8,818 us | 1.8x | | 2 | 4,958 us | 8,982 us | 1.8x | | 4 | 5,627 us | 9,564 us | 1.7x | New model (trackpad-O79HDB-FP32.pte): | Threads | Portable (median) | Old Optimized (median) | Regression | | 1 | 5,617 us | 7,695 us | 1.4x | | 2 | 5,818 us | 7,931 us | 1.4x | | 4 | 7,399 us | 9,424 us | 1.3x | **Full thread sweep (threshold=256):** Old model (trackpad-HUGC72.pte): | Threads | Portable (median) | Old Optimized (median) | New Optimized (median) | Speedup | | 1 | 4,826 us | 8,818 us | 3,929 us | 2.24x | | 2 | 5,021 us | 8,982 us | 4,041 us | 2.22x | | 4 | 5,657 us | 9,564 us | 4,712 us | 2.03x | New model (trackpad-O79HDB-FP32.pte): | Threads | Portable (median) | Old Optimized (median) | New Optimized (median) | Speedup | | 1 | 5,545 us | 7,695 us | 5,162 us | 1.49x | | 2 | 5,744 us | 7,931 us | 5,235 us | 1.52x | | 4 | 7,175 us | 9,424 us | 6,750 us | 1.40x | Optimized kernels now outperform portable on both models. The old model sees the largest improvement because 10 layer_norm calls at N=144 moved from the expensive SIMD path to the fast scalar path. Suggested by Kimish Patel in the PyTorch Edge Q&A discussion. Differential Revision: D98795281
dcea32a to
47cd2f6
Compare
…ch#18636) Summary: The optimized `native_layer_norm` kernel uses SIMD vectorization (RowwiseMoments + vec::map3) which has significant setup overhead that exceeds the benefit for small normalized dimensions. For EMG/CTRL-R trackpad models with N=26-144 features, this causes a latency regression vs portable kernels (D98628176). Add a scalar fast path for N < 256 that bypasses SIMD vectorization, using simple scalar loops for mean/variance computation and element-wise normalization. This matches the portable kernel's approach for small tensors while preserving the SIMD path for large tensors (N >= 256). The threshold of 256 was determined by analyzing per-op layer_norm dimensions across both trackpad models: - Old model (HUGC72): 20x N=72 [scalar], 10x N=144 [scalar], 2x N=512 [SIMD] - New model (O79HDB): 8x N=32-64 [scalar], 8x N=128 [scalar], 16x N=256 [SIMD], 5x N=512 [SIMD] With threshold=128, the N=144 calls in the old model still hit the slow SIMD path, causing ~1ms overhead. Raising to 256 captures these. Portable vs old optimized regression (x86_64 devserver): Old model (trackpad-HUGC72.pte): | Threads | Portable (median) | Old Optimized (median) | Regression | | 1 | 4,845 us | 8,818 us | 1.8x | | 2 | 4,958 us | 8,982 us | 1.8x | | 4 | 5,627 us | 9,564 us | 1.7x | New model (trackpad-O79HDB-FP32.pte): | Threads | Portable (median) | Old Optimized (median) | Regression | | 1 | 5,617 us | 7,695 us | 1.4x | | 2 | 5,818 us | 7,931 us | 1.4x | | 4 | 7,399 us | 9,424 us | 1.3x | **Full thread sweep (threshold=256):** Old model (trackpad-HUGC72.pte): | Threads | Portable (median) | Old Optimized (median) | New Optimized (median) | Speedup | | 1 | 4,826 us | 8,818 us | 3,929 us | 2.24x | | 2 | 5,021 us | 8,982 us | 4,041 us | 2.22x | | 4 | 5,657 us | 9,564 us | 4,712 us | 2.03x | New model (trackpad-O79HDB-FP32.pte): | Threads | Portable (median) | Old Optimized (median) | New Optimized (median) | Speedup | | 1 | 5,545 us | 7,695 us | 5,162 us | 1.49x | | 2 | 5,744 us | 7,931 us | 5,235 us | 1.52x | | 4 | 7,175 us | 9,424 us | 6,750 us | 1.40x | Optimized kernels now outperform portable on both models. The old model sees the largest improvement because 10 layer_norm calls at N=144 moved from the expensive SIMD path to the fast scalar path. Suggested by Kimish Patel in the [PyTorch Edge Q&A discussion](https://fb.workplace.com/groups/pytorch.edge.users/permalink/2015326169337666/). Differential Revision: D98795281
…ch#18636) Summary: The optimized `native_layer_norm` kernel uses SIMD vectorization (RowwiseMoments + vec::map3) which has significant setup overhead that exceeds the benefit for small normalized dimensions. For EMG/CTRL-R trackpad models with N=26-144 features, this causes a latency regression vs portable kernels (D98628176). Add a scalar fast path for N < 256 that bypasses SIMD vectorization, using simple scalar loops for mean/variance computation and element-wise normalization. This matches the portable kernel's approach for small tensors while preserving the SIMD path for large tensors (N >= 256). The threshold of 256 was determined by analyzing per-op layer_norm dimensions across both trackpad models: - Old model (HUGC72): 20x N=72 [scalar], 10x N=144 [scalar], 2x N=512 [SIMD] - New model (O79HDB): 8x N=32-64 [scalar], 8x N=128 [scalar], 16x N=256 [SIMD], 5x N=512 [SIMD] With threshold=128, the N=144 calls in the old model still hit the slow SIMD path, causing ~1ms overhead. Raising to 256 captures these. Portable vs old optimized regression (x86_64 devserver): Old model (trackpad-HUGC72.pte): | Threads | Portable (median) | Old Optimized (median) | Regression | | 1 | 4,845 us | 8,818 us | 1.8x | | 2 | 4,958 us | 8,982 us | 1.8x | | 4 | 5,627 us | 9,564 us | 1.7x | New model (trackpad-O79HDB-FP32.pte): | Threads | Portable (median) | Old Optimized (median) | Regression | | 1 | 5,617 us | 7,695 us | 1.4x | | 2 | 5,818 us | 7,931 us | 1.4x | | 4 | 7,399 us | 9,424 us | 1.3x | **Full thread sweep (threshold=256):** Old model (trackpad-HUGC72.pte): | Threads | Portable (median) | Old Optimized (median) | New Optimized (median) | Speedup | | 1 | 4,826 us | 8,818 us | 3,929 us | 2.24x | | 2 | 5,021 us | 8,982 us | 4,041 us | 2.22x | | 4 | 5,657 us | 9,564 us | 4,712 us | 2.03x | New model (trackpad-O79HDB-FP32.pte): | Threads | Portable (median) | Old Optimized (median) | New Optimized (median) | Speedup | | 1 | 5,545 us | 7,695 us | 5,162 us | 1.49x | | 2 | 5,744 us | 7,931 us | 5,235 us | 1.52x | | 4 | 7,175 us | 9,424 us | 6,750 us | 1.40x | Optimized kernels now outperform portable on both models. The old model sees the largest improvement because 10 layer_norm calls at N=144 moved from the expensive SIMD path to the fast scalar path. Suggested by Kimish Patel in the [PyTorch Edge Q&A discussion](https://fb.workplace.com/groups/pytorch.edge.users/permalink/2015326169337666/). Differential Revision: D98795281
47cd2f6 to
9e0d052
Compare
…ch#18636) Differential Revision: D98795281 Pull Request resolved: pytorch#18636
Summary:
The optimized
native_layer_normkernel uses SIMD vectorization(RowwiseMoments + vec::map3) which has significant setup overhead that
exceeds the benefit for small normalized dimensions. For EMG/CTRL-R
trackpad models with N=26-144 features, this causes a latency
regression vs portable kernels (D98628176).
Add a scalar fast path for N < 256 that bypasses SIMD vectorization,
using simple scalar loops for mean/variance computation and element-wise
normalization. This matches the portable kernel's approach for small
tensors while preserving the SIMD path for large tensors (N >= 256).
The threshold of 256 was determined by analyzing per-op layer_norm
dimensions across both trackpad models:
With threshold=128, the N=144 calls in the old model still hit the slow
SIMD path, causing ~1ms overhead. Raising to 256 captures these.
Portable vs old optimized regression (x86_64 devserver):
Old model (trackpad-HUGC72.pte):
| Threads | Portable (median) | Old Optimized (median) | Regression |
| 1 | 4,845 us | 8,818 us | 1.8x |
| 2 | 4,958 us | 8,982 us | 1.8x |
| 4 | 5,627 us | 9,564 us | 1.7x |
New model (trackpad-O79HDB-FP32.pte):
| Threads | Portable (median) | Old Optimized (median) | Regression |
| 1 | 5,617 us | 7,695 us | 1.4x |
| 2 | 5,818 us | 7,931 us | 1.4x |
| 4 | 7,399 us | 9,424 us | 1.3x |
Full thread sweep (threshold=256):
Old model (trackpad-HUGC72.pte):
| Threads | Portable (median) | Old Optimized (median) | New Optimized (median) | Speedup |
| 1 | 4,826 us | 8,818 us | 3,929 us | 2.24x |
| 2 | 5,021 us | 8,982 us | 4,041 us | 2.22x |
| 4 | 5,657 us | 9,564 us | 4,712 us | 2.03x |
New model (trackpad-O79HDB-FP32.pte):
| Threads | Portable (median) | Old Optimized (median) | New Optimized (median) | Speedup |
| 1 | 5,545 us | 7,695 us | 5,162 us | 1.49x |
| 2 | 5,744 us | 7,931 us | 5,235 us | 1.52x |
| 4 | 7,175 us | 9,424 us | 6,750 us | 1.40x |
Optimized kernels now outperform portable on both models. The old model
sees the largest improvement because 10 layer_norm calls at N=144 moved
from the expensive SIMD path to the fast scalar path.
Suggested by Kimish Patel in the PyTorch Edge Q&A discussion.
Differential Revision: D98795281