Cortex-M backend: add quantized_activation op with LUT lowering for sigmoid/tanh/silu#19792
Cortex-M backend: add quantized_activation op with LUT lowering for sigmoid/tanh/silu#19792rascani wants to merge 2 commits into
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19792
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 2 Unrelated FailuresAs of commit 5045ac2 with merge base 5395f20 ( NEW FAILURE - The following job has failed:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
BROKEN TRUNK - The following job failed but was present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
This PR needs a
|
| // activation function (sigmoid / tanh / silu / ...), so the kernel does not | ||
| // need to know which activation it is implementing. | ||
| const int64_t n = input.numel(); | ||
| for (int64_t i = 0; i < n; ++i) { |
There was a problem hiding this comment.
Can you try to take a stab on optimizing this using vector intrinsics and loop unrolling?
There was a problem hiding this comment.
Done. Did DSP too in case that lands ahead of this. :)
Adrian's three review comments on pytorch#19792, plus SIMD acceleration of the LUT lookup (his comment asked for vector intrinsics and loop unrolling): * Drop the target -> string indirection in the activation lowering. `passes_utils._ACTIVATION_FNS` now keys directly on the edge op target (`exir_ops.edge.aten.{sigmoid,tanh,silu}.default`), and `ConvertToCortexMPass._get_activation_replacement` passes `node.target` straight into `build_activation_lut` -- no `_ACTIVATION_KINDS` dict and no string round-trip. * Replace the scalar LUT-lookup loop with three compile-gated paths: - M55/M85 (MVE): 16 lanes per iteration -- `vldrbq_u8` load, `vaddq_n_u8` to bias by 128, `vldrbq_gather_offset_s8` to gather the LUT result, `vstrbq_s8` to store. - M4/M7 (DSP, no MVE): 4 bytes per iteration -- fold four byte-loads into one word-load, batch the +128 bias with `__uadd8`, four LUT lookups (no M-class gather instruction exists), fold four byte-stores into one word-store. Uses `<arm_acle.h>` and local memcpy helpers rather than pulling in the heavyweight `arm_nnsupportfunctions.h`. - All other cores (M0+/M3): a 4x-unrolled scalar tail, which also handles the sub-vector remainder of the two SIMD paths. * Switch the source header to Meta's standard copyright block to match the other cortex_m op files. The three paths were cross-compiled for cortex-m0plus / m4 / m7 / m55; the M4 build emits `uadd8` and the M55 build emits the MVE gather. Runtime correctness on M4/M7 hardware/FVP is not yet exercised by CI -- the host unit tests cover the scalar path only. Co-authored-by: Claude <noreply@anthropic.com>
Adrian's three review comments on pytorch#19792, plus SIMD acceleration of the LUT lookup (his comment asked for vector intrinsics and loop unrolling): * Drop the target -> string indirection in the activation lowering. `passes_utils._ACTIVATION_FNS` now keys directly on the edge op target (`exir_ops.edge.aten.{sigmoid,tanh,silu}.default`), and `ConvertToCortexMPass._get_activation_replacement` passes `node.target` straight into `build_activation_lut` -- no `_ACTIVATION_KINDS` dict and no string round-trip. * Replace the scalar LUT-lookup loop with three compile-gated paths: - M55/M85 (MVE): 16 lanes per iteration -- `vldrbq_u8` load, `vaddq_n_u8` to bias by 128, `vldrbq_gather_offset_s8` to gather the LUT result, `vstrbq_s8` to store. - M4/M7 (DSP, no MVE): 4 bytes per iteration -- fold four byte-loads into one word-load, batch the +128 bias with `__uadd8`, four LUT lookups (no M-class gather instruction exists), fold four byte-stores into one word-store. Uses `<arm_acle.h>` and local memcpy helpers rather than pulling in the heavyweight `arm_nnsupportfunctions.h`. - All other cores (M0+/M3): a 4x-unrolled scalar tail, which also handles the sub-vector remainder of the two SIMD paths. * Switch the source header to Meta's standard copyright block to match the other cortex_m op files. Also drop test_lstm_cell.py: the LSTMCell gates can't lower until a pre-annotation decompose pass lands, so the test isn't ready and is removed until that follow-up work is done. The MVE path is verified on the Corstone-300 FVP (cortex-m55) via the existing test_implementation_quantized_activation suite. The three paths were cross-compiled for cortex-m0plus / m4 / m7 / m55; the M4 build emits `uadd8` and the M55 build emits the MVE gather. Co-authored-by: Claude <noreply@anthropic.com>
8cc5dcf to
e08e727
Compare
…igmoid/tanh/silu CMSIS-NN has no s8 activation primitive — the s16 path requantizes around an on-target polynomial, which costs an extra s8 → s16 → activation → s8 trip per call. Instead this lowers standalone aten.sigmoid / aten.tanh / aten.silu to a single cortex_m.quantized_activation(input, lut) op backed by a 256-entry int8 LUT precomputed at AoT from the input/output qparams and the activation function. The kernel is a single byte-indexed lookup loop -- shape-agnostic, activation-agnostic, and free of any runtime requantization. Encoding the activation in the LUT bytes rather than a kind enum keeps the kernel surface to one op. For SiLU specifically, the LUT can encode `x * sigmoid(x)` directly, so the naive sigmoid-plus-elementwise-mul decomposition is unnecessary. aten.silu is added to the to_edge preserve_ops list so it doesn't decompose to sigmoid+mul before the lowering pass sees it; this is set globally because no per-test opt-out exists today. LUT-build numerics deliberately mirror the existing cortex_m CMSIS-NN conventions. Sigmoid/silu use a sign-branched stable form that always exponentiates a non-positive value, so the LUT build can't trip OverflowError for unusually wide input qparams. The final fp → int8 quantize uses round-half-away-from-zero, matching the rounding requantize_cmsis applies after its right-shift in passes_utils. In Silero VAD the final `sigmoid(final_conv(x))` now lowers; the 3 remaining sigmoids and 2 tanhs are LSTMCell gates and stay in aten because PyTorch export captures nn.LSTMCell as a single high-level op -- the quantizer never sees the gates and can't annotate them, and to_edge only decomposes the cell after the quantizer has run. test_lstm_cell.py captures the expected end-state as an xfail that will flip green once a pre-annotation decompose pass lands; that work is tracked as a separate follow-up. Other activations (GELU for KWT, Mish, ELU, Softplus) plug in as a few additional entries in passes_utils._ACTIVATION_FNS plus matching quantizer patterns. The generic op + LUT design carries them with no kernel changes. Co-authored-by: Claude <noreply@anthropic.com>
Adrian's three review comments on pytorch#19792, plus SIMD acceleration of the LUT lookup (his comment asked for vector intrinsics and loop unrolling): * Drop the target -> string indirection in the activation lowering. `passes_utils._ACTIVATION_FNS` now keys directly on the edge op target (`exir_ops.edge.aten.{sigmoid,tanh,silu}.default`), and `ConvertToCortexMPass._get_activation_replacement` passes `node.target` straight into `build_activation_lut` -- no `_ACTIVATION_KINDS` dict and no string round-trip. * Replace the scalar LUT-lookup loop with three compile-gated paths: - M55/M85 (MVE): 16 lanes per iteration -- `vldrbq_u8` load, `vaddq_n_u8` to bias by 128, `vldrbq_gather_offset_s8` to gather the LUT result, `vstrbq_s8` to store. - M4/M7 (DSP, no MVE): 4 bytes per iteration -- fold four byte-loads into one word-load, batch the +128 bias with `__uadd8`, four LUT lookups (no M-class gather instruction exists), fold four byte-stores into one word-store. Uses `<arm_acle.h>` and local memcpy helpers rather than pulling in the heavyweight `arm_nnsupportfunctions.h`. - All other cores (M0+/M3): a 4x-unrolled scalar tail, which also handles the sub-vector remainder of the two SIMD paths. * Switch the source header to Meta's standard copyright block to match the other cortex_m op files. Also drop test_lstm_cell.py: the LSTMCell gates can't lower until a pre-annotation decompose pass lands, so the test isn't ready and is removed until that follow-up work is done. The MVE path is verified on the Corstone-300 FVP (cortex-m55) via the existing test_implementation_quantized_activation suite. The three paths were cross-compiled for cortex-m0plus / m4 / m7 / m55; the M4 build emits `uadd8` and the M55 build emits the MVE gather. Co-authored-by: Claude <noreply@anthropic.com>
e08e727 to
5045ac2
Compare
Summary
CMSIS-NN has no s8 activation primitive — the s16 path requantizes around an on-target polynomial, which costs an extra s8 → s16 → activation → s8 trip per call. Instead this lowers standalone aten.sigmoid / aten.tanh / aten.silu to a single cortex_m.quantized_activation(input, lut) op backed by a 256-entry int8 LUT precomputed at AoT from the input/output qparams and the activation function. The kernel is a single byte-indexed lookup loop -- shape-agnostic, activation-agnostic, and free of any runtime requantization. Encoding the activation in the LUT bytes rather than a kind enum keeps the kernel surface to one op.
For SiLU specifically, the LUT can encode
x * sigmoid(x)directly, so the naive sigmoid-plus-elementwise-mul decomposition is unnecessary. aten.silu is added to the to_edge preserve_ops list so it doesn't decompose to sigmoid+mul before the lowering pass sees it; this is set globally because no per-test opt-out exists today.LUT-build numerics deliberately mirror the existing cortex_m CMSIS-NN conventions. Sigmoid/silu use a sign-branched stable form that always exponentiates a non-positive value, so the LUT build can't trip OverflowError for unusually wide input qparams. The final fp → int8 quantize uses round-half-away-from-zero, matching the rounding requantize_cmsis applies after its right-shift in passes_utils.
Test plan
In Silero VAD the final
sigmoid(final_conv(x))now lowers; the 3 remaining sigmoids and 2 tanhs are LSTMCell gates and stay in aten because PyTorch export captures nn.LSTMCell as a single high-level op -- the quantizer never sees the gates and can't annotate them, and to_edge only decomposes the cell after the quantizer has run. test_lstm_cell.py captures the expected end-state as an xfail that will flip green once a pre-annotation decompose pass lands; that work is tracked as a separate follow-up.Other activations (GELU for KWT, Mish, ELU, Softplus) plug in as a few additional entries in passes_utils._ACTIVATION_FNS plus matching quantizer patterns. The generic op + LUT design carries them with no kernel changes.