Cortex-M backend: add quantized_activation op with LUT lowering for sigmoid/tanh/silu by rascani · Pull Request #19792 · pytorch/executorch

rascani · 2026-05-26T23:27:17Z

Summary

CMSIS-NN has no s8 activation primitive — the s16 path requantizes around an on-target polynomial, which costs an extra s8 → s16 → activation → s8 trip per call. Instead this lowers standalone aten.sigmoid / aten.tanh / aten.silu to a single cortex_m.quantized_activation(input, lut) op backed by a 256-entry int8 LUT precomputed at AoT from the input/output qparams and the activation function. The kernel is a single byte-indexed lookup loop -- shape-agnostic, activation-agnostic, and free of any runtime requantization. Encoding the activation in the LUT bytes rather than a kind enum keeps the kernel surface to one op.

For SiLU specifically, the LUT can encode x * sigmoid(x) directly, so the naive sigmoid-plus-elementwise-mul decomposition is unnecessary. aten.silu is added to the to_edge preserve_ops list so it doesn't decompose to sigmoid+mul before the lowering pass sees it; this is set globally because no per-test opt-out exists today.

LUT-build numerics deliberately mirror the existing cortex_m CMSIS-NN conventions. Sigmoid/silu use a sign-branched stable form that always exponentiates a non-positive value, so the LUT build can't trip OverflowError for unusually wide input qparams. The final fp → int8 quantize uses round-half-away-from-zero, matching the rounding requantize_cmsis applies after its right-shift in passes_utils.

Test plan

In Silero VAD the final sigmoid(final_conv(x)) now lowers; the 3 remaining sigmoids and 2 tanhs are LSTMCell gates and stay in aten because PyTorch export captures nn.LSTMCell as a single high-level op -- the quantizer never sees the gates and can't annotate them, and to_edge only decomposes the cell after the quantizer has run. test_lstm_cell.py captures the expected end-state as an xfail that will flip green once a pre-annotation decompose pass lands; that work is tracked as a separate follow-up.

Other activations (GELU for KWT, Mish, ELU, Softplus) plug in as a few additional entries in passes_utils._ACTIVATION_FNS plus matching quantizer patterns. The generic op + LUT design carries them with no kernel changes.

pytorch-bot · 2026-05-26T23:27:20Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19792

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 2 Unrelated Failures

As of commit 5045ac2 with merge base 5395f20 ():

NEW FAILURE - The following job has failed:

trunk / test-llama-runner-mac (fp32, coreml) / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / test-qnn-testsuite-linux / test-backend-linux (qnn, models) / linux-job (gh) (matched linux rule in flaky-rules.json)
The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.

BROKEN TRUNK - The following job failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

trunk / test-arm-backend-ethos-u (test_smaller_stories_llama) / linux-job (gh) (trunk failure)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2026-05-26T23:27:25Z

✅ login: rascani / name: RJ Ascani (5045ac2, b91a8ef)
❌ - login: @claude / name: Claude. The commit (5045ac2, b91a8ef) is not authorized under a signed CLA. Please click here to be authorized. For further assistance with EasyCLA, please visit our EasyCLA portal and chat with our support bot.

github-actions · 2026-05-26T23:28:02Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

AdrianLundell · 2026-05-27T07:02:07Z

+  // activation function (sigmoid / tanh / silu / ...), so the kernel does not
+  // need to know which activation it is implementing.
+  const int64_t n = input.numel();
+  for (int64_t i = 0; i < n; ++i) {


Can you try to take a stab on optimizing this using vector intrinsics and loop unrolling?

Done. Did DSP too in case that lands ahead of this. :)

Adrian's three review comments on pytorch#19792, plus SIMD acceleration of the LUT lookup (his comment asked for vector intrinsics and loop unrolling): * Drop the target -> string indirection in the activation lowering. `passes_utils._ACTIVATION_FNS` now keys directly on the edge op target (`exir_ops.edge.aten.{sigmoid,tanh,silu}.default`), and `ConvertToCortexMPass._get_activation_replacement` passes `node.target` straight into `build_activation_lut` -- no `_ACTIVATION_KINDS` dict and no string round-trip. * Replace the scalar LUT-lookup loop with three compile-gated paths: - M55/M85 (MVE): 16 lanes per iteration -- `vldrbq_u8` load, `vaddq_n_u8` to bias by 128, `vldrbq_gather_offset_s8` to gather the LUT result, `vstrbq_s8` to store. - M4/M7 (DSP, no MVE): 4 bytes per iteration -- fold four byte-loads into one word-load, batch the +128 bias with `__uadd8`, four LUT lookups (no M-class gather instruction exists), fold four byte-stores into one word-store. Uses `<arm_acle.h>` and local memcpy helpers rather than pulling in the heavyweight `arm_nnsupportfunctions.h`. - All other cores (M0+/M3): a 4x-unrolled scalar tail, which also handles the sub-vector remainder of the two SIMD paths. * Switch the source header to Meta's standard copyright block to match the other cortex_m op files. The three paths were cross-compiled for cortex-m0plus / m4 / m7 / m55; the M4 build emits `uadd8` and the M55 build emits the MVE gather. Runtime correctness on M4/M7 hardware/FVP is not yet exercised by CI -- the host unit tests cover the scalar path only. Co-authored-by: Claude <noreply@anthropic.com>

Adrian's three review comments on pytorch#19792, plus SIMD acceleration of the LUT lookup (his comment asked for vector intrinsics and loop unrolling): * Drop the target -> string indirection in the activation lowering. `passes_utils._ACTIVATION_FNS` now keys directly on the edge op target (`exir_ops.edge.aten.{sigmoid,tanh,silu}.default`), and `ConvertToCortexMPass._get_activation_replacement` passes `node.target` straight into `build_activation_lut` -- no `_ACTIVATION_KINDS` dict and no string round-trip. * Replace the scalar LUT-lookup loop with three compile-gated paths: - M55/M85 (MVE): 16 lanes per iteration -- `vldrbq_u8` load, `vaddq_n_u8` to bias by 128, `vldrbq_gather_offset_s8` to gather the LUT result, `vstrbq_s8` to store. - M4/M7 (DSP, no MVE): 4 bytes per iteration -- fold four byte-loads into one word-load, batch the +128 bias with `__uadd8`, four LUT lookups (no M-class gather instruction exists), fold four byte-stores into one word-store. Uses `<arm_acle.h>` and local memcpy helpers rather than pulling in the heavyweight `arm_nnsupportfunctions.h`. - All other cores (M0+/M3): a 4x-unrolled scalar tail, which also handles the sub-vector remainder of the two SIMD paths. * Switch the source header to Meta's standard copyright block to match the other cortex_m op files. Also drop test_lstm_cell.py: the LSTMCell gates can't lower until a pre-annotation decompose pass lands, so the test isn't ready and is removed until that follow-up work is done. The MVE path is verified on the Corstone-300 FVP (cortex-m55) via the existing test_implementation_quantized_activation suite. The three paths were cross-compiled for cortex-m0plus / m4 / m7 / m55; the M4 build emits `uadd8` and the M55 build emits the MVE gather. Co-authored-by: Claude <noreply@anthropic.com>

…igmoid/tanh/silu CMSIS-NN has no s8 activation primitive — the s16 path requantizes around an on-target polynomial, which costs an extra s8 → s16 → activation → s8 trip per call. Instead this lowers standalone aten.sigmoid / aten.tanh / aten.silu to a single cortex_m.quantized_activation(input, lut) op backed by a 256-entry int8 LUT precomputed at AoT from the input/output qparams and the activation function. The kernel is a single byte-indexed lookup loop -- shape-agnostic, activation-agnostic, and free of any runtime requantization. Encoding the activation in the LUT bytes rather than a kind enum keeps the kernel surface to one op. For SiLU specifically, the LUT can encode `x * sigmoid(x)` directly, so the naive sigmoid-plus-elementwise-mul decomposition is unnecessary. aten.silu is added to the to_edge preserve_ops list so it doesn't decompose to sigmoid+mul before the lowering pass sees it; this is set globally because no per-test opt-out exists today. LUT-build numerics deliberately mirror the existing cortex_m CMSIS-NN conventions. Sigmoid/silu use a sign-branched stable form that always exponentiates a non-positive value, so the LUT build can't trip OverflowError for unusually wide input qparams. The final fp → int8 quantize uses round-half-away-from-zero, matching the rounding requantize_cmsis applies after its right-shift in passes_utils. In Silero VAD the final `sigmoid(final_conv(x))` now lowers; the 3 remaining sigmoids and 2 tanhs are LSTMCell gates and stay in aten because PyTorch export captures nn.LSTMCell as a single high-level op -- the quantizer never sees the gates and can't annotate them, and to_edge only decomposes the cell after the quantizer has run. test_lstm_cell.py captures the expected end-state as an xfail that will flip green once a pre-annotation decompose pass lands; that work is tracked as a separate follow-up. Other activations (GELU for KWT, Mish, ELU, Softplus) plug in as a few additional entries in passes_utils._ACTIVATION_FNS plus matching quantizer patterns. The generic op + LUT design carries them with no kernel changes. Co-authored-by: Claude <noreply@anthropic.com>

Adrian's three review comments on pytorch#19792, plus SIMD acceleration of the LUT lookup (his comment asked for vector intrinsics and loop unrolling): * Drop the target -> string indirection in the activation lowering. `passes_utils._ACTIVATION_FNS` now keys directly on the edge op target (`exir_ops.edge.aten.{sigmoid,tanh,silu}.default`), and `ConvertToCortexMPass._get_activation_replacement` passes `node.target` straight into `build_activation_lut` -- no `_ACTIVATION_KINDS` dict and no string round-trip. * Replace the scalar LUT-lookup loop with three compile-gated paths: - M55/M85 (MVE): 16 lanes per iteration -- `vldrbq_u8` load, `vaddq_n_u8` to bias by 128, `vldrbq_gather_offset_s8` to gather the LUT result, `vstrbq_s8` to store. - M4/M7 (DSP, no MVE): 4 bytes per iteration -- fold four byte-loads into one word-load, batch the +128 bias with `__uadd8`, four LUT lookups (no M-class gather instruction exists), fold four byte-stores into one word-store. Uses `<arm_acle.h>` and local memcpy helpers rather than pulling in the heavyweight `arm_nnsupportfunctions.h`. - All other cores (M0+/M3): a 4x-unrolled scalar tail, which also handles the sub-vector remainder of the two SIMD paths. * Switch the source header to Meta's standard copyright block to match the other cortex_m op files. Also drop test_lstm_cell.py: the LSTMCell gates can't lower until a pre-annotation decompose pass lands, so the test isn't ready and is removed until that follow-up work is done. The MVE path is verified on the Corstone-300 FVP (cortex-m55) via the existing test_implementation_quantized_activation suite. The three paths were cross-compiled for cortex-m0plus / m4 / m7 / m55; the M4 build emits `uadd8` and the M55 build emits the MVE gather. Co-authored-by: Claude <noreply@anthropic.com>

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 26, 2026

rascani added the ciflow/trunk label May 27, 2026

AdrianLundell reviewed May 27, 2026

View reviewed changes

rascani force-pushed the cortex-m-quantized-activation branch from 8cc5dcf to e08e727 Compare May 30, 2026 00:52

rascani and others added 2 commits May 29, 2026 17:59

rascani force-pushed the cortex-m-quantized-activation branch from e08e727 to 5045ac2 Compare May 30, 2026 01:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cortex-M backend: add quantized_activation op with LUT lowering for sigmoid/tanh/silu#19792

Cortex-M backend: add quantized_activation op with LUT lowering for sigmoid/tanh/silu#19792
rascani wants to merge 2 commits into
pytorch:mainfrom
rascani:cortex-m-quantized-activation

rascani commented May 26, 2026

Uh oh!

pytorch-bot Bot commented May 26, 2026 •

edited

Loading

Uh oh!

linux-foundation-easycla Bot commented May 26, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 26, 2026

Uh oh!

Uh oh!

AdrianLundell May 27, 2026

Uh oh!

rascani May 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rascani commented May 26, 2026

Summary

Test plan

Uh oh!

pytorch-bot Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19792

❌ 1 New Failure, 2 Unrelated Failures

Uh oh!

linux-foundation-easycla Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 26, 2026

This PR needs a release notes: label

Uh oh!

Uh oh!

AdrianLundell May 27, 2026

Choose a reason for hiding this comment

Uh oh!

rascani May 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pytorch-bot Bot commented May 26, 2026 •

edited

Loading

linux-foundation-easycla Bot commented May 26, 2026 •

edited

Loading

This PR needs a `release notes:` label