Skip to content

Cortex-M backend: add quantized_activation op with LUT lowering for sigmoid/tanh/silu#19792

Draft
rascani wants to merge 2 commits into
pytorch:mainfrom
rascani:cortex-m-quantized-activation
Draft

Cortex-M backend: add quantized_activation op with LUT lowering for sigmoid/tanh/silu#19792
rascani wants to merge 2 commits into
pytorch:mainfrom
rascani:cortex-m-quantized-activation

Conversation

@rascani
Copy link
Copy Markdown
Contributor

@rascani rascani commented May 26, 2026

Summary

CMSIS-NN has no s8 activation primitive — the s16 path requantizes around an on-target polynomial, which costs an extra s8 → s16 → activation → s8 trip per call. Instead this lowers standalone aten.sigmoid / aten.tanh / aten.silu to a single cortex_m.quantized_activation(input, lut) op backed by a 256-entry int8 LUT precomputed at AoT from the input/output qparams and the activation function. The kernel is a single byte-indexed lookup loop -- shape-agnostic, activation-agnostic, and free of any runtime requantization. Encoding the activation in the LUT bytes rather than a kind enum keeps the kernel surface to one op.

For SiLU specifically, the LUT can encode x * sigmoid(x) directly, so the naive sigmoid-plus-elementwise-mul decomposition is unnecessary. aten.silu is added to the to_edge preserve_ops list so it doesn't decompose to sigmoid+mul before the lowering pass sees it; this is set globally because no per-test opt-out exists today.

LUT-build numerics deliberately mirror the existing cortex_m CMSIS-NN conventions. Sigmoid/silu use a sign-branched stable form that always exponentiates a non-positive value, so the LUT build can't trip OverflowError for unusually wide input qparams. The final fp → int8 quantize uses round-half-away-from-zero, matching the rounding requantize_cmsis applies after its right-shift in passes_utils.

Test plan

In Silero VAD the final sigmoid(final_conv(x)) now lowers; the 3 remaining sigmoids and 2 tanhs are LSTMCell gates and stay in aten because PyTorch export captures nn.LSTMCell as a single high-level op -- the quantizer never sees the gates and can't annotate them, and to_edge only decomposes the cell after the quantizer has run. test_lstm_cell.py captures the expected end-state as an xfail that will flip green once a pre-annotation decompose pass lands; that work is tracked as a separate follow-up.

Other activations (GELU for KWT, Mish, ELU, Softplus) plug in as a few additional entries in passes_utils._ACTIVATION_FNS plus matching quantizer patterns. The generic op + LUT design carries them with no kernel changes.

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented May 26, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19792

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 2 Unrelated Failures

As of commit 5045ac2 with merge base 5395f20 (image):

NEW FAILURE - The following job has failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

BROKEN TRUNK - The following job failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 26, 2026
@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla Bot commented May 26, 2026

CLA Not Signed

@github-actions
Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Comment thread backends/cortex_m/passes/passes_utils.py
// activation function (sigmoid / tanh / silu / ...), so the kernel does not
// need to know which activation it is implementing.
const int64_t n = input.numel();
for (int64_t i = 0; i < n; ++i) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you try to take a stab on optimizing this using vector intrinsics and loop unrolling?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Did DSP too in case that lands ahead of this. :)

Comment thread backends/cortex_m/ops/op_quantized_activation.cpp Outdated
rascani added a commit to rascani/executorch that referenced this pull request May 30, 2026
Adrian's three review comments on pytorch#19792, plus SIMD acceleration of the
LUT lookup (his comment asked for vector intrinsics and loop unrolling):

* Drop the target -> string indirection in the activation lowering.
  `passes_utils._ACTIVATION_FNS` now keys directly on the edge op target
  (`exir_ops.edge.aten.{sigmoid,tanh,silu}.default`), and
  `ConvertToCortexMPass._get_activation_replacement` passes `node.target`
  straight into `build_activation_lut` -- no `_ACTIVATION_KINDS` dict and no
  string round-trip.

* Replace the scalar LUT-lookup loop with three compile-gated paths:
  - M55/M85 (MVE): 16 lanes per iteration -- `vldrbq_u8` load, `vaddq_n_u8`
    to bias by 128, `vldrbq_gather_offset_s8` to gather the LUT result,
    `vstrbq_s8` to store.
  - M4/M7 (DSP, no MVE): 4 bytes per iteration -- fold four byte-loads into
    one word-load, batch the +128 bias with `__uadd8`, four LUT lookups
    (no M-class gather instruction exists), fold four byte-stores into one
    word-store. Uses `<arm_acle.h>` and local memcpy helpers rather than
    pulling in the heavyweight `arm_nnsupportfunctions.h`.
  - All other cores (M0+/M3): a 4x-unrolled scalar tail, which also handles
    the sub-vector remainder of the two SIMD paths.

* Switch the source header to Meta's standard copyright block to match
  the other cortex_m op files.

The three paths were cross-compiled for cortex-m0plus / m4 / m7 / m55;
the M4 build emits `uadd8` and the M55 build emits the MVE gather. Runtime
correctness on M4/M7 hardware/FVP is not yet exercised by CI -- the host
unit tests cover the scalar path only.

Co-authored-by: Claude <noreply@anthropic.com>
rascani added a commit to rascani/executorch that referenced this pull request May 30, 2026
Adrian's three review comments on pytorch#19792, plus SIMD acceleration of the
LUT lookup (his comment asked for vector intrinsics and loop unrolling):

* Drop the target -> string indirection in the activation lowering.
  `passes_utils._ACTIVATION_FNS` now keys directly on the edge op target
  (`exir_ops.edge.aten.{sigmoid,tanh,silu}.default`), and
  `ConvertToCortexMPass._get_activation_replacement` passes `node.target`
  straight into `build_activation_lut` -- no `_ACTIVATION_KINDS` dict and no
  string round-trip.

* Replace the scalar LUT-lookup loop with three compile-gated paths:
  - M55/M85 (MVE): 16 lanes per iteration -- `vldrbq_u8` load, `vaddq_n_u8`
    to bias by 128, `vldrbq_gather_offset_s8` to gather the LUT result,
    `vstrbq_s8` to store.
  - M4/M7 (DSP, no MVE): 4 bytes per iteration -- fold four byte-loads into
    one word-load, batch the +128 bias with `__uadd8`, four LUT lookups
    (no M-class gather instruction exists), fold four byte-stores into one
    word-store. Uses `<arm_acle.h>` and local memcpy helpers rather than
    pulling in the heavyweight `arm_nnsupportfunctions.h`.
  - All other cores (M0+/M3): a 4x-unrolled scalar tail, which also handles
    the sub-vector remainder of the two SIMD paths.

* Switch the source header to Meta's standard copyright block to match
  the other cortex_m op files.

Also drop test_lstm_cell.py: the LSTMCell gates can't lower until a
pre-annotation decompose pass lands, so the test isn't ready and is removed
until that follow-up work is done.

The MVE path is verified on the Corstone-300 FVP (cortex-m55) via the
existing test_implementation_quantized_activation suite. The three paths
were cross-compiled for cortex-m0plus / m4 / m7 / m55; the M4 build emits
`uadd8` and the M55 build emits the MVE gather.

Co-authored-by: Claude <noreply@anthropic.com>
@rascani rascani force-pushed the cortex-m-quantized-activation branch from 8cc5dcf to e08e727 Compare May 30, 2026 00:52
rascani and others added 2 commits May 29, 2026 17:59
…igmoid/tanh/silu

CMSIS-NN has no s8 activation primitive — the s16 path requantizes around an
on-target polynomial, which costs an extra s8 → s16 → activation → s8 trip
per call. Instead this lowers standalone aten.sigmoid / aten.tanh / aten.silu
to a single cortex_m.quantized_activation(input, lut) op backed by a 256-entry
int8 LUT precomputed at AoT from the input/output qparams and the activation
function. The kernel is a single byte-indexed lookup loop -- shape-agnostic,
activation-agnostic, and free of any runtime requantization. Encoding the
activation in the LUT bytes rather than a kind enum keeps the kernel surface
to one op.

For SiLU specifically, the LUT can encode `x * sigmoid(x)` directly, so the
naive sigmoid-plus-elementwise-mul decomposition is unnecessary. aten.silu
is added to the to_edge preserve_ops list so it doesn't decompose to sigmoid+mul
before the lowering pass sees it; this is set globally because no per-test
opt-out exists today.

LUT-build numerics deliberately mirror the existing cortex_m CMSIS-NN
conventions. Sigmoid/silu use a sign-branched stable form that always
exponentiates a non-positive value, so the LUT build can't trip OverflowError
for unusually wide input qparams. The final fp → int8 quantize uses
round-half-away-from-zero, matching the rounding requantize_cmsis applies
after its right-shift in passes_utils.

In Silero VAD the final `sigmoid(final_conv(x))` now lowers; the 3 remaining
sigmoids and 2 tanhs are LSTMCell gates and stay in aten because PyTorch
export captures nn.LSTMCell as a single high-level op -- the quantizer never
sees the gates and can't annotate them, and to_edge only decomposes the cell
after the quantizer has run. test_lstm_cell.py captures the expected
end-state as an xfail that will flip green once a pre-annotation decompose
pass lands; that work is tracked as a separate follow-up.

Other activations (GELU for KWT, Mish, ELU, Softplus) plug in as a few
additional entries in passes_utils._ACTIVATION_FNS plus matching quantizer
patterns. The generic op + LUT design carries them with no kernel changes.

Co-authored-by: Claude <noreply@anthropic.com>
Adrian's three review comments on pytorch#19792, plus SIMD acceleration of the
LUT lookup (his comment asked for vector intrinsics and loop unrolling):

* Drop the target -> string indirection in the activation lowering.
  `passes_utils._ACTIVATION_FNS` now keys directly on the edge op target
  (`exir_ops.edge.aten.{sigmoid,tanh,silu}.default`), and
  `ConvertToCortexMPass._get_activation_replacement` passes `node.target`
  straight into `build_activation_lut` -- no `_ACTIVATION_KINDS` dict and no
  string round-trip.

* Replace the scalar LUT-lookup loop with three compile-gated paths:
  - M55/M85 (MVE): 16 lanes per iteration -- `vldrbq_u8` load, `vaddq_n_u8`
    to bias by 128, `vldrbq_gather_offset_s8` to gather the LUT result,
    `vstrbq_s8` to store.
  - M4/M7 (DSP, no MVE): 4 bytes per iteration -- fold four byte-loads into
    one word-load, batch the +128 bias with `__uadd8`, four LUT lookups
    (no M-class gather instruction exists), fold four byte-stores into one
    word-store. Uses `<arm_acle.h>` and local memcpy helpers rather than
    pulling in the heavyweight `arm_nnsupportfunctions.h`.
  - All other cores (M0+/M3): a 4x-unrolled scalar tail, which also handles
    the sub-vector remainder of the two SIMD paths.

* Switch the source header to Meta's standard copyright block to match
  the other cortex_m op files.

Also drop test_lstm_cell.py: the LSTMCell gates can't lower until a
pre-annotation decompose pass lands, so the test isn't ready and is removed
until that follow-up work is done.

The MVE path is verified on the Corstone-300 FVP (cortex-m55) via the
existing test_implementation_quantized_activation suite. The three paths
were cross-compiled for cortex-m0plus / m4 / m7 / m55; the M4 build emits
`uadd8` and the M55 build emits the MVE gather.

Co-authored-by: Claude <noreply@anthropic.com>
@rascani rascani force-pushed the cortex-m-quantized-activation branch from e08e727 to 5045ac2 Compare May 30, 2026 01:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants