[Pico2] Add CMSIS-NN INT8 support and latency instrumentation by psiddh · Pull Request #18612 · pytorch/executorch

psiddh · 2026-03-31T17:05:16Z

Add INT8 quantized inference via CMSIS-NN kernels for Cortex-M33 on Pico2, alongside the existing FP32 portable path. This enables a direct FP32 vs INT8 comparison for MCU deployment benchmarking.

Add export_mlp_mnist_cmsis.py: quantized export using CortexMQuantizer
CMakeLists.txt: USE_CMSIS_NN and USE_SELECTIVE_BUILD options for flexible linking
build_firmware_pico.sh: --cmsis flag, fix TARGET_CPU to cortex-m33+nofp, portable nproc, auto-detect ARM toolchain, remove unused cmake flags
main.cpp: per-digit inference timing via time_us_32()

pytorch-bot · 2026-03-31T17:05:20Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18612

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Rolling out OSDC (ARC) runners on pull workflow for PyTorch trunk commits

⏳ 20 Pending, 1 Unrelated Failure

As of commit 50bdac5 with merge base 75c677f ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

Test Vulkan Backend / test-vulkan / test-backend-linux (vulkan, operators) / linux-job (gh) (detected as infra flaky with no log or failing log classifier)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Add INT8 quantized inference via CMSIS-NN kernels for Cortex-M33 on Pico2, alongside the existing FP32 portable path. This enables a direct FP32 vs INT8 comparison for MCU deployment benchmarking. - Add export_mlp_mnist_cmsis.py: quantized export using CortexMQuantizer - CMakeLists.txt: USE_CMSIS_NN and USE_SELECTIVE_BUILD options for flexible linking - build_firmware_pico.sh: --cmsis flag, fix TARGET_CPU to cortex-m33+nofp, portable nproc, auto-detect ARM toolchain, remove unused cmake flags - main.cpp: per-digit inference timing via time_us_32() Co-authored-by: Claude <noreply@anthropic.com>

Co-authored-by: Claude <noreply@anthropic.com>

github-actions · 2026-04-14T21:42:09Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copilot

Pull request overview

Adds an INT8/CMSIS-NN accelerated inference path for the Pico2 (Cortex-M33) MNIST MLP demo, plus instrumentation to compare FP32 vs INT8 latency and memory usage during MCU benchmarking.

Changes:

Add a new export script to produce an INT8-quantized (CMSIS-NN-targeting) .pte using CortexMQuantizer + Cortex-M passes.
Add build-time switches for CMSIS-NN linking and (intended) selective-build linking in the Pico2 CMake + build script.
Add per-digit inference timing and a latency summary; add a method allocator memory-usage printout.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
examples/raspberry_pi/pico2/main.cpp	Adds per-inference timing + latency summary and a memory usage printout after method load.
examples/raspberry_pi/pico2/export_mlp_mnist_cmsis.py	New INT8/CMSIS-NN export flow using Cortex-M quantization + optimization passes.
examples/raspberry_pi/pico2/build_firmware_pico.sh	Adds `--cmsis` flag, portable `nproc`, toolchain autodetect, and Pico CMake flag handling.
examples/raspberry_pi/pico2/CMakeLists.txt	Adds `USE_CMSIS_NN` / `USE_SELECTIVE_BUILD` options and conditional linking for CMSIS-NN vs portable kernels.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-14T21:48:11Z

+    pass_manager = CortexMPassManager(edge_program.exported_program())
+    transformed_ep = pass_manager.transform()
+
+    edge_program = to_edge(transformed_ep, compile_config=edge_config)
+


CortexMPassManager.transform() returns an ExportedProgram that is already in Edge dialect (it’s transforming edge_program.exported_program()). Re-wrapping that result with to_edge() is meant for ATen->Edge conversion and can re-run ATen->Edge passes on an Edge program. Instead, construct an EdgeProgramManager directly from the transformed Edge ExportedProgram (e.g., EdgeProgramManager(transformed_ep, compile_config=edge_config)) before calling to_executorch().

Copilot · 2026-04-14T21:48:11Z

+if [ $USE_CMSIS -eq 1 ]; then
+  PICO_CMAKE_FLAGS+=(-DUSE_CMSIS_NN=ON)
+fi
+


The script enables ExecuTorch selective build via -DEXECUTORCH_SELECT_OPS_MODEL, but the Pico2 firmware CMake is never told to link the selective kernel library (USE_SELECTIVE_BUILD option added in CMakeLists.txt). As a result, the firmware still links the full portable_ops_lib under --whole-archive, negating the binary-size benefits of selective build. Consider propagating this automatically (e.g., add -DUSE_SELECTIVE_BUILD=ON when SELECT_OPS_FLAGS is set) so selective build works end-to-end.

Suggested change

if [ -n "${SELECT_OPS_FLAGS:-}" ]; then

PICO_CMAKE_FLAGS+=(-DUSE_SELECTIVE_BUILD=ON)

fi

Copilot · 2026-04-14T21:48:12Z

+  // Probe method allocator usage: try allocating 1 byte to find cur_ position
+  void* probe = method_allocator.allocate(1, 1);


The allocator "probe" uses method_allocator.allocate(1, 1), which advances the allocator cursor and permanently consumes a byte (plus any padding). On tight SRAM budgets this can change whether method loading / subsequent allocations succeed and also makes the reported usage slightly inconsistent with the post-probe state. Prefer querying usage without mutating the allocator (e.g., allocate(0, 1) to get the current cursor without consuming space, or add/ use an explicit "bytes_used" accessor if available).

Suggested change

// Probe method allocator usage: try allocating 1 byte to find cur_ position

void* probe = method_allocator.allocate(1, 1);

// Probe method allocator usage without consuming memory.

void* probe = method_allocator.allocate(0, 1);

Copilot · 2026-04-14T21:48:12Z

+    # CMSIS-NN model uses only cortex_m:: ops, no portable ops needed.
+    # Skip --whole-archive on portable_ops_lib to avoid pulling unused ops.
+    message(STATUS "Selective build: CMSIS-NN only (no portable ops)")
+    target_link_libraries(
+      executorch_pico
+      PRIVATE ${BAREMETAL_BUILD_DIR}/lib/libexecutorch.a
+              ${BAREMETAL_BUILD_DIR}/lib/libexecutorch_core.a
+              -Wl,--whole-archive
+              ${BAREMETAL_BUILD_DIR}/lib/libcortex_m_ops_lib.a


In the USE_CMSIS_NN + USE_SELECTIVE_BUILD branch, portable_ops_lib / executorch_selected_kernels are not linked. Even with CMSIS-NN accelerated int8 linear, this model still needs non-CMSIS ops like view/reshape (from the TinyMLPMNIST forward) to be registered, which typically come from portable_ops_lib or executorch_selected_kernels. Consider linking executorch_selected_kernels alongside cortex_m_ops_lib (or keep portable_ops_lib) so this configuration doesn’t hit OperatorMissing at runtime.

Suggested change

# CMSIS-NN model uses only cortex_m:: ops, no portable ops needed.

# Skip --whole-archive on portable_ops_lib to avoid pulling unused ops.

message(STATUS "Selective build: CMSIS-NN only (no portable ops)")

target_link_libraries(

executorch_pico

PRIVATE ${BAREMETAL_BUILD_DIR}/lib/libexecutorch.a

${BAREMETAL_BUILD_DIR}/lib/libexecutorch_core.a

-Wl,--whole-archive

${BAREMETAL_BUILD_DIR}/lib/libcortex_m_ops_lib.a

# CMSIS-NN accelerates supported kernels, but selective builds may still

# require non-CMSIS operator registrations (for example view/reshape).

# Link executorch_selected_kernels alongside cortex_m_ops_lib so the

# model-required non-CMSIS ops are registered without pulling all portable ops.

message(STATUS "Selective build: CMSIS-NN + executorch_selected_kernels")

target_link_libraries(

executorch_pico

PRIVATE ${BAREMETAL_BUILD_DIR}/lib/libexecutorch.a

${BAREMETAL_BUILD_DIR}/lib/libexecutorch_core.a

-Wl,--whole-archive

${BAREMETAL_BUILD_DIR}/lib/libcortex_m_ops_lib.a

${BAREMETAL_BUILD_DIR}/lib/libexecutorch_selected_kernels.a

Co-authored-by: Claude <noreply@anthropic.com>

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 31, 2026

psiddh marked this pull request as ready for review March 31, 2026 17:06

psiddh requested review from AdrianLundell, digantdesai and rascani March 31, 2026 17:06

psiddh changed the base branch from pico2 to main March 31, 2026 17:32

psiddh requested review from kirklandsign and larryliu0820 as code owners March 31, 2026 17:32

rascani approved these changes Mar 31, 2026

View reviewed changes

psiddh and others added 2 commits April 14, 2026 14:41

Fix clang-format lint issues in main.cpp

0ec307d

Co-authored-by: Claude <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings April 14, 2026 21:41

psiddh force-pushed the pico2 branch from 237b1ac to 0ec307d Compare April 14, 2026 21:41

Copilot started reviewing on behalf of psiddh April 14, 2026 21:42 View session

Copilot AI reviewed Apr 14, 2026

View reviewed changes

Fix lint: cmake-format and import ordering

50bdac5

Co-authored-by: Claude <noreply@anthropic.com>

psiddh merged commit 26e2ab8 into pytorch:main Apr 14, 2026
158 of 161 checks passed

psiddh mentioned this pull request Apr 15, 2026

Pico2 Improvements (MlSys Updates) #18908

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Pico2] Add CMSIS-NN INT8 support and latency instrumentation#18612

[Pico2] Add CMSIS-NN INT8 support and latency instrumentation#18612
psiddh merged 3 commits intopytorch:mainfrom
psiddh:pico2

psiddh commented Mar 31, 2026

Uh oh!

pytorch-bot Bot commented Mar 31, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 14, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

+if [ -n "${SELECT_OPS_FLAGS:-}" ]; then
+  PICO_CMAKE_FLAGS+=(-DUSE_SELECTIVE_BUILD=ON)
+fi

		// Probe method allocator usage: try allocating 1 byte to find cur_ position
		void* probe = method_allocator.allocate(1, 1);

Conversation

psiddh commented Mar 31, 2026

Uh oh!

pytorch-bot Bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18612

❗ 1 Active SEVs

⏳ 20 Pending, 1 Unrelated Failure

Uh oh!

github-actions Bot commented Apr 14, 2026

This PR needs a release notes: label

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pytorch-bot Bot commented Mar 31, 2026 •

edited

Loading

This PR needs a `release notes:` label