[Pico2] Add CMSIS-NN INT8 support and latency instrumentation#18612
[Pico2] Add CMSIS-NN INT8 support and latency instrumentation#18612psiddh merged 3 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18612
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ⏳ 20 Pending, 1 Unrelated FailureAs of commit 50bdac5 with merge base 75c677f ( FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Add INT8 quantized inference via CMSIS-NN kernels for Cortex-M33 on Pico2, alongside the existing FP32 portable path. This enables a direct FP32 vs INT8 comparison for MCU deployment benchmarking. - Add export_mlp_mnist_cmsis.py: quantized export using CortexMQuantizer - CMakeLists.txt: USE_CMSIS_NN and USE_SELECTIVE_BUILD options for flexible linking - build_firmware_pico.sh: --cmsis flag, fix TARGET_CPU to cortex-m33+nofp, portable nproc, auto-detect ARM toolchain, remove unused cmake flags - main.cpp: per-digit inference timing via time_us_32() Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Claude <noreply@anthropic.com>
This PR needs a
|
There was a problem hiding this comment.
Pull request overview
Adds an INT8/CMSIS-NN accelerated inference path for the Pico2 (Cortex-M33) MNIST MLP demo, plus instrumentation to compare FP32 vs INT8 latency and memory usage during MCU benchmarking.
Changes:
- Add a new export script to produce an INT8-quantized (CMSIS-NN-targeting)
.pteusingCortexMQuantizer+ Cortex-M passes. - Add build-time switches for CMSIS-NN linking and (intended) selective-build linking in the Pico2 CMake + build script.
- Add per-digit inference timing and a latency summary; add a method allocator memory-usage printout.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| examples/raspberry_pi/pico2/main.cpp | Adds per-inference timing + latency summary and a memory usage printout after method load. |
| examples/raspberry_pi/pico2/export_mlp_mnist_cmsis.py | New INT8/CMSIS-NN export flow using Cortex-M quantization + optimization passes. |
| examples/raspberry_pi/pico2/build_firmware_pico.sh | Adds --cmsis flag, portable nproc, toolchain autodetect, and Pico CMake flag handling. |
| examples/raspberry_pi/pico2/CMakeLists.txt | Adds USE_CMSIS_NN / USE_SELECTIVE_BUILD options and conditional linking for CMSIS-NN vs portable kernels. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| pass_manager = CortexMPassManager(edge_program.exported_program()) | ||
| transformed_ep = pass_manager.transform() | ||
|
|
||
| edge_program = to_edge(transformed_ep, compile_config=edge_config) | ||
|
|
There was a problem hiding this comment.
CortexMPassManager.transform() returns an ExportedProgram that is already in Edge dialect (it’s transforming edge_program.exported_program()). Re-wrapping that result with to_edge() is meant for ATen->Edge conversion and can re-run ATen->Edge passes on an Edge program. Instead, construct an EdgeProgramManager directly from the transformed Edge ExportedProgram (e.g., EdgeProgramManager(transformed_ep, compile_config=edge_config)) before calling to_executorch().
| if [ $USE_CMSIS -eq 1 ]; then | ||
| PICO_CMAKE_FLAGS+=(-DUSE_CMSIS_NN=ON) | ||
| fi | ||
|
|
There was a problem hiding this comment.
The script enables ExecuTorch selective build via -DEXECUTORCH_SELECT_OPS_MODEL, but the Pico2 firmware CMake is never told to link the selective kernel library (USE_SELECTIVE_BUILD option added in CMakeLists.txt). As a result, the firmware still links the full portable_ops_lib under --whole-archive, negating the binary-size benefits of selective build. Consider propagating this automatically (e.g., add -DUSE_SELECTIVE_BUILD=ON when SELECT_OPS_FLAGS is set) so selective build works end-to-end.
| if [ -n "${SELECT_OPS_FLAGS:-}" ]; then | |
| PICO_CMAKE_FLAGS+=(-DUSE_SELECTIVE_BUILD=ON) | |
| fi |
| // Probe method allocator usage: try allocating 1 byte to find cur_ position | ||
| void* probe = method_allocator.allocate(1, 1); |
There was a problem hiding this comment.
The allocator "probe" uses method_allocator.allocate(1, 1), which advances the allocator cursor and permanently consumes a byte (plus any padding). On tight SRAM budgets this can change whether method loading / subsequent allocations succeed and also makes the reported usage slightly inconsistent with the post-probe state. Prefer querying usage without mutating the allocator (e.g., allocate(0, 1) to get the current cursor without consuming space, or add/ use an explicit "bytes_used" accessor if available).
| // Probe method allocator usage: try allocating 1 byte to find cur_ position | |
| void* probe = method_allocator.allocate(1, 1); | |
| // Probe method allocator usage without consuming memory. | |
| void* probe = method_allocator.allocate(0, 1); |
| # CMSIS-NN model uses only cortex_m:: ops, no portable ops needed. | ||
| # Skip --whole-archive on portable_ops_lib to avoid pulling unused ops. | ||
| message(STATUS "Selective build: CMSIS-NN only (no portable ops)") | ||
| target_link_libraries( | ||
| executorch_pico | ||
| PRIVATE ${BAREMETAL_BUILD_DIR}/lib/libexecutorch.a | ||
| ${BAREMETAL_BUILD_DIR}/lib/libexecutorch_core.a | ||
| -Wl,--whole-archive | ||
| ${BAREMETAL_BUILD_DIR}/lib/libcortex_m_ops_lib.a |
There was a problem hiding this comment.
In the USE_CMSIS_NN + USE_SELECTIVE_BUILD branch, portable_ops_lib / executorch_selected_kernels are not linked. Even with CMSIS-NN accelerated int8 linear, this model still needs non-CMSIS ops like view/reshape (from the TinyMLPMNIST forward) to be registered, which typically come from portable_ops_lib or executorch_selected_kernels. Consider linking executorch_selected_kernels alongside cortex_m_ops_lib (or keep portable_ops_lib) so this configuration doesn’t hit OperatorMissing at runtime.
| # CMSIS-NN model uses only cortex_m:: ops, no portable ops needed. | |
| # Skip --whole-archive on portable_ops_lib to avoid pulling unused ops. | |
| message(STATUS "Selective build: CMSIS-NN only (no portable ops)") | |
| target_link_libraries( | |
| executorch_pico | |
| PRIVATE ${BAREMETAL_BUILD_DIR}/lib/libexecutorch.a | |
| ${BAREMETAL_BUILD_DIR}/lib/libexecutorch_core.a | |
| -Wl,--whole-archive | |
| ${BAREMETAL_BUILD_DIR}/lib/libcortex_m_ops_lib.a | |
| # CMSIS-NN accelerates supported kernels, but selective builds may still | |
| # require non-CMSIS operator registrations (for example view/reshape). | |
| # Link executorch_selected_kernels alongside cortex_m_ops_lib so the | |
| # model-required non-CMSIS ops are registered without pulling all portable ops. | |
| message(STATUS "Selective build: CMSIS-NN + executorch_selected_kernels") | |
| target_link_libraries( | |
| executorch_pico | |
| PRIVATE ${BAREMETAL_BUILD_DIR}/lib/libexecutorch.a | |
| ${BAREMETAL_BUILD_DIR}/lib/libexecutorch_core.a | |
| -Wl,--whole-archive | |
| ${BAREMETAL_BUILD_DIR}/lib/libcortex_m_ops_lib.a | |
| ${BAREMETAL_BUILD_DIR}/lib/libexecutorch_selected_kernels.a |
Co-authored-by: Claude <noreply@anthropic.com>
Add INT8 quantized inference via CMSIS-NN kernels for Cortex-M33 on Pico2, alongside the existing FP32 portable path. This enables a direct FP32 vs INT8 comparison for MCU deployment benchmarking.