Summary: Initial CMSS-NN Add Op #13296

psiddh · 2025-08-11T19:53:23Z

Test Plan:

examples/arm/run.sh - No regressions ==> Ok
examples/arm/run.sh now runs 'qadd2' in quantize only mode ==> Ok
python -m unittest test_replace_quant_nodes.py ==> Ok
python -m unittest test_quantize_op_fusion_pass.py ==> Ok

Reviewers:

Subscribers:

Tasks:

Tags:

Summary

[PLEASE REMOVE] See CONTRIBUTING.md's Pull Requests for ExecuTorch PR guidelines.

[PLEASE REMOVE] If this PR closes an issue, please add a Fixes #<issue-id> line.

[PLEASE REMOVE] If this PR introduces a fix or feature that should be the upcoming release notes, please add a "Release notes: " label. For a list of available release notes labels, check out CONTRIBUTING.md's Pull Requests.

Test plan

[PLEASE REMOVE] How did you test this PR? Please write down any manual commands you used and note down tests that you have written if applicable.

cc @digantdesai @freddan80 @per @zingo @oscarandersson8218

pytorch-bot · 2025-08-11T19:53:27Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/13296

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 7262dda with merge base dfc387b ():

NEW FAILURE - The following job has failed:

Build documentation / build (buck2) / Build doc (gh)
At least one of the pre-conditions you specified did not hold

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2025-08-11T19:53:57Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

backends/cortex_m/CMakeLists.txt

backends/cortex_m/ops/op_add.cpp

+  ET_LOG(Info, "Input1 dtype: %d, Input2 dtype: %d, Output dtype: %d",
+         static_cast<int>(input1_dtype), static_cast<int>(input2_dtype), static_cast<int>(out_dtype));
+
+  // Stub for now


backends/cortex_m/ops/op_add_tensor.cpp

+extern "C" {
+#include "Include/arm_nnfunctions.h"
+}


backends/cortex_m/ops/op_add_tensor.cpp

+    Tensor& out) {
+  ET_LOG(Info, "add_Tensor kernel called");
+
+  // Ensure input is char type


backends/cortex_m/ops/op_add_tensor.cpp

+      "other.scalar_type() %" PRId8 " is not char type",
+      static_cast<int8_t>(other.scalar_type()));
+
+  // Stub for now


backends/cortex_m/ops/op_quantized_add.cpp

+
+    // 🔧 FIX: Use template types that ExecutorTorch definitely provides
+    // Use to<int64_t>() and to<double>() which are commonly instantiated
+    int64_t zp1 = input1_zero_point.to<int64_t>();


backends/cortex_m/ops/op_quantized_add.cpp

+        ET_LOG(Error, "quantized_add_out: arm_elementwise_add_s8 failed with status [%d]", status);
+        std::memset(out.mutable_data_ptr<int8_t>(), 0, out.nbytes());
+    } else {
+        ET_LOG(Info, "quantized_add_out: Successfully completed with AoT-computed parameters! 🎯");


backends/cortex_m/ops/op_quantized_add.cpp

+    ET_LOG(Info, "quantized_add: input1_int8.sizes() = %zu", input1_int8.sizes().size());
+    return const_cast<Tensor&>(input1_int8); // to make compiler happy


backends/cortex_m/ops/op_add.cpp

@@ -0,0 +1,44 @@
+/*


backends/cortex_m/ops/operators.py

+    # Call the backend kernel that writes into 'out'
+    return exir_ops.edge.cortex_m.add.out(self, other, alpha, out)


backends/cortex_m/ops/operators.py

+    output_multiplier: int,
+    output_shift: int,
+) -> torch.Tensor:
+    return torch.empty_like(self, dtype=torch.int8)


backends/cortex_m/ops/operators.py

+    # For now, convert back to float, add, and quantize (as placeholder)
+
+    # Dequantize inputs using multiplier/shift
+    self_fp = (self.float() - self_zero_point) * (self_multiplier / (1 << (31 - self_shift)))


backends/cortex_m/ops/operators.py

+    out: torch.Tensor,
+) -> torch.Tensor:
+    # Validate shape compatibility if needed
+    assert out.shape == self.shape, "Output shape must match input shape"


backends/cortex_m/ops/operators.yaml

  kernels:
    - arg_meta: null
      kernel_name: cortex_m::dequantize_per_tensor_out
+- func: aten::add.Tensor(Tensor input1, Tensor input2, *, Tensor(a!) out) -> Tensor(a!)


backends/cortex_m/ops/operators.yaml

+    - arg_meta: null
+      kernel_name: cortex_m::add_out
+
+- func: cortex_m::quantized_add(Tensor self, Scalar self_zero_point, Scalar self_multiplier, Scalar self_shift, Tensor other, Scalar other_zero_point, Scalar other_multiplier, Scalar other_shift, Scalar output_zero_point, Scalar output_multiplier, Scalar output_shift) -> Tensor


backends/cortex_m/passes/quantized_add_fusion_pass.py

+import torch
+import math
+
+class QuantizedAddFusionPass(ExportPass):


digantdesai · 2025-08-12T15:50:17Z

backends/cortex_m/passes/quantized_add_fusion_pass.py

+                    and dequant_node2.target == exir_ops.edge.cortex_m.dequantize_per_tensor.default):
+                continue
+
+            print("✅ Found complete cortex_m Q/DQ + add pattern!")


can we simplify this by using a subgreph_rewriter? I am thinking of this as a common pattern for most of these ops.
see - https://docs.pytorch.org/executorch/stable/compiler-custom-compiler-passes.html#level-2

Agree, I am looking to replace this with subgraph_rewriter. I haven't successfully managed to do it so far. Maybe a follow up PR ,if you are ok with it

create an issue?

All CMSIS-NN ops are quantized so adding one pass per op for fusing them in this way will result in a lot of duplicate code. In the arm backend we "fold" all q/dq-ops into quantized ops like this: https://github.com/pytorch/executorch/blob/main/backends/arm/_passes/fold_qdq_with_annotated_qparams_pass.py.

Could we reuse that here? And then have one single pass for mapping edge ops with correct dtypes to arm ops perhaps?

Nice! Let me look into this pass and see if I can reuse it for this purpose

Reviewed the fold_qdq pass, I think it fundamentally serves a diff purpose, afaict, the pass only removes Q/DQ nodes and stores quantization parameters as metadata etc.. The current cortex_m pass on the other hand, replaces the operation itself with custom CMSIS-NN operations (like cortex_m::quantized_add.out) and computes CMSIS-NN specific multipliers and shifts etc..

Moroever with the refactored changes (latest version) , quantized_op_fusion_pass.py is already properly extensible - we can easily add more operations by extending the SUPPORTED_OPS_MAPPING dictionary without creating separate passes for each operation.

@digantdesai Here is the issue to track subgraph_rewriter : #13627

Great improvement with the generalized pass! FWIW this is closer to what we started out with in the Arm backend and we moved to folding the Q/DQ ops separately from all other lowering logic to simplify development as the backend grew more complex. This backend is smaller though so it might not be as big of an issue, we'll see.

digantdesai · 2025-08-19T21:05:47Z

backends/cortex_m/test/test_quantize_add_fusion_pass.py

+
+        # Step 2: Apply fusion pass
+        fusion_pass = QuantizedAddFusionPass()
+        final_program = intermediate_program.transform([fusion_pass])


you can give these as a list no need to call transform twice

digantdesai · 2025-08-19T21:07:48Z

backends/cortex_m/test/test_quantize_add_fusion_pass.py

+        # Check fusion occurred
+        check_count(transformed_graph, exir_ops.edge.cortex_m.quantized_add.default, 1)
+
+        # Verify numerical equivalence


why are these checks commented out?

Ok the test case was doing wrong thing previously ,in the sense that it was comparing f32 output with quantized int8 output. I fixed this 'numerical equivalence' in the latest iteration and it passes as expected

digantdesai · 2025-08-19T21:11:30Z

examples/arm/aot_arm_compiler.py

+    gm = edge.exported_program().graph_module
+
+    logging.debug(">>> Lowered GraphModule code <<<")
+    logging.debug(gm.code)  # Python‐style source of the graph
+    logging.debug(">>> Lowered GraphModule nodes <<<")
+    for node in gm.graph.nodes:
+        logging.debug(f"Node: {node.target}, args={node.args}, kwargs={node.kwargs}")
+    logging.debug("==== Graph after quantization ====")
+


do we need this?

I just left it, as it proved very useful while debugging this pass.

Generally we do not keep debugging code around upstream unless it is not overly verbose and there is a really good reason for it so I would prefer to see this removed.

digantdesai

s

AdrianLundell · 2025-08-20T21:19:09Z

examples/arm/run.sh

        "--delegate --quantize" # 4 qadd2
-        "--delegate --quantize" # 5 qops
-        "--delegate --quantize" # 6 mv2
+        "--quantize"            # 5 qadd2 (quantize only)


Why is this test added here?

This test case qadd2 (no delegation , only quantize) serves as a good e2e test case on FVP sim to validate / test the fused quantized node flow (with CMSIS-NN integration )

I like that we are testing this but adding them to the default list of models to run in run.sh will scale very poorly as we add more ops so I do not think it is the best place for it. Could it be added as a testsuite in https://github.com/pytorch/executorch/blob/main/backends/arm/test/test_arm_baremetal.sh perhaps (or even have a separate test_cortex_m.sh in this backend)?

AdrianLundell · 2025-08-20T21:19:16Z

examples/arm/aot_arm_compiler.py

-    edge = edge.transform([ReplaceQuantNodesPass()])
+    # Instantiate the pass
+    replace_quant_pass = ReplaceQuantNodesPass()
+    quantized_add_fusion_pass = QuantizedAddFusionPass()


Maybe not in this PR but the Cortex-M backend will need a more structured way of running passes soon

AdrianLundell · 2025-08-20T21:19:46Z

backends/cortex_m/passes/quantized_add_fusion_pass.py

+                    and dequant_node2.target == exir_ops.edge.cortex_m.dequantize_per_tensor.default):
+                continue
+
+            print("✅ Found complete cortex_m Q/DQ + add pattern!")


All CMSIS-NN ops are quantized so adding one pass per op for fusing them in this way will result in a lot of duplicate code. In the arm backend we "fold" all q/dq-ops into quantized ops like this: https://github.com/pytorch/executorch/blob/main/backends/arm/_passes/fold_qdq_with_annotated_qparams_pass.py.

Could we reuse that here? And then have one single pass for mapping edge ops with correct dtypes to arm ops perhaps?

AdrianLundell · 2025-08-25T08:39:56Z

backends/cortex_m/CMakeLists.txt

+)
+
+# Link directly to the CMSIS-NN static library file
+target_link_libraries(


Why is it done this way instead of directly adding the library target?

The CMSIS-NN library is linked by specifying the static library file path directly because CMSIS-NN is brought in via FetchContent and does not define a CMake target in the build.

AdrianLundell · 2025-08-25T08:41:15Z

backends/cortex_m/ops/cortex_m_ops_common.h

+using Error = executorch::runtime::Error;
+
+// Basic tensor type / layout validation and dimension order checking
+inline void validate_quantized_tensor_types_and_dim_order(


Most ops will require channels-last dim_order and there are multiple dtypes available so this name might be confusing in the future.

AdrianLundell · 2025-08-25T11:12:13Z

examples/arm/run.sh

        "--delegate --quantize" # 4 qadd2
-        "--delegate --quantize" # 5 qops
-        "--delegate --quantize" # 6 mv2
+        "--quantize"            # 5 qadd2 (quantize only)


I like that we are testing this but adding them to the default list of models to run in run.sh will scale very poorly as we add more ops so I do not think it is the best place for it. Could it be added as a testsuite in https://github.com/pytorch/executorch/blob/main/backends/arm/test/test_arm_baremetal.sh perhaps (or even have a separate test_cortex_m.sh in this backend)?

AdrianLundell · 2025-08-25T11:15:15Z

examples/arm/aot_arm_compiler.py

+    gm = edge.exported_program().graph_module
+
+    logging.debug(">>> Lowered GraphModule code <<<")
+    logging.debug(gm.code)  # Python‐style source of the graph
+    logging.debug(">>> Lowered GraphModule nodes <<<")
+    for node in gm.graph.nodes:
+        logging.debug(f"Node: {node.target}, args={node.args}, kwargs={node.kwargs}")
+    logging.debug("==== Graph after quantization ====")
+


Generally we do not keep debugging code around upstream unless it is not overly verbose and there is a really good reason for it so I would prefer to see this removed.

AdrianLundell · 2025-08-25T11:31:08Z

backends/cortex_m/passes/quantized_add_fusion_pass.py

+                    and dequant_node2.target == exir_ops.edge.cortex_m.dequantize_per_tensor.default):
+                continue
+
+            print("✅ Found complete cortex_m Q/DQ + add pattern!")


Great improvement with the generalized pass! FWIW this is closer to what we started out with in the Arm backend and we moved to folding the Q/DQ ops separately from all other lowering logic to simplify development as the backend grew more complex. This backend is smaller though so it might not be as big of an issue, we'll see.

AdrianLundell · 2025-08-25T11:38:04Z

backends/cortex_m/ops/op_quantized_add.cpp

+      output_shift_val);
+
+  // Call CMSIS-NN kernel with precomputed parameters
+  arm_cmsis_nn_status status = arm_elementwise_add_s8(


This could potentially be added as a pass as well: https://github.com/pytorch/executorch/blob/main/backends/arm/_passes/broadcast_args_pass.py. But long term the ideal solution would be to add broadcast support to CMSIS-NN to get it accelerated w/o memcopies.

digantdesai · 2025-08-26T12:00:06Z

backends/cortex_m/ops/cortex_m_ops_common.h

+inline void validate_quantization_params(
+    const Scalar& zero_point1,
+    const Scalar& multiplier1,
+    const Scalar& shift1,
+    const Scalar& zero_point2,
+    const Scalar& multiplier2,
+    const Scalar& shift2,
+    const Scalar& output_zero_point,
+    const Scalar& output_multiplier,
+    const Scalar& output_shift,
+    Tensor& output) {


Make it a util to check a single quant params

Suggested change

inline void validate_quantization_params(

const Scalar& zero_point1,

const Scalar& multiplier1,

const Scalar& shift1,

const Scalar& zero_point2,

const Scalar& multiplier2,

const Scalar& shift2,

const Scalar& output_zero_point,

const Scalar& output_multiplier,

const Scalar& output_shift,

Tensor& output) {

inline void validate_quantization_params(

const Scalar& zero_point,

const Scalar& multiplier,

const Scalar& shift) {

digantdesai · 2025-08-26T12:01:51Z

backends/cortex_m/ops/cortex_m_ops_common.h

+      out_shift_val);
+}
+
+inline Error resize_to_broadcast_target_size_quantized(


why quantization in the name?

digantdesai · 2025-08-26T12:04:10Z

backends/cortex_m/ops/cortex_m_ops_common.h

+  // Initialize shapes with 1s for padding
+  for (int i = 0; i < max_dim; i++) {
+    inp1_shape[i] = 1;
+    inp2_shape[i] = 1;
+    out_shape[i] = 1;
+  }
+
+  int offset_inp1 = max_dim - input1.dim();
+  int offset_inp2 = max_dim - input2.dim();
+  int offset_out = max_dim - output.dim();
+
+  for (int i = 0; i < input1.dim(); i++) {
+    inp1_shape[i + offset_inp1] = input1.size(i);
+  }
+  for (int i = 0; i < input2.dim(); i++) {
+    inp2_shape[i + offset_inp2] = input2.size(i);
+  }
+  for (int i = 0; i < output.dim(); i++) {
+    out_shape[i + offset_out] = output.size(i);
+  }


do we need all this logic if we are going to call the get_broadcast_target_size util?

Not needed anymore, removed it

digantdesai · 2025-08-26T12:09:17Z

backends/cortex_m/ops/op_quantized_add.cpp

+// In the pass we are calling quantized_add's default variant
+// but ExecuTorch's kernel dispatch mechanism will end up calling the out
+// variant. This stub is to make sure that compiler doesn't complain.
+Tensor quantized_add(


remove this?

digantdesai · 2025-08-26T12:11:38Z

backends/cortex_m/ops/op_quantized_add.cpp

+        Error,
+        "quantized_add_out: arm_elementwise_add_s8 failed with status [%d]",
+        status);
+    std::memset(out.mutable_data_ptr<int8_t>(), 0, out.nbytes());


why do we need this?
Set a failure flag with the appropriate Error in ctx? ctx.fail(err);, so runtime doesn't keep on going with wrong data

digantdesai

Thanks @psiddh, left some nit comments.

digantdesai · 2025-08-26T12:20:03Z

backends/cortex_m/passes/quantized_op_fusion_pass.py

+            # Copy essential fields
+            for field in ["val", "tensor_meta", "stack_trace"]:
+                if field in source_node.meta:
+                    new_node.meta[field] = source_node.meta[field]


wouldn't val here be fp32?

Yes the val field in node metadata is fp32 from the original unquantized computation. This was done for debugging. In the latest iter, removed it

digantdesai · 2025-08-26T12:20:47Z

backends/cortex_m/passes/quantized_op_fusion_pass.py

+
+

run code formatting?

digantdesai · 2025-08-26T12:27:59Z

examples/arm/aot_arm_compiler.py

+    logging.debug(">>> Lowered GraphModule code <<<")
+    logging.debug(gm.code)  # Python‐style source of the graph
+    logging.debug(">>> Lowered GraphModule nodes <<<")
+    for node in gm.graph.nodes:
+        logging.debug(f"Node: {node.target}, args={node.args}, kwargs={node.kwargs}")
+    logging.debug("==== Graph after quantization ====")
+


Nit: remove?

mergennachin

Can you convert this instructions to CI? Could be a follow-up PR.

I believe we should be able to run on on our AWS graviton instances.

Looks like there is test_arm_baremetal.sh running, either extend that or create new tests to exercise the code you just added please.

Test Plan:

examples/arm/run.sh - No regressions ==> Ok
examples/arm/run.sh now runs 'qadd2' in quantize only mode ==> Ok
python -m unittest test_replace_quant_nodes.py ==> Ok
python -m unittest test_quantize_op_fusion_pass.py ==> Ok

mergennachin · 2025-08-27T14:03:31Z

backends/cortex_m/CMakeLists.txt

+# Add dependency to ensure CMSIS-NN builds before we try to link. Use the
+# actual CMSIS-NN target name (usually 'cmsis-nn')
+add_dependencies(cortex_m_kernels cmsis-nn)
+


There are no C++ testing?

if(CORTEX_M_BUILD_TESTS) enable_testing() add_subdirectory(tests) endif()

@mergennachin Created task to track dedicated test suite for cortex_m ops (that should cover Python tests, C++ unit tests & E2E tests): #13739. (and the initial work on this: PR : #1357)

FYI we can't run cortex-m op unittests without FVP. but for c++ only utils, sure.

mergennachin · 2025-08-27T14:05:10Z

backends/cortex_m/CMakeLists.txt

-target_link_libraries(cortex_m_kernels PRIVATE executorch)
 target_compile_options(cortex_m_kernels PUBLIC ${_common_compile_options})

+# Include directories for cortex_m_kernels


besides _common_compile_options don't you also need cortex-m specific compile options?

comes from the toolchain file.

psiddh · 2025-08-27T23:19:21Z

Three follow up issues identified to make further improvements:

Optimization Improvements : Extend broadcast_args_pass.py to also handle CMSIS-NN kernel selection: #13748
Pass Refactor/ Simplification: Refactor/Simplify the quantized_op_fusion_pass.py with subgraph_rewriter #13627
Add dedicated Test Suite: Create dedicated test_suite for for cortex_m ops #13739

Test Plan: a) Setup for Arm FVP and run 'examples/arm/run.sh' (Check no regressions in e2e test scenarios) b) Then add to run.sh another iteration with qadd with only --quantize flag and see that quantized add op is called c) cd backends/cortex_m/test/; python test_quantize_op_fusion_pass.py ---------------------------------------------------------------------- Ran 9 tests in 11.128s OK Reviewers: Subscribers: Tasks: Tags:

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 11, 2025

psiddh commented Aug 11, 2025

View reviewed changes

backends/cortex_m/CMakeLists.txt Outdated Show resolved Hide resolved

psiddh force-pushed the cmsis_main branch 10 times, most recently from 23666ac to b166a2e Compare August 12, 2025 08:39