Skip to content

Conversation

psiddh
Copy link
Contributor

@psiddh psiddh commented Aug 11, 2025

Test Plan:

  1. examples/arm/run.sh - No regressions ==> Ok
  2. examples/arm/run.sh now runs 'qadd2' in quantize only mode ==> Ok
  3. python -m unittest test_replace_quant_nodes.py ==> Ok
  4. python -m unittest test_quantize_op_fusion_pass.py ==> Ok

Reviewers:

Subscribers:

Tasks:

Tags:

Summary

[PLEASE REMOVE] See CONTRIBUTING.md's Pull Requests for ExecuTorch PR guidelines.

[PLEASE REMOVE] If this PR closes an issue, please add a Fixes #<issue-id> line.

[PLEASE REMOVE] If this PR introduces a fix or feature that should be the upcoming release notes, please add a "Release notes: " label. For a list of available release notes labels, check out CONTRIBUTING.md's Pull Requests.

Test plan

[PLEASE REMOVE] How did you test this PR? Please write down any manual commands you used and note down tests that you have written if applicable.

cc @digantdesai @freddan80 @per @zingo @oscarandersson8218

Copy link

pytorch-bot bot commented Aug 11, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/13296

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 7262dda with merge base dfc387b (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 11, 2025
Copy link

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@psiddh psiddh force-pushed the cmsis_main branch 10 times, most recently from 23666ac to b166a2e Compare August 12, 2025 08:39
ET_LOG(Info, "Input1 dtype: %d, Input2 dtype: %d, Output dtype: %d",
static_cast<int>(input1_dtype), static_cast<int>(input2_dtype), static_cast<int>(out_dtype));

// Stub for now

This comment was marked as resolved.

This comment was marked as resolved.

Comment on lines 11 to 13
extern "C" {
#include "Include/arm_nnfunctions.h"
}

This comment was marked as resolved.

This comment was marked as resolved.

Tensor& out) {
ET_LOG(Info, "add_Tensor kernel called");

// Ensure input is char type

This comment was marked as resolved.

This comment was marked as resolved.

"other.scalar_type() %" PRId8 " is not char type",
static_cast<int8_t>(other.scalar_type()));

// Stub for now

This comment was marked as resolved.


// 🔧 FIX: Use template types that ExecutorTorch definitely provides
// Use to<int64_t>() and to<double>() which are commonly instantiated
int64_t zp1 = input1_zero_point.to<int64_t>();

This comment was marked as resolved.

This comment was marked as resolved.

ET_LOG(Error, "quantized_add_out: arm_elementwise_add_s8 failed with status [%d]", status);
std::memset(out.mutable_data_ptr<int8_t>(), 0, out.nbytes());
} else {
ET_LOG(Info, "quantized_add_out: Successfully completed with AoT-computed parameters! 🎯");

This comment was marked as resolved.

This comment was marked as resolved.

Comment on lines 127 to 128
ET_LOG(Info, "quantized_add: input1_int8.sizes() = %zu", input1_int8.sizes().size());
return const_cast<Tensor&>(input1_int8); // to make compiler happy

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

@@ -0,0 +1,44 @@
/*

This comment was marked as resolved.

This comment was marked as resolved.

Comment on lines 62 to 63
# Call the backend kernel that writes into 'out'
return exir_ops.edge.cortex_m.add.out(self, other, alpha, out)

This comment was marked as resolved.

This comment was marked as resolved.

output_multiplier: int,
output_shift: int,
) -> torch.Tensor:
return torch.empty_like(self, dtype=torch.int8)

This comment was marked as resolved.

This comment was marked as duplicate.

This comment was marked as duplicate.

This comment was marked as duplicate.

This comment was marked as resolved.

# For now, convert back to float, add, and quantize (as placeholder)

# Dequantize inputs using multiplier/shift
self_fp = (self.float() - self_zero_point) * (self_multiplier / (1 << (31 - self_shift)))

This comment was marked as duplicate.

This comment was marked as resolved.

out: torch.Tensor,
) -> torch.Tensor:
# Validate shape compatibility if needed
assert out.shape == self.shape, "Output shape must match input shape"

This comment was marked as resolved.

This comment was marked as resolved.

kernels:
- arg_meta: null
kernel_name: cortex_m::dequantize_per_tensor_out
- func: aten::add.Tensor(Tensor input1, Tensor input2, *, Tensor(a!) out) -> Tensor(a!)

This comment was marked as resolved.

This comment was marked as resolved.

- arg_meta: null
kernel_name: cortex_m::add_out

- func: cortex_m::quantized_add(Tensor self, Scalar self_zero_point, Scalar self_multiplier, Scalar self_shift, Tensor other, Scalar other_zero_point, Scalar other_multiplier, Scalar other_shift, Scalar output_zero_point, Scalar output_multiplier, Scalar output_shift) -> Tensor

This comment was marked as resolved.

This comment was marked as resolved.

import torch
import math

class QuantizedAddFusionPass(ExportPass):

This comment was marked as resolved.

and dequant_node2.target == exir_ops.edge.cortex_m.dequantize_per_tensor.default):
continue

print("✅ Found complete cortex_m Q/DQ + add pattern!")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we simplify this by using a subgreph_rewriter? I am thinking of this as a common pattern for most of these ops.
see - https://docs.pytorch.org/executorch/stable/compiler-custom-compiler-passes.html#level-2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I am looking to replace this with subgraph_rewriter. I haven't successfully managed to do it so far. Maybe a follow up PR ,if you are ok with it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create an issue?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All CMSIS-NN ops are quantized so adding one pass per op for fusing them in this way will result in a lot of duplicate code. In the arm backend we "fold" all q/dq-ops into quantized ops like this: https://github.com/pytorch/executorch/blob/main/backends/arm/_passes/fold_qdq_with_annotated_qparams_pass.py.

Could we reuse that here? And then have one single pass for mapping edge ops with correct dtypes to arm ops perhaps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Let me look into this pass and see if I can reuse it for this purpose

Copy link
Contributor Author

@psiddh psiddh Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the fold_qdq pass, I think it fundamentally serves a diff purpose, afaict, the pass only removes Q/DQ nodes and stores quantization parameters as metadata etc.. The current cortex_m pass on the other hand, replaces the operation itself with custom CMSIS-NN operations (like cortex_m::quantized_add.out) and computes CMSIS-NN specific multipliers and shifts etc..

Moroever with the refactored changes (latest version) , quantized_op_fusion_pass.py is already properly extensible - we can easily add more operations by extending the SUPPORTED_OPS_MAPPING dictionary without creating separate passes for each operation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@digantdesai Here is the issue to track subgraph_rewriter : #13627

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great improvement with the generalized pass! FWIW this is closer to what we started out with in the Arm backend and we moved to folding the Q/DQ ops separately from all other lowering logic to simplify development as the backend grew more complex. This backend is smaller though so it might not be as big of an issue, we'll see.


# Step 2: Apply fusion pass
fusion_pass = QuantizedAddFusionPass()
final_program = intermediate_program.transform([fusion_pass])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can give these as a list no need to call transform twice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

# Check fusion occurred
check_count(transformed_graph, exir_ops.edge.cortex_m.quantized_add.default, 1)

# Verify numerical equivalence
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are these checks commented out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok the test case was doing wrong thing previously ,in the sense that it was comparing f32 output with quantized int8 output. I fixed this 'numerical equivalence' in the latest iteration and it passes as expected

Comment on lines 763 to 771
gm = edge.exported_program().graph_module

logging.debug(">>> Lowered GraphModule code <<<")
logging.debug(gm.code) # Python‐style source of the graph
logging.debug(">>> Lowered GraphModule nodes <<<")
for node in gm.graph.nodes:
logging.debug(f"Node: {node.target}, args={node.args}, kwargs={node.kwargs}")
logging.debug("==== Graph after quantization ====")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just left it, as it proved very useful while debugging this pass.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally we do not keep debugging code around upstream unless it is not overly verbose and there is a really good reason for it so I would prefer to see this removed.

Copy link
Contributor

@digantdesai digantdesai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s

"--delegate --quantize" # 4 qadd2
"--delegate --quantize" # 5 qops
"--delegate --quantize" # 6 mv2
"--quantize" # 5 qadd2 (quantize only)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this test added here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case qadd2 (no delegation , only quantize) serves as a good e2e test case on FVP sim to validate / test the fused quantized node flow (with CMSIS-NN integration )

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that we are testing this but adding them to the default list of models to run in run.sh will scale very poorly as we add more ops so I do not think it is the best place for it. Could it be added as a testsuite in https://github.com/pytorch/executorch/blob/main/backends/arm/test/test_arm_baremetal.sh perhaps (or even have a separate test_cortex_m.sh in this backend)?

edge = edge.transform([ReplaceQuantNodesPass()])
# Instantiate the pass
replace_quant_pass = ReplaceQuantNodesPass()
quantized_add_fusion_pass = QuantizedAddFusionPass()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not in this PR but the Cortex-M backend will need a more structured way of running passes soon

and dequant_node2.target == exir_ops.edge.cortex_m.dequantize_per_tensor.default):
continue

print("✅ Found complete cortex_m Q/DQ + add pattern!")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All CMSIS-NN ops are quantized so adding one pass per op for fusing them in this way will result in a lot of duplicate code. In the arm backend we "fold" all q/dq-ops into quantized ops like this: https://github.com/pytorch/executorch/blob/main/backends/arm/_passes/fold_qdq_with_annotated_qparams_pass.py.

Could we reuse that here? And then have one single pass for mapping edge ops with correct dtypes to arm ops perhaps?

)

# Link directly to the CMSIS-NN static library file
target_link_libraries(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it done this way instead of directly adding the library target?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CMSIS-NN library is linked by specifying the static library file path directly because CMSIS-NN is brought in via FetchContent and does not define a CMake target in the build.

using Error = executorch::runtime::Error;

// Basic tensor type / layout validation and dimension order checking
inline void validate_quantized_tensor_types_and_dim_order(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most ops will require channels-last dim_order and there are multiple dtypes available so this name might be confusing in the future.

"--delegate --quantize" # 4 qadd2
"--delegate --quantize" # 5 qops
"--delegate --quantize" # 6 mv2
"--quantize" # 5 qadd2 (quantize only)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that we are testing this but adding them to the default list of models to run in run.sh will scale very poorly as we add more ops so I do not think it is the best place for it. Could it be added as a testsuite in https://github.com/pytorch/executorch/blob/main/backends/arm/test/test_arm_baremetal.sh perhaps (or even have a separate test_cortex_m.sh in this backend)?

Comment on lines 763 to 771
gm = edge.exported_program().graph_module

logging.debug(">>> Lowered GraphModule code <<<")
logging.debug(gm.code) # Python‐style source of the graph
logging.debug(">>> Lowered GraphModule nodes <<<")
for node in gm.graph.nodes:
logging.debug(f"Node: {node.target}, args={node.args}, kwargs={node.kwargs}")
logging.debug("==== Graph after quantization ====")

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally we do not keep debugging code around upstream unless it is not overly verbose and there is a really good reason for it so I would prefer to see this removed.

and dequant_node2.target == exir_ops.edge.cortex_m.dequantize_per_tensor.default):
continue

print("✅ Found complete cortex_m Q/DQ + add pattern!")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great improvement with the generalized pass! FWIW this is closer to what we started out with in the Arm backend and we moved to folding the Q/DQ ops separately from all other lowering logic to simplify development as the backend grew more complex. This backend is smaller though so it might not be as big of an issue, we'll see.

output_shift_val);

// Call CMSIS-NN kernel with precomputed parameters
arm_cmsis_nn_status status = arm_elementwise_add_s8(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could potentially be added as a pass as well: https://github.com/pytorch/executorch/blob/main/backends/arm/_passes/broadcast_args_pass.py. But long term the ideal solution would be to add broadcast support to CMSIS-NN to get it accelerated w/o memcopies.

Comment on lines +54 to +110
inline void validate_quantization_params(
const Scalar& zero_point1,
const Scalar& multiplier1,
const Scalar& shift1,
const Scalar& zero_point2,
const Scalar& multiplier2,
const Scalar& shift2,
const Scalar& output_zero_point,
const Scalar& output_multiplier,
const Scalar& output_shift,
Tensor& output) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make it a util to check a single quant params

Suggested change
inline void validate_quantization_params(
const Scalar& zero_point1,
const Scalar& multiplier1,
const Scalar& shift1,
const Scalar& zero_point2,
const Scalar& multiplier2,
const Scalar& shift2,
const Scalar& output_zero_point,
const Scalar& output_multiplier,
const Scalar& output_shift,
Tensor& output) {
inline void validate_quantization_params(
const Scalar& zero_point,
const Scalar& multiplier,
const Scalar& shift) {

out_shift_val);
}

inline Error resize_to_broadcast_target_size_quantized(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why quantization in the name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

Comment on lines 145 to 164
// Initialize shapes with 1s for padding
for (int i = 0; i < max_dim; i++) {
inp1_shape[i] = 1;
inp2_shape[i] = 1;
out_shape[i] = 1;
}

int offset_inp1 = max_dim - input1.dim();
int offset_inp2 = max_dim - input2.dim();
int offset_out = max_dim - output.dim();

for (int i = 0; i < input1.dim(); i++) {
inp1_shape[i + offset_inp1] = input1.size(i);
}
for (int i = 0; i < input2.dim(); i++) {
inp2_shape[i + offset_inp2] = input2.size(i);
}
for (int i = 0; i < output.dim(); i++) {
out_shape[i + offset_out] = output.size(i);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need all this logic if we are going to call the get_broadcast_target_size util?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed anymore, removed it

// In the pass we are calling quantized_add's default variant
// but ExecuTorch's kernel dispatch mechanism will end up calling the out
// variant. This stub is to make sure that compiler doesn't complain.
Tensor quantized_add(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this?

Error,
"quantized_add_out: arm_elementwise_add_s8 failed with status [%d]",
status);
std::memset(out.mutable_data_ptr<int8_t>(), 0, out.nbytes());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this?
Set a failure flag with the appropriate Error in ctx? ctx.fail(err);, so runtime doesn't keep on going with wrong data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@digantdesai digantdesai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @psiddh, left some nit comments.

# Copy essential fields
for field in ["val", "tensor_meta", "stack_trace"]:
if field in source_node.meta:
new_node.meta[field] = source_node.meta[field]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't val here be fp32?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the val field in node metadata is fp32 from the original unquantized computation. This was done for debugging. In the latest iter, removed it

Comment on lines 130 to 131


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run code formatting?

Comment on lines 765 to 771
logging.debug(">>> Lowered GraphModule code <<<")
logging.debug(gm.code) # Python‐style source of the graph
logging.debug(">>> Lowered GraphModule nodes <<<")
for node in gm.graph.nodes:
logging.debug(f"Node: {node.target}, args={node.args}, kwargs={node.kwargs}")
logging.debug("==== Graph after quantization ====")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: remove?

Copy link
Contributor

@mergennachin mergennachin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you convert this instructions to CI? Could be a follow-up PR.

I believe we should be able to run on on our AWS graviton instances.

Looks like there is test_arm_baremetal.sh running, either extend that or create new tests to exercise the code you just added please.

Test Plan:

examples/arm/run.sh - No regressions ==> Ok
examples/arm/run.sh now runs 'qadd2' in quantize only mode ==> Ok
python -m unittest test_replace_quant_nodes.py ==> Ok
python -m unittest test_quantize_op_fusion_pass.py ==> Ok

# Add dependency to ensure CMSIS-NN builds before we try to link. Use the
# actual CMSIS-NN target name (usually 'cmsis-nn')
add_dependencies(cortex_m_kernels cmsis-nn)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no C++ testing?

if(CORTEX_M_BUILD_TESTS)
  enable_testing()
  add_subdirectory(tests)
endif()

Copy link
Contributor Author

@psiddh psiddh Aug 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mergennachin Created task to track dedicated test suite for cortex_m ops (that should cover Python tests, C++ unit tests & E2E tests): #13739. (and the initial work on this: PR : #1357)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI we can't run cortex-m op unittests without FVP. but for c++ only utils, sure.

target_link_libraries(cortex_m_kernels PRIVATE executorch)
target_compile_options(cortex_m_kernels PUBLIC ${_common_compile_options})

# Include directories for cortex_m_kernels
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

besides _common_compile_options don't you also need cortex-m specific compile options?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comes from the toolchain file.

@psiddh
Copy link
Contributor Author

psiddh commented Aug 27, 2025

Three follow up issues identified to make further improvements:

@psiddh psiddh force-pushed the cmsis_main branch 2 times, most recently from d9b48d3 to 5768f06 Compare August 28, 2025 04:15
Test Plan:
  a) Setup for Arm FVP and run 'examples/arm/run.sh' (Check no
regressions in e2e test scenarios)
  b) Then add to run.sh another iteration with qadd with only --quantize
flag and see that quantized add op is called
  c) cd backends/cortex_m/test/; python test_quantize_op_fusion_pass.py
     ----------------------------------------------------------------------
     Ran 9 tests in 11.128s
     OK

Reviewers:

Subscribers:

Tasks:

Tags:
@psiddh psiddh merged commit 18907e6 into pytorch:main Aug 28, 2025
111 of 112 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants