Skip to content

Conversation

psiddh
Copy link
Contributor

@psiddh psiddh commented Sep 12, 2025

Integrate with CMSIS-NN with per-channel quantization support

Test Plan:
With local changes :Run e2e test on FVP simulator
./examples/arm/run_mcu_models_fvp.sh --target=cortex-m55 --models=qlinear

Reviewers:

Subscribers:

Tasks:

Tags:

Summary

[PLEASE REMOVE] See CONTRIBUTING.md's Pull Requests for ExecuTorch PR guidelines.

[PLEASE REMOVE] If this PR closes an issue, please add a Fixes #<issue-id> line.

[PLEASE REMOVE] If this PR introduces a fix or feature that should be the upcoming release notes, please add a "Release notes: " label. For a list of available release notes labels, check out CONTRIBUTING.md's Pull Requests.

Test plan

[PLEASE REMOVE] How did you test this PR? Please write down any manual commands you used and note down tests that you have written if applicable.

Copy link

pytorch-bot bot commented Sep 12, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/14252

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 3f53d8d with merge base c2ddeec (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 12, 2025
Copy link

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@digantdesai digantdesai changed the title Summary: Add Statefull FC Cortex-m linearOps Summary: Add Stateful FC Cortex-m linearOps Sep 12, 2025
Comment on lines +257 to +268
input_zero_point: int,
input_multiplier: int,
input_shift: int,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to just take Tensor in (even for a single element)? Rationale is to support per-token like quant if we want like per-tensor today.

Copy link
Contributor Author

@psiddh psiddh Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do this in follow up change (if you are ok )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File an issue?

bias_multiplier: torch.Tensor,
bias_shift: torch.Tensor,
scratch_buffer: torch.Tensor,
output_zero_point: int,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this one too..

) * weight_scales.unsqueeze(1)
if bias is not None:
if bias_multiplier.numel() == 1:
bias_scale = bias_multiplier.item() * (2.0 ** (-bias_shift.item()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need .item() specialization for numel == 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.item() is needed for single-element tensors to extract a Python scalar for correct math .. isn't ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

float() can work with numel() == 1 and then you can get float when you do consume it or pass it down as as tensor..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@digantdesai digantdesai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great starting point. Thanks @psiddh for all the back and forth.
Left some comments.

}

// start of cmsis buffer
ctx.buf = scratch_ptr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kernel_sum_state->get_scrtach_ptr()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I encapsulated everything in the simple helper class now

@digantdesai
Copy link
Contributor

@AdrianLundell - Just FYI. We need more work to do here but want to give you a heads up.

BINARY_DIR CMSIS_NN_BINARY_DIR
)
set(CMSIS_NN_INCLUDE_DIR "${CMSIS_NN_SOURCE_DIR}/Include")
set(CMSIS_NN_LIB "${CMSIS_NN_BINARY_DIR}/libcmsis-nn.a")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an absolute path to the binary tree which causes portability issues when installing. Could you take another look at this and try again to only use the cmsis-nn target when building?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return out;
}

// Functional variant (stub, not used at runtime)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we should work toward removing the need for these stub functions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree! I will create a follow up issue to clean this up further..

can_delegate = True


class QuantLinearTest(torch.nn.Module):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are trying to move away from adding testing logic to the aot_arm_compliler, please see #13902 for my suggestions on the cortex_m testing strategy

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be reworked in the coming PRs, don't want to block this one. WDYT @AdrianLundell ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way would be that you test this way internally for now but don't upstream it, that way we don't have to undo it later?

fc_params.output_offset = output_zp;
fc_params.activation.min = std::numeric_limits<int8_t>::min();
fc_params.activation.max = std::numeric_limits<int8_t>::max();
cmsis_nn_dims input_dims = {1, 1, 1, in_feat};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is channels last where as pytorch per default is channels first, how are you handling that?

Copy link
Contributor Author

@psiddh psiddh Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is no spatial dims for linear, {1, 1, 1, in_feat} sets H=W=1 because linear ops operate on flattened vectors .
Looking here : it appears to NHWC.
Do we further need to handle NCHW↔NHWC for linear ops ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right Sid. Make sure to add tests though if you haven't already.

"Single quant Output");
}

// From CMSIS-NN implementation, See arm_nn_requantize for details,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit Add perm-link?

// scratch_ptr(start) scratch_ptr + cmsis_scratch scratch_ptr + total_size
//
// - CMSIS-NN workspace: used by CMSIS-NN kernels for temporary data
// - Always give CMSIS-NN the start of the buffer for alignment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have a fat-header i.e. 32B to solve alignment problem? The reason I Don't like trailing header is because OOB read (if not also write) to match SIMD width is an acceptable practice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, so I refactored this one more time to fit the FAT header at the beginning with 32 byte alignment.. also updated the diagram

Comment on lines 58 to 61
int32_t in_feat,
int32_t out_feat,
const int8_t* weight_data,
int32_t weight_zp) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see in/out feat comment, what I was suggesting was instead of taking cmsis_filter_dims at ctor and weight data/zp here - what if we take weight_tensor at ctor and hide the rest inside the class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think this time I factored in your ask

@psiddh psiddh force-pushed the main branch 4 times, most recently from 5c69054 to b9da8c9 Compare September 17, 2025 20:22
}

// Refer to CMSIS-NN 'arm_nn_requantize' implementation for details:
// https://fburl.com/afvegf0m
Copy link
Contributor

@digantdesai digantdesai Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just regular github perm link? not sure if this is accessible in oss?

// C++ Runtime:
// ┌─────────────────┬─────────────────────────────────────┐
// │ Fat Header │ CMSIS Workspace │
// │ (32 bytes) │ (X bytes) │
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit

Suggested change
// │ (32 bytes) │ (X bytes) │
// │ (32 bytes) │ (x bytes) │

// │ │
// ▼ ▼
// arm_vector_sum_s8() writes kernel sums (with bias if avail):
// [sum₀+bias₀][sum₁+bias₁][sum₂+bias₂]...[sum_{n-1}+bias_{n-1}]
Copy link
Contributor

@digantdesai digantdesai Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice subscript, sad can't do for {n-1} :-p

Suggested change
// [sum₀+bias₀][sum₁+bias₁][sum₂+bias₂]...[sum_{n-1}+bias_{n-1}]
// [sum₀+bias₀][sum₁+bias₁][sum₂+bias₂]...[sum_{n-1}+bias_{n-1}]

// (n × 4-byte int32_t values = X bytes)
//
// - n = out_features (number of output features)
// - X = n × 4 bytes (total CMSIS buffer size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// - X = n × 4 bytes (total CMSIS buffer size)
// - x = n * 4 bytes (total CMSIS buffer size)

// (n × 4-byte int32_t values = X bytes)
//
// - n = out_features (number of output features)
// - X = n × 4 bytes (total CMSIS buffer size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 because sizeof(int32)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ys


// Initialize fat header with offsets
header_->kernel_sum_state.updated = false;
header_->kernel_sum_state.required_size = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this?

Copy link
Contributor

@digantdesai digantdesai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need some more work but let's move forward. Make a list of tasks and we can fix these next..


// Create filter dims for validation
cmsis_nn_dims filter_dims = {in_feat, 1, 1, out_feat};
validate_size(filter_dims);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to assert first before using the args?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 158 to 168
// Cached weight tensor information
int32_t in_features;
int32_t out_features;

// Store offsets instead of raw pointers
uint32_t weight_data_offset;
uint32_t weight_zp_data_offset;
uint32_t bias_data_offset; // (0 if no bias)

bool is_per_channel;
uint8_t padding[3]; // 3 bytes explicit padding for 32-byte alignment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why do these have to be in this struct as opposed to in the CMSISScratchBufferContext

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return 0;
const uint8_t* ptr_bytes = reinterpret_cast<const uint8_t*>(ptr);
ET_CHECK_MSG(ptr_bytes >= base_ptr_, "Pointer is before base address");
uint64_t offset = ptr_bytes - base_ptr_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
uint64_t offset = ptr_bytes - base_ptr_;
const std::ptrdiff_t offset = ptr_bytes - base_ptr_;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 262 to 264
cmsis_nn_context ctx = scratch_buf_ctx_mgr.get_cmsis_ctx();
// Compute kernel sums if needed
scratch_buf_ctx_mgr.compute_kernel_sums_if_needed();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cmsis_nn_context ctx = scratch_buf_ctx_mgr.get_cmsis_ctx();
// Compute kernel sums if needed
scratch_buf_ctx_mgr.compute_kernel_sums_if_needed();
// Compute kernel sums if needed
scratch_buf_ctx_mgr.compute_kernel_sums_if_needed();
cmsis_nn_context ctx = scratch_buf_ctx_mgr.get_cmsis_ctx();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// - Total buffer = 32 + x bytes

public:
CMSISScratchBufferContext(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this out of linear since other ops can also use this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// Fixed-size metadata header using offsets for architecture independence.
// Stores cached weight information and kernel sum computation state.
// Ensures 32-byte alignment for the CMSIS workspace that follows.
struct CMSISFatHeader {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: ```suggestion
struct CMSISHeader {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to KernelSumHeader

// Fixed-size metadata header using offsets for architecture independence.
// Stores cached weight information and kernel sum computation state.
// Ensures 32-byte alignment for the CMSIS workspace that follows.
struct CMSISFatHeader {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only purpose of introducing a larger (=fat though we don't want to call it that :-p) header was to fix alignment for the kernel_sum buffer. We don't need to have that as a concept at runtime, header size of 32 at aot, and an aligned tensor from ET runtime is sufficient.

Ok for now but let's refactor and scrub it in the next PR. Also move it out of linear.cpp

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File a task? 🙏

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the struct into another header already.

uint32_t bias_data_offset; // (0 if no bias)

bool is_per_channel;
uint8_t padding[3]; // 3 bytes explicit padding for 32-byte alignment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need any padding here, as long as this assert is OK,

cmsis_workspace_ptr_ % 32 == 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep

@psiddh psiddh force-pushed the main branch 3 times, most recently from e4af3fd to 4d3d216 Compare September 18, 2025 07:57
Comment on lines 119 to 133
int32_t required_size = 0;
uint8_t padding[24]; // Pad to 32 bytes for alignment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need both of these fields

Copy link
Contributor Author

@psiddh psiddh Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

required_size is used for assertion/validation along with logging . It is used to compare the size passed by aot vs c++ runtime required size by cmsis api , looking at cmsis impl code, standard 4 byte alignment is enough.so removed extra padding field


cmsis_nn_context get_cmsis_ctx() const {
cmsis_nn_context ctx;
ctx.buf = cmsis_workspace_ptr_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to put an alignment check on this pointer before passing it down to CMSIS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like this you meant ?
ET_CHECK_MSG(
reinterpret_cast<uintptr_t>(cmsis_workspace_ptr_) % 4 == 0,
"CMSIS workspace not 4-byte aligned");

Integrate with CMSIS-NN with per-channel quantization support

Test Plan:
Run e2e test on FVP simulator
./examples/arm/run_mcu_models_fvp.sh --target=cortex-m55 --models=qlinear

Reviewers:

Subscribers:

Tasks:

Tags:
@psiddh psiddh merged commit d873063 into pytorch:main Sep 19, 2025
209 of 213 checks passed
@github-project-automation github-project-automation bot moved this from To triage to Done in ExecuTorch Embedded Sep 19, 2025
@AdrianLundell
Copy link
Collaborator

AdrianLundell commented Sep 19, 2025

Hi, our builds are failing after this was merged, we should make sure to test the arm backend after changes in the cortex_m backend since they are so intertwined. Can you have a look?

reproducer:

 backends/arm/scripts/build_executorch.sh

error:

In file included from /home/svc_mlsystest/executorch/backends/cortex_m/ops/op_quantized_linear.cpp:9:
11:42:08  /home/svc_mlsystest/executorch/backends/cortex_m/ops/cmsis_scratch_buffer_context.h: In member function 'void cortex_m::native::CMSISScratchBufferContext::compute_kernel_sums_if_needed()':
11:42:08  /home/svc_mlsystest/executorch/backends/cortex_m/ops/cmsis_scratch_buffer_context.h:89:7: error: 'arm_vector_sum_s8' was not declared in this scope
11:42:08     89 |       arm_vector_sum_s8(
11:42:08        |       ^~~~~~~~~~~~~~~~~
11:42:08  /home/svc_mlsystest/executorch/backends/cortex_m/ops/op_quantized_linear.cpp: In function 'Tensor& cortex_m::native::quantized_linear_out(KernelRuntimeContext&, const Tensor&, const Scalar&, const Scalar&, const Scalar&, const Tensor&, const Tensor&, const Tensor&, const Tensor&, const std::optional<executorch::runtime::etensor::Tensor>&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, const Scalar&, Tensor&)':
11:42:08  /home/svc_mlsystest/executorch/backends/cortex_m/ops/op_quantized_linear.cpp:103:16: error: 'arm_fully_connected_per_channel_s8' was not declared in this scope; did you mean 'arm_fully_connected_s8'?
11:42:08    103 |       status = arm_fully_connected_per_channel_s8(
11:42:08        |                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
11:42:08        |                arm_fully_connected_s8
11:42:08  gmake[2]: *** [backends/cortex_m/CMakeFiles/cortex_m_kernels.dir/build.make:121: backends/cortex_m/CMakeFiles/cortex_m_kernels.dir/ops/op_quantized_linear.cpp.obj] Error 1
11:42:08  gmake[2]: *** Waiting for unfinished jobs....
11:42:08  gmake[1]: *** [CMakeFiles/Makefile2:1110: backends/cortex_m/CMakeFiles/cortex_m_kernels.dir/all] Error 2
11:42:08  gmake[1]: *** Waiting for unfinished jobs....

@AdrianLundell
Copy link
Collaborator

After some more investigation it builds for a clean environment so this works fine, would be nice long term to fix the build scripts so that they are able to do incremental rebuilds though instead of having to do a full rebuild.

edge = edge.transform(passes)

return edge
passes += [QuantizedLinearFusionPass, QuantizedOpFusionPass]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok so this is enabled and we don't have any tests. can you either disable or land at least some tests?

StrycekSimon pushed a commit to nxp-upstream/executorch that referenced this pull request Sep 23, 2025
Integrate with CMSIS-NN with per-channel quantization support

Test Plan:
With local changes :Run e2e test on FVP simulator
./examples/arm/run_mcu_models_fvp.sh --target=cortex-m55
--models=qlinear

Reviewers:

Subscribers:

Tasks:

Tags:

### Summary
[PLEASE REMOVE] See [CONTRIBUTING.md's Pull
Requests](https://github.com/pytorch/executorch/blob/main/CONTRIBUTING.md#pull-requests)
for ExecuTorch PR guidelines.

[PLEASE REMOVE] If this PR closes an issue, please add a `Fixes
#<issue-id>` line.

[PLEASE REMOVE] If this PR introduces a fix or feature that should be
the upcoming release notes, please add a "Release notes: <area>" label.
For a list of available release notes labels, check out
[CONTRIBUTING.md's Pull
Requests](https://github.com/pytorch/executorch/blob/main/CONTRIBUTING.md#pull-requests).

### Test plan
[PLEASE REMOVE] How did you test this PR? Please write down any manual
commands you used and note down tests that you have written if
applicable.

Co-authored-by: Github Executorch <github_executorch@arm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants