-
Notifications
You must be signed in to change notification settings - Fork 683
Summary: Add Stateful FC Cortex-m linearOps #14252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/14252
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New FailureAs of commit 3f53d8d with merge base c2ddeec ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
input_zero_point: int, | ||
input_multiplier: int, | ||
input_shift: int, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to just take Tensor in (even for a single element)? Rationale is to support per-token like quant if we want like per-tensor today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do this in follow up change (if you are ok )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
File an issue?
bias_multiplier: torch.Tensor, | ||
bias_shift: torch.Tensor, | ||
scratch_buffer: torch.Tensor, | ||
output_zero_point: int, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this one too..
backends/cortex_m/ops/operators.py
Outdated
) * weight_scales.unsqueeze(1) | ||
if bias is not None: | ||
if bias_multiplier.numel() == 1: | ||
bias_scale = bias_multiplier.item() * (2.0 ** (-bias_shift.item())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need .item()
specialization for numel == 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.item() is needed for single-element tensors to extract a Python scalar for correct math .. isn't ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
float() can work with numel() == 1 and then you can get float when you do consume it or pass it down as as tensor..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great starting point. Thanks @psiddh for all the back and forth.
Left some comments.
} | ||
|
||
// start of cmsis buffer | ||
ctx.buf = scratch_ptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kernel_sum_state->get_scrtach_ptr()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I encapsulated everything in the simple helper class now
@AdrianLundell - Just FYI. We need more work to do here but want to give you a heads up. |
backends/cortex_m/CMakeLists.txt
Outdated
BINARY_DIR CMSIS_NN_BINARY_DIR | ||
) | ||
set(CMSIS_NN_INCLUDE_DIR "${CMSIS_NN_SOURCE_DIR}/Include") | ||
set(CMSIS_NN_LIB "${CMSIS_NN_BINARY_DIR}/libcmsis-nn.a") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an absolute path to the binary tree which causes portability issues when installing. Could you take another look at this and try again to only use the cmsis-nn target when building?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
return out; | ||
} | ||
|
||
// Functional variant (stub, not used at runtime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we should work toward removing the need for these stub functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree! I will create a follow up issue to clean this up further..
examples/arm/aot_arm_compiler.py
Outdated
can_delegate = True | ||
|
||
|
||
class QuantLinearTest(torch.nn.Module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are trying to move away from adding testing logic to the aot_arm_compliler, please see #13902 for my suggestions on the cortex_m testing strategy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be reworked in the coming PRs, don't want to block this one. WDYT @AdrianLundell ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One way would be that you test this way internally for now but don't upstream it, that way we don't have to undo it later?
fc_params.output_offset = output_zp; | ||
fc_params.activation.min = std::numeric_limits<int8_t>::min(); | ||
fc_params.activation.max = std::numeric_limits<int8_t>::max(); | ||
cmsis_nn_dims input_dims = {1, 1, 1, in_feat}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is channels last where as pytorch per default is channels first, how are you handling that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is no spatial dims for linear, {1, 1, 1, in_feat} sets H=W=1 because linear ops operate on flattened vectors .
Looking here : it appears to NHWC.
Do we further need to handle NCHW↔NHWC for linear ops ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are right Sid. Make sure to add tests though if you haven't already.
"Single quant Output"); | ||
} | ||
|
||
// From CMSIS-NN implementation, See arm_nn_requantize for details, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit Add perm-link?
// scratch_ptr(start) scratch_ptr + cmsis_scratch scratch_ptr + total_size | ||
// | ||
// - CMSIS-NN workspace: used by CMSIS-NN kernels for temporary data | ||
// - Always give CMSIS-NN the start of the buffer for alignment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we have a fat-header i.e. 32B to solve alignment problem? The reason I Don't like trailing header is because OOB read (if not also write) to match SIMD width is an acceptable practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, so I refactored this one more time to fit the FAT header at the beginning with 32 byte alignment.. also updated the diagram
int32_t in_feat, | ||
int32_t out_feat, | ||
const int8_t* weight_data, | ||
int32_t weight_zp) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see in/out feat comment, what I was suggesting was instead of taking cmsis_filter_dims
at ctor and weight data/zp here - what if we take weight_tensor
at ctor and hide the rest inside the class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think this time I factored in your ask
5c69054
to
b9da8c9
Compare
} | ||
|
||
// Refer to CMSIS-NN 'arm_nn_requantize' implementation for details: | ||
// https://fburl.com/afvegf0m |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just regular github perm link? not sure if this is accessible in oss?
// C++ Runtime: | ||
// ┌─────────────────┬─────────────────────────────────────┐ | ||
// │ Fat Header │ CMSIS Workspace │ | ||
// │ (32 bytes) │ (X bytes) │ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit
// │ (32 bytes) │ (X bytes) │ | |
// │ (32 bytes) │ (x bytes) │ |
// │ │ | ||
// ▼ ▼ | ||
// arm_vector_sum_s8() writes kernel sums (with bias if avail): | ||
// [sum₀+bias₀][sum₁+bias₁][sum₂+bias₂]...[sum_{n-1}+bias_{n-1}] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice subscript, sad can't do for {n-1}
:-p
// [sum₀+bias₀][sum₁+bias₁][sum₂+bias₂]...[sum_{n-1}+bias_{n-1}] | |
// [sum₀+bias₀][sum₁+bias₁][sum₂+bias₂]...[sum_{n-1}+bias_{n-1}] |
// (n × 4-byte int32_t values = X bytes) | ||
// | ||
// - n = out_features (number of output features) | ||
// - X = n × 4 bytes (total CMSIS buffer size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// - X = n × 4 bytes (total CMSIS buffer size) | |
// - x = n * 4 bytes (total CMSIS buffer size) |
// (n × 4-byte int32_t values = X bytes) | ||
// | ||
// - n = out_features (number of output features) | ||
// - X = n × 4 bytes (total CMSIS buffer size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 because sizeof(int32)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ys
|
||
// Initialize fat header with offsets | ||
header_->kernel_sum_state.updated = false; | ||
header_->kernel_sum_state.required_size = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need some more work but let's move forward. Make a list of tasks and we can fix these next..
|
||
// Create filter dims for validation | ||
cmsis_nn_dims filter_dims = {in_feat, 1, 1, out_feat}; | ||
validate_size(filter_dims); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to assert first before using the args?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
// Cached weight tensor information | ||
int32_t in_features; | ||
int32_t out_features; | ||
|
||
// Store offsets instead of raw pointers | ||
uint32_t weight_data_offset; | ||
uint32_t weight_zp_data_offset; | ||
uint32_t bias_data_offset; // (0 if no bias) | ||
|
||
bool is_per_channel; | ||
uint8_t padding[3]; // 3 bytes explicit padding for 32-byte alignment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure why do these have to be in this struct as opposed to in the CMSISScratchBufferContext
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
return 0; | ||
const uint8_t* ptr_bytes = reinterpret_cast<const uint8_t*>(ptr); | ||
ET_CHECK_MSG(ptr_bytes >= base_ptr_, "Pointer is before base address"); | ||
uint64_t offset = ptr_bytes - base_ptr_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uint64_t offset = ptr_bytes - base_ptr_; | |
const std::ptrdiff_t offset = ptr_bytes - base_ptr_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
cmsis_nn_context ctx = scratch_buf_ctx_mgr.get_cmsis_ctx(); | ||
// Compute kernel sums if needed | ||
scratch_buf_ctx_mgr.compute_kernel_sums_if_needed(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cmsis_nn_context ctx = scratch_buf_ctx_mgr.get_cmsis_ctx(); | |
// Compute kernel sums if needed | |
scratch_buf_ctx_mgr.compute_kernel_sums_if_needed(); | |
// Compute kernel sums if needed | |
scratch_buf_ctx_mgr.compute_kernel_sums_if_needed(); | |
cmsis_nn_context ctx = scratch_buf_ctx_mgr.get_cmsis_ctx(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
// - Total buffer = 32 + x bytes | ||
|
||
public: | ||
CMSISScratchBufferContext( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this out of linear since other ops can also use this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
// Fixed-size metadata header using offsets for architecture independence. | ||
// Stores cached weight information and kernel sum computation state. | ||
// Ensures 32-byte alignment for the CMSIS workspace that follows. | ||
struct CMSISFatHeader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: ```suggestion
struct CMSISHeader {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed to KernelSumHeader
// Fixed-size metadata header using offsets for architecture independence. | ||
// Stores cached weight information and kernel sum computation state. | ||
// Ensures 32-byte alignment for the CMSIS workspace that follows. | ||
struct CMSISFatHeader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only purpose of introducing a larger (=fat though we don't want to call it that :-p) header was to fix alignment for the kernel_sum buffer. We don't need to have that as a concept at runtime, header size of 32 at aot, and an aligned tensor from ET runtime is sufficient.
Ok for now but let's refactor and scrub it in the next PR. Also move it out of linear.cpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
File a task? 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the struct into another header already.
uint32_t bias_data_offset; // (0 if no bias) | ||
|
||
bool is_per_channel; | ||
uint8_t padding[3]; // 3 bytes explicit padding for 32-byte alignment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need any padding here, as long as this assert is OK,
cmsis_workspace_ptr_ % 32 == 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep
e4af3fd
to
4d3d216
Compare
int32_t required_size = 0; | ||
uint8_t padding[24]; // Pad to 32 bytes for alignment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need both of these fields
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
required_size is used for assertion/validation along with logging . It is used to compare the size passed by aot vs c++ runtime required size by cmsis api , looking at cmsis impl code, standard 4 byte alignment is enough.so removed extra padding field
|
||
cmsis_nn_context get_cmsis_ctx() const { | ||
cmsis_nn_context ctx; | ||
ctx.buf = cmsis_workspace_ptr_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to put an alignment check on this pointer before passing it down to CMSIS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like this you meant ?
ET_CHECK_MSG(
reinterpret_cast<uintptr_t>(cmsis_workspace_ptr_) % 4 == 0,
"CMSIS workspace not 4-byte aligned");
Integrate with CMSIS-NN with per-channel quantization support Test Plan: Run e2e test on FVP simulator ./examples/arm/run_mcu_models_fvp.sh --target=cortex-m55 --models=qlinear Reviewers: Subscribers: Tasks: Tags:
Hi, our builds are failing after this was merged, we should make sure to test the arm backend after changes in the cortex_m backend since they are so intertwined. Can you have a look? reproducer:
error:
|
After some more investigation it builds for a clean environment so this works fine, would be nice long term to fix the build scripts so that they are able to do incremental rebuilds though instead of having to do a full rebuild. |
edge = edge.transform(passes) | ||
|
||
return edge | ||
passes += [QuantizedLinearFusionPass, QuantizedOpFusionPass] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok so this is enabled and we don't have any tests. can you either disable or land at least some tests?
Integrate with CMSIS-NN with per-channel quantization support Test Plan: With local changes :Run e2e test on FVP simulator ./examples/arm/run_mcu_models_fvp.sh --target=cortex-m55 --models=qlinear Reviewers: Subscribers: Tasks: Tags: ### Summary [PLEASE REMOVE] See [CONTRIBUTING.md's Pull Requests](https://github.com/pytorch/executorch/blob/main/CONTRIBUTING.md#pull-requests) for ExecuTorch PR guidelines. [PLEASE REMOVE] If this PR closes an issue, please add a `Fixes #<issue-id>` line. [PLEASE REMOVE] If this PR introduces a fix or feature that should be the upcoming release notes, please add a "Release notes: <area>" label. For a list of available release notes labels, check out [CONTRIBUTING.md's Pull Requests](https://github.com/pytorch/executorch/blob/main/CONTRIBUTING.md#pull-requests). ### Test plan [PLEASE REMOVE] How did you test this PR? Please write down any manual commands you used and note down tests that you have written if applicable. Co-authored-by: Github Executorch <github_executorch@arm.com>
Integrate with CMSIS-NN with per-channel quantization support
Test Plan:
With local changes :Run e2e test on FVP simulator
./examples/arm/run_mcu_models_fvp.sh --target=cortex-m55 --models=qlinear
Reviewers:
Subscribers:
Tasks:
Tags:
Summary
[PLEASE REMOVE] See CONTRIBUTING.md's Pull Requests for ExecuTorch PR guidelines.
[PLEASE REMOVE] If this PR closes an issue, please add a
Fixes #<issue-id>
line.[PLEASE REMOVE] If this PR introduces a fix or feature that should be the upcoming release notes, please add a "Release notes: " label. For a list of available release notes labels, check out CONTRIBUTING.md's Pull Requests.
Test plan
[PLEASE REMOVE] How did you test this PR? Please write down any manual commands you used and note down tests that you have written if applicable.