Adding fixes for im2row and conv1d, link im2row, [ET][Cadence] patch transpose kernel for safe load & store, [ET][Cadence][Hifi] add cpp test for im2row, [ET][Cadence][Hifi] link transpose to internal, [ET][Cadence][Hifi] add cpp... (#16512) #17180

zonglinpeng · 2026-02-03T23:01:11Z

Summary:
...test for transpose

Summary
Adding fixes for im2row and conv1d
Original patch: cad-audio@1cb9ee7#diff-72cb6b6b59f6d8be4021885fa14490cd182776f883b6d2132678968b9e7cb267

Explanation of Code Changes in xa_nn_transpose_32.c
This patch fixes a memory alignment crash issue in the transpose operation for Cadence HiFi DSPs. Let me explain both changes:

Change 1: Simplified Inner Loop (Lines 170-191)
The Problem
The original code had a flawed alignment check:

// OLD CODE - BUGGY
if((((unsigned)p_inp4 & 1) == 0) && (((unsigned)p_out & 1) == 0))
{
// "Aligned" path using AE_L32X2_IP / AE_S32X2_IP
}
else
{
// Unaligned path using AE_LA32X2_IP / AE_SA32X2_IP
}

The bug: The check & 1 only verifies 2-byte alignment (checking if the least significant bit is 0). However:

ae_int32x2 is a 64-bit (8-byte) SIMD type
AE_L32X2_IP / AE_S32X2_IP are aligned load/store intrinsics that require 8-byte alignment
The correct check should be & 0x7 (for 8-byte alignment) or at minimum & 0x3 (for 4-byte alignment)
Result: When pointers were 2-byte aligned but NOT 8-byte aligned, the code incorrectly took the "aligned" path, causing:

Crashes on HiFi DSPs due to misaligned memory access
Data corruption in some cases
The Fix
// NEW CODE - SAFE
ae_int32x2 *restrict pae_i = (ae_int32x2 *)(p_inp4);
ae_int32x2 *restrict pae_o = (ae_int32x2 *)(p_out);
ae_valign a_inp = AE_LA64_PP(pae_i); // Prime alignment register for input
ae_valign a_out = AE_ZALIGN64(); // Initialize alignment register for output
ae_int32x2 d0;
for(itr4 = 0; itr4 < (out_dim4 >> 1); itr4++)
{
AE_LA32X2_IP(d0, a_inp, pae_i); // Unaligned load
AE_SA32X2_IP(d0, a_out, pae_o); // Unaligned store
}
AE_SA64POS_FP(a_out, pae_o); // Flush remaining data

Why this works:

Always uses unaligned intrinsics (AE_LA32X2_IP / AE_SA32X2_IP) which work correctly for both aligned and unaligned data
Eliminates the faulty branch entirely - no alignment check needed
Minimal performance impact - on modern HiFi DSPs, unaligned intrinsics on aligned data have negligible overhead
Simpler code - easier to maintain and less error-prone
Intrinsic Type Requirement
AE_L32X2_IP Aligned load Requires 8-byte aligned pointer
AE_LA32X2_IP Unaligned load Works with any alignment
AE_S32X2_IP Aligned store Requires 8-byte aligned pointer
AE_SA32X2_IP Unaligned store Works with any alignment
Change 2: Fixed Strided Load (Lines 216-222)
The Problem
// OLD CODE - PROBLEMATIC
AE_L32_XP(d0, (ae_int32 *)p_inp4, inp_stride[p_5D_permute_vec[4]] << 2);
AE_L32_XP(d1, (ae_int32 *)p_inp4, inp_stride[p_5D_permute_vec[4]] << 2);

The AE_L32_XP intrinsic:

Loads a 32-bit value from the pointer
Modifies the pointer by adding the byte offset
Has specific alignment and type requirements
Issues:

The << 2 converts stride to bytes, but the pointer increment behavior inside the macro can be problematic with unaligned addresses
The intrinsic may have stricter alignment requirements than expected
The cast (ae_int32 *)p_inp4 combined with the post-increment behavior can cause undefined behavior on some platforms
The Fix
// NEW CODE - EXPLICIT AND SAFE
d0 = AE_L32_X((ae_int32 *)p_inp4, 0); // Load with explicit offset
p_inp4 += inp_stride[p_5D_permute_vec[4]]; // Manual pointer increment
d1 = AE_L32_X((ae_int32 *)p_inp4, 0); // Load with explicit offset
p_inp4 += inp_stride[p_5D_permute_vec[4]]; // Manual pointer increment

Why this works:

AE_L32_X(ptr, offset) - Loads from ptr + offset without modifying the pointer
Explicit pointer arithmetic - p_inp4 += stride increments by elements (not bytes), which is correct for WORD32*
Clearer semantics - Separates load and pointer update, making behavior predictable
Avoids intrinsic quirks - No reliance on how the intrinsic handles pointer modification internally
Aspect Old (AE_L32_XP) New (AE_L32_X + manual)
Pointer update Inside intrinsic (byte offset) Explicit (element offset)
Clarity Implicit behavior Explicit, readable
Safety Potential alignment issues Guaranteed correct
Summary
Change Root Cause Fix Strategy
Inner loop simplification Wrong alignment check (& 1 instead of & 0x7) Always use unaligned intrinsics
Strided load fix AE_L32_XP intrinsic issues with unaligned pointers Use AE_L32_X + manual pointer increment
Both fixes follow the same principle: Instead of trying to detect alignment and use different code paths, use intrinsics that handle unaligned data correctly in all cases. This is safer, simpler, and has minimal performance impact on modern HiFi DSPs.
titled
titled

titled

Differential Revision: D92207944

pytorch-bot · 2026-02-03T23:01:15Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17180

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 6c86501 with merge base c0cef6c ():

NEW FAILURE - The following job has failed:

pull / unittest-arm-backend-with-no-deps (test_pytest_models_tosa) / linux-job (gh)
RuntimeError: Command docker exec -t c69b6aafd5f5751aca2239b81abdea9707a81fe148867dfc1490714cbdf3c375 /exec failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-codesync · 2026-02-03T23:01:21Z

@zonglinpeng has exported this pull request. If you are a Meta employee, you can view the originating Diff in D92207944.

github-actions · 2026-02-03T23:02:02Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

…el for safe load & store, add cpp test for im2row, link transpose to internal, add cpp (pytorch#17180) Summary: ...test for transpose Summary Adding fixes for im2row and conv1d Original patch: cad-audio@1cb9ee7#diff-72cb6b6b59f6d8be4021885fa14490cd182776f883b6d2132678968b9e7cb267 Explanation of Code Changes in xa_nn_transpose_32.c This patch fixes a memory alignment crash issue in the transpose operation for Cadence HiFi DSPs. Let me explain both changes: Change 1: Simplified Inner Loop (Lines 170-191) The Problem The original code had a flawed alignment check: // OLD CODE - BUGGY if((((unsigned)p_inp4 & 1) == 0) && (((unsigned)p_out & 1) == 0)) { // "Aligned" path using AE_L32X2_IP / AE_S32X2_IP } else { // Unaligned path using AE_LA32X2_IP / AE_SA32X2_IP } The bug: The check & 1 only verifies 2-byte alignment (checking if the least significant bit is 0). However: ae_int32x2 is a 64-bit (8-byte) SIMD type AE_L32X2_IP / AE_S32X2_IP are aligned load/store intrinsics that require 8-byte alignment The correct check should be & 0x7 (for 8-byte alignment) or at minimum & 0x3 (for 4-byte alignment) Result: When pointers were 2-byte aligned but NOT 8-byte aligned, the code incorrectly took the "aligned" path, causing: Crashes on HiFi DSPs due to misaligned memory access Data corruption in some cases The Fix // NEW CODE - SAFE ae_int32x2 *__restrict__ pae_i = (ae_int32x2 *)(p_inp4); ae_int32x2 *__restrict__ pae_o = (ae_int32x2 *)(p_out); ae_valign a_inp = AE_LA64_PP(pae_i); // Prime alignment register for input ae_valign a_out = AE_ZALIGN64(); // Initialize alignment register for output ae_int32x2 d0; for(itr4 = 0; itr4 < (out_dim4 >> 1); itr4++) { AE_LA32X2_IP(d0, a_inp, pae_i); // Unaligned load AE_SA32X2_IP(d0, a_out, pae_o); // Unaligned store } AE_SA64POS_FP(a_out, pae_o); // Flush remaining data Why this works: Always uses unaligned intrinsics (AE_LA32X2_IP / AE_SA32X2_IP) which work correctly for both aligned and unaligned data Eliminates the faulty branch entirely - no alignment check needed Minimal performance impact - on modern HiFi DSPs, unaligned intrinsics on aligned data have negligible overhead Simpler code - easier to maintain and less error-prone Intrinsic Type Requirement AE_L32X2_IP Aligned load Requires 8-byte aligned pointer AE_LA32X2_IP Unaligned load Works with any alignment AE_S32X2_IP Aligned store Requires 8-byte aligned pointer AE_SA32X2_IP Unaligned store Works with any alignment Change 2: Fixed Strided Load (Lines 216-222) The Problem // OLD CODE - PROBLEMATIC AE_L32_XP(d0, (ae_int32 *)p_inp4, inp_stride[p_5D_permute_vec[4]] << 2); AE_L32_XP(d1, (ae_int32 *)p_inp4, inp_stride[p_5D_permute_vec[4]] << 2); The AE_L32_XP intrinsic: Loads a 32-bit value from the pointer Modifies the pointer by adding the byte offset Has specific alignment and type requirements Issues: The << 2 converts stride to bytes, but the pointer increment behavior inside the macro can be problematic with unaligned addresses The intrinsic may have stricter alignment requirements than expected The cast (ae_int32 *)p_inp4 combined with the post-increment behavior can cause undefined behavior on some platforms The Fix // NEW CODE - EXPLICIT AND SAFE d0 = AE_L32_X((ae_int32 *)p_inp4, 0); // Load with explicit offset p_inp4 += inp_stride[p_5D_permute_vec[4]]; // Manual pointer increment d1 = AE_L32_X((ae_int32 *)p_inp4, 0); // Load with explicit offset p_inp4 += inp_stride[p_5D_permute_vec[4]]; // Manual pointer increment Why this works: AE_L32_X(ptr, offset) - Loads from ptr + offset without modifying the pointer Explicit pointer arithmetic - p_inp4 += stride increments by elements (not bytes), which is correct for WORD32* Clearer semantics - Separates load and pointer update, making behavior predictable Avoids intrinsic quirks - No reliance on how the intrinsic handles pointer modification internally Aspect Old (AE_L32_XP) New (AE_L32_X + manual) Pointer update Inside intrinsic (byte offset) Explicit (element offset) Clarity Implicit behavior Explicit, readable Safety Potential alignment issues Guaranteed correct Summary Change Root Cause Fix Strategy Inner loop simplification Wrong alignment check (& 1 instead of & 0x7) Always use unaligned intrinsics Strided load fix AE_L32_XP intrinsic issues with unaligned pointers Use AE_L32_X + manual pointer increment Both fixes follow the same principle: Instead of trying to detect alignment and use different code paths, use intrinsics that handle unaligned data correctly in all cases. This is safer, simpler, and has minimal performance impact on modern HiFi DSPs. titled titled titled Differential Revision: D92207944

zonglinpeng requested review from kirklandsign and larryliu0820 as code owners February 3, 2026 23:01

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 3, 2026

meta-codesync bot added fb-exported meta-exported labels Feb 3, 2026

DrJessop approved these changes Feb 3, 2026

View reviewed changes

hsharma35 approved these changes Feb 3, 2026

View reviewed changes

zonglinpeng force-pushed the export-D92207944 branch from aa6f106 to 6c86501 Compare February 4, 2026 00:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding fixes for im2row and conv1d, link im2row, [ET][Cadence] patch transpose kernel for safe load & store, [ET][Cadence][Hifi] add cpp test for im2row, [ET][Cadence][Hifi] link transpose to internal, [ET][Cadence][Hifi] add cpp... (#16512) #17180

Adding fixes for im2row and conv1d, link im2row, [ET][Cadence] patch transpose kernel for safe load & store, [ET][Cadence][Hifi] add cpp test for im2row, [ET][Cadence][Hifi] link transpose to internal, [ET][Cadence][Hifi] add cpp... (#16512) #17180

Uh oh!

zonglinpeng commented Feb 3, 2026

Uh oh!

pytorch-bot bot commented Feb 3, 2026 •

edited

Loading

Uh oh!

meta-codesync bot commented Feb 3, 2026

Uh oh!

github-actions bot commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Adding fixes for im2row and conv1d, link im2row, [ET][Cadence] patch transpose kernel for safe load & store, [ET][Cadence][Hifi] add cpp test for im2row, [ET][Cadence][Hifi] link transpose to internal, [ET][Cadence][Hifi] add cpp... (#16512) #17180

Are you sure you want to change the base?

Adding fixes for im2row and conv1d, link im2row, [ET][Cadence] patch transpose kernel for safe load & store, [ET][Cadence][Hifi] add cpp test for im2row, [ET][Cadence][Hifi] link transpose to internal, [ET][Cadence][Hifi] add cpp... (#16512) #17180

Uh oh!

Conversation

zonglinpeng commented Feb 3, 2026

Uh oh!

pytorch-bot bot commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17180

❌ 1 New Failure

Uh oh!

meta-codesync bot commented Feb 3, 2026

Uh oh!

github-actions bot commented Feb 3, 2026

This PR needs a release notes: label

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pytorch-bot bot commented Feb 3, 2026 •

edited

Loading

This PR needs a `release notes:` label