Skip to content

Conversation

@zonglinpeng
Copy link
Contributor

Summary:
...test for transpose

Summary
Adding fixes for im2row and conv1d
Original patch: cad-audio@1cb9ee7#diff-72cb6b6b59f6d8be4021885fa14490cd182776f883b6d2132678968b9e7cb267

Explanation of Code Changes in xa_nn_transpose_32.c
This patch fixes a memory alignment crash issue in the transpose operation for Cadence HiFi DSPs. Let me explain both changes:

Change 1: Simplified Inner Loop (Lines 170-191)
The Problem
The original code had a flawed alignment check:

// OLD CODE - BUGGY
if((((unsigned)p_inp4 & 1) == 0) && (((unsigned)p_out & 1) == 0))
{
// "Aligned" path using AE_L32X2_IP / AE_S32X2_IP
}
else
{
// Unaligned path using AE_LA32X2_IP / AE_SA32X2_IP
}

The bug: The check & 1 only verifies 2-byte alignment (checking if the least significant bit is 0). However:

ae_int32x2 is a 64-bit (8-byte) SIMD type
AE_L32X2_IP / AE_S32X2_IP are aligned load/store intrinsics that require 8-byte alignment
The correct check should be & 0x7 (for 8-byte alignment) or at minimum & 0x3 (for 4-byte alignment)
Result: When pointers were 2-byte aligned but NOT 8-byte aligned, the code incorrectly took the "aligned" path, causing:

Crashes on HiFi DSPs due to misaligned memory access
Data corruption in some cases
The Fix
// NEW CODE - SAFE
ae_int32x2 *restrict pae_i = (ae_int32x2 *)(p_inp4);
ae_int32x2 *restrict pae_o = (ae_int32x2 *)(p_out);
ae_valign a_inp = AE_LA64_PP(pae_i); // Prime alignment register for input
ae_valign a_out = AE_ZALIGN64(); // Initialize alignment register for output
ae_int32x2 d0;
for(itr4 = 0; itr4 < (out_dim4 >> 1); itr4++)
{
AE_LA32X2_IP(d0, a_inp, pae_i); // Unaligned load
AE_SA32X2_IP(d0, a_out, pae_o); // Unaligned store
}
AE_SA64POS_FP(a_out, pae_o); // Flush remaining data

Why this works:

Always uses unaligned intrinsics (AE_LA32X2_IP / AE_SA32X2_IP) which work correctly for both aligned and unaligned data
Eliminates the faulty branch entirely - no alignment check needed
Minimal performance impact - on modern HiFi DSPs, unaligned intrinsics on aligned data have negligible overhead
Simpler code - easier to maintain and less error-prone
Intrinsic Type Requirement
AE_L32X2_IP Aligned load Requires 8-byte aligned pointer
AE_LA32X2_IP Unaligned load Works with any alignment
AE_S32X2_IP Aligned store Requires 8-byte aligned pointer
AE_SA32X2_IP Unaligned store Works with any alignment
Change 2: Fixed Strided Load (Lines 216-222)
The Problem
// OLD CODE - PROBLEMATIC
AE_L32_XP(d0, (ae_int32 *)p_inp4, inp_stride[p_5D_permute_vec[4]] << 2);
AE_L32_XP(d1, (ae_int32 *)p_inp4, inp_stride[p_5D_permute_vec[4]] << 2);

The AE_L32_XP intrinsic:

Loads a 32-bit value from the pointer
Modifies the pointer by adding the byte offset
Has specific alignment and type requirements
Issues:

The << 2 converts stride to bytes, but the pointer increment behavior inside the macro can be problematic with unaligned addresses
The intrinsic may have stricter alignment requirements than expected
The cast (ae_int32 *)p_inp4 combined with the post-increment behavior can cause undefined behavior on some platforms
The Fix
// NEW CODE - EXPLICIT AND SAFE
d0 = AE_L32_X((ae_int32 *)p_inp4, 0); // Load with explicit offset
p_inp4 += inp_stride[p_5D_permute_vec[4]]; // Manual pointer increment
d1 = AE_L32_X((ae_int32 *)p_inp4, 0); // Load with explicit offset
p_inp4 += inp_stride[p_5D_permute_vec[4]]; // Manual pointer increment

Why this works:

AE_L32_X(ptr, offset) - Loads from ptr + offset without modifying the pointer
Explicit pointer arithmetic - p_inp4 += stride increments by elements (not bytes), which is correct for WORD32*
Clearer semantics - Separates load and pointer update, making behavior predictable
Avoids intrinsic quirks - No reliance on how the intrinsic handles pointer modification internally
Aspect Old (AE_L32_XP) New (AE_L32_X + manual)
Pointer update Inside intrinsic (byte offset) Explicit (element offset)
Clarity Implicit behavior Explicit, readable
Safety Potential alignment issues Guaranteed correct
Summary
Change Root Cause Fix Strategy
Inner loop simplification Wrong alignment check (& 1 instead of & 0x7) Always use unaligned intrinsics
Strided load fix AE_L32_XP intrinsic issues with unaligned pointers Use AE_L32_X + manual pointer increment
Both fixes follow the same principle: Instead of trying to detect alignment and use different code paths, use intrinsics that handle unaligned data correctly in all cases. This is safer, simpler, and has minimal performance impact on modern HiFi DSPs.
titled
titled

titled

Differential Revision: D92207944

@pytorch-bot
Copy link

pytorch-bot bot commented Feb 3, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17180

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 6c86501 with merge base c0cef6c (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 3, 2026
@meta-codesync
Copy link
Contributor

meta-codesync bot commented Feb 3, 2026

@zonglinpeng has exported this pull request. If you are a Meta employee, you can view the originating Diff in D92207944.

@github-actions
Copy link

github-actions bot commented Feb 3, 2026

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

…el for safe load & store, add cpp test for im2row, link transpose to internal, add cpp (pytorch#17180)

Summary:

...test for transpose

Summary
Adding fixes for im2row and conv1d
Original patch: cad-audio@1cb9ee7#diff-72cb6b6b59f6d8be4021885fa14490cd182776f883b6d2132678968b9e7cb267

Explanation of Code Changes in xa_nn_transpose_32.c
This patch fixes a memory alignment crash issue in the transpose operation for Cadence HiFi DSPs. Let me explain both changes:

Change 1: Simplified Inner Loop (Lines 170-191)
The Problem
The original code had a flawed alignment check:

// OLD CODE - BUGGY
if((((unsigned)p_inp4 & 1) == 0) && (((unsigned)p_out & 1) == 0))
{
    // "Aligned" path using AE_L32X2_IP / AE_S32X2_IP
}
else
{
    // Unaligned path using AE_LA32X2_IP / AE_SA32X2_IP
}

The bug: The check & 1 only verifies 2-byte alignment (checking if the least significant bit is 0). However:

ae_int32x2 is a 64-bit (8-byte) SIMD type
AE_L32X2_IP / AE_S32X2_IP are aligned load/store intrinsics that require 8-byte alignment
The correct check should be & 0x7 (for 8-byte alignment) or at minimum & 0x3 (for 4-byte alignment)
Result: When pointers were 2-byte aligned but NOT 8-byte aligned, the code incorrectly took the "aligned" path, causing:

Crashes on HiFi DSPs due to misaligned memory access
Data corruption in some cases
The Fix
// NEW CODE - SAFE
ae_int32x2 *__restrict__ pae_i = (ae_int32x2 *)(p_inp4);
ae_int32x2 *__restrict__ pae_o = (ae_int32x2 *)(p_out);
ae_valign a_inp = AE_LA64_PP(pae_i);      // Prime alignment register for input
ae_valign a_out = AE_ZALIGN64();           // Initialize alignment register for output
ae_int32x2 d0;
for(itr4 = 0; itr4 < (out_dim4 >> 1); itr4++)
{
    AE_LA32X2_IP(d0, a_inp, pae_i);        // Unaligned load
    AE_SA32X2_IP(d0, a_out, pae_o);        // Unaligned store
}
AE_SA64POS_FP(a_out, pae_o);               // Flush remaining data

Why this works:

Always uses unaligned intrinsics (AE_LA32X2_IP / AE_SA32X2_IP) which work correctly for both aligned and unaligned data
Eliminates the faulty branch entirely - no alignment check needed
Minimal performance impact - on modern HiFi DSPs, unaligned intrinsics on aligned data have negligible overhead
Simpler code - easier to maintain and less error-prone
Intrinsic	Type	Requirement
AE_L32X2_IP	Aligned load	Requires 8-byte aligned pointer
AE_LA32X2_IP	Unaligned load	Works with any alignment
AE_S32X2_IP	Aligned store	Requires 8-byte aligned pointer
AE_SA32X2_IP	Unaligned store	Works with any alignment
Change 2: Fixed Strided Load (Lines 216-222)
The Problem
// OLD CODE - PROBLEMATIC
AE_L32_XP(d0, (ae_int32 *)p_inp4, inp_stride[p_5D_permute_vec[4]] << 2);
AE_L32_XP(d1, (ae_int32 *)p_inp4, inp_stride[p_5D_permute_vec[4]] << 2);

The AE_L32_XP intrinsic:

Loads a 32-bit value from the pointer
Modifies the pointer by adding the byte offset
Has specific alignment and type requirements
Issues:

The << 2 converts stride to bytes, but the pointer increment behavior inside the macro can be problematic with unaligned addresses
The intrinsic may have stricter alignment requirements than expected
The cast (ae_int32 *)p_inp4 combined with the post-increment behavior can cause undefined behavior on some platforms
The Fix
// NEW CODE - EXPLICIT AND SAFE
d0 = AE_L32_X((ae_int32 *)p_inp4, 0);           // Load with explicit offset
p_inp4 += inp_stride[p_5D_permute_vec[4]];      // Manual pointer increment
d1 = AE_L32_X((ae_int32 *)p_inp4, 0);           // Load with explicit offset
p_inp4 += inp_stride[p_5D_permute_vec[4]];      // Manual pointer increment

Why this works:

AE_L32_X(ptr, offset) - Loads from ptr + offset without modifying the pointer
Explicit pointer arithmetic - p_inp4 += stride increments by elements (not bytes), which is correct for WORD32*
Clearer semantics - Separates load and pointer update, making behavior predictable
Avoids intrinsic quirks - No reliance on how the intrinsic handles pointer modification internally
Aspect	Old (AE_L32_XP)	New (AE_L32_X + manual)
Pointer update	Inside intrinsic (byte offset)	Explicit (element offset)
Clarity	Implicit behavior	Explicit, readable
Safety	Potential alignment issues	Guaranteed correct
Summary
Change	Root Cause	Fix Strategy
Inner loop simplification	Wrong alignment check (& 1 instead of & 0x7)	Always use unaligned intrinsics
Strided load fix	AE_L32_XP intrinsic issues with unaligned pointers	Use AE_L32_X + manual pointer increment
Both fixes follow the same principle: Instead of trying to detect alignment and use different code paths, use intrinsics that handle unaligned data correctly in all cases. This is safer, simpler, and has minimal performance impact on modern HiFi DSPs.
titled
titled

titled

Differential Revision: D92207944
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants