-
Notifications
You must be signed in to change notification settings - Fork 828
Adding fixes for im2row and conv1d, link im2row, [ET][Cadence] patch transpose kernel for safe load & store, [ET][Cadence][Hifi] add cpp test for im2row, [ET][Cadence][Hifi] link transpose to internal, [ET][Cadence][Hifi] add cpp... (#16512) #17180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17180
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New FailureAs of commit 6c86501 with merge base c0cef6c ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@zonglinpeng has exported this pull request. If you are a Meta employee, you can view the originating Diff in D92207944. |
This PR needs a
|
…el for safe load & store, add cpp test for im2row, link transpose to internal, add cpp (pytorch#17180) Summary: ...test for transpose Summary Adding fixes for im2row and conv1d Original patch: cad-audio@1cb9ee7#diff-72cb6b6b59f6d8be4021885fa14490cd182776f883b6d2132678968b9e7cb267 Explanation of Code Changes in xa_nn_transpose_32.c This patch fixes a memory alignment crash issue in the transpose operation for Cadence HiFi DSPs. Let me explain both changes: Change 1: Simplified Inner Loop (Lines 170-191) The Problem The original code had a flawed alignment check: // OLD CODE - BUGGY if((((unsigned)p_inp4 & 1) == 0) && (((unsigned)p_out & 1) == 0)) { // "Aligned" path using AE_L32X2_IP / AE_S32X2_IP } else { // Unaligned path using AE_LA32X2_IP / AE_SA32X2_IP } The bug: The check & 1 only verifies 2-byte alignment (checking if the least significant bit is 0). However: ae_int32x2 is a 64-bit (8-byte) SIMD type AE_L32X2_IP / AE_S32X2_IP are aligned load/store intrinsics that require 8-byte alignment The correct check should be & 0x7 (for 8-byte alignment) or at minimum & 0x3 (for 4-byte alignment) Result: When pointers were 2-byte aligned but NOT 8-byte aligned, the code incorrectly took the "aligned" path, causing: Crashes on HiFi DSPs due to misaligned memory access Data corruption in some cases The Fix // NEW CODE - SAFE ae_int32x2 *__restrict__ pae_i = (ae_int32x2 *)(p_inp4); ae_int32x2 *__restrict__ pae_o = (ae_int32x2 *)(p_out); ae_valign a_inp = AE_LA64_PP(pae_i); // Prime alignment register for input ae_valign a_out = AE_ZALIGN64(); // Initialize alignment register for output ae_int32x2 d0; for(itr4 = 0; itr4 < (out_dim4 >> 1); itr4++) { AE_LA32X2_IP(d0, a_inp, pae_i); // Unaligned load AE_SA32X2_IP(d0, a_out, pae_o); // Unaligned store } AE_SA64POS_FP(a_out, pae_o); // Flush remaining data Why this works: Always uses unaligned intrinsics (AE_LA32X2_IP / AE_SA32X2_IP) which work correctly for both aligned and unaligned data Eliminates the faulty branch entirely - no alignment check needed Minimal performance impact - on modern HiFi DSPs, unaligned intrinsics on aligned data have negligible overhead Simpler code - easier to maintain and less error-prone Intrinsic Type Requirement AE_L32X2_IP Aligned load Requires 8-byte aligned pointer AE_LA32X2_IP Unaligned load Works with any alignment AE_S32X2_IP Aligned store Requires 8-byte aligned pointer AE_SA32X2_IP Unaligned store Works with any alignment Change 2: Fixed Strided Load (Lines 216-222) The Problem // OLD CODE - PROBLEMATIC AE_L32_XP(d0, (ae_int32 *)p_inp4, inp_stride[p_5D_permute_vec[4]] << 2); AE_L32_XP(d1, (ae_int32 *)p_inp4, inp_stride[p_5D_permute_vec[4]] << 2); The AE_L32_XP intrinsic: Loads a 32-bit value from the pointer Modifies the pointer by adding the byte offset Has specific alignment and type requirements Issues: The << 2 converts stride to bytes, but the pointer increment behavior inside the macro can be problematic with unaligned addresses The intrinsic may have stricter alignment requirements than expected The cast (ae_int32 *)p_inp4 combined with the post-increment behavior can cause undefined behavior on some platforms The Fix // NEW CODE - EXPLICIT AND SAFE d0 = AE_L32_X((ae_int32 *)p_inp4, 0); // Load with explicit offset p_inp4 += inp_stride[p_5D_permute_vec[4]]; // Manual pointer increment d1 = AE_L32_X((ae_int32 *)p_inp4, 0); // Load with explicit offset p_inp4 += inp_stride[p_5D_permute_vec[4]]; // Manual pointer increment Why this works: AE_L32_X(ptr, offset) - Loads from ptr + offset without modifying the pointer Explicit pointer arithmetic - p_inp4 += stride increments by elements (not bytes), which is correct for WORD32* Clearer semantics - Separates load and pointer update, making behavior predictable Avoids intrinsic quirks - No reliance on how the intrinsic handles pointer modification internally Aspect Old (AE_L32_XP) New (AE_L32_X + manual) Pointer update Inside intrinsic (byte offset) Explicit (element offset) Clarity Implicit behavior Explicit, readable Safety Potential alignment issues Guaranteed correct Summary Change Root Cause Fix Strategy Inner loop simplification Wrong alignment check (& 1 instead of & 0x7) Always use unaligned intrinsics Strided load fix AE_L32_XP intrinsic issues with unaligned pointers Use AE_L32_X + manual pointer increment Both fixes follow the same principle: Instead of trying to detect alignment and use different code paths, use intrinsics that handle unaligned data correctly in all cases. This is safer, simpler, and has minimal performance impact on modern HiFi DSPs. titled titled titled Differential Revision: D92207944
aa6f106 to
6c86501
Compare
Summary:
...test for transpose
Summary
Adding fixes for im2row and conv1d
Original patch: cad-audio@1cb9ee7#diff-72cb6b6b59f6d8be4021885fa14490cd182776f883b6d2132678968b9e7cb267
Explanation of Code Changes in xa_nn_transpose_32.c
This patch fixes a memory alignment crash issue in the transpose operation for Cadence HiFi DSPs. Let me explain both changes:
Change 1: Simplified Inner Loop (Lines 170-191)
The Problem
The original code had a flawed alignment check:
// OLD CODE - BUGGY
if((((unsigned)p_inp4 & 1) == 0) && (((unsigned)p_out & 1) == 0))
{
// "Aligned" path using AE_L32X2_IP / AE_S32X2_IP
}
else
{
// Unaligned path using AE_LA32X2_IP / AE_SA32X2_IP
}
The bug: The check & 1 only verifies 2-byte alignment (checking if the least significant bit is 0). However:
ae_int32x2 is a 64-bit (8-byte) SIMD type
AE_L32X2_IP / AE_S32X2_IP are aligned load/store intrinsics that require 8-byte alignment
The correct check should be & 0x7 (for 8-byte alignment) or at minimum & 0x3 (for 4-byte alignment)
Result: When pointers were 2-byte aligned but NOT 8-byte aligned, the code incorrectly took the "aligned" path, causing:
Crashes on HiFi DSPs due to misaligned memory access
Data corruption in some cases
The Fix
// NEW CODE - SAFE
ae_int32x2 *restrict pae_i = (ae_int32x2 *)(p_inp4);
ae_int32x2 *restrict pae_o = (ae_int32x2 *)(p_out);
ae_valign a_inp = AE_LA64_PP(pae_i); // Prime alignment register for input
ae_valign a_out = AE_ZALIGN64(); // Initialize alignment register for output
ae_int32x2 d0;
for(itr4 = 0; itr4 < (out_dim4 >> 1); itr4++)
{
AE_LA32X2_IP(d0, a_inp, pae_i); // Unaligned load
AE_SA32X2_IP(d0, a_out, pae_o); // Unaligned store
}
AE_SA64POS_FP(a_out, pae_o); // Flush remaining data
Why this works:
Always uses unaligned intrinsics (AE_LA32X2_IP / AE_SA32X2_IP) which work correctly for both aligned and unaligned data
Eliminates the faulty branch entirely - no alignment check needed
Minimal performance impact - on modern HiFi DSPs, unaligned intrinsics on aligned data have negligible overhead
Simpler code - easier to maintain and less error-prone
Intrinsic Type Requirement
AE_L32X2_IP Aligned load Requires 8-byte aligned pointer
AE_LA32X2_IP Unaligned load Works with any alignment
AE_S32X2_IP Aligned store Requires 8-byte aligned pointer
AE_SA32X2_IP Unaligned store Works with any alignment
Change 2: Fixed Strided Load (Lines 216-222)
The Problem
// OLD CODE - PROBLEMATIC
AE_L32_XP(d0, (ae_int32 *)p_inp4, inp_stride[p_5D_permute_vec[4]] << 2);
AE_L32_XP(d1, (ae_int32 *)p_inp4, inp_stride[p_5D_permute_vec[4]] << 2);
The AE_L32_XP intrinsic:
Loads a 32-bit value from the pointer
Modifies the pointer by adding the byte offset
Has specific alignment and type requirements
Issues:
The << 2 converts stride to bytes, but the pointer increment behavior inside the macro can be problematic with unaligned addresses
The intrinsic may have stricter alignment requirements than expected
The cast (ae_int32 *)p_inp4 combined with the post-increment behavior can cause undefined behavior on some platforms
The Fix
// NEW CODE - EXPLICIT AND SAFE
d0 = AE_L32_X((ae_int32 *)p_inp4, 0); // Load with explicit offset
p_inp4 += inp_stride[p_5D_permute_vec[4]]; // Manual pointer increment
d1 = AE_L32_X((ae_int32 *)p_inp4, 0); // Load with explicit offset
p_inp4 += inp_stride[p_5D_permute_vec[4]]; // Manual pointer increment
Why this works:
AE_L32_X(ptr, offset) - Loads from ptr + offset without modifying the pointer
Explicit pointer arithmetic - p_inp4 += stride increments by elements (not bytes), which is correct for WORD32*
Clearer semantics - Separates load and pointer update, making behavior predictable
Avoids intrinsic quirks - No reliance on how the intrinsic handles pointer modification internally
Aspect Old (AE_L32_XP) New (AE_L32_X + manual)
Pointer update Inside intrinsic (byte offset) Explicit (element offset)
Clarity Implicit behavior Explicit, readable
Safety Potential alignment issues Guaranteed correct
Summary
Change Root Cause Fix Strategy
Inner loop simplification Wrong alignment check (& 1 instead of & 0x7) Always use unaligned intrinsics
Strided load fix AE_L32_XP intrinsic issues with unaligned pointers Use AE_L32_X + manual pointer increment
Both fixes follow the same principle: Instead of trying to detect alignment and use different code paths, use intrinsics that handle unaligned data correctly in all cases. This is safer, simpler, and has minimal performance impact on modern HiFi DSPs.
titled
titled
titled
Differential Revision: D92207944