Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ARM] Override Clang x4 NEON intrinsics for Android #1694

Merged
merged 1 commit into from
Mar 5, 2024

Conversation

mtl1979
Copy link
Collaborator

@mtl1979 mtl1979 commented Feb 25, 2024

  • Clang for Android requires 256-bit alignment for x4 loads and stores, which can't be guaranteed and is unnecessary

Fixes #1343.

* Clang for Android requires 256-bit alignment for x4 loads and stores, which can't be guaranteed and is unnecessary
Copy link

codecov bot commented Feb 25, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.03%. Comparing base (93b870f) to head (30e8575).

Additional details and impacted files
@@           Coverage Diff            @@
##           develop    #1694   +/-   ##
========================================
  Coverage    83.03%   83.03%           
========================================
  Files          134      134           
  Lines        10336    10336           
  Branches      2813     2813           
========================================
  Hits          8583     8583           
  Misses        1054     1054           
  Partials       699      699           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@KungFuJesus
Copy link
Contributor

Looking back on this I may have missed the subtlety that is strictly an aarch32 android issue? Is that really the case or does aarch64 with android also suffer this nuisance?

@mtl1979
Copy link
Collaborator Author

mtl1979 commented Feb 25, 2024

Looking back on this I may have missed the subtlety that is strictly an aarch32 android issue? Is that really the case or does aarch64 with android also suffer this nuisance?

So far I haven't seen anything that suggests that AArch64 ABI even uses alignment hints... I've been reading NEON headers for various ARM/AArch64 toolkits and also source code for code generator in Clang as gcc is no longer used for Android.

@ccawley2011
Copy link
Contributor

Is it not possible to guarantee alignment by simply doing scalar sums to 32 byte alignment instead of 16 in the adler32 code? The slide_hash code already appears to guarantee 64 byte alignment through the ZALLOC macro.

@mtl1979
Copy link
Collaborator Author

mtl1979 commented Feb 25, 2024

Is it not possible to guarantee alignment by simply doing scalar sums to 32 byte alignment instead of 16 in the adler32 code? The slide_hash code already appears to guarantee 64 byte alignment through the ZALLOC macro.

We're talking about 8-bit reads here which shouldn't need alignment at all... 16-bit reads and writes should only need maximum of 16-bit (2-byte) alignment, not 256-bit (32-byte) alignment.

@KungFuJesus
Copy link
Contributor

It certainly is though for a lot of use cases you may do a scalar run through the entire data before you hit that point.

@mtl1979
Copy link
Collaborator Author

mtl1979 commented Feb 25, 2024

When I hear about Android-specific bugs, I always remember that ARM has been bi-endian since ARMv3 and as such enforcing alignment checks in software might have some justification but in this case Google has gone full bats in the clock tower.

Since I started working with optimizations, I have seen quite a few motherboards where main processor is little-endian and co-processor is big-endian. CPUs on those motherboards have own "move" instructions that automatically do byteswap to correct the endianess.

I completely agree with what KungFuJesus said about some/many buffers being too short to allow aligning before using vector instructions.

@Dead2 Dead2 added the Architecture Architecture specific label Feb 27, 2024
@ccawley2011
Copy link
Contributor

We're talking about 8-bit reads here which shouldn't need alignment at all... 16-bit reads and writes should only need maximum of 16-bit (2-byte) alignment, not 256-bit (32-byte) alignment.

The existing NEON adler32 already does scalar sums to 16 bytes/128 bits, with a comment suggesting that it's for speed, which is why I'm wondering if it's simpler to adjust that on Android instead of overriding the intrinsics.

https://github.com/zlib-ng/zlib-ng/blob/develop/arch/arm/adler32_neon.c#L171

@mtl1979
Copy link
Collaborator Author

mtl1979 commented Feb 28, 2024

We're talking about 8-bit reads here which shouldn't need alignment at all... 16-bit reads and writes should only need maximum of 16-bit (2-byte) alignment, not 256-bit (32-byte) alignment.

The existing NEON adler32 already does scalar sums to 16 bytes/128 bits, with a comment suggesting that it's for speed, which is why I'm wondering if it's simpler to adjust that on Android instead of overriding the intrinsics.

https://github.com/zlib-ng/zlib-ng/blob/develop/arch/arm/adler32_neon.c#L171

Even 32-bit ARM has enough registers that it makes sense to utilize all/most of them instead of using just one or two registers. That way we can delay the expensive arithmetic operations, for example modulo, as late as possible, resulting in speed gain.

@KungFuJesus
Copy link
Contributor

KungFuJesus commented Feb 28, 2024

We're talking about 8-bit reads here which shouldn't need alignment at all... 16-bit reads and writes should only need maximum of 16-bit (2-byte) alignment, not 256-bit (32-byte) alignment.

The existing NEON adler32 already does scalar sums to 16 bytes/128 bits, with a comment suggesting that it's for speed, which is why I'm wondering if it's simpler to adjust that on Android instead of overriding the intrinsics.

https://github.com/zlib-ng/zlib-ng/blob/develop/arch/arm/adler32_neon.c#L171

I suppose we could conditionally force this to be a modulus of 32 bytes under the right conditions but as I was alluding to earlier, the right conditions to make this happen don't exist on very short strings. The 16 byte alignment does help, but seemingly only on the little CPUs on the big.LITTLE SBCs I've tested, the big seem to pipeline the loads better that alignment is unimportant.

We also do the fake 4x load in platforms that lack this intrinsic, so any workaround that adjusted this aligning scalar sum would also need to take that into account.

@mtl1979
Copy link
Collaborator Author

mtl1979 commented Feb 28, 2024

We also do the fake 4x load in platforms that lack this intrinsic, so any workaround that adjusted this aligning scalar sum would also need to take that into account.

I've only seen Clang and MSVC having the x4 versions, at least my gcc doesn't have them... A lot of people still prefer gcc as it's kinda self-contained toolchain, Clang will almost always need parts of another toolchain to maintain ABI compatibility.

@KungFuJesus
Copy link
Contributor

KungFuJesus commented Feb 28, 2024

We also do the fake 4x load in platforms that lack this intrinsic, so any workaround that adjusted this aligning scalar sum would also need to take that into account.

I've only seen Clang and MSVC having the x4 versions, at least my gcc doesn't have them... A lot of people still prefer gcc as it's kinda self-contained toolchain, Clang will almost always need parts of another toolchain to maintain ABI compatibility.

I have it in my version of GCC and I believe I've had it for a while (since maybe v10 or v11?). I imagine the impact of the wider load is felt more on some implementations than others. It is a bit of a balancing act to determine when the alignment is actually helpful and when doing the scalar computations completely nullifies the gains. With pre-nehalem CPUs, the adler checksum there is doing a bit of calculus to determine that. Everything after nehalem the alignment mattered less and less (and really the part where it has any impact, however small, is the stores rather than loads).

Given that this data is fed from a raw stream of bytes and in the worse case, we're already doing up to 15 scalar sums, I don't know how I feel about 16 more. Perhaps a viable strategy could be that we do a 16 byte wide load and jump over the wide load if not aligned. This would probably maximize the benefit while not going too overboard?

It's a little complicated, but to clarify, the loop is already unrolling by a factor of 4. The remainder peeling for the modulo 4 is done after that loop with "if (rem)". The loop could be restructured such that it jumped to that section and then jumped back back up to the top of the loop based on both the alignment requirements and whether or not the remaining len had at least 32 bytes left to checksum.

@mtl1979
Copy link
Collaborator Author

mtl1979 commented Feb 28, 2024

I have it in my version of GCC and I believe I've had it for a while (since maybe v10 or v11?).

I have gcc 9.4.0 and few different versions of Clang... Default gcc version number has lagged behind a lot compared to Clang version numbers on at least Ubuntu. Clang 12.0.0 is latest I have without rebooting...

@KungFuJesus
Copy link
Contributor

It's not the prettiest and there are some corner cases that are not being caught, but something in the vein of this is what I had been proposing:

adam@pi5:~/zlib-ng/build $ git diff
diff --git a/arch/arm/adler32_neon.c b/arch/arm/adler32_neon.c
index 8e46b380..972480c1 100644
--- a/arch/arm/adler32_neon.c
+++ b/arch/arm/adler32_neon.c
@@ -11,7 +11,7 @@
 #include "adler32_p.h"
 
 static void NEON_accum32(uint32_t *s, const uint8_t *buf, size_t len) {
-    static const uint16_t ALIGNED_(16) taps[64] = {
+    static const uint16_t ALIGNED_(32) taps[64] = {
         64, 63, 62, 61, 60, 59, 58, 57,
         56, 55, 54, 53, 52, 51, 50, 49,
         48, 47, 46, 45, 44, 43, 42, 41,
@@ -39,8 +39,22 @@ static void NEON_accum32(uint32_t *s, const uint8_t *buf, size_t len) {
     uint16x8_t s2_4, s2_5, s2_6, s2_7;
     s2_4 = s2_5 = s2_6 = s2_7 = vdupq_n_u16(0);
 
+loop_unrolled:
     size_t num_iter = len >> 2;
     int rem = len & 3;
+#if 1
+    int align_rem = (uintptr_t)buf & 31;
+
+    if (align_rem != 0) {
+        /* Determine if the modulus for the aligning loads is greater
+         * than the length of the buffer in 16 byte increments. If so,
+         * only checksum the remaining length of the buffer. If not,
+         * compute the residual number of 16 byte loads needed and at
+         * the end of this loop, jump back to the 4x load loop */
+        rem = 1;
+        goto rem_peel;
+    }
+#endif
 
     for (size_t i = 0; i < num_iter; ++i) {
         uint8x16x4_t d0_d3 = vld1q_u8_x4(buf);
@@ -75,10 +89,12 @@ static void NEON_accum32(uint32_t *s, const uint8_t *buf, size_t len) {
 
         adacc_prev = adacc;
         buf += 64;
+        len -= 4;
     }
 
     s3acc = vshlq_n_u32(s3acc, 6);
 
+rem_peel:
     if (rem) {
         uint32x4_t s3acc_0 = vdupq_n_u32(0);
         while (rem--) {
@@ -91,10 +107,15 @@ static void NEON_accum32(uint32_t *s, const uint8_t *buf, size_t len) {
             s3acc_0 = vaddq_u32(s3acc_0, adacc_prev);
             adacc_prev = adacc;
             buf += 16;
+            --len;
         }
 
         s3acc_0 = vshlq_n_u32(s3acc_0, 4);
         s3acc = vaddq_u32(s3acc_0, s3acc);
+
+        if (len) {
+            goto loop_unrolled;
+        }
     }
 
     uint16x8x4_t t0_t3 = vld1q_u16_x4(taps);

I don't love having to decrement another counting variable (undoubtedly there might also be a way around that by decrement len before the top jump or something).

@mtl1979
Copy link
Collaborator Author

mtl1979 commented Feb 28, 2024

Using goto in C or C++ is pretty much frowned upon... I know it's been used before, when there was absolutely no other choice, but it really makes the code almost unreadable and prone to hard-to-notice errors.

@KungFuJesus
Copy link
Contributor

I mean I get that it's spaghetti I was just trying to find a way to decrease the potential machine code size and reuse bits of the code. On further inspection though this isn't going to be all that helpful because the code doing the alignment to 16 bytes only does so if there will still be work left over (forgot I did that). For a short enough string, it feeds in arbitrarily aligned data, because that's what the ABI allows for so long as the alignment hint doesn't get compiled it.

The benefit the 4x loads really buys you varies wildly between ARM implementations, but it's never been super huge (maybe M1 it's more significant, I'm not sure). I think falling back on Android to use the 16 byte at a time unaligned load is probably acceptable, however silly.

It is interesting that it only requires 256 bit alignment, despite it being a 512 bit load. That might indicate that most microarchitectures are only loading 32 bytes at a time, anyway.

@mtl1979
Copy link
Collaborator Author

mtl1979 commented Feb 28, 2024

It is interesting that it only requires 256 bit alignment, despite it being a 512 bit load. That might indicate that most microarchitectures are only loading 32 bytes at a time, anyway.

I think I said before that x4 versions might compile to two x2 loads/stores on 32-bit targets... However I haven't checked if some 32-bit ARM processors allow x4 loads/stores natively.

I disassembled call to vld1q_u8_x4 and got:

_vld1q_u8_x4:
        vld1.8  {d0-d3}, [r0]!
        vld1.8  {d4-d7}, [r0]

I disassembled call to vld1q_u16_x4 and got:

_vld1q_u16_x4:
        vld1.16 {d0-d3}, [r0]!
        vld1.16 {d4-d7}, [r0]

@Dead2 Dead2 merged commit af494fc into zlib-ng:develop Mar 5, 2024
140 checks passed
This was referenced May 30, 2024
@Dead2 Dead2 mentioned this pull request Jun 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Architecture Architecture specific
Projects
None yet
4 participants