-
-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARM Fixes... #1008
ARM Fixes... #1008
Conversation
mtl1979
commented
Jun 18, 2021
- Some optimizations for NEON to reduce code size
- Don't call chunkcopy() from inside chunkcopy_safe() as it might write past safe length if len is not multiple of chunk size.
Codecov Report
@@ Coverage Diff @@
## develop #1008 +/- ##
===========================================
+ Coverage 77.93% 77.96% +0.02%
===========================================
Files 77 77
Lines 8241 8249 +8
Branches 1336 1340 +4
===========================================
+ Hits 6423 6431 +8
Misses 1290 1290
Partials 528 528
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
It looks like it passes the pigz CI test I added for aarch64 which is great! My only comment is that the commit messages could be a bit better and explain the changes. Other than that it LGTM. |
Actually we may need benchmarks for x86 and aarch64. |
@nmoinvaz It's 3 am here and I've been awake since Wednesday ;) |
I can run some x86 benchmarks tonight. |
I noticed my compiler can also do 32 byte chunks on AArch64... Dunno how much faster that is than using 16 byte chunks... I know 48 byte chunks caused some issues because that is odd register count, not even. |
* There is no need to convert between unsigned and signed vector types. All relevant intrinsics have versions for all unsigned vector types.
* Using vdupq_n_u64 duplicates the unsigned 64-bit integer to two consecutive aligned memory locations in stack so compiler can use wider load instructions. All different-sized general-purpose registers overlay on ARM/AArch64, so any vector cast is no-op in assembly.
* chunkcopy() can read or write more than the safe length if the length is not multiple of chunk size.
I made the individual commit messages a little more descriptive... I guess @Dead2 wants to be the second one to give feedback as I know he likes to play with ARM devices ;) |
Corpora.tarZLIB-NG 834e7d8
ZLIB-NG PR 1008 0025f87
|
That's like 0.007% and 0.66% difference... lol... |
Well it's not bad. I just wanted to make sure there wasn't a degradation of any kind. |
@nmoinvaz Exactly... I was expecting some degradation... But I guess the optimizer works as I wanted, as in modifying |
Baseline 834e7d8 aarch64
PR #1008 858ec3e aarch64
|