ARM Fixes... #1008

mtl1979 · 2021-06-18T23:49:40Z

Some optimizations for NEON to reduce code size
Don't call chunkcopy() from inside chunkcopy_safe() as it might write past safe length if len is not multiple of chunk size.

codecov · 2021-06-18T23:52:18Z

Codecov Report

Merging #1008 (858ec3e) into develop (101653c) will increase coverage by 0.02%.
The diff coverage is 100.00%.

@@             Coverage Diff             @@
##           develop    #1008      +/-   ##
===========================================
+ Coverage    77.93%   77.96%   +0.02%     
===========================================
  Files           77       77              
  Lines         8241     8249       +8     
  Branches      1336     1340       +4     
===========================================
+ Hits          6423     6431       +8     
  Misses        1290     1290              
  Partials       528      528

Flag	Coverage Δ
macos_clang	`68.41% <100.00%> (+0.02%)`	⬆️
macos_gcc	`67.24% <100.00%> (+0.03%)`	⬆️
ubuntu_clang	`74.88% <100.00%> (+0.02%)`	⬆️
ubuntu_clang_debug	`75.20% <100.00%> (+0.02%)`	⬆️
ubuntu_clang_inflate_allow_invalid_dist	`74.62% <100.00%> (+0.02%)`	⬆️
ubuntu_clang_inflate_strict	`74.86% <100.00%> (+0.02%)`	⬆️
ubuntu_clang_mmap	`74.85% <100.00%> (+0.02%)`	⬆️
ubuntu_clang_msan	`74.88% <100.00%> (+0.02%)`	⬆️
ubuntu_clang_pigz	`34.95% <100.00%> (-0.25%)`	⬇️
ubuntu_clang_pigz_no_optim	`37.41% <100.00%> (-0.32%)`	⬇️
ubuntu_clang_pigz_no_threads	`34.60% <100.00%> (-0.25%)`	⬇️
ubuntu_gcc	`74.43% <100.00%> (+0.03%)`	⬆️
ubuntu_gcc_aarch64	`74.72% <100.00%> (+0.02%)`	⬆️
ubuntu_gcc_aarch64_compat_no_opt	`73.49% <100.00%> (+0.02%)`	⬆️
ubuntu_gcc_aarch64_no_acle	`73.89% <100.00%> (+0.02%)`	⬆️
ubuntu_gcc_aarch64_no_neon	`74.25% <100.00%> (+0.02%)`	⬆️
ubuntu_gcc_armhf	`74.71% <100.00%> (+0.03%)`	⬆️
ubuntu_gcc_armhf_compat_no_opt	`73.46% <100.00%> (+0.02%)`	⬆️
ubuntu_gcc_armhf_no_acle	`74.90% <100.00%> (+0.03%)`	⬆️
ubuntu_gcc_armhf_no_neon	`75.21% <100.00%> (+0.02%)`	⬆️
ubuntu_gcc_armsf	`74.72% <100.00%> (+0.03%)`	⬆️
ubuntu_gcc_armsf_compat_no_opt	`73.46% <100.00%> (+0.02%)`	⬆️
ubuntu_gcc_compat_no_opt	`74.62% <100.00%> (+0.02%)`	⬆️
ubuntu_gcc_mingw_i686	`0.00% <0.00%> (ø)`
ubuntu_gcc_mingw_x86_64	`0.00% <0.00%> (ø)`
ubuntu_gcc_no_avx2	`74.68% <100.00%> (+0.02%)`	⬆️
ubuntu_gcc_no_pclmulqdq	`72.99% <100.00%> (+0.03%)`	⬆️
ubuntu_gcc_no_sse2	`74.01% <100.00%> (+0.02%)`	⬆️
ubuntu_gcc_no_sse4	`73.20% <100.00%> (+0.03%)`	⬆️
ubuntu_gcc_o3	`74.35% <100.00%> (+0.03%)`	⬆️
ubuntu_gcc_osb	`77.40% <100.00%> (+0.02%)`	⬆️
ubuntu_gcc_pigz	`35.25% <100.00%> (+0.02%)`	⬆️
ubuntu_gcc_pigz_aarch64	`35.97% <100.00%> (?)`
ubuntu_gcc_ppc	`74.98% <100.00%> (+0.02%)`	⬆️
ubuntu_gcc_ppc64	`75.83% <100.00%> (+0.01%)`	⬆️
ubuntu_gcc_ppc64le	`74.60% <100.00%> (+0.02%)`	⬆️
ubuntu_gcc_s390x	`73.52% <100.00%> (+0.01%)`	⬆️
ubuntu_gcc_sparc64	`76.19% <100.00%> (+0.01%)`	⬆️
win64_gcc	`70.06% <0.00%> (-0.10%)`	⬇️
win64_gcc_compat_no_opt	`73.60% <100.00%> (+0.03%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
arch/arm/chunkset_neon.c	`100.00% <100.00%> (ø)`
chunkset_tpl.h	`99.10% <100.00%> (+0.05%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 101653c...858ec3e. Read the comment docs.

nmoinvaz · 2021-06-19T00:31:42Z

It looks like it passes the pigz CI test I added for aarch64 which is great! My only comment is that the commit messages could be a bit better and explain the changes. Other than that it LGTM.

nmoinvaz · 2021-06-19T00:33:07Z

Actually we may need benchmarks for x86 and aarch64.

mtl1979 · 2021-06-19T00:33:34Z

@nmoinvaz It's 3 am here and I've been awake since Wednesday ;)

nmoinvaz · 2021-06-19T00:34:43Z

I can run some x86 benchmarks tonight.

mtl1979 · 2021-06-19T00:48:35Z

I noticed my compiler can also do 32 byte chunks on AArch64... Dunno how much faster that is than using 16 byte chunks...

I know 48 byte chunks caused some issues because that is odd register count, not even.

* There is no need to convert between unsigned and signed vector types. All relevant intrinsics have versions for all unsigned vector types.

* Using vdupq_n_u64 duplicates the unsigned 64-bit integer to two consecutive aligned memory locations in stack so compiler can use wider load instructions. All different-sized general-purpose registers overlay on ARM/AArch64, so any vector cast is no-op in assembly.

* chunkcopy() can read or write more than the safe length if the length is not multiple of chunk size.

mtl1979 · 2021-06-19T01:38:23Z

I made the individual commit messages a little more descriptive... I guess @Dead2 wants to be the second one to give feedback as I know he likes to play with ARM devices ;)

nmoinvaz · 2021-06-19T03:05:06Z

Corpora.tar

ZLIB-NG `834e7d8`

 Tool: minigzip-ng-develop.exe Levels: 1-9
 Runs: 70         Trim worst: 40

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     44.809%      1.054/1.071/1.081/0.007        0.668/0.685/0.693/0.006      104,067,822
 2     35.391%      1.731/1.754/1.763/0.008        0.685/0.695/0.699/0.004       82,195,416
 3     33.698%      2.376/2.395/2.404/0.007        0.658/0.673/0.678/0.004       78,262,436
 4     32.989%      2.689/2.703/2.712/0.007        0.644/0.659/0.664/0.004       76,616,183
 5     32.483%      2.818/2.858/2.869/0.012        0.643/0.654/0.658/0.003       75,441,984
 6     32.318%      3.465/3.486/3.496/0.008        0.642/0.650/0.656/0.004       75,058,801
 7     32.061%      4.731/4.760/4.773/0.011        0.639/0.651/0.656/0.005       74,461,300
 8     31.978%      7.218/7.309/7.327/0.025        0.644/0.653/0.657/0.003       74,269,032
 9     31.961%      9.862/9.932/9.953/0.019        0.647/0.659/0.666/0.005       74,230,342

 avg1  34.188%                        4.030                          0.664
 tot                               1088.034                        179.321      714,603,316

ZLIB-NG PR 1008 `0025f87`

 Tool: minigzip-ng-chunkset-neon.exe Levels: 1-9
 Runs: 70         Trim worst: 40

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     44.809%      1.057/1.067/1.075/0.006        0.674/0.681/0.686/0.003      104,067,822
 2     35.391%      1.727/1.747/1.757/0.008        0.681/0.691/0.695/0.003       82,195,416
 3     33.698%      2.360/2.395/2.409/0.011        0.656/0.665/0.672/0.004       78,262,436
 4     32.989%      2.667/2.699/2.710/0.009        0.642/0.654/0.659/0.004       76,616,183
 5     32.483%      2.839/2.861/2.871/0.009        0.637/0.650/0.654/0.004       75,441,984
 6     32.318%      3.454/3.486/3.496/0.009        0.636/0.645/0.650/0.003       75,058,801
 7     32.061%      4.723/4.767/4.782/0.012        0.639/0.650/0.654/0.003       74,461,300
 8     31.978%      7.233/7.305/7.329/0.025        0.636/0.648/0.652/0.004       74,269,032
 9     31.961%      9.808/9.915/9.945/0.036        0.642/0.653/0.660/0.005       74,230,342

 avg1  34.188%                        4.027                          0.660
 tot                               1087.268                        178.137      714,603,316

mtl1979 · 2021-06-19T03:07:58Z

That's like 0.007% and 0.66% difference... lol...

nmoinvaz · 2021-06-19T03:33:40Z

Well it's not bad. I just wanted to make sure there wasn't a degradation of any kind.

mtl1979 · 2021-06-19T03:35:44Z

@nmoinvaz Exactly... I was expecting some degradation... But I guess the optimizer works as I wanted, as in modifying len in every step doesn't slow it down.

Dead2 · 2021-06-19T13:18:51Z

Baseline `834e7d8` aarch64

   text    data     bss     dec     hex filename
 105194    1576      16  106786   1a122 libz-ng.so.2

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     54.166%      2.746/2.764/2.780/0.011        0.965/0.993/1.015/0.013       42,618,858
 2     43.879%      2.819/2.851/2.870/0.017        0.563/0.601/0.631/0.022       20,714,938
 3     42.383%      3.713/3.758/3.780/0.020        0.565/0.580/0.592/0.008       20,008,755
 4     41.660%      2.768/2.802/2.818/0.015        0.353/0.372/0.381/0.008       13,111,391
 5     41.216%      3.014/3.031/3.051/0.012        0.365/0.375/0.383/0.005       12,971,746
 6     41.039%      3.505/3.539/3.555/0.014        0.358/0.372/0.384/0.009       12,916,025
 7     40.778%      2.239/2.270/2.284/0.011        0.168/0.175/0.181/0.004        6,416,924
 8     40.704%      2.769/2.800/2.809/0.010        0.168/0.179/0.188/0.007        6,405,244
 9     40.696%      2.983/3.016/3.029/0.013        0.160/0.172/0.180/0.007        6,404,084

 avg1  42.947%                        2.981                          0.424
 tot                                402.450                         57.285      141,567,965

PR #1008 `858ec3e` aarch64

   text    data     bss     dec     hex filename
 105194    1576      16  106786   1a122 libz-ng.so.2

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     54.166%      2.739/2.769/2.803/0.017        0.979/0.997/1.010/0.009       42,618,858
 2     43.879%      2.838/2.857/2.871/0.011        0.584/0.612/0.625/0.013       20,714,938
 3     42.383%      3.733/3.754/3.780/0.016        0.551/0.582/0.597/0.015       20,008,755
 4     41.660%      2.765/2.800/2.815/0.014        0.353/0.375/0.387/0.011       13,111,391
 5     41.216%      3.003/3.026/3.042/0.012        0.352/0.372/0.381/0.008       12,971,746
 6     41.039%      3.513/3.533/3.540/0.007        0.351/0.364/0.378/0.008       12,916,025
 7     40.778%      2.247/2.271/2.283/0.009        0.156/0.179/0.188/0.009        6,416,924
 8     40.704%      2.772/2.804/2.817/0.011        0.161/0.177/0.184/0.007        6,405,244
 9     40.696%      2.988/3.024/3.039/0.013        0.156/0.177/0.188/0.008        6,404,084

 avg1  42.947%                        2.982                          0.426
 tot                                402.583                         57.530      141,567,965

- Fix inflate corruption on aarch64 #1008 - Fix MSVC warnings #1002 - Minor chunkset improvements #1000 #994 - Minor cleanup #997 - Add CI test for pigz on aarch64 #1004 - Cmake improvements #996

- Fix inflate corruption on aarch64 #1008 - Fix MSVC warnings #1002 #1013 - Minor chunkset improvements #1000 #994 - Minor cleanup #997 - Add CI test for pigz on aarch64 #1004 - Cmake improvements #996

- Fix inflate corruption on aarch64 #1008 - Fix MSVC warnings #1002 #1013 - Minor chunkset improvements #1000 #994 #1015 - Minor cleanup #997 - Add CI test for pigz on aarch64 #1004 - Cmake improvements #996

mtl1979 added 3 commits June 19, 2021 04:23

[chunkset_neon] Don't use signed vector types.

061aca5

* There is no need to convert between unsigned and signed vector types. All relevant intrinsics have versions for all unsigned vector types.

[chunkcopy_safe] Don't call chunkcopy().

858ec3e

* chunkcopy() can read or write more than the safe length if the length is not multiple of chunk size.

mtl1979 force-pushed the chunkset-neon branch from 0025f87 to 858ec3e Compare June 19, 2021 01:36

nmoinvaz approved these changes Jun 19, 2021

View reviewed changes

nmoinvaz added Architecture Architecture specific bug labels Jun 19, 2021

Dead2 merged commit 4521023 into zlib-ng:develop Jun 21, 2021

Dead2 added a commit that referenced this pull request Jun 21, 2021

Version 2.0.5

f9d61dc

- Fix inflate corruption on aarch64 #1008 - Fix MSVC warnings #1002 - Minor chunkset improvements #1000 #994 - Minor cleanup #997 - Add CI test for pigz on aarch64 #1004 - Cmake improvements #996

Dead2 mentioned this pull request Jun 21, 2021

Version 2.0.5 #1014

Merged

Dead2 added a commit that referenced this pull request Jun 21, 2021

Version 2.0.5

feb12cd

- Fix inflate corruption on aarch64 #1008 - Fix MSVC warnings #1002 #1013 - Minor chunkset improvements #1000 #994 - Minor cleanup #997 - Add CI test for pigz on aarch64 #1004 - Cmake improvements #996

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARM Fixes... #1008

ARM Fixes... #1008

mtl1979 commented Jun 18, 2021

codecov bot commented Jun 18, 2021 •

edited

Loading

nmoinvaz commented Jun 19, 2021

nmoinvaz commented Jun 19, 2021

mtl1979 commented Jun 19, 2021

nmoinvaz commented Jun 19, 2021

mtl1979 commented Jun 19, 2021

mtl1979 commented Jun 19, 2021

nmoinvaz commented Jun 19, 2021

mtl1979 commented Jun 19, 2021 •

edited

Loading

nmoinvaz commented Jun 19, 2021

mtl1979 commented Jun 19, 2021 •

edited

Loading

Dead2 commented Jun 19, 2021 •

edited

Loading

ARM Fixes... #1008

ARM Fixes... #1008

Conversation

mtl1979 commented Jun 18, 2021

codecov bot commented Jun 18, 2021 • edited Loading

Codecov Report

nmoinvaz commented Jun 19, 2021

nmoinvaz commented Jun 19, 2021

mtl1979 commented Jun 19, 2021

nmoinvaz commented Jun 19, 2021

mtl1979 commented Jun 19, 2021

mtl1979 commented Jun 19, 2021

nmoinvaz commented Jun 19, 2021

Corpora.tar

ZLIB-NG 834e7d8

ZLIB-NG PR 1008 0025f87

mtl1979 commented Jun 19, 2021 • edited Loading

nmoinvaz commented Jun 19, 2021

mtl1979 commented Jun 19, 2021 • edited Loading

Dead2 commented Jun 19, 2021 • edited Loading

Baseline 834e7d8 aarch64

PR #1008 858ec3e aarch64

codecov bot commented Jun 18, 2021 •

edited

Loading

ZLIB-NG `834e7d8`

ZLIB-NG PR 1008 `0025f87`

mtl1979 commented Jun 19, 2021 •

edited

Loading

mtl1979 commented Jun 19, 2021 •

edited

Loading

Dead2 commented Jun 19, 2021 •

edited

Loading

Baseline `834e7d8` aarch64

PR #1008 `858ec3e` aarch64