Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More struct and deflate-related cleanups and optimizations #740

Merged
merged 7 commits into from Aug 31, 2020

Conversation

Dead2
Copy link
Member

@Dead2 Dead2 commented Aug 29, 2020

This PR is a little all over the place, since I was hunting warnings all over, and it represents about 2 full days of testing and tweaking.
I did have to give up on shrinking s->heap, and also on getting rid of all of the bi_valid warnings, after numerous attemtpts from scratch, all resulting in incorrect deflate output data for unknown reasons (I have theories, but don't know exactly where the errors happen).

I don't know how many warnings got fixed and/or silenced, but I'd estimate 20 (when excluding duplicates), still lots more (and quite a lot of them false due to a GCC bug, makes it hard to spot the real ones).

This shrinks the deflate internal_state struct by 8 bytes. Before:

        /* size: 5984, cachelines: 94, members: 57 */
        /* sum members: 5972, holes: 3, sum holes: 8 */
        /* padding: 4 */
        /* last cacheline: 32 bytes */

After:

        /* size: 5976, cachelines: 94, members: 56 */
        /* sum members: 5967, holes: 2, sum holes: 5 */
        /* padding: 4 */
        /* last cacheline: 24 bytes */

PS: To review this, please look at each commit separately, otherwise it'll be very difficult to get an overview of what is happening.

@Dead2 Dead2 added cleanup Improving maintainability or removing code. optimization Reviews wanted labels Aug 29, 2020
@nmoinvaz
Copy link
Member

I noticed that you removed the UNLIKELY macro in longest_match. Did you notice any improvements when you removed it?

@Dead2
Copy link
Member Author

Dead2 commented Aug 29, 2020

Baseline 6264b5a

   text    data     bss     dec     hex filename
  95243    1232      32   96507   178fb libz-ng.so.1

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     57.781%      1.320/1.339/1.348/0.008        0.875/0.888/0.898/0.008      127,296,397
 2     43.874%      1.265/1.279/1.288/0.007        0.443/0.464/0.473/0.007       48,329,132
 3     42.490%      1.319/1.332/1.339/0.006        0.372/0.384/0.392/0.005       40,118,068
 4     41.472%      1.304/1.313/1.319/0.004        0.296/0.313/0.322/0.006       32,630,864
 5     41.212%      1.392/1.407/1.415/0.006        0.300/0.311/0.317/0.005       32,425,995
 6     41.037%      1.314/1.332/1.339/0.006        0.236/0.249/0.253/0.004       25,831,160
 7     40.784%      1.289/1.297/1.302/0.004        0.172/0.186/0.192/0.005       19,253,737
 8     40.706%      1.108/1.115/1.118/0.003        0.109/0.123/0.129/0.004       12,811,166
 9     40.695%      1.227/1.237/1.240/0.003        0.115/0.121/0.126/0.003       12,807,941

 avg1  43.339%                        1.295                          0.338
 tot                                349.546                         91.177      351,504,460

PR #740 254c3e7

   text    data     bss     dec     hex filename
  95547    1232      32   96811   17a2b libz-ng.so.1

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     57.781%      1.291/1.337/1.349/0.013        0.858/0.884/0.895/0.010      127,296,397
 2     43.874%      1.244/1.267/1.276/0.009        0.423/0.465/0.473/0.011       48,329,132
 3     42.490%      1.317/1.328/1.335/0.005        0.369/0.388/0.395/0.006       40,118,068
 4     41.472%      1.291/1.301/1.309/0.005        0.301/0.313/0.319/0.005       32,630,864
 5     41.212%      1.392/1.404/1.411/0.005        0.297/0.312/0.320/0.005       32,425,995
 6     41.037%      1.318/1.329/1.334/0.005        0.240/0.249/0.253/0.004       25,831,160
 7     40.784%      1.282/1.295/1.301/0.005        0.175/0.183/0.189/0.005       19,253,737
 8     40.706%      1.103/1.111/1.116/0.003        0.109/0.121/0.126/0.004       12,811,166
 9     40.695%      1.218/1.232/1.237/0.004        0.105/0.121/0.126/0.004       12,807,941

 avg1  43.339%                        1.289                          0.337
 tot                                348.143                         91.034      351,504,460

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     57.781%      1.298/1.333/1.345/0.011        0.865/0.884/0.898/0.008      127,296,397
 2     43.874%      1.256/1.268/1.274/0.005        0.443/0.461/0.469/0.008       48,329,132
 3     42.490%      1.317/1.329/1.334/0.005        0.366/0.385/0.396/0.008       40,118,068
 4     41.472%      1.287/1.302/1.309/0.006        0.301/0.314/0.322/0.006       32,630,864
 5     41.212%      1.386/1.406/1.414/0.006        0.300/0.312/0.320/0.005       32,425,995
 6     41.037%      1.319/1.329/1.335/0.004        0.232/0.247/0.253/0.005       25,831,160
 7     40.784%      1.287/1.296/1.302/0.004        0.175/0.185/0.192/0.005       19,253,737
 8     40.706%      1.104/1.112/1.116/0.003        0.109/0.120/0.126/0.004       12,811,166
 9     40.695%      1.219/1.229/1.234/0.004        0.109/0.121/0.126/0.004       12,807,941

 avg1  43.339%                        1.289                          0.337
 tot                                348.097                         90.880      351,504,460

About 0.4% faster on average.
Speed increase seems to be more or less even across the algorithms.
Compiled code size (text) increases by 0.32% or 304 bytes. (the first few commits decrease size, but the last few increase size)

@zlib-ng zlib-ng deleted a comment from codecov bot Aug 29, 2020
@zlib-ng zlib-ng deleted a comment from codecov bot Aug 29, 2020
arch/x86/compare258_avx.c Outdated Show resolved Hide resolved
arch/x86/compare258_avx.c Outdated Show resolved Hide resolved
arch/x86/compare258_sse.c Outdated Show resolved Hide resolved
arch/x86/compare258_sse.c Outdated Show resolved Hide resolved
@Dead2
Copy link
Member Author

Dead2 commented Aug 29, 2020

I noticed that you removed the UNLIKELY macro in longest_match. Did you notice any improvements when you removed it?

Yes, that is actually what made most of the speed increase.
The thing about UNLIKELY is that it not only puts the unlikely branch outside the function code block, but it also lets the compiler optimize it for size rather than speed. So even though those only hit about 30% of the time, it is faster if we don't do that.

compare258.c Outdated Show resolved Hide resolved
compare258.c Outdated Show resolved Hide resolved
@nmoinvaz
Copy link
Member

nmoinvaz commented Aug 29, 2020

Side note: I wonder if the compiler can better optimize if returning len from the compare functions instead of 258 and if it is faster or not. My initial tests at the time showed so, but I don't place much faith in those tests since it was before deflatebench.

Copy link
Member

@nmoinvaz nmoinvaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look good to me other than comments made by @mtl1979 that should be fixed.

@codecov
Copy link

codecov bot commented Aug 29, 2020

Codecov Report

Merging #740 into develop will increase coverage by 0.04%.
The diff coverage is 73.49%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #740      +/-   ##
===========================================
+ Coverage    77.38%   77.43%   +0.04%     
===========================================
  Files           69       69              
  Lines         7614     7644      +30     
  Branches      1312     1322      +10     
===========================================
+ Hits          5892     5919      +27     
+ Misses        1211     1206       -5     
- Partials       511      519       +8     
Flag Coverage Δ
#macos_clang 63.85% <28.76%> (-0.05%) ⬇️
#macos_gcc 63.85% <28.76%> (-0.05%) ⬇️
#ubuntu_clang 75.90% <57.33%> (?)
#ubuntu_clang_debug 75.56% <57.33%> (-0.03%) ⬇️
#ubuntu_clang_inflate_allow_invalid_dist 74.91% <58.90%> (-0.03%) ⬇️
#ubuntu_clang_inflate_strict 75.19% <58.90%> (-0.03%) ⬇️
#ubuntu_clang_mmap 74.92% <58.90%> (?)
#ubuntu_clang_msan 74.97% <60.29%> (-0.03%) ⬇️
#ubuntu_gcc 77.08% <70.37%> (+0.14%) ⬆️
#ubuntu_gcc_armhf ?
#ubuntu_gcc_armhf_compat_no_opt ?
#ubuntu_gcc_armhf_no_acle 76.43% <53.06%> (-0.02%) ⬇️
#ubuntu_gcc_armhf_no_neon ?
#ubuntu_gcc_armsf 76.24% <53.06%> (-1.20%) ⬇️
#ubuntu_gcc_armsf_compat_no_opt ?
#ubuntu_gcc_compat_no_opt 76.30% <45.65%> (-0.02%) ⬇️
#ubuntu_gcc_no_avx2 77.02% <60.71%> (-0.01%) ⬇️
#ubuntu_gcc_no_pclmulqdq 75.61% <57.57%> (?)
#ubuntu_gcc_no_sse2 76.28% <58.46%> (-0.04%) ⬇️
#ubuntu_gcc_no_sse4 75.60% <59.09%> (-0.04%) ⬇️
#ubuntu_gcc_osb 69.60% <51.38%> (-0.02%) ⬇️
#ubuntu_gcc_ppc 77.60% <59.61%> (-0.01%) ⬇️
#ubuntu_gcc_ppc64 76.83% <59.61%> (-0.01%) ⬇️
#ubuntu_gcc_ppc64le 76.04% <53.06%> (-0.02%) ⬇️
#ubuntu_gcc_s390x 75.34% <62.79%> (-0.03%) ⬇️
#ubuntu_gcc_sparc64 77.60% <62.50%> (+1.12%) ⬆️
#win64_gcc 76.61% <54.54%> (?)
#win64_gcc_compat_no_opt 76.18% <45.65%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
deflate.h 75.00% <ø> (ø)
compare258.c 51.61% <46.15%> (ø)
deflate.c 68.28% <50.00%> (+0.12%) ⬆️
arch/x86/compare258_avx.c 100.00% <100.00%> (ø)
arch/x86/compare258_sse.c 100.00% <100.00%> (ø)
arch/x86/slide_avx.c 100.00% <100.00%> (ø)
arch/x86/slide_sse.c 95.23% <100.00%> (ø)
deflate_medium.c 86.18% <100.00%> (ø)
functable.c 84.61% <100.00%> (ø)
match_tpl.h 100.00% <100.00%> (ø)
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1f10aa4...d0da0b9. Read the comment docs.

@mtl1979
Copy link
Collaborator

mtl1979 commented Aug 29, 2020

@nmoinvaz Normally return value is in register, so if you return a variable, compiler can optimize the code, so same register is used across the whole function. This implies that the function doesn't call other functions. I don't think loading constant to register is much slower as compiler can zero- or sign extend constants more easily than variables. In most cases the compiler also knows it doesn't need to spill the previous contents of EAX/RAX register.

@zlib-ng zlib-ng deleted a comment from codecov bot Aug 29, 2020
@nmoinvaz
Copy link
Member

@mtl1979 Ah ok so it probably doesn't really matter.

@mtl1979
Copy link
Collaborator

mtl1979 commented Aug 29, 2020

@nmoinvaz Unless there is a lot of other memory loads close to the return of constant, optimizer most likely can mask the difference by reordering so that the load happens without a pipeline stall. There is obviously limit of how many memory loads can happen in one block of code (it depends on processor architecture, but safe is less than 5), so this doesn't apply to using constants in any other part of code.

@Dead2
Copy link
Member Author

Dead2 commented Aug 29, 2020

I pushed another commit that cleans up some annoying if block inconsistencies in deflate.c, please review if you have the time.
@nmoinvaz @mtl1979

@Dead2
Copy link
Member Author

Dead2 commented Aug 29, 2020

Removing {} around macros is usually bad idea.

Right, at least if the macros don't have them themselves. =/

ERR_RETURN(strm, Z_BUF_ERROR);
}

/* User must not provide more input after the first FINISH: */
if (s->status == FINISH_STATE && strm->avail_in != 0) {
if (s->status == FINISH_STATE && strm->avail_in != 0)
ERR_RETURN(strm, Z_BUF_ERROR);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mtl1979 You meant this one, I guess.
That macro was already missing {} around it at line 787, but I'll add it to both just for consistency.

@mtl1979
Copy link
Collaborator

mtl1979 commented Aug 29, 2020

@Dead2 In most cases macro itself shouldn't have {} around the actual code, but a lot of people use dummy do or while block to avoid mysterious errors when including the macro in places where only a single code line is expected. Only when the macro declares new variables, it requires explicit {}.

Some older compilers did actually have maximum limit for nested block scopes, for example old Linaro ARM compiler.

@Dead2
Copy link
Member Author

Dead2 commented Aug 31, 2020

Rebased to fix merge conflicts with #735

Ran another benchmark too

 Tool: minigzip   Levels: 1-9
 Runs: 70         Trim worst: 40

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     57.781%      1.309/1.336/1.350/0.011        0.862/0.882/0.892/0.008      127,296,397
 2     43.874%      1.250/1.270/1.276/0.006        0.446/0.464/0.473/0.008       48,329,132
 3     42.490%      1.308/1.329/1.336/0.007        0.372/0.386/0.395/0.005       40,118,068
 4     41.472%      1.285/1.299/1.305/0.005        0.305/0.314/0.319/0.005       32,630,864
 5     41.212%      1.390/1.405/1.411/0.006        0.303/0.314/0.320/0.005       32,425,995
 6     41.037%      1.317/1.328/1.335/0.005        0.226/0.247/0.253/0.006       25,831,160
 7     40.784%      1.275/1.290/1.298/0.006        0.172/0.186/0.189/0.004       19,253,737
 8     40.706%      1.097/1.108/1.113/0.005        0.116/0.124/0.126/0.003       12,811,166
 9     40.695%      1.210/1.226/1.231/0.004        0.113/0.121/0.126/0.003       12,807,941

 avg1  43.339%                        1.288                          0.338
 tot                                347.739                         91.126      351,504,460

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     57.781%      1.311/1.331/1.341/0.008        0.858/0.887/0.898/0.011      127,296,397
 2     43.874%      1.252/1.270/1.279/0.008        0.440/0.463/0.472/0.007       48,329,132
 3     42.490%      1.313/1.327/1.336/0.006        0.369/0.386/0.392/0.007       40,118,068
 4     41.472%      1.275/1.298/1.307/0.008        0.295/0.311/0.319/0.007       32,630,864
 5     41.212%      1.390/1.404/1.410/0.005        0.297/0.310/0.317/0.006       32,425,995
 6     41.037%      1.309/1.326/1.333/0.006        0.230/0.248/0.253/0.006       25,831,160
 7     40.784%      1.279/1.294/1.299/0.005        0.179/0.187/0.192/0.004       19,253,737
 8     40.706%      1.096/1.108/1.113/0.005        0.113/0.121/0.126/0.004       12,811,166
 9     40.695%      1.224/1.231/1.234/0.003        0.112/0.121/0.126/0.004       12,807,941

 avg1  43.339%                        1.287                          0.337
 tot                                347.601                         91.063      351,504,460

Actually slightly better than the previous two tests, but still within test-to-test variance.
But always nice with another confirmation that it performs better.

@Dead2 Dead2 merged commit 52eb835 into develop Aug 31, 2020
@Dead2 Dead2 deleted the struct-cleanup branch January 29, 2021 09:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cleanup Improving maintainability or removing code. optimization Reviews wanted
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants