More struct and deflate-related cleanups and optimizations #740

Dead2 · 2020-08-29T21:22:10Z

This PR is a little all over the place, since I was hunting warnings all over, and it represents about 2 full days of testing and tweaking.
I did have to give up on shrinking s->heap, and also on getting rid of all of the bi_valid warnings, after numerous attemtpts from scratch, all resulting in incorrect deflate output data for unknown reasons (I have theories, but don't know exactly where the errors happen).

I don't know how many warnings got fixed and/or silenced, but I'd estimate 20 (when excluding duplicates), still lots more (and quite a lot of them false due to a GCC bug, makes it hard to spot the real ones).

This shrinks the deflate internal_state struct by 8 bytes. Before:

        /* size: 5984, cachelines: 94, members: 57 */
        /* sum members: 5972, holes: 3, sum holes: 8 */
        /* padding: 4 */
        /* last cacheline: 32 bytes */

After:

        /* size: 5976, cachelines: 94, members: 56 */
        /* sum members: 5967, holes: 2, sum holes: 5 */
        /* padding: 4 */
        /* last cacheline: 24 bytes */

PS: To review this, please look at each commit separately, otherwise it'll be very difficult to get an overview of what is happening.

nmoinvaz · 2020-08-29T21:29:38Z

I noticed that you removed the UNLIKELY macro in longest_match. Did you notice any improvements when you removed it?

Dead2 · 2020-08-29T21:30:20Z

Baseline `6264b5a`

   text    data     bss     dec     hex filename
  95243    1232      32   96507   178fb libz-ng.so.1

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     57.781%      1.320/1.339/1.348/0.008        0.875/0.888/0.898/0.008      127,296,397
 2     43.874%      1.265/1.279/1.288/0.007        0.443/0.464/0.473/0.007       48,329,132
 3     42.490%      1.319/1.332/1.339/0.006        0.372/0.384/0.392/0.005       40,118,068
 4     41.472%      1.304/1.313/1.319/0.004        0.296/0.313/0.322/0.006       32,630,864
 5     41.212%      1.392/1.407/1.415/0.006        0.300/0.311/0.317/0.005       32,425,995
 6     41.037%      1.314/1.332/1.339/0.006        0.236/0.249/0.253/0.004       25,831,160
 7     40.784%      1.289/1.297/1.302/0.004        0.172/0.186/0.192/0.005       19,253,737
 8     40.706%      1.108/1.115/1.118/0.003        0.109/0.123/0.129/0.004       12,811,166
 9     40.695%      1.227/1.237/1.240/0.003        0.115/0.121/0.126/0.003       12,807,941

 avg1  43.339%                        1.295                          0.338
 tot                                349.546                         91.177      351,504,460

PR #740 `254c3e7`

   text    data     bss     dec     hex filename
  95547    1232      32   96811   17a2b libz-ng.so.1

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     57.781%      1.291/1.337/1.349/0.013        0.858/0.884/0.895/0.010      127,296,397
 2     43.874%      1.244/1.267/1.276/0.009        0.423/0.465/0.473/0.011       48,329,132
 3     42.490%      1.317/1.328/1.335/0.005        0.369/0.388/0.395/0.006       40,118,068
 4     41.472%      1.291/1.301/1.309/0.005        0.301/0.313/0.319/0.005       32,630,864
 5     41.212%      1.392/1.404/1.411/0.005        0.297/0.312/0.320/0.005       32,425,995
 6     41.037%      1.318/1.329/1.334/0.005        0.240/0.249/0.253/0.004       25,831,160
 7     40.784%      1.282/1.295/1.301/0.005        0.175/0.183/0.189/0.005       19,253,737
 8     40.706%      1.103/1.111/1.116/0.003        0.109/0.121/0.126/0.004       12,811,166
 9     40.695%      1.218/1.232/1.237/0.004        0.105/0.121/0.126/0.004       12,807,941

 avg1  43.339%                        1.289                          0.337
 tot                                348.143                         91.034      351,504,460

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     57.781%      1.298/1.333/1.345/0.011        0.865/0.884/0.898/0.008      127,296,397
 2     43.874%      1.256/1.268/1.274/0.005        0.443/0.461/0.469/0.008       48,329,132
 3     42.490%      1.317/1.329/1.334/0.005        0.366/0.385/0.396/0.008       40,118,068
 4     41.472%      1.287/1.302/1.309/0.006        0.301/0.314/0.322/0.006       32,630,864
 5     41.212%      1.386/1.406/1.414/0.006        0.300/0.312/0.320/0.005       32,425,995
 6     41.037%      1.319/1.329/1.335/0.004        0.232/0.247/0.253/0.005       25,831,160
 7     40.784%      1.287/1.296/1.302/0.004        0.175/0.185/0.192/0.005       19,253,737
 8     40.706%      1.104/1.112/1.116/0.003        0.109/0.120/0.126/0.004       12,811,166
 9     40.695%      1.219/1.229/1.234/0.004        0.109/0.121/0.126/0.004       12,807,941

 avg1  43.339%                        1.289                          0.337
 tot                                348.097                         90.880      351,504,460

About 0.4% faster on average.
Speed increase seems to be more or less even across the algorithms.
Compiled code size (text) increases by 0.32% or 304 bytes. (the first few commits decrease size, but the last few increase size)

arch/x86/compare258_avx.c

arch/x86/compare258_sse.c

Dead2 · 2020-08-29T21:33:34Z

I noticed that you removed the UNLIKELY macro in longest_match. Did you notice any improvements when you removed it?

Yes, that is actually what made most of the speed increase.
The thing about UNLIKELY is that it not only puts the unlikely branch outside the function code block, but it also lets the compiler optimize it for size rather than speed. So even though those only hit about 30% of the time, it is faster if we don't do that.

compare258.c

nmoinvaz · 2020-08-29T21:39:18Z

Side note: I wonder if the compiler can better optimize if returning len from the compare functions instead of 258 and if it is faster or not. My initial tests at the time showed so, but I don't place much faith in those tests since it was before deflatebench.

nmoinvaz

Look good to me other than comments made by @mtl1979 that should be fixed.

codecov · 2020-08-29T21:44:10Z

Codecov Report

Merging #740 into develop will increase coverage by 0.04%.
The diff coverage is 73.49%.

@@             Coverage Diff             @@
##           develop     #740      +/-   ##
===========================================
+ Coverage    77.38%   77.43%   +0.04%     
===========================================
  Files           69       69              
  Lines         7614     7644      +30     
  Branches      1312     1322      +10     
===========================================
+ Hits          5892     5919      +27     
+ Misses        1211     1206       -5     
- Partials       511      519       +8

Flag	Coverage Δ
#macos_clang	`63.85% <28.76%> (-0.05%)`	⬇️
#macos_gcc	`63.85% <28.76%> (-0.05%)`	⬇️
#ubuntu_clang	`75.90% <57.33%> (?)`
#ubuntu_clang_debug	`75.56% <57.33%> (-0.03%)`	⬇️
#ubuntu_clang_inflate_allow_invalid_dist	`74.91% <58.90%> (-0.03%)`	⬇️
#ubuntu_clang_inflate_strict	`75.19% <58.90%> (-0.03%)`	⬇️
#ubuntu_clang_mmap	`74.92% <58.90%> (?)`
#ubuntu_clang_msan	`74.97% <60.29%> (-0.03%)`	⬇️
#ubuntu_gcc	`77.08% <70.37%> (+0.14%)`	⬆️
#ubuntu_gcc_armhf	`?`
#ubuntu_gcc_armhf_compat_no_opt	`?`
#ubuntu_gcc_armhf_no_acle	`76.43% <53.06%> (-0.02%)`	⬇️
#ubuntu_gcc_armhf_no_neon	`?`
#ubuntu_gcc_armsf	`76.24% <53.06%> (-1.20%)`	⬇️
#ubuntu_gcc_armsf_compat_no_opt	`?`
#ubuntu_gcc_compat_no_opt	`76.30% <45.65%> (-0.02%)`	⬇️
#ubuntu_gcc_no_avx2	`77.02% <60.71%> (-0.01%)`	⬇️
#ubuntu_gcc_no_pclmulqdq	`75.61% <57.57%> (?)`
#ubuntu_gcc_no_sse2	`76.28% <58.46%> (-0.04%)`	⬇️
#ubuntu_gcc_no_sse4	`75.60% <59.09%> (-0.04%)`	⬇️
#ubuntu_gcc_osb	`69.60% <51.38%> (-0.02%)`	⬇️
#ubuntu_gcc_ppc	`77.60% <59.61%> (-0.01%)`	⬇️
#ubuntu_gcc_ppc64	`76.83% <59.61%> (-0.01%)`	⬇️
#ubuntu_gcc_ppc64le	`76.04% <53.06%> (-0.02%)`	⬇️
#ubuntu_gcc_s390x	`75.34% <62.79%> (-0.03%)`	⬇️
#ubuntu_gcc_sparc64	`77.60% <62.50%> (+1.12%)`	⬆️
#win64_gcc	`76.61% <54.54%> (?)`
#win64_gcc_compat_no_opt	`76.18% <45.65%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
deflate.h	`75.00% <ø> (ø)`
compare258.c	`51.61% <46.15%> (ø)`
deflate.c	`68.28% <50.00%> (+0.12%)`	⬆️
arch/x86/compare258_avx.c	`100.00% <100.00%> (ø)`
arch/x86/compare258_sse.c	`100.00% <100.00%> (ø)`
arch/x86/slide_avx.c	`100.00% <100.00%> (ø)`
arch/x86/slide_sse.c	`95.23% <100.00%> (ø)`
deflate_medium.c	`86.18% <100.00%> (ø)`
functable.c	`84.61% <100.00%> (ø)`
match_tpl.h	`100.00% <100.00%> (ø)`
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1f10aa4...d0da0b9. Read the comment docs.

mtl1979 · 2020-08-29T21:44:19Z

@nmoinvaz Normally return value is in register, so if you return a variable, compiler can optimize the code, so same register is used across the whole function. This implies that the function doesn't call other functions. I don't think loading constant to register is much slower as compiler can zero- or sign extend constants more easily than variables. In most cases the compiler also knows it doesn't need to spill the previous contents of EAX/RAX register.

nmoinvaz · 2020-08-29T22:07:26Z

@mtl1979 Ah ok so it probably doesn't really matter.

mtl1979 · 2020-08-29T22:29:07Z

@nmoinvaz Unless there is a lot of other memory loads close to the return of constant, optimizer most likely can mask the difference by reordering so that the load happens without a pipeline stall. There is obviously limit of how many memory loads can happen in one block of code (it depends on processor architecture, but safe is less than 5), so this doesn't apply to using constants in any other part of code.

Dead2 · 2020-08-29T22:47:14Z

I pushed another commit that cleans up some annoying if block inconsistencies in deflate.c, please review if you have the time.
@nmoinvaz @mtl1979

Dead2 · 2020-08-29T23:02:03Z

Removing {} around macros is usually bad idea.

Right, at least if the macros don't have them themselves. =/

Dead2 · 2020-08-29T23:06:36Z

deflate.c

        ERR_RETURN(strm, Z_BUF_ERROR);
    }

    /* User must not provide more input after the first FINISH: */
-    if (s->status == FINISH_STATE && strm->avail_in != 0) {
+    if (s->status == FINISH_STATE && strm->avail_in != 0)
        ERR_RETURN(strm, Z_BUF_ERROR);


@mtl1979 You meant this one, I guess.
That macro was already missing {} around it at line 787, but I'll add it to both just for consistency.

mtl1979 · 2020-08-29T23:23:50Z

@Dead2 In most cases macro itself shouldn't have {} around the actual code, but a lot of people use dummy do or while block to avoid mysterious errors when including the macro in places where only a single code line is expected. Only when the macro declares new variables, it requires explicit {}.

Some older compilers did actually have maximum limit for nested block scopes, for example old Linaro ARM compiler.

…ead.

longest_match related code.

Dead2 · 2020-08-31T10:43:38Z

Rebased to fix merge conflicts with #735

Ran another benchmark too

 Tool: minigzip   Levels: 1-9
 Runs: 70         Trim worst: 40

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     57.781%      1.309/1.336/1.350/0.011        0.862/0.882/0.892/0.008      127,296,397
 2     43.874%      1.250/1.270/1.276/0.006        0.446/0.464/0.473/0.008       48,329,132
 3     42.490%      1.308/1.329/1.336/0.007        0.372/0.386/0.395/0.005       40,118,068
 4     41.472%      1.285/1.299/1.305/0.005        0.305/0.314/0.319/0.005       32,630,864
 5     41.212%      1.390/1.405/1.411/0.006        0.303/0.314/0.320/0.005       32,425,995
 6     41.037%      1.317/1.328/1.335/0.005        0.226/0.247/0.253/0.006       25,831,160
 7     40.784%      1.275/1.290/1.298/0.006        0.172/0.186/0.189/0.004       19,253,737
 8     40.706%      1.097/1.108/1.113/0.005        0.116/0.124/0.126/0.003       12,811,166
 9     40.695%      1.210/1.226/1.231/0.004        0.113/0.121/0.126/0.003       12,807,941

 avg1  43.339%                        1.288                          0.338
 tot                                347.739                         91.126      351,504,460

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     57.781%      1.311/1.331/1.341/0.008        0.858/0.887/0.898/0.011      127,296,397
 2     43.874%      1.252/1.270/1.279/0.008        0.440/0.463/0.472/0.007       48,329,132
 3     42.490%      1.313/1.327/1.336/0.006        0.369/0.386/0.392/0.007       40,118,068
 4     41.472%      1.275/1.298/1.307/0.008        0.295/0.311/0.319/0.007       32,630,864
 5     41.212%      1.390/1.404/1.410/0.005        0.297/0.310/0.317/0.006       32,425,995
 6     41.037%      1.309/1.326/1.333/0.006        0.230/0.248/0.253/0.006       25,831,160
 7     40.784%      1.279/1.294/1.299/0.005        0.179/0.187/0.192/0.004       19,253,737
 8     40.706%      1.096/1.108/1.113/0.005        0.113/0.121/0.126/0.004       12,811,166
 9     40.695%      1.224/1.231/1.234/0.003        0.112/0.121/0.126/0.004       12,807,941

 avg1  43.339%                        1.287                          0.337
 tot                                347.601                         91.063      351,504,460

Actually slightly better than the previous two tests, but still within test-to-test variance.
But always nice with another confirmation that it performs better.

Dead2 added cleanup Improving maintainability or removing code. optimization Reviews wanted labels Aug 29, 2020

zlib-ng deleted a comment from codecov bot Aug 29, 2020