Memcpy/memcmp macros #1109

nmoinvaz · 2022-01-10T01:29:03Z

Define preprocessor macros zmemcpy/zmemcmp that force unaligned access on supported architectures
Use zmemcpy/zmemcmp in the code where necessary.

lgtm-com · 2022-01-10T01:39:33Z

This pull request introduces 15 alerts when merging 9e31ad4 into b2ef108 - view on LGTM.com

new alerts:

15 for Unclear comparison precedence

lgtm-com · 2022-01-10T01:45:48Z

This pull request introduces 15 alerts when merging d40a42d into b2ef108 - view on LGTM.com

new alerts:

15 for Unclear comparison precedence

codecov · 2022-01-10T02:20:20Z

Codecov Report

Merging #1109 (727254d) into develop (9146bd4) will decrease coverage by 0.51%.
The diff coverage is 86.11%.

❗ Current head 727254d differs from pull request most recent head f5d680f. Consider uploading reports for the commit f5d680f to get more accurate results

@@             Coverage Diff             @@
##           develop    #1109      +/-   ##
===========================================
- Coverage    80.23%   79.71%   -0.52%     
===========================================
  Files           98       92       -6     
  Lines         9040     8943      -97     
  Branches      1438     1432       -6     
===========================================
- Hits          7253     7129     -124     
- Misses        1223     1247      +24     
- Partials       564      567       +3

Flag	Coverage Δ
macos_clang	`69.48% <51.85%> (-0.02%)`	⬇️
macos_gcc	`68.70% <53.84%> (+0.10%)`	⬆️
ubuntu_clang	`69.97% <18.51%> (-0.02%)`	⬇️
ubuntu_clang_debug	`69.33% <18.51%> (+1.51%)`	⬆️
ubuntu_clang_inflate_allow_invalid_dist	`69.84% <18.51%> (-0.02%)`	⬇️
ubuntu_clang_inflate_strict	`69.89% <18.51%> (-0.02%)`	⬇️
ubuntu_clang_mmap	`69.96% <18.51%> (-0.02%)`	⬇️
ubuntu_clang_msan	`69.97% <18.51%> (-0.02%)`	⬇️
ubuntu_clang_pigz	`33.83% <24.00%> (-0.01%)`	⬇️
ubuntu_clang_pigz_no_optim	`39.68% <76.00%> (-0.03%)`	⬇️
ubuntu_clang_pigz_no_threads	`33.48% <24.00%> (-0.01%)`	⬇️
ubuntu_clang_reduced_mem	`70.09% <18.51%> (-0.02%)`	⬇️
ubuntu_gcc	`69.01% <16.00%> (-0.10%)`	⬇️
ubuntu_gcc_aarch64	`70.72% <47.36%> (-0.02%)`	⬇️
ubuntu_gcc_aarch64_compat_no_opt	`68.97% <50.00%> (-0.09%)`	⬇️
ubuntu_gcc_aarch64_no_acle	`69.81% <47.36%> (+0.04%)`	⬆️
ubuntu_gcc_aarch64_no_neon	`69.94% <64.28%> (-0.03%)`	⬇️
ubuntu_gcc_armhf	`70.69% <47.36%> (-0.02%)`	⬇️
ubuntu_gcc_armhf_compat_no_opt	`69.03% <60.00%> (+0.03%)`	⬆️
ubuntu_gcc_armhf_no_acle	`70.88% <47.36%> (-0.02%)`	⬇️
ubuntu_gcc_armhf_no_neon	`70.98% <64.28%> (-0.07%)`	⬇️
ubuntu_gcc_armsf	`70.69% <47.36%> (+<0.01%)`	⬆️
ubuntu_gcc_armsf_compat_no_opt	`69.03% <60.00%> (+0.02%)`	⬆️
ubuntu_gcc_benchmark	`70.42% <19.04%> (-0.42%)`	⬇️
ubuntu_gcc_compat_no_opt	`70.24% <50.00%> (-0.08%)`	⬇️
ubuntu_gcc_compat_sprefix	`68.75% <16.00%> (-0.09%)`	⬇️
ubuntu_gcc_mingw_i686	`0.00% <0.00%> (ø)`
ubuntu_gcc_mingw_x86_64	`0.00% <0.00%> (ø)`
ubuntu_gcc_no_avx2	`70.25% <52.00%> (+0.04%)`	⬆️
ubuntu_gcc_no_ctz	`71.06% <70.00%> (-0.03%)`	⬇️
ubuntu_gcc_no_ctzll	`70.80% <75.00%> (-0.01%)`	⬇️
ubuntu_gcc_no_pclmulqdq	`67.36% <16.00%> (+0.03%)`	⬆️
ubuntu_gcc_no_sse2	`68.42% <16.00%> (+0.03%)`	⬆️
ubuntu_gcc_no_sse4	`68.22% <16.00%> (+0.03%)`	⬆️
ubuntu_gcc_o3	`69.62% <20.00%> (-0.02%)`	⬇️
ubuntu_gcc_osb	`70.68% <18.18%> (+1.59%)`	⬆️
ubuntu_gcc_pigz	`34.31% <21.73%> (-0.03%)`	⬇️
ubuntu_gcc_pigz_aarch64	`37.52% <58.82%> (+0.23%)`	⬆️
ubuntu_gcc_ppc	`68.22% <40.00%> (-0.87%)`	⬇️
ubuntu_gcc_ppc64	`71.76% <20.00%> (+0.16%)`	⬆️
ubuntu_gcc_ppc64le	`70.37% <47.36%> (+0.13%)`	⬆️
ubuntu_gcc_ppc_no_power8	`70.76% <40.00%> (-1.10%)`	⬇️
ubuntu_gcc_s390x	`73.39% <25.00%> (+1.11%)`	⬆️
ubuntu_gcc_s390x_dfltcc	`?`
ubuntu_gcc_s390x_dfltcc_compat	`?`
ubuntu_gcc_s390x_no_crc32	`?`
ubuntu_gcc_sparc64	`71.97% <40.00%> (ø)`
ubuntu_gcc_sprefix	`68.60% <16.00%> (-0.09%)`	⬇️
win64_gcc	`∅ <ø> (∅)`
win64_gcc_compat_no_opt	`∅ <ø> (∅)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
insert_string_tpl.h	`96.87% <ø> (ø)`
deflate.h	`30.00% <33.33%> (ø)`
match_tpl.h	`74.07% <75.00%> (ø)`
arch/arm/chunkset_neon.c	`100.00% <100.00%> (ø)`
arch/power/chunkset_power8.c	`77.77% <100.00%> (+11.11%)`	⬆️
chunkset.c	`100.00% <100.00%> (+38.09%)`	⬆️
chunkset_tpl.h	`99.14% <100.00%> (ø)`
compare256.c	`87.34% <100.00%> (+3.59%)`	⬆️
deflate_quick.c	`95.83% <100.00%> (ø)`
inffast.c	`82.56% <100.00%> (-0.07%)`	⬇️
... and 40 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9146bd4...f5d680f. Read the comment docs.

Dead2 · 2022-01-14T19:47:50Z

This does look very promising, I wish we could do some broader benchmarking of this though, like aarch32 for example.

nmoinvaz · 2022-01-14T20:05:57Z

Yeah I agree. I don't have any access to ARM machines. 😟

nmoinvaz · 2022-01-14T20:08:25Z

Rebased.

gdevenyi · 2022-01-14T23:20:36Z

Oracle cloud has free amprere arm VMs

KungFuJesus · 2022-01-15T01:58:13Z

I've got an odroid n2 for aarch64 and a quad powermac g5 for ppc64

nmoinvaz · 2022-01-16T16:58:42Z

Rebased.

nmoinvaz · 2022-01-17T18:05:28Z

I plan to split some of this PR out into a PR just for moving unaligned detection into code.

nmoinvaz · 2022-01-18T02:36:28Z

Rebased.

nmoinvaz · 2022-02-07T00:50:21Z

Looks like it still happens with -DWITH_OPTIM=OFF. And it still happens when doing -T ClangCl.

nmoinvaz · 2022-02-07T02:07:22Z

@Dead2 it looks like it is caused by this commit: 0129e88.

mtl1979 · 2022-02-07T02:22:39Z

We knew limiting minimum match length to 4 instead of 3 would hurt compression in some cases...

nmoinvaz · 2022-02-07T02:25:33Z

It hurts decompression..

nmoinvaz · 2022-02-07T02:29:15Z

The compress stream changes and somehow decompression is hurt. Maybe it is having to seek far back into the window all the time?

I did a comparison of minigzip output with z_verbose=2 between level 7 and 8 and their didn't seem to be that much of a difference imo.

Any debugging ideas?

mtl1979 · 2022-02-07T02:43:52Z

I would check distances of 3 and 4 byte matches... It's possible that the chain lenghts are long enough to hurt performance.

nmoinvaz · 2022-02-07T03:22:58Z

Well my CPU may need to be bringing memory in and out of cache to perform these operations if they are too far back in the window is my thinking. If somebody else can run the test using silesia-small.tar it might give a better idea.

Dead2 · 2022-02-07T08:33:19Z

I think I would do a profiling comparison, and see where the huge amount of extra time is spent. Not sure whether it will actually tell us what is "wrong" about the data, but hopefully it would give a hint.

My thinking is that the commit that you found did not cause this, it just modified the compressed data in a way that made the issue crop up in this specific circumstance. The compressed data is valid after all, and it would surprise me if it is very different from level 7, making the increased time spent even more surprising.

Interestingly, I do not see the same problem with the same data. But we do have several cases where we compress the same data differently depending on what arch-specific optimizations are enabled etc. Not sure how that still happens with WITH_OPTIM=OFF, but we have not really kept track of what affects the compressed output.

You could upload the compressed level 7 and level 8 files, then we could all test decompression to see whether we can at least replicate the slowdown with the same compressed data as you.

nmoinvaz · 2022-02-07T16:38:49Z

Last night I tried to do some performance profiling in Visual Studio 2022 but nothing stood out at me.
Here is my data set. I have included the original .tar and also the compressed version of the same at level 7 and 8.
silesia-small-data.zip

nmoinvaz · 2022-02-07T16:45:30Z

I will try run the same test on my Tiger Lake laptop to see if the issue exists there as well.

Level 7 decompression

Level 8 decompression

KungFuJesus · 2022-02-07T18:13:46Z

Nothing terribly conclusive there - just that 4% less of the total runtime is burned in zng_inflate_fast on level 8 than level 7 (inclusive, not just self). There are also seems to be ~5% less time in the C runtime though it's hard to say if that's a good apples to apples comparison here (there's going to be some run to run variance and it's hard to get great resolution). Do you see this same behavior with the larger version of the test data?

nmoinvaz · 2022-02-08T02:52:32Z

Here is benchmark results on my Tiger Lake laptop:

MSVC 17.0.5 x86-64 silesta-small.tar

DEVELOP `748127e`

 Processor: Intel64 Family 6 Model 140 Stepping 1, GenuineIntel
 Tool: ../zlib-ng/build-x/Release/minigzip.exe Levels: 0-9
 Runs: 70         Trim worst: 40

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 0    100.008%      0.016/0.017/0.017/0.000        0.085/0.104/0.114/0.008       15,737,543
 1     54.166%      0.089/0.092/0.097/0.002        0.044/0.045/0.046/0.001        8,523,732
 2     43.871%      0.143/0.151/0.170/0.007        0.046/0.047/0.048/0.001        6,903,609
 3     42.387%      0.200/0.206/0.215/0.005        0.045/0.046/0.047/0.001        6,670,099
 4     41.647%      0.224/0.234/0.265/0.011        0.043/0.045/0.045/0.000        6,553,723
 5     41.216%      0.229/0.235/0.247/0.005        0.043/0.045/0.045/0.001        6,485,938
 6     41.037%      0.267/0.276/0.289/0.006        0.043/0.044/0.045/0.000        6,457,776
 7     40.778%      0.337/0.345/0.367/0.008        0.043/0.044/0.045/0.001        6,416,919
 8     40.704%      0.430/0.444/0.465/0.010        0.105/0.121/0.128/0.005        6,405,244
 9     40.409%      0.505/0.518/0.535/0.009        0.042/0.044/0.044/0.001        6,358,951

 avg1  48.622%                        0.252                          0.059
 avg2  54.025%                        0.280                          0.065
 tot                                 75.573                         17.558       76,513,534

Oddly enough on level 0, the decompression time is also higher.

nmoinvaz · 2022-02-08T16:10:16Z

As far as I can go back in the git tree, it seems that inflate stored takes that same amount of time.

KungFuJesus · 2022-02-08T18:09:57Z

Oddly enough on level 0, the decompression time is also higher.

So that's not too surprising. At level 0 you're likely incurring much more IO if you're actually decompressing from block storage. You're running nearly double the amount of data through memory and back out to storage.

Dead2 · 2022-02-08T20:20:13Z

That is what I have always seen too, that the time spent to inflate a file increases roughly linearly with its compressed size.

I have not gotten around to testing the file you uploaded yet.

Dead2 · 2022-02-09T15:31:33Z

Xeon E5-2650, x86-64, GCC 4.8

Baseline GCC4.8 `88f331e`

 Tool: minigzip   Levels: 0-9
 Runs: 32         Trim worst: 18

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 0    100.008%      0.014/0.015/0.017/0.001        0.013/0.019/0.023/0.003       15,737,543
 1     54.166%      0.212/0.227/0.237/0.007        0.100/0.113/0.117/0.005        8,523,732
 2     43.871%      0.434/0.445/0.450/0.005        0.106/0.118/0.122/0.005        6,903,609
 3     42.387%      0.544/0.552/0.557/0.003        0.112/0.118/0.119/0.002        6,670,099
 4     41.647%      0.596/0.607/0.614/0.006        0.099/0.109/0.115/0.005        6,553,723
 5     41.216%      0.660/0.667/0.672/0.004        0.104/0.110/0.114/0.003        6,485,938
 6     41.037%      0.745/0.749/0.756/0.003        0.099/0.105/0.112/0.005        6,457,776
 7     40.778%      0.951/0.959/0.966/0.005        0.098/0.108/0.111/0.004        6,416,919
 8     40.704%      1.167/1.174/1.177/0.003        0.087/0.105/0.114/0.007        6,405,244
 9     40.409%      1.432/1.437/1.442/0.003        0.105/0.114/0.118/0.004        6,358,951

 avg1  48.622%                        0.683                          0.102
 avg2  54.025%                        0.759                          0.113
 tot                                 95.652                         14.269       76,513,534

   text    data     bss     dec     hex filename
 130303    1456      48  131807   202df libz-ng.so.2

PR #1109 GCC4.8 `727254d`

 Tool: minigzip   Levels: 0-9
 Runs: 32         Trim worst: 18

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 0    100.008%      0.009/0.014/0.017/0.003        0.016/0.020/0.023/0.002       15,737,543
 1     54.166%      0.216/0.227/0.234/0.006        0.100/0.109/0.117/0.006        8,523,732
 2     43.871%      0.437/0.444/0.449/0.003        0.109/0.120/0.125/0.004        6,903,609
 3     42.387%      0.538/0.553/0.557/0.005        0.109/0.118/0.122/0.004        6,670,099
 4     41.647%      0.593/0.605/0.611/0.005        0.101/0.111/0.118/0.004        6,553,723
 5     41.216%      0.662/0.672/0.677/0.004        0.100/0.111/0.116/0.004        6,485,938
 6     41.037%      0.749/0.764/0.771/0.007        0.089/0.106/0.113/0.006        6,457,776
 7     40.778%      0.926/0.937/0.944/0.004        0.091/0.103/0.111/0.006        6,416,919
 8     40.704%      1.142/1.150/1.156/0.005        0.098/0.107/0.114/0.004        6,405,244
 9     40.409%      1.405/1.416/1.422/0.004        0.101/0.112/0.118/0.005        6,358,951

 avg1  48.622%                        0.678                          0.102
 avg2  54.025%                        0.754                          0.113
 tot                                 94.948                         14.235       76,513,534

   text    data     bss     dec     hex filename
 130495    1456      48  131999   2039f libz-ng.so.2

Slightly faster on older compiler for x86-64, that is very good to see.
Increases compiled code size slightly.

That alleviates my worries about this introducing any potential regressions.

Dead2 · 2022-02-09T15:52:01Z

@nmoinvaz No, you are right. Level 0 being so much slower is indeed seriously strange, and I would guess it is probably related to the level 8 slowdown. It involves less code, so possibly that could make it easier to narrow down the problem.

Interestingly, my Xeon benchmark above makes the exact same file as yours on level 8, but in my tests the performance is right in line with what is expected. And my level 0 decompression is more than 5x faster than level 1, not twice as slow like yours seem to be.

This one is tricky.

It could be a measurement error, but how could that happen reliably 40+ times only on level 8 and 0, and across two different machines even?
It could be MSVC-specific, but that still begs the question of how it only affects levels 0 and 8. Those two levels don't have any code-paths in common that the others don't. Is it possible to test a different compiler, or perhaps a different version?
It could be OS-specific, possibly related to IO or scheduling, but how any OS-specific issue could reliably trigger only on level 0 and 8 would be super-weird.
It could be a CPU-bug in modern Intel cpus, but I honestly cannot see how that could be possible.

One possible experiment that might or might not tell us anything: If you append silesia-small.tar onto itself, so the same bitstream comes twice after eachother in a twice the size file, does that file have the same behavior? I mean, it should at least have the same slowdown for the first half of the file since the bitstreams would be identical up to that point. However if it turns out to not be slower than expected, then it becomes seriously weird and I'd begin to suspect Windows itself..
Also possibly worth a test is trimming down the file by removing the last 64KiB for example, that should probably not make a measurable difference in the timings, but with this issue who knows.

Also, not trimming any of the worst benchmark results would tell us for sure whether this slowdown sometimes happens on the other levels, and whether some runs are even worse than this on levels 0 and 8. (Still makes no sense that it would happen always on levels 0 and 8, but I am really grasping at straws here)

KungFuJesus · 2022-02-09T18:19:47Z

@Dead2 are you sure it's not time lost in IO for @nmoinvaz at level 0? It'd be helpful if these were fed from a ram disk. At least that point you're only measuring the additional memory access time (which still would likely be slower but not by the same scales at least).

Is there a way to measure system time of a process in windows to narrow it down? If it's all in system time that would 100% agree with that theory.

Also something to consider: that xeon has a 20 mb LLC, his laptop -U series tiger lake CPU only has a 12mb LLC.

nmoinvaz · 2022-02-09T19:22:13Z

Here is a run on WSL2 using GCC 9.3.0.

OS: Linux 5.10.16.3-microsoft-standard-WSL2 #1 SMP Fri Apr 2 22:23:49 UTC 2021 x86_64
CPU: x86_64
Tool: /mnt/c/users/nathan/source/zlib-ng/build-wsl/minigzip
Levels: 0-9
Runs: 70         Trim worst: 40

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 0    100.008%      0.171/0.195/0.215/0.013        0.245/0.273/0.288/0.011       15,737,543
 1     54.166%      0.260/0.331/0.369/0.029        0.182/0.225/0.254/0.021        8,523,732
 2     43.871%      0.334/0.395/0.430/0.025        0.196/0.246/0.270/0.022        6,903,609
 3     42.387%      0.389/0.446/0.481/0.025        0.188/0.223/0.243/0.015        6,670,099
 4     41.647%      0.416/0.487/0.517/0.028        0.189/0.221/0.242/0.017        6,553,723
 5     41.216%      0.470/0.531/0.563/0.027        0.177/0.215/0.235/0.018        6,485,938
 6     41.037%      0.514/0.580/0.613/0.030        0.181/0.213/0.229/0.014        6,457,776
 7     40.778%      0.603/0.661/0.703/0.028        0.177/0.209/0.229/0.015        6,416,919
 8     40.704%      0.736/0.782/0.805/0.020        0.254/0.297/0.318/0.018        6,405,244
 9     40.409%      0.822/0.881/0.902/0.020        0.185/0.219/0.238/0.016        6,358,951

 avg1  48.622%                        0.529                          0.234
 avg2  54.025%                        0.588                          0.260
 tot                                158.660                         70.243       76,513,534

It seems to be only related to this dataset. A single run using silesia.tar produces faster store 0 level and faster 8 level.

Testing level 0: cd   2.188   2.283     211,973,953 100.008%
Testing level 1: cd   4.459   3.376      94,127,290  44.409%
Testing level 2: cd   6.192   3.513      75,282,961  35.518%
Testing level 3: cd   6.210   3.395      71,816,478  33.882%
Testing level 4: cd   8.608   2.738      70,315,668  33.174%
Testing level 5: cd   7.887   3.183      69,225,542  32.660%
Testing level 6: cd   8.894   3.032      68,902,222  32.508%
Testing level 7: cd  11.174   3.005      68,366,800  32.255%
Testing level 8: cd  13.764   2.801      68,180,776  32.167%
Testing level 9: cd  13.918   2.720      67,586,442  31.887%

nmoinvaz · 2022-02-09T20:09:43Z

@KungFuJesus I thought maybe if I used WSL2 I could have access to all the perf tools but they were all missing or needed to be compiled manually. So I couldn't get access to the sys timings.

KungFuJesus · 2022-02-09T20:21:56Z

@KungFuJesus I thought maybe if I used WSL2 I could have access to all the perf tools but they were all missing or needed to be compiled manually. So I couldn't get access to the sys timings.

Perf is not a strictly userspace component either, though WSL2 does provide a kernel so perhaps that doesn't matter anymore. You should be able to do a direct comparison with time -v (should give you memory utilization, system time, etc). You will want to restrict it to just the compression levels you're testing though, and perhaps just a single run.

Dead2 · 2022-02-09T20:24:40Z

@KungFuJesus For sure, it could absolutely be explained by slow IO.
But I don't see how the OS manages to somehow recognize and only do slow IO for compressed files compressed by level 0 or 8.

AFAIK, there is no way the OS should be aware of what compression level was used for those files. Also, the file is compressed before each decompression test, so it is not even the same file that was decompressed 70 times, it is 70 different (and probably cached) files that should not look any different to the OS (compared to level 7 for example) than the quite minor differences in size.

The OS could be slower when doing IO because of those exact file sizes, but that would be a quite weird bug.

Also, remember the fact that using a different files for testing does not result in this phenomenon..

KungFuJesus · 2022-02-09T20:36:35Z

@KungFuJesus For sure, it could absolutely be explained by slow IO. But I don't see how the OS manages to somehow recognize and only do slow IO for compressed files compressed by level 0 or 8.

AFAIK, there is no way the OS should be aware of what compression level was used for those files. Also, the file is compressed before each decompression test, so it is not even the same file that was decompressed 70 times, it is 70 different (and probably cached) files that should not look any different to the OS (compared to level 7 for example) than the quite minor differences in size.

The OS could be slower when doing IO because of those exact file sizes, but that would be a quite weird bug.

Also, remember the fact that using a different files for testing does not result in this phenomenon..

Well given that compression is kind of a state machine with opcodes, it's not super surprising that differently compressible files would react differently here. If I had to wager a guess, you're fitting more into cache than he is for this data set. Perf can be used to measure LLC misses, so it'd be interesting to see from the xeon and from the tiger lake -U CPU what those were for those given compression levels.

Dead2 · 2022-02-09T21:21:22Z

@KungFuJesus Level 7 and Level 8 should not differ so much that it would affect speed much. And certainly level 0 should not be very challenging at all, since it does not in fact compress the data at all.
I don't think the LLC is the problem that causes the 11KB smaller level 8 file to be 2.5x slower to decompress than the level 7 file.. If anything, it should be the other way around.
The compressed file is also only about 6.4MiB, so that should leave a lot of room even in the 15MiB tiger lake cache.

I have a feeling this is OS or timing related, but it is pretty hard to do any kind of deep inspection like this on windows.
Intels Vtune Profiler could do it though https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html
It can (used to at least) be a bit of a hassle to set up, but that whole toolkit allows very powerful analysis both on Linux and Windows (and Android possibly).

Remember that even the Raspberry Pi 1B shows a much faster level 0 than level 1, and level 8 is just about as expected. And that thing does not have a lot of cache at all.

nmoinvaz · 2022-02-09T21:32:56Z

I used SysInternal's ProcessMonitor to watch the file read/writes from minigzip.exe as deflatebench ran. It seems on level 0 and level 8 a Win32 function called QueryEAFile is run when the archive is first opened. It is probably related to Windows Defender although I can't confirm because nothing really pulls up online for that function. We have seen in our own applications at work that Windows Defender can cause delays when closing files by several seconds as the system scanner checks the file before allowing it to be committed to disk. The doubling of the decompression time makes sense if the Windows Defender scanner also opens and decompresses the achieve for scanning.

Here are rough (not all my apps shutdown or videos stopped playing) benchmarks with Windows Defender on and without. For those who don't know, Windows Defender is anti-virus/anti-malware that comes with Windows - it's not something I installed.

Real-time protection enabled

OS: Windows 10 10.0.22000 AMD64
CPU: Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
Tool: ../zlib-ng/build/Release/minigzip.exe
Levels: 0-9
Runs: 70         Trim worst: 40

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 0    100.015%      0.044/0.051/0.056/0.003        0.131/0.152/0.160/0.008       15,738,743
 1     54.166%      0.135/0.142/0.149/0.004        0.068/0.077/0.082/0.003        8,523,732
 2     43.871%      0.210/0.223/0.231/0.007        0.076/0.082/0.087/0.004        6,903,609
 3     42.387%      0.281/0.300/0.314/0.010        0.070/0.079/0.083/0.004        6,670,099
 4     41.647%      0.295/0.313/0.325/0.009        0.071/0.075/0.079/0.002        6,553,723
 5     41.216%      0.307/0.324/0.336/0.008        0.069/0.075/0.081/0.004        6,485,938
 6     41.037%      0.349/0.365/0.377/0.008        0.067/0.073/0.079/0.003        6,457,776
 7     40.778%      0.419/0.451/0.465/0.011        0.067/0.075/0.080/0.004        6,416,919
 8     40.704%      0.519/0.565/0.579/0.014        0.143/0.155/0.165/0.006        6,405,244
 9     40.409%      0.668/0.699/0.725/0.015        0.072/0.080/0.085/0.004        6,358,951

 avg1  48.623%                        0.343                          0.092
 avg2  54.026%                        0.381                          0.103
 tot                                102.984                         27.688       76,514,734

Real-time protection disabled

OS: Windows 10 10.0.22000 AMD64
CPU: Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
Tool: ../zlib-ng/build/Release/minigzip.exe
Levels: 0-9
Runs: 70         Trim worst: 40

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 0    100.015%      0.018/0.021/0.023/0.001        0.017/0.019/0.021/0.001       15,738,743
 1     54.166%      0.095/0.103/0.111/0.005        0.049/0.053/0.056/0.002        8,523,732
 2     43.871%      0.155/0.165/0.176/0.007        0.051/0.055/0.058/0.002        6,903,609
 3     42.387%      0.218/0.227/0.240/0.008        0.048/0.053/0.057/0.002        6,670,099
 4     41.647%      0.239/0.251/0.264/0.009        0.047/0.052/0.056/0.003        6,553,723
 5     41.216%      0.245/0.265/0.276/0.010        0.047/0.050/0.054/0.002        6,485,938
 6     41.037%      0.287/0.304/0.316/0.010        0.046/0.051/0.054/0.002        6,457,776
 7     40.778%      0.353/0.372/0.392/0.015        0.046/0.051/0.054/0.002        6,416,919
 8     40.704%      0.446/0.465/0.493/0.017        0.047/0.051/0.055/0.003        6,405,244
 9     40.409%      0.525/0.546/0.580/0.020        0.047/0.051/0.057/0.003        6,358,951

 avg1  48.623%                        0.272                          0.049
 avg2  54.026%                        0.302                          0.054
 tot                                 81.599                         14.582       76,514,734

Notice that level 0 and level 8 return to normal when real-time protection is disabled.

KungFuJesus · 2022-02-09T21:37:42Z

@KungFuJesus Level 7 and Level 8 should not differ so much that it would affect speed much. And certainly level 0 should not be very challenging at all, since it does not in fact compress the data at all. I don't think the LLC is the problem that causes the 11KB smaller level 8 file to be 2.5x slower to decompress than the level 7 file.. If anything, it should be the other way around. The compressed file is also only about 6.4MiB, so that should leave a lot of room even in the 15MiB tiger lake cache.

I have a feeling this is OS or timing related, but it is pretty hard to do any kind of deep inspection like this on windows. Intels Vtune Profiler could do it though https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html It can (used to at least) be a bit of a hassle to set up, but that whole toolkit allows very powerful analysis both on Linux and Windows (and Android possibly).

Remember that even the Raspberry Pi 1B shows a much faster level 0 than level 1, and level 8 is just about as expected. And that thing does not have a lot of cache at all.

I'll contend that the level 7 & 8 different could very well be a bug or a really strange side effect. However the level 0 being slower than level 1 despite being far less work for the CPU doesn't seem strange to me, especially if level 1 is not significantly more work and the compression reduces the traffic from memory to disk by nearly half. Even if it were just dumping it back into memory with a round trip, I could see the reduction in memory bandwidth and cache pressure being a win. This is precisely the argument made for compressed swap or file systems with transparent compression.

Though, on my system level 0 is beating level 1 by a long shot (at least when I operating entirely out of a tmpfs backed file system):

 Tool: minigzip   Levels: 0-9       
 Runs: 15         Trim worst: 5         

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 0    100.008%      0.000/0.001/0.002/0.001        0.000/0.001/0.003/0.001       15,737,543
 1     54.166%      0.070/0.072/0.074/0.001        0.033/0.035/0.037/0.002        8,523,732
 2     43.871%      0.134/0.137/0.139/0.002        0.032/0.037/0.039/0.002        6,903,609
 3     42.387%      0.162/0.165/0.166/0.001        0.033/0.035/0.037/0.001        6,670,099
 4     41.647%      0.192/0.193/0.194/0.001        0.031/0.034/0.037/0.002        6,553,723
 5     41.216%      0.208/0.211/0.214/0.002        0.031/0.034/0.036/0.001        6,485,938
 6     41.037%      0.253/0.255/0.258/0.001        0.032/0.034/0.035/0.001        6,457,776
 7     40.778%      0.345/0.348/0.350/0.002        0.031/0.034/0.036/0.001        6,416,919
 8     40.704%      0.452/0.455/0.456/0.001        0.032/0.034/0.038/0.002        6,405,244
 9     40.409%      0.533/0.536/0.537/0.001        0.031/0.034/0.036/0.002        6,358,951

 avg1  48.622%                        0.237                          0.031
 avg2  54.025%                        0.264                          0.035
 tot                                 23.724                          3.132       76,513,534

I do likely have cache depraved systems I could try this on, but I might even have something a step further - a slow 7200 RPM disk:

 Tool: minigzip   Levels: 0-9       
 Runs: 15         Trim worst: 5         

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 0    100.008%      0.014/0.016/0.018/0.002        0.017/0.020/0.022/0.002       15,737,543
 1     54.185%      0.147/0.150/0.153/0.002        0.074/0.080/0.082/0.003        8,526,745
 2     43.871%      0.270/0.275/0.277/0.002        0.076/0.081/0.083/0.002        6,903,702
 3     42.388%      0.336/0.338/0.340/0.002        0.074/0.077/0.080/0.002        6,670,239
 4     41.647%      0.382/0.384/0.387/0.002        0.072/0.076/0.079/0.002        6,553,746
 5     41.216%      0.418/0.423/0.426/0.003        0.073/0.076/0.079/0.002        6,485,936
 6     41.038%      0.489/0.491/0.493/0.002        0.072/0.075/0.077/0.002        6,457,827
 7     40.778%      0.644/0.648/0.650/0.002        0.072/0.075/0.077/0.002        6,416,941
 8     40.704%      0.810/0.812/0.814/0.002        0.071/0.074/0.077/0.002        6,405,249
 9     40.409%      0.894/0.897/0.899/0.002        0.074/0.077/0.079/0.002        6,358,951

 avg1  48.624%                        0.444                          0.071
 avg2  54.027%                        0.493                          0.079
 tot                                 44.352                          7.097       76,516,879

Ok, I'll eat my words. What is going on with Windows? Are you plugged in?

KungFuJesus · 2022-02-09T21:39:47Z

I used SysInternal's ProcessMonitor to watch the file read/writes from minigzip.exe as deflatebench ran. It seems on level 0 and level 8 a Win32 function called QueryEAFile is run when the archive is first opened. It is probably related to Windows Defender although I can't confirm because nothing really pulls up online for that function. We have seen in our own applications at work that Windows Defender can cause delays when closing files by several seconds as the system scanner checks the file before allowing it to be committed to disk. The doubling of the decompression time makes sense if the Windows Defender scanner also opens and decompresses the achieve for scanning.

Lol, oh yeah, that's likely it. We've seen all kinds of weird performance issues with Windows defender. You might want to whitelist the directory you're benchmarking to, it intercepts all writes and tries to inspect them. Man, I wonder how many times you've had bad benchmark info due to defender chowing down on your IO performance.

This definitely justifies the Google Benchmark microbenchmarks quite a bit more.

nmoinvaz · 2022-02-09T21:44:58Z

You might want to whitelist the directory you're benchmarking to, it intercepts all writes and tries to inspect them.

Good idea! I had forgotten about that which I had done before... but I recently reimaged my machine and now need to set that up.

nmoinvaz force-pushed the improvements/zmemcopy branch from ca1b8d0 to 9e31ad4 Compare January 10, 2022 01:33

nmoinvaz force-pushed the improvements/zmemcopy branch from 9e31ad4 to d40a42d Compare January 10, 2022 01:41

nmoinvaz force-pushed the improvements/zmemcopy branch from d40a42d to 8e12f2b Compare January 10, 2022 02:15

nmoinvaz force-pushed the improvements/zmemcopy branch 3 times, most recently from 70239dc to de97ab4 Compare January 10, 2022 04:19

nmoinvaz marked this pull request as ready for review January 10, 2022 05:25

Dead2 added the Rebase needed Please do a 'git rebase develop yourbranch' label Jan 14, 2022

nmoinvaz force-pushed the improvements/zmemcopy branch from de97ab4 to 6df6a0b Compare January 14, 2022 20:08

nmoinvaz removed the Rebase needed Please do a 'git rebase develop yourbranch' label Jan 14, 2022

nmoinvaz mentioned this pull request Jan 15, 2022

Clean up and remove bestcmp_t type for longest_match. #1119

Merged

Dead2 added the Rebase needed Please do a 'git rebase develop yourbranch' label Jan 16, 2022

nmoinvaz force-pushed the improvements/zmemcopy branch from 6df6a0b to 6db1377 Compare January 16, 2022 16:58

nmoinvaz removed the Rebase needed Please do a 'git rebase develop yourbranch' label Jan 16, 2022

Dead2 added Rebase needed Please do a 'git rebase develop yourbranch' Benchmarks wanted Please benchmark with and without this PR enhancement labels Jan 17, 2022

nmoinvaz force-pushed the improvements/zmemcopy branch from 6db1377 to fffd789 Compare January 18, 2022 02:34

nmoinvaz removed the Rebase needed Please do a 'git rebase develop yourbranch' label Jan 18, 2022

Dead2 approved these changes Feb 9, 2022

View reviewed changes

Dead2 merged commit ab6665b into zlib-ng:develop Feb 10, 2022

Memcpy/memcmp macros #1109

Memcpy/memcmp macros #1109

Conversation

nmoinvaz commented Jan 10, 2022 • edited

lgtm-com bot commented Jan 10, 2022

lgtm-com bot commented Jan 10, 2022

codecov bot commented Jan 10, 2022 • edited

Codecov Report

Dead2 commented Jan 14, 2022

nmoinvaz commented Jan 14, 2022

nmoinvaz commented Jan 14, 2022

gdevenyi commented Jan 14, 2022

KungFuJesus commented Jan 15, 2022

nmoinvaz commented Jan 16, 2022

nmoinvaz commented Jan 17, 2022

nmoinvaz commented Jan 18, 2022

nmoinvaz commented Feb 7, 2022 • edited

nmoinvaz commented Feb 7, 2022

mtl1979 commented Feb 7, 2022

nmoinvaz commented Feb 7, 2022

nmoinvaz commented Feb 7, 2022 • edited

mtl1979 commented Feb 7, 2022

nmoinvaz commented Feb 7, 2022 • edited

Dead2 commented Feb 7, 2022

nmoinvaz commented Feb 7, 2022

nmoinvaz commented Feb 7, 2022 • edited

Level 7 decompression

Level 8 decompression

KungFuJesus commented Feb 7, 2022

nmoinvaz commented Feb 8, 2022

DEVELOP 748127e

nmoinvaz commented Feb 8, 2022

KungFuJesus commented Feb 8, 2022 • edited

Dead2 commented Feb 8, 2022

Dead2 commented Feb 9, 2022

Baseline GCC4.8 88f331e

PR #1109 GCC4.8 727254d

Dead2 commented Feb 9, 2022

KungFuJesus commented Feb 9, 2022 • edited

nmoinvaz commented Feb 9, 2022 • edited

nmoinvaz commented Feb 9, 2022

KungFuJesus commented Feb 9, 2022 • edited

Dead2 commented Feb 9, 2022

KungFuJesus commented Feb 9, 2022

Dead2 commented Feb 9, 2022

nmoinvaz commented Feb 9, 2022

Real-time protection enabled

Real-time protection disabled

KungFuJesus commented Feb 9, 2022

KungFuJesus commented Feb 9, 2022

nmoinvaz commented Feb 9, 2022

nmoinvaz commented Jan 10, 2022 •

edited

codecov bot commented Jan 10, 2022 •

edited

nmoinvaz commented Feb 7, 2022 •

edited

nmoinvaz commented Feb 7, 2022 •

edited

nmoinvaz commented Feb 7, 2022 •

edited

nmoinvaz commented Feb 7, 2022 •

edited

DEVELOP `748127e`

KungFuJesus commented Feb 8, 2022 •

edited

Baseline GCC4.8 `88f331e`

PR #1109 GCC4.8 `727254d`

KungFuJesus commented Feb 9, 2022 •

edited

nmoinvaz commented Feb 9, 2022 •

edited

KungFuJesus commented Feb 9, 2022 •

edited