Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memcpy/memcmp macros #1109

Merged
merged 3 commits into from Feb 10, 2022
Merged

Memcpy/memcmp macros #1109

merged 3 commits into from Feb 10, 2022

Conversation

nmoinvaz
Copy link
Member

@nmoinvaz nmoinvaz commented Jan 10, 2022

  • Define preprocessor macros zmemcpy/zmemcmp that force unaligned access on supported architectures
  • Use zmemcpy/zmemcmp in the code where necessary.

@lgtm-com
Copy link

lgtm-com bot commented Jan 10, 2022

This pull request introduces 15 alerts when merging 9e31ad4 into b2ef108 - view on LGTM.com

new alerts:

  • 15 for Unclear comparison precedence

@lgtm-com
Copy link

lgtm-com bot commented Jan 10, 2022

This pull request introduces 15 alerts when merging d40a42d into b2ef108 - view on LGTM.com

new alerts:

  • 15 for Unclear comparison precedence

@codecov
Copy link

codecov bot commented Jan 10, 2022

Codecov Report

Merging #1109 (727254d) into develop (9146bd4) will decrease coverage by 0.51%.
The diff coverage is 86.11%.

❗ Current head 727254d differs from pull request most recent head f5d680f. Consider uploading reports for the commit f5d680f to get more accurate results
Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #1109      +/-   ##
===========================================
- Coverage    80.23%   79.71%   -0.52%     
===========================================
  Files           98       92       -6     
  Lines         9040     8943      -97     
  Branches      1438     1432       -6     
===========================================
- Hits          7253     7129     -124     
- Misses        1223     1247      +24     
- Partials       564      567       +3     
Flag Coverage Δ
macos_clang 69.48% <51.85%> (-0.02%) ⬇️
macos_gcc 68.70% <53.84%> (+0.10%) ⬆️
ubuntu_clang 69.97% <18.51%> (-0.02%) ⬇️
ubuntu_clang_debug 69.33% <18.51%> (+1.51%) ⬆️
ubuntu_clang_inflate_allow_invalid_dist 69.84% <18.51%> (-0.02%) ⬇️
ubuntu_clang_inflate_strict 69.89% <18.51%> (-0.02%) ⬇️
ubuntu_clang_mmap 69.96% <18.51%> (-0.02%) ⬇️
ubuntu_clang_msan 69.97% <18.51%> (-0.02%) ⬇️
ubuntu_clang_pigz 33.83% <24.00%> (-0.01%) ⬇️
ubuntu_clang_pigz_no_optim 39.68% <76.00%> (-0.03%) ⬇️
ubuntu_clang_pigz_no_threads 33.48% <24.00%> (-0.01%) ⬇️
ubuntu_clang_reduced_mem 70.09% <18.51%> (-0.02%) ⬇️
ubuntu_gcc 69.01% <16.00%> (-0.10%) ⬇️
ubuntu_gcc_aarch64 70.72% <47.36%> (-0.02%) ⬇️
ubuntu_gcc_aarch64_compat_no_opt 68.97% <50.00%> (-0.09%) ⬇️
ubuntu_gcc_aarch64_no_acle 69.81% <47.36%> (+0.04%) ⬆️
ubuntu_gcc_aarch64_no_neon 69.94% <64.28%> (-0.03%) ⬇️
ubuntu_gcc_armhf 70.69% <47.36%> (-0.02%) ⬇️
ubuntu_gcc_armhf_compat_no_opt 69.03% <60.00%> (+0.03%) ⬆️
ubuntu_gcc_armhf_no_acle 70.88% <47.36%> (-0.02%) ⬇️
ubuntu_gcc_armhf_no_neon 70.98% <64.28%> (-0.07%) ⬇️
ubuntu_gcc_armsf 70.69% <47.36%> (+<0.01%) ⬆️
ubuntu_gcc_armsf_compat_no_opt 69.03% <60.00%> (+0.02%) ⬆️
ubuntu_gcc_benchmark 70.42% <19.04%> (-0.42%) ⬇️
ubuntu_gcc_compat_no_opt 70.24% <50.00%> (-0.08%) ⬇️
ubuntu_gcc_compat_sprefix 68.75% <16.00%> (-0.09%) ⬇️
ubuntu_gcc_mingw_i686 0.00% <0.00%> (ø)
ubuntu_gcc_mingw_x86_64 0.00% <0.00%> (ø)
ubuntu_gcc_no_avx2 70.25% <52.00%> (+0.04%) ⬆️
ubuntu_gcc_no_ctz 71.06% <70.00%> (-0.03%) ⬇️
ubuntu_gcc_no_ctzll 70.80% <75.00%> (-0.01%) ⬇️
ubuntu_gcc_no_pclmulqdq 67.36% <16.00%> (+0.03%) ⬆️
ubuntu_gcc_no_sse2 68.42% <16.00%> (+0.03%) ⬆️
ubuntu_gcc_no_sse4 68.22% <16.00%> (+0.03%) ⬆️
ubuntu_gcc_o3 69.62% <20.00%> (-0.02%) ⬇️
ubuntu_gcc_osb 70.68% <18.18%> (+1.59%) ⬆️
ubuntu_gcc_pigz 34.31% <21.73%> (-0.03%) ⬇️
ubuntu_gcc_pigz_aarch64 37.52% <58.82%> (+0.23%) ⬆️
ubuntu_gcc_ppc 68.22% <40.00%> (-0.87%) ⬇️
ubuntu_gcc_ppc64 71.76% <20.00%> (+0.16%) ⬆️
ubuntu_gcc_ppc64le 70.37% <47.36%> (+0.13%) ⬆️
ubuntu_gcc_ppc_no_power8 70.76% <40.00%> (-1.10%) ⬇️
ubuntu_gcc_s390x 73.39% <25.00%> (+1.11%) ⬆️
ubuntu_gcc_s390x_dfltcc ?
ubuntu_gcc_s390x_dfltcc_compat ?
ubuntu_gcc_s390x_no_crc32 ?
ubuntu_gcc_sparc64 71.97% <40.00%> (ø)
ubuntu_gcc_sprefix 68.60% <16.00%> (-0.09%) ⬇️
win64_gcc ∅ <ø> (∅)
win64_gcc_compat_no_opt ∅ <ø> (∅)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
insert_string_tpl.h 96.87% <ø> (ø)
deflate.h 30.00% <33.33%> (ø)
match_tpl.h 74.07% <75.00%> (ø)
arch/arm/chunkset_neon.c 100.00% <100.00%> (ø)
arch/power/chunkset_power8.c 77.77% <100.00%> (+11.11%) ⬆️
chunkset.c 100.00% <100.00%> (+38.09%) ⬆️
chunkset_tpl.h 99.14% <100.00%> (ø)
compare256.c 87.34% <100.00%> (+3.59%) ⬆️
deflate_quick.c 95.83% <100.00%> (ø)
inffast.c 82.56% <100.00%> (-0.07%) ⬇️
... and 40 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9146bd4...f5d680f. Read the comment docs.

@nmoinvaz nmoinvaz force-pushed the improvements/zmemcopy branch 3 times, most recently from 70239dc to de97ab4 Compare January 10, 2022 04:19
@nmoinvaz nmoinvaz marked this pull request as ready for review January 10, 2022 05:25
@Dead2 Dead2 added the Rebase needed Please do a 'git rebase develop yourbranch' label Jan 14, 2022
@Dead2
Copy link
Member

Dead2 commented Jan 14, 2022

This does look very promising, I wish we could do some broader benchmarking of this though, like aarch32 for example.

@nmoinvaz
Copy link
Member Author

Yeah I agree. I don't have any access to ARM machines. 😟

@nmoinvaz nmoinvaz removed the Rebase needed Please do a 'git rebase develop yourbranch' label Jan 14, 2022
@nmoinvaz
Copy link
Member Author

Rebased.

@gdevenyi
Copy link
Contributor

Oracle cloud has free amprere arm VMs

@KungFuJesus
Copy link
Contributor

I've got an odroid n2 for aarch64 and a quad powermac g5 for ppc64

@Dead2 Dead2 added the Rebase needed Please do a 'git rebase develop yourbranch' label Jan 16, 2022
@nmoinvaz nmoinvaz removed the Rebase needed Please do a 'git rebase develop yourbranch' label Jan 16, 2022
@nmoinvaz
Copy link
Member Author

Rebased.

@Dead2 Dead2 added Rebase needed Please do a 'git rebase develop yourbranch' Benchmarks wanted Please benchmark with and without this PR enhancement labels Jan 17, 2022
@nmoinvaz
Copy link
Member Author

I plan to split some of this PR out into a PR just for moving unaligned detection into code.

@nmoinvaz nmoinvaz removed the Rebase needed Please do a 'git rebase develop yourbranch' label Jan 18, 2022
@nmoinvaz
Copy link
Member Author

Rebased.

@nmoinvaz
Copy link
Member Author

nmoinvaz commented Feb 7, 2022

Looks like it still happens with -DWITH_OPTIM=OFF. And it still happens when doing -T ClangCl.

@nmoinvaz
Copy link
Member Author

nmoinvaz commented Feb 7, 2022

@Dead2 it looks like it is caused by this commit: 0129e88.

@mtl1979
Copy link
Collaborator

mtl1979 commented Feb 7, 2022

We knew limiting minimum match length to 4 instead of 3 would hurt compression in some cases...

@nmoinvaz
Copy link
Member Author

nmoinvaz commented Feb 7, 2022

It hurts decompression..

@nmoinvaz
Copy link
Member Author

nmoinvaz commented Feb 7, 2022

The compress stream changes and somehow decompression is hurt. Maybe it is having to seek far back into the window all the time?

I did a comparison of minigzip output with z_verbose=2 between level 7 and 8 and their didn't seem to be that much of a difference imo.

Any debugging ideas?

@mtl1979
Copy link
Collaborator

mtl1979 commented Feb 7, 2022

I would check distances of 3 and 4 byte matches... It's possible that the chain lenghts are long enough to hurt performance.

@nmoinvaz
Copy link
Member Author

nmoinvaz commented Feb 7, 2022

Well my CPU may need to be bringing memory in and out of cache to perform these operations if they are too far back in the window is my thinking. If somebody else can run the test using silesia-small.tar it might give a better idea.

@Dead2
Copy link
Member

Dead2 commented Feb 7, 2022

I think I would do a profiling comparison, and see where the huge amount of extra time is spent. Not sure whether it will actually tell us what is "wrong" about the data, but hopefully it would give a hint.

My thinking is that the commit that you found did not cause this, it just modified the compressed data in a way that made the issue crop up in this specific circumstance. The compressed data is valid after all, and it would surprise me if it is very different from level 7, making the increased time spent even more surprising.

Interestingly, I do not see the same problem with the same data. But we do have several cases where we compress the same data differently depending on what arch-specific optimizations are enabled etc. Not sure how that still happens with WITH_OPTIM=OFF, but we have not really kept track of what affects the compressed output.

You could upload the compressed level 7 and level 8 files, then we could all test decompression to see whether we can at least replicate the slowdown with the same compressed data as you.

@nmoinvaz
Copy link
Member Author

nmoinvaz commented Feb 7, 2022

Last night I tried to do some performance profiling in Visual Studio 2022 but nothing stood out at me.
Here is my data set. I have included the original .tar and also the compressed version of the same at level 7 and 8.
silesia-small-data.zip

@nmoinvaz
Copy link
Member Author

nmoinvaz commented Feb 7, 2022

I will try run the same test on my Tiger Lake laptop to see if the issue exists there as well.

Level 7 decompression

image

Level 8 decompression

image

@KungFuJesus
Copy link
Contributor

Nothing terribly conclusive there - just that 4% less of the total runtime is burned in zng_inflate_fast on level 8 than level 7 (inclusive, not just self). There are also seems to be ~5% less time in the C runtime though it's hard to say if that's a good apples to apples comparison here (there's going to be some run to run variance and it's hard to get great resolution). Do you see this same behavior with the larger version of the test data?

@nmoinvaz
Copy link
Member Author

nmoinvaz commented Feb 8, 2022

Here is benchmark results on my Tiger Lake laptop:

MSVC 17.0.5 x86-64 silesta-small.tar

DEVELOP 748127e

 Processor: Intel64 Family 6 Model 140 Stepping 1, GenuineIntel
 Tool: ../zlib-ng/build-x/Release/minigzip.exe Levels: 0-9
 Runs: 70         Trim worst: 40

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 0    100.008%      0.016/0.017/0.017/0.000        0.085/0.104/0.114/0.008       15,737,543
 1     54.166%      0.089/0.092/0.097/0.002        0.044/0.045/0.046/0.001        8,523,732
 2     43.871%      0.143/0.151/0.170/0.007        0.046/0.047/0.048/0.001        6,903,609
 3     42.387%      0.200/0.206/0.215/0.005        0.045/0.046/0.047/0.001        6,670,099
 4     41.647%      0.224/0.234/0.265/0.011        0.043/0.045/0.045/0.000        6,553,723
 5     41.216%      0.229/0.235/0.247/0.005        0.043/0.045/0.045/0.001        6,485,938
 6     41.037%      0.267/0.276/0.289/0.006        0.043/0.044/0.045/0.000        6,457,776
 7     40.778%      0.337/0.345/0.367/0.008        0.043/0.044/0.045/0.001        6,416,919
 8     40.704%      0.430/0.444/0.465/0.010        0.105/0.121/0.128/0.005        6,405,244
 9     40.409%      0.505/0.518/0.535/0.009        0.042/0.044/0.044/0.001        6,358,951

 avg1  48.622%                        0.252                          0.059
 avg2  54.025%                        0.280                          0.065
 tot                                 75.573                         17.558       76,513,534

Oddly enough on level 0, the decompression time is also higher.

@nmoinvaz
Copy link
Member Author

nmoinvaz commented Feb 8, 2022

As far as I can go back in the git tree, it seems that inflate stored takes that same amount of time.

@KungFuJesus
Copy link
Contributor

KungFuJesus commented Feb 8, 2022

Oddly enough on level 0, the decompression time is also higher.

So that's not too surprising. At level 0 you're likely incurring much more IO if you're actually decompressing from block storage. You're running nearly double the amount of data through memory and back out to storage.

@Dead2
Copy link
Member

Dead2 commented Feb 8, 2022

That is what I have always seen too, that the time spent to inflate a file increases roughly linearly with its compressed size.

I have not gotten around to testing the file you uploaded yet.

@Dead2
Copy link
Member

Dead2 commented Feb 9, 2022

Xeon E5-2650, x86-64, GCC 4.8

Baseline GCC4.8 88f331e

 Tool: minigzip   Levels: 0-9
 Runs: 32         Trim worst: 18

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 0    100.008%      0.014/0.015/0.017/0.001        0.013/0.019/0.023/0.003       15,737,543
 1     54.166%      0.212/0.227/0.237/0.007        0.100/0.113/0.117/0.005        8,523,732
 2     43.871%      0.434/0.445/0.450/0.005        0.106/0.118/0.122/0.005        6,903,609
 3     42.387%      0.544/0.552/0.557/0.003        0.112/0.118/0.119/0.002        6,670,099
 4     41.647%      0.596/0.607/0.614/0.006        0.099/0.109/0.115/0.005        6,553,723
 5     41.216%      0.660/0.667/0.672/0.004        0.104/0.110/0.114/0.003        6,485,938
 6     41.037%      0.745/0.749/0.756/0.003        0.099/0.105/0.112/0.005        6,457,776
 7     40.778%      0.951/0.959/0.966/0.005        0.098/0.108/0.111/0.004        6,416,919
 8     40.704%      1.167/1.174/1.177/0.003        0.087/0.105/0.114/0.007        6,405,244
 9     40.409%      1.432/1.437/1.442/0.003        0.105/0.114/0.118/0.004        6,358,951

 avg1  48.622%                        0.683                          0.102
 avg2  54.025%                        0.759                          0.113
 tot                                 95.652                         14.269       76,513,534

   text    data     bss     dec     hex filename
 130303    1456      48  131807   202df libz-ng.so.2

PR #1109 GCC4.8 727254d

 Tool: minigzip   Levels: 0-9
 Runs: 32         Trim worst: 18

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 0    100.008%      0.009/0.014/0.017/0.003        0.016/0.020/0.023/0.002       15,737,543
 1     54.166%      0.216/0.227/0.234/0.006        0.100/0.109/0.117/0.006        8,523,732
 2     43.871%      0.437/0.444/0.449/0.003        0.109/0.120/0.125/0.004        6,903,609
 3     42.387%      0.538/0.553/0.557/0.005        0.109/0.118/0.122/0.004        6,670,099
 4     41.647%      0.593/0.605/0.611/0.005        0.101/0.111/0.118/0.004        6,553,723
 5     41.216%      0.662/0.672/0.677/0.004        0.100/0.111/0.116/0.004        6,485,938
 6     41.037%      0.749/0.764/0.771/0.007        0.089/0.106/0.113/0.006        6,457,776
 7     40.778%      0.926/0.937/0.944/0.004        0.091/0.103/0.111/0.006        6,416,919
 8     40.704%      1.142/1.150/1.156/0.005        0.098/0.107/0.114/0.004        6,405,244
 9     40.409%      1.405/1.416/1.422/0.004        0.101/0.112/0.118/0.005        6,358,951

 avg1  48.622%                        0.678                          0.102
 avg2  54.025%                        0.754                          0.113
 tot                                 94.948                         14.235       76,513,534

   text    data     bss     dec     hex filename
 130495    1456      48  131999   2039f libz-ng.so.2

Slightly faster on older compiler for x86-64, that is very good to see.
Increases compiled code size slightly.

That alleviates my worries about this introducing any potential regressions.

@Dead2
Copy link
Member

Dead2 commented Feb 9, 2022

@nmoinvaz No, you are right. Level 0 being so much slower is indeed seriously strange, and I would guess it is probably related to the level 8 slowdown. It involves less code, so possibly that could make it easier to narrow down the problem.

Interestingly, my Xeon benchmark above makes the exact same file as yours on level 8, but in my tests the performance is right in line with what is expected. And my level 0 decompression is more than 5x faster than level 1, not twice as slow like yours seem to be.

This one is tricky.

  • It could be a measurement error, but how could that happen reliably 40+ times only on level 8 and 0, and across two different machines even?
  • It could be MSVC-specific, but that still begs the question of how it only affects levels 0 and 8. Those two levels don't have any code-paths in common that the others don't. Is it possible to test a different compiler, or perhaps a different version?
  • It could be OS-specific, possibly related to IO or scheduling, but how any OS-specific issue could reliably trigger only on level 0 and 8 would be super-weird.
  • It could be a CPU-bug in modern Intel cpus, but I honestly cannot see how that could be possible.

One possible experiment that might or might not tell us anything: If you append silesia-small.tar onto itself, so the same bitstream comes twice after eachother in a twice the size file, does that file have the same behavior? I mean, it should at least have the same slowdown for the first half of the file since the bitstreams would be identical up to that point. However if it turns out to not be slower than expected, then it becomes seriously weird and I'd begin to suspect Windows itself..
Also possibly worth a test is trimming down the file by removing the last 64KiB for example, that should probably not make a measurable difference in the timings, but with this issue who knows.

Also, not trimming any of the worst benchmark results would tell us for sure whether this slowdown sometimes happens on the other levels, and whether some runs are even worse than this on levels 0 and 8. (Still makes no sense that it would happen always on levels 0 and 8, but I am really grasping at straws here)

@KungFuJesus
Copy link
Contributor

KungFuJesus commented Feb 9, 2022

@Dead2 are you sure it's not time lost in IO for @nmoinvaz at level 0? It'd be helpful if these were fed from a ram disk. At least that point you're only measuring the additional memory access time (which still would likely be slower but not by the same scales at least).

Is there a way to measure system time of a process in windows to narrow it down? If it's all in system time that would 100% agree with that theory.

Also something to consider: that xeon has a 20 mb LLC, his laptop -U series tiger lake CPU only has a 12mb LLC.

@nmoinvaz
Copy link
Member Author

nmoinvaz commented Feb 9, 2022

Here is a run on WSL2 using GCC 9.3.0.

OS: Linux 5.10.16.3-microsoft-standard-WSL2 #1 SMP Fri Apr 2 22:23:49 UTC 2021 x86_64
CPU: x86_64
Tool: /mnt/c/users/nathan/source/zlib-ng/build-wsl/minigzip
Levels: 0-9
Runs: 70         Trim worst: 40

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 0    100.008%      0.171/0.195/0.215/0.013        0.245/0.273/0.288/0.011       15,737,543
 1     54.166%      0.260/0.331/0.369/0.029        0.182/0.225/0.254/0.021        8,523,732
 2     43.871%      0.334/0.395/0.430/0.025        0.196/0.246/0.270/0.022        6,903,609
 3     42.387%      0.389/0.446/0.481/0.025        0.188/0.223/0.243/0.015        6,670,099
 4     41.647%      0.416/0.487/0.517/0.028        0.189/0.221/0.242/0.017        6,553,723
 5     41.216%      0.470/0.531/0.563/0.027        0.177/0.215/0.235/0.018        6,485,938
 6     41.037%      0.514/0.580/0.613/0.030        0.181/0.213/0.229/0.014        6,457,776
 7     40.778%      0.603/0.661/0.703/0.028        0.177/0.209/0.229/0.015        6,416,919
 8     40.704%      0.736/0.782/0.805/0.020        0.254/0.297/0.318/0.018        6,405,244
 9     40.409%      0.822/0.881/0.902/0.020        0.185/0.219/0.238/0.016        6,358,951

 avg1  48.622%                        0.529                          0.234
 avg2  54.025%                        0.588                          0.260
 tot                                158.660                         70.243       76,513,534

It seems to be only related to this dataset. A single run using silesia.tar produces faster store 0 level and faster 8 level.

Testing level 0: cd   2.188   2.283     211,973,953 100.008%
Testing level 1: cd   4.459   3.376      94,127,290  44.409%
Testing level 2: cd   6.192   3.513      75,282,961  35.518%
Testing level 3: cd   6.210   3.395      71,816,478  33.882%
Testing level 4: cd   8.608   2.738      70,315,668  33.174%
Testing level 5: cd   7.887   3.183      69,225,542  32.660%
Testing level 6: cd   8.894   3.032      68,902,222  32.508%
Testing level 7: cd  11.174   3.005      68,366,800  32.255%
Testing level 8: cd  13.764   2.801      68,180,776  32.167%
Testing level 9: cd  13.918   2.720      67,586,442  31.887%

@nmoinvaz
Copy link
Member Author

nmoinvaz commented Feb 9, 2022

@KungFuJesus I thought maybe if I used WSL2 I could have access to all the perf tools but they were all missing or needed to be compiled manually. So I couldn't get access to the sys timings.

@KungFuJesus
Copy link
Contributor

KungFuJesus commented Feb 9, 2022

@KungFuJesus I thought maybe if I used WSL2 I could have access to all the perf tools but they were all missing or needed to be compiled manually. So I couldn't get access to the sys timings.

Perf is not a strictly userspace component either, though WSL2 does provide a kernel so perhaps that doesn't matter anymore. You should be able to do a direct comparison with time -v (should give you memory utilization, system time, etc). You will want to restrict it to just the compression levels you're testing though, and perhaps just a single run.

@Dead2
Copy link
Member

Dead2 commented Feb 9, 2022

@KungFuJesus For sure, it could absolutely be explained by slow IO.
But I don't see how the OS manages to somehow recognize and only do slow IO for compressed files compressed by level 0 or 8.

AFAIK, there is no way the OS should be aware of what compression level was used for those files. Also, the file is compressed before each decompression test, so it is not even the same file that was decompressed 70 times, it is 70 different (and probably cached) files that should not look any different to the OS (compared to level 7 for example) than the quite minor differences in size.

The OS could be slower when doing IO because of those exact file sizes, but that would be a quite weird bug.

Also, remember the fact that using a different files for testing does not result in this phenomenon..

@KungFuJesus
Copy link
Contributor

@KungFuJesus For sure, it could absolutely be explained by slow IO. But I don't see how the OS manages to somehow recognize and only do slow IO for compressed files compressed by level 0 or 8.

AFAIK, there is no way the OS should be aware of what compression level was used for those files. Also, the file is compressed before each decompression test, so it is not even the same file that was decompressed 70 times, it is 70 different (and probably cached) files that should not look any different to the OS (compared to level 7 for example) than the quite minor differences in size.

The OS could be slower when doing IO because of those exact file sizes, but that would be a quite weird bug.

Also, remember the fact that using a different files for testing does not result in this phenomenon..

Well given that compression is kind of a state machine with opcodes, it's not super surprising that differently compressible files would react differently here. If I had to wager a guess, you're fitting more into cache than he is for this data set. Perf can be used to measure LLC misses, so it'd be interesting to see from the xeon and from the tiger lake -U CPU what those were for those given compression levels.

@Dead2
Copy link
Member

Dead2 commented Feb 9, 2022

@KungFuJesus Level 7 and Level 8 should not differ so much that it would affect speed much. And certainly level 0 should not be very challenging at all, since it does not in fact compress the data at all.
I don't think the LLC is the problem that causes the 11KB smaller level 8 file to be 2.5x slower to decompress than the level 7 file.. If anything, it should be the other way around.
The compressed file is also only about 6.4MiB, so that should leave a lot of room even in the 15MiB tiger lake cache.

I have a feeling this is OS or timing related, but it is pretty hard to do any kind of deep inspection like this on windows.
Intels Vtune Profiler could do it though https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html
It can (used to at least) be a bit of a hassle to set up, but that whole toolkit allows very powerful analysis both on Linux and Windows (and Android possibly).

Remember that even the Raspberry Pi 1B shows a much faster level 0 than level 1, and level 8 is just about as expected. And that thing does not have a lot of cache at all.

@nmoinvaz
Copy link
Member Author

nmoinvaz commented Feb 9, 2022

I used SysInternal's ProcessMonitor to watch the file read/writes from minigzip.exe as deflatebench ran. It seems on level 0 and level 8 a Win32 function called QueryEAFile is run when the archive is first opened. It is probably related to Windows Defender although I can't confirm because nothing really pulls up online for that function. We have seen in our own applications at work that Windows Defender can cause delays when closing files by several seconds as the system scanner checks the file before allowing it to be committed to disk. The doubling of the decompression time makes sense if the Windows Defender scanner also opens and decompresses the achieve for scanning.

Here are rough (not all my apps shutdown or videos stopped playing) benchmarks with Windows Defender on and without. For those who don't know, Windows Defender is anti-virus/anti-malware that comes with Windows - it's not something I installed.

Real-time protection enabled

OS: Windows 10 10.0.22000 AMD64
CPU: Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
Tool: ../zlib-ng/build/Release/minigzip.exe
Levels: 0-9
Runs: 70         Trim worst: 40

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 0    100.015%      0.044/0.051/0.056/0.003        0.131/0.152/0.160/0.008       15,738,743
 1     54.166%      0.135/0.142/0.149/0.004        0.068/0.077/0.082/0.003        8,523,732
 2     43.871%      0.210/0.223/0.231/0.007        0.076/0.082/0.087/0.004        6,903,609
 3     42.387%      0.281/0.300/0.314/0.010        0.070/0.079/0.083/0.004        6,670,099
 4     41.647%      0.295/0.313/0.325/0.009        0.071/0.075/0.079/0.002        6,553,723
 5     41.216%      0.307/0.324/0.336/0.008        0.069/0.075/0.081/0.004        6,485,938
 6     41.037%      0.349/0.365/0.377/0.008        0.067/0.073/0.079/0.003        6,457,776
 7     40.778%      0.419/0.451/0.465/0.011        0.067/0.075/0.080/0.004        6,416,919
 8     40.704%      0.519/0.565/0.579/0.014        0.143/0.155/0.165/0.006        6,405,244
 9     40.409%      0.668/0.699/0.725/0.015        0.072/0.080/0.085/0.004        6,358,951

 avg1  48.623%                        0.343                          0.092
 avg2  54.026%                        0.381                          0.103
 tot                                102.984                         27.688       76,514,734

Real-time protection disabled

OS: Windows 10 10.0.22000 AMD64
CPU: Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
Tool: ../zlib-ng/build/Release/minigzip.exe
Levels: 0-9
Runs: 70         Trim worst: 40

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 0    100.015%      0.018/0.021/0.023/0.001        0.017/0.019/0.021/0.001       15,738,743
 1     54.166%      0.095/0.103/0.111/0.005        0.049/0.053/0.056/0.002        8,523,732
 2     43.871%      0.155/0.165/0.176/0.007        0.051/0.055/0.058/0.002        6,903,609
 3     42.387%      0.218/0.227/0.240/0.008        0.048/0.053/0.057/0.002        6,670,099
 4     41.647%      0.239/0.251/0.264/0.009        0.047/0.052/0.056/0.003        6,553,723
 5     41.216%      0.245/0.265/0.276/0.010        0.047/0.050/0.054/0.002        6,485,938
 6     41.037%      0.287/0.304/0.316/0.010        0.046/0.051/0.054/0.002        6,457,776
 7     40.778%      0.353/0.372/0.392/0.015        0.046/0.051/0.054/0.002        6,416,919
 8     40.704%      0.446/0.465/0.493/0.017        0.047/0.051/0.055/0.003        6,405,244
 9     40.409%      0.525/0.546/0.580/0.020        0.047/0.051/0.057/0.003        6,358,951

 avg1  48.623%                        0.272                          0.049
 avg2  54.026%                        0.302                          0.054
 tot                                 81.599                         14.582       76,514,734

Notice that level 0 and level 8 return to normal when real-time protection is disabled.

@KungFuJesus
Copy link
Contributor

@KungFuJesus Level 7 and Level 8 should not differ so much that it would affect speed much. And certainly level 0 should not be very challenging at all, since it does not in fact compress the data at all. I don't think the LLC is the problem that causes the 11KB smaller level 8 file to be 2.5x slower to decompress than the level 7 file.. If anything, it should be the other way around. The compressed file is also only about 6.4MiB, so that should leave a lot of room even in the 15MiB tiger lake cache.

I have a feeling this is OS or timing related, but it is pretty hard to do any kind of deep inspection like this on windows. Intels Vtune Profiler could do it though https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html It can (used to at least) be a bit of a hassle to set up, but that whole toolkit allows very powerful analysis both on Linux and Windows (and Android possibly).

Remember that even the Raspberry Pi 1B shows a much faster level 0 than level 1, and level 8 is just about as expected. And that thing does not have a lot of cache at all.

I'll contend that the level 7 & 8 different could very well be a bug or a really strange side effect. However the level 0 being slower than level 1 despite being far less work for the CPU doesn't seem strange to me, especially if level 1 is not significantly more work and the compression reduces the traffic from memory to disk by nearly half. Even if it were just dumping it back into memory with a round trip, I could see the reduction in memory bandwidth and cache pressure being a win. This is precisely the argument made for compressed swap or file systems with transparent compression.

Though, on my system level 0 is beating level 1 by a long shot (at least when I operating entirely out of a tmpfs backed file system):

 Tool: minigzip   Levels: 0-9       
 Runs: 15         Trim worst: 5         

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 0    100.008%      0.000/0.001/0.002/0.001        0.000/0.001/0.003/0.001       15,737,543
 1     54.166%      0.070/0.072/0.074/0.001        0.033/0.035/0.037/0.002        8,523,732
 2     43.871%      0.134/0.137/0.139/0.002        0.032/0.037/0.039/0.002        6,903,609
 3     42.387%      0.162/0.165/0.166/0.001        0.033/0.035/0.037/0.001        6,670,099
 4     41.647%      0.192/0.193/0.194/0.001        0.031/0.034/0.037/0.002        6,553,723
 5     41.216%      0.208/0.211/0.214/0.002        0.031/0.034/0.036/0.001        6,485,938
 6     41.037%      0.253/0.255/0.258/0.001        0.032/0.034/0.035/0.001        6,457,776
 7     40.778%      0.345/0.348/0.350/0.002        0.031/0.034/0.036/0.001        6,416,919
 8     40.704%      0.452/0.455/0.456/0.001        0.032/0.034/0.038/0.002        6,405,244
 9     40.409%      0.533/0.536/0.537/0.001        0.031/0.034/0.036/0.002        6,358,951

 avg1  48.622%                        0.237                          0.031
 avg2  54.025%                        0.264                          0.035
 tot                                 23.724                          3.132       76,513,534

I do likely have cache depraved systems I could try this on, but I might even have something a step further - a slow 7200 RPM disk:

 Tool: minigzip   Levels: 0-9       
 Runs: 15         Trim worst: 5         

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 0    100.008%      0.014/0.016/0.018/0.002        0.017/0.020/0.022/0.002       15,737,543
 1     54.185%      0.147/0.150/0.153/0.002        0.074/0.080/0.082/0.003        8,526,745
 2     43.871%      0.270/0.275/0.277/0.002        0.076/0.081/0.083/0.002        6,903,702
 3     42.388%      0.336/0.338/0.340/0.002        0.074/0.077/0.080/0.002        6,670,239
 4     41.647%      0.382/0.384/0.387/0.002        0.072/0.076/0.079/0.002        6,553,746
 5     41.216%      0.418/0.423/0.426/0.003        0.073/0.076/0.079/0.002        6,485,936
 6     41.038%      0.489/0.491/0.493/0.002        0.072/0.075/0.077/0.002        6,457,827
 7     40.778%      0.644/0.648/0.650/0.002        0.072/0.075/0.077/0.002        6,416,941
 8     40.704%      0.810/0.812/0.814/0.002        0.071/0.074/0.077/0.002        6,405,249
 9     40.409%      0.894/0.897/0.899/0.002        0.074/0.077/0.079/0.002        6,358,951

 avg1  48.624%                        0.444                          0.071
 avg2  54.027%                        0.493                          0.079
 tot                                 44.352                          7.097       76,516,879

Ok, I'll eat my words. What is going on with Windows? Are you plugged in?

@KungFuJesus
Copy link
Contributor

I used SysInternal's ProcessMonitor to watch the file read/writes from minigzip.exe as deflatebench ran. It seems on level 0 and level 8 a Win32 function called QueryEAFile is run when the archive is first opened. It is probably related to Windows Defender although I can't confirm because nothing really pulls up online for that function. We have seen in our own applications at work that Windows Defender can cause delays when closing files by several seconds as the system scanner checks the file before allowing it to be committed to disk. The doubling of the decompression time makes sense if the Windows Defender scanner also opens and decompresses the achieve for scanning.

Lol, oh yeah, that's likely it. We've seen all kinds of weird performance issues with Windows defender. You might want to whitelist the directory you're benchmarking to, it intercepts all writes and tries to inspect them. Man, I wonder how many times you've had bad benchmark info due to defender chowing down on your IO performance.

This definitely justifies the Google Benchmark microbenchmarks quite a bit more.

@nmoinvaz
Copy link
Member Author

nmoinvaz commented Feb 9, 2022

You might want to whitelist the directory you're benchmarking to, it intercepts all writes and tries to inspect them.

Good idea! I had forgotten about that which I had done before... but I recently reimaged my machine and now need to set that up.

@Dead2 Dead2 merged commit ab6665b into zlib-ng:develop Feb 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Benchmarks wanted Please benchmark with and without this PR enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants