Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write an SSE2 optimized compare256 #1131

Merged
merged 2 commits into from
Feb 11, 2022
Merged

Conversation

KungFuJesus
Copy link
Contributor

@KungFuJesus KungFuJesus commented Jan 23, 2022

The SSE4 variant uses the unfortunate string comparison instructions from
SSE4.2 which not only don't work on as many CPUs but, are often slower
than the SSE2 counterparts except in very specific circumstances.

This version should be ~2x faster than unaligned_64 for larger strings
and about half the performance of AVX2 comparisons on identical
hardware.

This version is meant to supplement pre AVX hardware. An attempt was
made to align at least one of the strings since this is a lot of
pre-nehalem systems.

@codecov
Copy link

codecov bot commented Jan 23, 2022

Codecov Report

Merging #1131 (8e49017) into develop (9146bd4) will decrease coverage by 0.09%.
The diff coverage is 100.00%.

❗ Current head 8e49017 differs from pull request most recent head 52299a0. Consider uploading reports for the commit 52299a0 to get more accurate results
Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #1131      +/-   ##
===========================================
- Coverage    80.23%   80.14%   -0.10%     
===========================================
  Files           98       93       -5     
  Lines         9040     8968      -72     
  Branches      1438     1431       -7     
===========================================
- Hits          7253     7187      -66     
+ Misses        1223     1217       -6     
  Partials       564      564              
Flag Coverage Δ
macos_clang 69.80% <97.50%> (+0.30%) ⬆️
macos_gcc 69.13% <90.32%> (+0.52%) ⬆️
ubuntu_clang 71.73% <92.50%> (+1.74%) ⬆️
ubuntu_clang_debug 71.27% <89.13%> (+3.45%) ⬆️
ubuntu_clang_inflate_allow_invalid_dist 70.06% <92.50%> (+0.21%) ⬆️
ubuntu_clang_inflate_strict 71.64% <92.50%> (+1.74%) ⬆️
ubuntu_clang_mmap 71.71% <92.50%> (+1.74%) ⬆️
ubuntu_clang_msan 70.19% <92.50%> (+0.21%) ⬆️
ubuntu_clang_pigz 35.72% <7.69%> (+1.89%) ⬆️
ubuntu_clang_pigz_no_optim 39.70% <ø> (ø)
ubuntu_clang_pigz_no_threads 35.26% <7.69%> (+1.77%) ⬆️
ubuntu_clang_reduced_mem 71.86% <92.50%> (+1.76%) ⬆️
ubuntu_gcc 71.07% <82.75%> (+1.96%) ⬆️
ubuntu_gcc_aarch64 70.76% <ø> (+0.02%) ⬆️
ubuntu_gcc_aarch64_compat_no_opt 69.05% <ø> (ø)
ubuntu_gcc_aarch64_no_acle 69.77% <ø> (ø)
ubuntu_gcc_aarch64_no_neon 69.97% <ø> (ø)
ubuntu_gcc_armhf 70.73% <ø> (+0.02%) ⬆️
ubuntu_gcc_armhf_compat_no_opt 68.99% <ø> (ø)
ubuntu_gcc_armhf_no_acle 70.92% <ø> (+0.02%) ⬆️
ubuntu_gcc_armhf_no_neon 71.04% <ø> (ø)
ubuntu_gcc_armsf 70.71% <ø> (+0.02%) ⬆️
ubuntu_gcc_armsf_compat_no_opt 69.00% <ø> (ø)
ubuntu_gcc_benchmark 72.32% <89.65%> (+1.49%) ⬆️
ubuntu_gcc_compat_no_opt 70.31% <ø> (ø)
ubuntu_gcc_compat_sprefix 70.84% <82.75%> (+1.99%) ⬆️
ubuntu_gcc_mingw_i686 0.00% <0.00%> (ø)
ubuntu_gcc_mingw_x86_64 0.00% <0.00%> (ø)
ubuntu_gcc_no_avx2 70.64% <90.00%> (+0.44%) ⬆️
ubuntu_gcc_no_ctz 71.08% <ø> (ø)
ubuntu_gcc_no_ctzll 70.81% <ø> (ø)
ubuntu_gcc_no_pclmulqdq 69.41% <83.33%> (+2.08%) ⬆️
ubuntu_gcc_no_sse2 70.18% <ø> (+1.79%) ⬆️
ubuntu_gcc_no_sse4 70.20% <83.33%> (+2.01%) ⬆️
ubuntu_gcc_o3 69.48% <0.00%> (-0.15%) ⬇️
ubuntu_gcc_osb 71.00% <88.37%> (+1.91%) ⬆️
ubuntu_gcc_pigz 36.55% <14.28%> (+2.22%) ⬆️
ubuntu_gcc_pigz_aarch64 37.36% <ø> (+0.07%) ⬆️
ubuntu_gcc_ppc 68.22% <ø> (-0.87%) ⬇️
ubuntu_gcc_ppc64 71.76% <ø> (+0.16%) ⬆️
ubuntu_gcc_ppc64le 70.40% <ø> (+0.16%) ⬆️
ubuntu_gcc_ppc_no_power8 70.76% <ø> (-1.10%) ⬇️
ubuntu_gcc_s390x 73.41% <ø> (+1.13%) ⬆️
ubuntu_gcc_s390x_dfltcc ?
ubuntu_gcc_s390x_dfltcc_compat ?
ubuntu_gcc_s390x_no_crc32 ?
ubuntu_gcc_sparc64 71.97% <ø> (ø)
ubuntu_gcc_sprefix 70.66% <82.75%> (+1.97%) ⬆️
win64_gcc ∅ <ø> (∅)
win64_gcc_compat_no_opt ∅ <ø> (∅)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
arch/x86/compare256_sse2.c 100.00% <100.00%> (ø)
functable.c 75.95% <100.00%> (+2.73%) ⬆️
arch/power/adler32_vmx.c 0.00% <0.00%> (-96.78%) ⬇️
fallback_builtins.h 33.33% <0.00%> (-44.45%) ⬇️
adler32.c 81.13% <0.00%> (ø)
uncompr.c 81.39% <0.00%> (ø)
zutil_p.h 71.42% <0.00%> (ø)
chunkset.c 61.90% <0.00%> (ø)
crc32_fold.c 87.50% <0.00%> (ø)
chunkset_tpl.h 99.14% <0.00%> (ø)
... and 31 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9146bd4...52299a0. Read the comment docs.

@KungFuJesus KungFuJesus force-pushed the sse2_compare256 branch 2 times, most recently from 4f2a249 to fb265c1 Compare January 23, 2022 04:29
@KungFuJesus
Copy link
Contributor Author

KungFuJesus commented Jan 23, 2022

Performance is about where I'd expect it to be when we have to assume unaligned loads on pre-nehalem hardware:

2022-01-22T23:46:59-05:00
Running ./benchmark_zlib
Run on (4 X 2013.99 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 6144 KiB (x2)
Load Average: 0.01, 0.08, 0.08
------------------------------------------------------------------------
Benchmark                              Time             CPU   Iterations
------------------------------------------------------------------------
compare256/unaligned_64/1           3.35 ns         3.35 ns    209028988
compare256/unaligned_64/8           4.35 ns         4.35 ns    160783187
compare256/unaligned_64/64          11.4 ns         11.4 ns     61479081
compare256/unaligned_64/256         35.9 ns         35.9 ns     19529461
compare256/unaligned_sse2/1         3.68 ns         3.68 ns    190033995
compare256/unaligned_sse2/8         3.68 ns         3.68 ns    190218654
compare256/unaligned_sse2/64        9.32 ns         9.32 ns     75047295
compare256/unaligned_sse2/256       24.9 ns         24.9 ns     28125358

On cascade lake, a platform where unaligned load instructions on aligned access have virtually no penalty, this is the breakdown:

------------------------------------------------------------------------
Benchmark                              Time             CPU   Iterations
------------------------------------------------------------------------
compare256/c/1                      1.22 ns         1.22 ns    572770519
compare256/c/8                      2.94 ns         2.93 ns    238564252
compare256/c/64                     19.9 ns         19.9 ns     35191136
compare256/c/256                    78.7 ns         78.7 ns      8896746
compare256/unaligned_16/1           1.79 ns         1.79 ns    390618875
compare256/unaligned_16/8           2.70 ns         2.70 ns    259911518
compare256/unaligned_16/64          11.3 ns         11.3 ns     61922727
compare256/unaligned_16/256         40.3 ns         40.3 ns     17351465
compare256/unaligned_32/1           1.71 ns         1.71 ns    409227483
compare256/unaligned_32/8           2.96 ns         2.96 ns    236767832
compare256/unaligned_32/64          16.6 ns         16.6 ns     42130549
compare256/unaligned_32/256         62.4 ns         62.4 ns     11219254
compare256/unaligned_64/1           1.73 ns         1.73 ns    404655632
compare256/unaligned_64/8           2.77 ns         2.77 ns    253041555
compare256/unaligned_64/64          8.06 ns         8.06 ns     86846111
compare256/unaligned_64/256         24.7 ns         24.7 ns     28299510
compare256/unaligned_sse2/1         1.47 ns         1.47 ns    477620250
compare256/unaligned_sse2/8         1.47 ns         1.47 ns    475605924
compare256/unaligned_sse2/64        3.42 ns         3.42 ns    204558888
compare256/unaligned_sse2/256       8.91 ns         8.91 ns     78066374
compare256/unaligned_sse4/1         2.45 ns         2.45 ns    285813847
compare256/unaligned_sse4/8         2.45 ns         2.45 ns    286352541
compare256/unaligned_sse4/64        8.42 ns         8.42 ns     83975254
compare256/unaligned_sse4/256       23.0 ns         23.0 ns     30509133
compare256/unaligned_avx2/1         1.64 ns         1.64 ns    428169536
compare256/unaligned_avx2/8         1.64 ns         1.64 ns    427699799
compare256/unaligned_avx2/64        2.69 ns         2.69 ns    255980607
compare256/unaligned_avx2/256       4.73 ns         4.73 ns    147394399

So ~half of avx2's performance, nearly double unaligned_64 when we don't have alignment to account for. The compare function on pre-nehalem hardware is about 40% faster on the 256 byte string when we do force aligned loads but to do so, we need both strings to be aligned (probably difficult to control and fairly unlikely without having to make performance killing copies into buffers).

@KungFuJesus KungFuJesus force-pushed the sse2_compare256 branch 4 times, most recently from df815b9 to e49e507 Compare January 23, 2022 07:41
@Dead2
Copy link
Member

Dead2 commented Jan 23, 2022

Looks awesome, and it is very good that you split it into two commits. However, the commit for removing SSE4.2 support contains a lot of changes that add SSE2 support as well. This would make it difficult for future bisecting, review or reverts.
Could you please clean that up first?

@KungFuJesus
Copy link
Contributor Author

I'll do my best, it's quite a bit of revision surgery, even if only two commits.

functable.c Outdated Show resolved Hide resolved
@KungFuJesus
Copy link
Contributor Author

KungFuJesus commented Jan 23, 2022

So I'm pretty sure this should be good now - I'm just somewhat confused by github's delta display here. See my comment above.

Ahh it is accurate, rebase confusingly reordered this. I'm guessing this must have happened on develop?

arch/x86/compare256_sse2.c Outdated Show resolved Hide resolved
cpu_features.h Outdated Show resolved Hide resolved
@nmoinvaz
Copy link
Member

@KungFuJesus also I forgot that the README.md needs to be updated because it mentions SSE4.2 but should say SSE2 now.

@KungFuJesus KungFuJesus force-pushed the sse2_compare256 branch 2 times, most recently from 2e71175 to 8143756 Compare January 23, 2022 17:14
@KungFuJesus
Copy link
Contributor Author

@KungFuJesus also I forgot that the README.md needs to be updated because it mentions SSE4.2 but should say SSE2 now.

Resolved

functable.c Outdated Show resolved Hide resolved
functable.c Outdated Show resolved Hide resolved
test/benchmarks/benchmark_compare256.cc Outdated Show resolved Hide resolved
@KungFuJesus KungFuJesus force-pushed the sse2_compare256 branch 2 times, most recently from 827e80d to 584b8c0 Compare January 23, 2022 17:50
@KungFuJesus KungFuJesus marked this pull request as draft January 24, 2022 16:27
@KungFuJesus
Copy link
Contributor Author

Hold off on this one second, I did get a decent gain by aligning just one of the loads. The compiler is eliding the load with comparison, using the indirect address from a register offset. This is helping in all cases (though not as much as if both loads were unaligned, but that's impossible to do without obliterating performance to get a least common multiple for the alignment).

@KungFuJesus KungFuJesus marked this pull request as ready for review January 24, 2022 16:56
@KungFuJesus
Copy link
Contributor Author

Rebase Needed label is no longer needed

@Dead2 Dead2 removed the Rebase needed Please do a 'git rebase develop yourbranch' label Feb 2, 2022
@Dead2
Copy link
Member

Dead2 commented Feb 10, 2022

SSE2-only comparison
Xeon E5-2650, x86-64, GCC 4.8

Baseline ab6665b GCC4.8 SSE2-only

Tool: minideflate Levels: 0-9
Runs: 40         Trim worst: 20

Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
0    100.015%      0.010/0.019/0.023/0.003        0.016/0.024/0.029/0.003       15,738,731
1     54.177%      0.242/0.259/0.268/0.007        0.121/0.131/0.138/0.005        8,525,423
2     43.871%      0.447/0.460/0.466/0.005        0.131/0.138/0.142/0.003        6,903,690
3     42.387%      0.564/0.580/0.588/0.006        0.122/0.134/0.139/0.004        6,670,227
4     41.647%      0.614/0.635/0.645/0.008        0.121/0.129/0.134/0.005        6,553,734
5     41.225%      0.686/0.704/0.712/0.007        0.106/0.126/0.132/0.007        6,487,311
6     41.043%      0.767/0.786/0.794/0.006        0.114/0.126/0.134/0.005        6,458,729
7     40.778%      0.974/0.992/1.000/0.007        0.119/0.127/0.132/0.004        6,416,929
8     40.704%      1.187/1.206/1.218/0.008        0.111/0.125/0.129/0.005        6,405,237
9     40.409%      1.330/1.346/1.355/0.008        0.120/0.129/0.134/0.005        6,358,939

avg1  48.626%                        0.699                          0.119
avg2  54.029%                        0.776                          0.132
tot                                139.721                         23.752       76,518,950

  text    data     bss     dec     hex filename
116535    1456      48  118039   1cd17 libz-ng.so.2

PR #1131 e095a45 GCC4.8 SSE2-only

Tool: minideflate Levels: 0-9
Runs: 40         Trim worst: 20

Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
0    100.015%      0.013/0.020/0.023/0.003        0.016/0.024/0.029/0.004       15,738,731
1     54.177%      0.250/0.261/0.268/0.005        0.118/0.133/0.138/0.006        8,525,423
2     43.871%      0.461/0.468/0.474/0.004        0.130/0.137/0.141/0.004        6,903,690
3     42.387%      0.565/0.579/0.587/0.006        0.113/0.132/0.140/0.007        6,670,227
4     41.647%      0.620/0.636/0.645/0.007        0.117/0.127/0.133/0.005        6,553,734
5     41.225%      0.689/0.703/0.709/0.006        0.115/0.125/0.131/0.004        6,487,311
6     41.043%      0.766/0.786/0.796/0.010        0.117/0.125/0.131/0.004        6,458,729
7     40.778%      0.965/0.984/0.991/0.006        0.106/0.123/0.129/0.006        6,416,929
8     40.704%      1.180/1.199/1.208/0.007        0.117/0.125/0.132/0.004        6,405,237
9     40.409%      1.340/1.354/1.364/0.007        0.117/0.128/0.134/0.004        6,358,939

avg1  48.626%                        0.699                          0.118
avg2  54.029%                        0.777                          0.131
tot                                139.814                         23.620       76,518,950

  text    data     bss     dec     hex filename
119639    1456      48  121143   1d937 libz-ng.so.2

Unfortunately it does not seem to be any faster than the old, when compiling without anything more fancy than SSE2.

@Dead2
Copy link
Member

Dead2 commented Feb 10, 2022

Comparison with SSE2, SSSE3, SSE4 and PCLMULQDQ
Xeon E5-2650, x86-64, GCC 4.8

Baseline ab6665b GCC4.8 SSE2, SSSE3, SSE4 and PCLMULQDQ

 Tool: minideflate Levels: 0-9
 Runs: 40         Trim worst: 20

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 0    100.015%      0.003/0.014/0.018/0.004        0.003/0.016/0.020/0.005       15,738,731
 1     54.159%      0.243/0.253/0.261/0.005        0.105/0.125/0.130/0.006        8,522,662
 2     43.870%      0.444/0.461/0.466/0.005        0.120/0.127/0.133/0.004        6,903,597
 3     42.387%      0.574/0.582/0.591/0.005        0.117/0.125/0.130/0.004        6,670,087
 4     41.647%      0.624/0.641/0.649/0.008        0.105/0.121/0.128/0.005        6,553,711
 5     41.226%      0.694/0.702/0.709/0.005        0.107/0.120/0.124/0.005        6,487,406
 6     41.044%      0.775/0.789/0.798/0.007        0.103/0.115/0.122/0.005        6,458,769
 7     40.778%      0.976/1.001/1.009/0.008        0.107/0.115/0.121/0.004        6,416,907
 8     40.703%      1.207/1.219/1.229/0.005        0.101/0.118/0.124/0.006        6,405,232
 9     40.409%      1.398/1.413/1.421/0.007        0.112/0.123/0.127/0.004        6,358,939

 avg1  48.624%                        0.707                          0.110
 avg2  54.026%                        0.786                          0.123
 tot                                141.499                         22.090       76,516,041

   text    data     bss     dec     hex filename
 125031    1456      48  126535   1ee47 libz-ng.so.2

PR #1131 e095a45 GCC4.8 SSE2, SSSE3, SSE4 and PCLMULQDQ

 Tool: minideflate Levels: 0-9
 Runs: 40         Trim worst: 20

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 0    100.015%      0.007/0.012/0.017/0.004        0.010/0.017/0.020/0.003       15,738,731
 1     54.159%      0.241/0.251/0.257/0.005        0.107/0.123/0.129/0.006        8,522,662
 2     43.870%      0.445/0.454/0.461/0.005        0.116/0.128/0.133/0.005        6,903,597
 3     42.387%      0.560/0.573/0.578/0.004        0.117/0.125/0.130/0.004        6,670,087
 4     41.647%      0.614/0.628/0.635/0.006        0.102/0.120/0.126/0.006        6,553,711
 5     41.226%      0.676/0.684/0.692/0.005        0.096/0.120/0.127/0.007        6,487,406
 6     41.044%      0.753/0.767/0.777/0.007        0.105/0.119/0.125/0.006        6,458,769
 7     40.778%      0.990/0.999/1.010/0.006        0.105/0.117/0.120/0.004        6,416,907
 8     40.703%      1.188/1.213/1.225/0.010        0.103/0.118/0.124/0.006        6,405,232
 9     40.409%      1.338/1.348/1.356/0.006        0.111/0.121/0.127/0.005        6,358,939

 avg1  48.624%                        0.693                          0.111
 avg2  54.026%                        0.770                          0.123
 tot                                138.581                         22.114       76,516,041

   text    data     bss     dec     hex filename
 125223    1456      48  126727   1ef07 libz-ng.so.2

Getting rid of the old SSE4 version makes a big impact on compression speed. On average about 2% faster.

@KungFuJesus
Copy link
Contributor Author

Getting rid of the old SSE4 version makes a big impact on compression speed. On average about 2% faster.

Heh, "big". It's only going to help in scenarios that actually make it to the full 256 byte comparison. It'd be nice to get a more microscopic view of the differences. Can you run benchmark_zlib --benchmark_filter=compare256 ?

Your results there suggest that maybe somehow something is benefiting from allowing those other instructions into the compilation flags (though without output from objdump or perf annotate I can't say for certain). The SSE4 instructions were for sure slower, they carry a lot of latency with them.

My suspicion for what's happening: the 256 byte comparison has a fair amount of unrolling happening with jumps based on the trailing zero count. If you compile with -msse2 only, you're telling the compiler it doesn't have BMI1 or BMI2 instructions. This means it uses bsf (bit set first). This instruction has gotten slower over the evolution of Intel CPUs since tzcnt can be used without having to account for when the operand is 0. What might be happening when you enable those other instructions is the code may be calling at least the BMI1 variant lzcnt on the inverted result and subtracting it from 32. This could in fact be faster on newer generations of CPUs.

Additionally, it's probably faster for Nehalem and newer CPUs to just take the unaligned load (so there's a small gap between the first core i series and haswell where this function may not be peak performance, but it still ought to be way better than the sse4 one, and at least a little better than unaligned_64).

@Dead2
Copy link
Member

Dead2 commented Feb 10, 2022

@KungFuJesus I think you misunderstand. I did not enable those instruction sets globally.
I did this: cmake -DWITH_SSSE3=OFF -DWITH_SSE41=OFF -DWITH_SSE42=OFF -DWITH_AVX2=OFF .
That should be as close as I can get to running the code that an actual SSE2 cpu would be running.

Microbenchmarks are very good, but the purpose of these tests is to do as close to a real-world test as possible, both double-checking speeds and correctness with real-world data when using the full code-path.

@KungFuJesus
Copy link
Contributor Author

@KungFuJesus I think you misunderstand. I did not enable those instruction sets globally. I did this: cmake -DWITH_SSSE3=OFF -DWITH_SSE41=OFF -DWITH_SSE42=OFF -DWITH_AVX2=OFF . That should be as close as I can get to running the code that an actual SSE2 cpu would be running.

Microbenchmarks are very good, but the purpose of these tests is to do as close to a real-world test as possible, both double-checking speeds and correctness with real-world data when using the full code-path.

Except for the fact that you're not actually running on SSE2 era hardware. The Q9650 tells a very different story for a lot of this stuff.

I'm not saying the macro benchmarks are not without merit, they just give a really crappy resolution as to what's going on. The microbenchmarks can say a lot more about what's happening. In particular, since the longest match stuff is actually inlined into a template function, there are also a lot of other effects that aren't quite capturing with the microbenchmark, either.

I would like to see what the results are for you in isolation of all other effects so that I can know the difference on exactly what this code optimized. It also will give you better resolution if you use perf record to capture a profile, as it's only going to be measuring just this method and all the other noise is dropped out of the samples.

@Dead2
Copy link
Member

Dead2 commented Feb 10, 2022

@KungFuJesus Just to be very clear; I am not arguing against this PR at all.

I just have an established review process that I go through with all PRs. With some PRs I skip steps like benchmarking because the PR clearly makes no changes that can affect those (such as CI changes), but doing these tests has caught a lot of different kinds of regressions before they make it into zlib-ng. I wish I had the time to do more in-depth testing customized to each PR, but I just don't have the bandwidth for that. I did however do some customized testing here, to ensure that I actually tested the SSE2 code-paths instead of just whatever is default on this CPU.
I do trust that your benchmarks show an actual performance benefit on your system, that has never been in question. 😉

I am going to approve this PR, now that I have found no negative effects.

Btw, the benchmarks don't compile with GCC4.8, so I cannot provide those. 💣

/usr/src/github/zlib-ng/test/benchmarks/benchmark_crc32.cc: In member function ‘void crc32::Bench(benchmark::State&, crc32_func)’:
/usr/src/github/zlib-ng/test/benchmarks/benchmark_crc32.cc:38:14: warning: ‘auto’ changes meaning in C++11; please remove it [-Wc++0x-compat]
         for (auto _ : state) {
              ^
/usr/src/github/zlib-ng/test/benchmarks/benchmark_crc32.cc:38:19: error: ‘_’ does not name a type
         for (auto _ : state) {
                   ^

@KungFuJesus
Copy link
Contributor Author

@KungFuJesus Just to be very clear; I am not arguing against this PR at all.

I understand that much, I was more hoping for the microbenchmarks so that I could establish why it doesn't help in the case of -msse2 in your case. Compiling with that should be better by some observable margins, assuming most of the tests don't end up being stopped at their early exit with the first two bytes. By compiling with -msse2 for this file in the first revision and then in this PR, you're effectively comparing unaligned_64 to the sse2 method. And while we established that unaligned_64 is always faster than the SSE4 comparison was, I'd like to establish that the sse2 version is in fact faster for everyone than the unaligned_64 version.

The issue you're seeing is because GCC 4.8 doesn't have C++11 support, which is...unfortunate. But, it can be fixed, at a fixed loop overhead cost, by modifying that for loop to instead by while (state.KeepRunning()). Perhaps we should ifdef that or something at compile time.

@Dead2
Copy link
Member

Dead2 commented Feb 10, 2022

Unfortunately there are more errors when compiling the benchmarks, this is the next one to crop up but there are likely quite a few more:

/usr/src/github/zlib-ng/test/benchmarks/benchmark_crc32.cc:43:34: error: ‘hash’ has not been declared
         benchmark::DoNotOptimize(hash);
                                  ^
/usr/src/github/zlib-ng/test/benchmarks/benchmark_crc32.cc:43:38: error: ISO C++ forbids declaration of ‘DoNotOptimize’ with no type [-fpermissive]
         benchmark::DoNotOptimize(hash);
/usr/src/github/zlib-ng/test/benchmarks/benchmark_crc32.cc:43:38: error: invalid use of ‘::’

I have never done much C++ programming, and last time was ~20 years ago, so I have not kept up to date with all the nice new features unfortunately.

@KungFuJesus
Copy link
Contributor Author

Unfortunately there are more errors when compiling the benchmarks, this is the next one to crop up but there are likely quite a few more:

/usr/src/github/zlib-ng/test/benchmarks/benchmark_crc32.cc:43:34: error: ‘hash’ has not been declared
         benchmark::DoNotOptimize(hash);
                                  ^
/usr/src/github/zlib-ng/test/benchmarks/benchmark_crc32.cc:43:38: error: ISO C++ forbids declaration of ‘DoNotOptimize’ with no type [-fpermissive]
         benchmark::DoNotOptimize(hash);
/usr/src/github/zlib-ng/test/benchmarks/benchmark_crc32.cc:43:38: error: invalid use of ‘::’

I have never done much C++ programming, and last time was ~20 years ago, so I have not kept up to date with all the nice new features unfortunately.

Ahh, that is likely due to glibc-ancient on SL/CentOS not providing fixed width types in the include (we declare it as a uint32_t). If you typedef that or find the right include to have on the ol' enterprise distributions of yore it should compile.

@Dead2
Copy link
Member

Dead2 commented Feb 10, 2022

Not sure about that, it kind of looks like hash is declared fine but that it really has a problem with :: and that the hash appearing after that is misinterpreted. hash is used before that without any such error after all.
I did try to include stdint.h just in case, but it made no difference.

@KungFuJesus
Copy link
Contributor Author

Not sure about that, it kind of looks like hash is declared fine but that it really has a problem with :: and that the hash appearing after that is misinterpreted. hash is used before that without any such error after all. I did try to include stdint.h just in case, but it made no difference.

Hmm, namespace specifiers definitely aren't a new thing, I'm pretty sure those have been in the grammar since the beginning of C++. I believe DoNotOptimize() is just a static function inside the benchmark namespace that effectively wraps volatile or some other keyword around it so that it can't optimize the calculation away.

Though, if we're being really honest here I don't think that is 100% necessary, since I'm pretty sure we use it as an argument into the next loop iteration so the optimizer can't optimize it away since it doesn't know what sort of side effects calling the function might have. It's more there as a "just in case" so the optimizer can't get too clever about it.

@Dead2
Copy link
Member

Dead2 commented Feb 10, 2022

Right, as I said, I am not a C++ guy at all, I have probably forgotten 90% of the little I knew about it back then by now. 😄

Anyway, if someone wants to have a go at fixing this, I might be able to set up access to something later, but it'll probably take me some time as I don't currently have a machine ready for that but have lots of stuff on the calendar.
Alternatively it is possible to install Centos 7 in a VM for example; http://isoredirect.centos.org/centos/7/isos/x86_64/

Hmm, or perhaps it could be done by just using godbolt?

@KungFuJesus
Copy link
Contributor Author

KungFuJesus commented Feb 10, 2022

Hmm, or perhaps it could be done by just using godbolt?

A little bit of a challenge when it requires pulling in headers. How far do you get if you just comment out the "DoNotOptimize" line? Actually now that I think about it, the other possibility could be that you don't have a C++ compiler installed and instead of using g++ it's trying to compile it with GCC as C.

@Dead2
Copy link
Member

Dead2 commented Feb 10, 2022

Lines 36-45:

    void Bench(benchmark::State& state, crc32_func crc32) {
        uint32_t hash = 0;

        //for (auto _ : state) {
        while (state.KeepRunning())
            hash = crc32(hash, (const unsigned char *)random_ints, state.range(0));
        }

        //benchmark::DoNotOptimize(hash);
    }
/usr/src/github/zlib-ng/test/benchmarks/benchmark_crc32.cc:45:5: error: expected ‘;’ after class definition
     }
     ^
/usr/src/github/zlib-ng/test/benchmarks/benchmark_crc32.cc:27:10: warning: unused parameter ‘state’ [-Wunused-parameter]
     void SetUp(const ::benchmark::State& state) {
          ^
/usr/src/github/zlib-ng/test/benchmarks/benchmark_crc32.cc: In function ‘void TearDown(const benchmark::State&)’:
/usr/src/github/zlib-ng/test/benchmarks/benchmark_crc32.cc:48:18: error: ‘random_ints’ was not declared in this scope
         zng_free(random_ints);
                  ^
/usr/src/github/zlib-ng/test/benchmarks/benchmark_crc32.cc: At global scope:
/usr/src/github/zlib-ng/test/benchmarks/benchmark_crc32.cc:47:10: warning: unused parameter ‘state’ [-Wunused-parameter]
     void TearDown(const ::benchmark::State& state) {
          ^
/usr/src/github/zlib-ng/test/benchmarks/benchmark_crc32.cc:50:1: error: expected declaration before ‘}’ token
 };
 ^

@KungFuJesus
Copy link
Contributor Author

Hah, fairly cryptic but the actual bug is that you're missing a { brace after the while. That's a C bug 😆

@Dead2
Copy link
Member

Dead2 commented Feb 10, 2022

Doh, I just copied what you gave me, and never really looked at that line again 😆

@Dead2
Copy link
Member

Dead2 commented Feb 10, 2022

Ok, only that for loop needed to be changed (in each of the four benchmark files).

Run on (32 X 2000.03 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 256 KiB (x16)
  L3 Unified 20480 KiB (x2)
Load Average: 1.08, 0.51, 0.44
------------------------------------------------------------------------
Benchmark                              Time             CPU   Iterations
------------------------------------------------------------------------
compare256/c/1                      6.54 ns         6.54 ns    106934154
compare256/c/8                      14.0 ns         14.0 ns     50125371
compare256/c/64                     73.7 ns         73.7 ns      9518600
compare256/c/256                     274 ns          274 ns      2553467
compare256/unaligned_16/1           6.54 ns         6.54 ns    107021217
compare256/unaligned_16/8           11.2 ns         11.2 ns     62445151
compare256/unaligned_16/64          43.9 ns         43.9 ns     15940660
compare256/unaligned_16/256          154 ns          154 ns      4531052
compare256/unaligned_32/1           6.55 ns         6.55 ns    106961496
compare256/unaligned_32/8           10.3 ns         10.3 ns     68018063
compare256/unaligned_32/64          36.5 ns         36.5 ns     19202827
compare256/unaligned_32/256          129 ns          129 ns      5451357
compare256/unaligned_64/1           6.57 ns         6.57 ns    106520069
compare256/unaligned_64/8           8.52 ns         8.52 ns     82151264
compare256/unaligned_64/64          21.5 ns         21.5 ns     32500395
compare256/unaligned_64/256         64.5 ns         64.5 ns     10856761
compare256/unaligned_sse2/1         5.60 ns         5.60 ns    124913769
compare256/unaligned_sse2/8         5.60 ns         5.60 ns    124872603
compare256/unaligned_sse2/64        21.5 ns         21.5 ns     32589183
compare256/unaligned_sse2/256       64.4 ns         64.4 ns     10868078
compare256/unaligned_sse4/1         11.5 ns         11.5 ns     60828583
compare256/unaligned_sse4/8         11.5 ns         11.5 ns     60818275
compare256/unaligned_sse4/64        33.6 ns         33.6 ns     20801447
compare256/unaligned_sse4/256       92.8 ns         92.8 ns      7542976

@Dead2
Copy link
Member

Dead2 commented Feb 10, 2022

Interestingly I got a segfault when benchmarking slide_hash_c

Program received signal SIGSEGV, Segmentation fault.
slide_hash_c_chain (wsize=44257, entries=65536, table=0x20473f659f85b70d) at /usr/src/github/zlib-ng/slide_hash.c:39
39                  Pos m = *q;

(gdb) bt
#0  slide_hash_c_chain (wsize=44257, entries=65536, table=0x20473f659f85b70d) at /usr/src/github/zlib-ng/slide_hash.c:39
#1  slide_hash_c (s=0x67a080) at /usr/src/github/zlib-ng/slide_hash.c:50
#2  0x0000000000405ec4 in slide_hash_c_Benchmark::BenchmarkCase(benchmark::State&) ()
#3  0x000000000040529e in benchmark::Fixture::Run(benchmark::State&) ()
#4  0x0000000000439162 in benchmark::internal::BenchmarkInstance::Run(unsigned long, int, benchmark::internal::ThreadTimer*, benchmark::internal::ThreadManager*, benchmark::internal::PerfCountersMeasurement*) const ()
#5  0x000000000041fdbd in benchmark::internal::(anonymous namespace)::RunInThread(benchmark::internal::BenchmarkInstance const*, unsigned long, int, benchmark::internal::ThreadManager*, benchmark::internal::PerfCountersMeasurement*) ()
#6  0x0000000000420529 in benchmark::internal::BenchmarkRunner::DoNIterations() ()
#7  0x0000000000420e46 in benchmark::internal::BenchmarkRunner::DoOneRepetition() ()
#8  0x000000000040e5f1 in benchmark::internal::(anonymous namespace)::RunBenchmarks(std::vector<benchmark::internal::BenchmarkInstance, std::allocator<benchmark::internal::BenchmarkInstance> > const&, benchmark::BenchmarkReporter*, benchmark::BenchmarkReporter*) ()
#9  0x0000000000410899 in benchmark::RunSpecifiedBenchmarks(benchmark::BenchmarkReporter*, benchmark::BenchmarkReporter*) ()
#10 0x0000000000404653 in main ()

(gdb) info locals
m = <optimized out>
t = 44257
q = 0x20473f659f85b70d

@KungFuJesus
Copy link
Contributor Author

KungFuJesus commented Feb 10, 2022

Interestingly I got a segfault when benchmarking slide_hash_c

Program received signal SIGSEGV, Segmentation fault.
slide_hash_c_chain (wsize=44257, entries=65536, table=0x20473f659f85b70d) at /usr/src/github/zlib-ng/slide_hash.c:39
39                  Pos m = *q;

(gdb) bt
#0  slide_hash_c_chain (wsize=44257, entries=65536, table=0x20473f659f85b70d) at /usr/src/github/zlib-ng/slide_hash.c:39
#1  slide_hash_c (s=0x67a080) at /usr/src/github/zlib-ng/slide_hash.c:50
#2  0x0000000000405ec4 in slide_hash_c_Benchmark::BenchmarkCase(benchmark::State&) ()
#3  0x000000000040529e in benchmark::Fixture::Run(benchmark::State&) ()
#4  0x0000000000439162 in benchmark::internal::BenchmarkInstance::Run(unsigned long, int, benchmark::internal::ThreadTimer*, benchmark::internal::ThreadManager*, benchmark::internal::PerfCountersMeasurement*) const ()
#5  0x000000000041fdbd in benchmark::internal::(anonymous namespace)::RunInThread(benchmark::internal::BenchmarkInstance const*, unsigned long, int, benchmark::internal::ThreadManager*, benchmark::internal::PerfCountersMeasurement*) ()
#6  0x0000000000420529 in benchmark::internal::BenchmarkRunner::DoNIterations() ()
#7  0x0000000000420e46 in benchmark::internal::BenchmarkRunner::DoOneRepetition() ()
#8  0x000000000040e5f1 in benchmark::internal::(anonymous namespace)::RunBenchmarks(std::vector<benchmark::internal::BenchmarkInstance, std::allocator<benchmark::internal::BenchmarkInstance> > const&, benchmark::BenchmarkReporter*, benchmark::BenchmarkReporter*) ()
#9  0x0000000000410899 in benchmark::RunSpecifiedBenchmarks(benchmark::BenchmarkReporter*, benchmark::BenchmarkReporter*) ()
#10 0x0000000000404653 in main ()

So, the value for the wsize argument there doesn't look correct. I believe that we only use power of 2 sizes, and they happen to be multiples of 64. From what I understand in practice we only actually use 2, but for sure that wsize argument does not look correct.

For the compare benchmark, tha't's interesting. It looks as though for non-insignificant sizes of strings the times are basically equal, which is not what I found on my Cascade Lake system a while back. I also noticed that when compiling with -O3, the compiler actually unrolls that loop even more than we are, and to some decent amount of gain. I'll try running the compare benchmark again with the standard release flags out of cmake.

@KungFuJesus
Copy link
Contributor Author

compare256/unaligned_64/8           1.95 ns         1.95 ns    358231809
compare256/unaligned_64/64          6.65 ns         6.65 ns    105467465
compare256/unaligned_64/256         23.7 ns         23.7 ns     29544215
compare256/unaligned_sse2/1         1.47 ns         1.47 ns    477523397
compare256/unaligned_sse2/8         1.47 ns         1.47 ns    477633824
compare256/unaligned_sse2/64        4.23 ns         4.23 ns    164934499
compare256/unaligned_sse2/256       12.9 ns         12.9 ns     54277867

Hmm, perhaps we chalk it up to gcc 4.8 being a fairly poor optimizer? It is rather ancient. It would be possible to get to the bottom of with perf profiles. Interestingly, despite me not specifying any special flags, it does seem to be generating tzcnt instructions which shouldn't be available with strictly SSE2 instructions (which this file should be compiling with). Here's the first few instructions of the function:

  0.05 │      movdqu   (%rdi),%xmm0                                                                                                                                                              
       │      movdqu   (%rsi),%xmm2                                                                                                                                                              
 13.39 │      pcmpeqb  %xmm2,%xmm0                                                                                                                                                               
  3.19 │      pmovmskb %xmm0,%eax                                                                                                                                                                
       │      cmp      $0xffff,%eax                                                                                                                                                              
  0.07 │    ↓ je       20                                                                                                                                                                        
       │      not      %eax                                                                                                                                                                      
 26.10 │      tzcnt    %eax,%eax                                                                                                                                                                 
  0.34 │    ← ret           

@KungFuJesus
Copy link
Contributor Author

Oh cool, evidently the opcode they chose for bsf has some overlap with rep tzcnt such that on newer CPUs that translates to a tzcnt and older ones it's a bsf instruction.
https://c.tenor.com/ZiLugTiVQNgAAAAC/the-more-you-know.gif

@Dead2
Copy link
Member

Dead2 commented Feb 10, 2022

Not sure what defines the values of state.range that becomes wsize. But it fails on the first test, not sure whether that is supposed to be 1 like the hash tests or not.

And yes, GCC 4.8 is ancient and has a pretty bad optimizer. It is also the oldest version we attempt to support. ;)
Often makes it a good for regression testing though.

We actually banked on tzcnt/bsf being pretty much compatible, I think it is still in fallback_builtins.h, but I think it was a lot more clearly commented before. Or perhaps I am remembering the discussions 😄

I'll have to log off and get in bed now 💤, thanks for playing 😄

@Dead2 Dead2 merged commit b3260fd into zlib-ng:develop Feb 11, 2022
@Dead2
Copy link
Member

Dead2 commented Feb 11, 2022

@KungFuJesus I thought I'd try my hand at making a PR for fixing the benchmark on older GCC.
You suggested that GCC 4.8 does not fully support C++11, however that does not seem to be the case actually.
See: https://gcc.gnu.org/projects/cxx-status.html#cxx11

So I am unsure what to actually test for now, any idea what feature proposal this depends on?

@KungFuJesus
Copy link
Contributor Author

KungFuJesus commented Feb 11, 2022

So the feature in question is foreach loops, however it could very well be it just didn't default to trying c++11 mode. Try adding to the cxx flags -std=c++11.

@Dead2
Copy link
Member

Dead2 commented Feb 11, 2022

@KungFuJesus I considered that, but mistakenly thought; nah, GCC defaults to the highest standard it supports. I made a PR with the CMake fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants