-
-
Notifications
You must be signed in to change notification settings - Fork 257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write an SSE2 optimized compare256 #1131
Conversation
Codecov Report
@@ Coverage Diff @@
## develop #1131 +/- ##
===========================================
- Coverage 80.23% 80.14% -0.10%
===========================================
Files 98 93 -5
Lines 9040 8968 -72
Branches 1438 1431 -7
===========================================
- Hits 7253 7187 -66
+ Misses 1223 1217 -6
Partials 564 564
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
4f2a249
to
fb265c1
Compare
Performance is about where I'd expect it to be when we have to assume unaligned loads on pre-nehalem hardware:
On cascade lake, a platform where unaligned load instructions on aligned access have virtually no penalty, this is the breakdown:
So ~half of avx2's performance, nearly double unaligned_64 when we don't have alignment to account for. The compare function on pre-nehalem hardware is about 40% faster on the 256 byte string when we do force aligned loads but to do so, we need both strings to be aligned (probably difficult to control and fairly unlikely without having to make performance killing copies into buffers). |
df815b9
to
e49e507
Compare
Looks awesome, and it is very good that you split it into two commits. However, the commit for removing SSE4.2 support contains a lot of changes that add SSE2 support as well. This would make it difficult for future bisecting, review or reverts. |
e49e507
to
3cf9373
Compare
I'll do my best, it's quite a bit of revision surgery, even if only two commits. |
3cf9373
to
8f7293c
Compare
So I'm pretty sure this should be good now - I'm just somewhat confused by github's delta display here. See my comment above. Ahh it is accurate, rebase confusingly reordered this. I'm guessing this must have happened on develop? |
8f7293c
to
2fda82a
Compare
@KungFuJesus also I forgot that the README.md needs to be updated because it mentions SSE4.2 but should say SSE2 now. |
2e71175
to
8143756
Compare
Resolved |
827e80d
to
584b8c0
Compare
Hold off on this one second, I did get a decent gain by aligning just one of the loads. The compiler is eliding the load with comparison, using the indirect address from a register offset. This is helping in all cases (though not as much as if both loads were unaligned, but that's impossible to do without obliterating performance to get a least common multiple for the alignment). |
584b8c0
to
4477786
Compare
Rebase Needed label is no longer needed |
SSE2-only comparison Baseline ab6665b GCC4.8 SSE2-only
PR #1131 e095a45 GCC4.8 SSE2-only
Unfortunately it does not seem to be any faster than the old, when compiling without anything more fancy than SSE2. |
Comparison with SSE2, SSSE3, SSE4 and PCLMULQDQ Baseline ab6665b GCC4.8 SSE2, SSSE3, SSE4 and PCLMULQDQ
PR #1131 e095a45 GCC4.8 SSE2, SSSE3, SSE4 and PCLMULQDQ
Getting rid of the old SSE4 version makes a big impact on compression speed. On average about 2% faster. |
Heh, "big". It's only going to help in scenarios that actually make it to the full 256 byte comparison. It'd be nice to get a more microscopic view of the differences. Can you run benchmark_zlib --benchmark_filter=compare256 ? Your results there suggest that maybe somehow something is benefiting from allowing those other instructions into the compilation flags (though without output from objdump or perf annotate I can't say for certain). The SSE4 instructions were for sure slower, they carry a lot of latency with them. My suspicion for what's happening: the 256 byte comparison has a fair amount of unrolling happening with jumps based on the trailing zero count. If you compile with -msse2 only, you're telling the compiler it doesn't have BMI1 or BMI2 instructions. This means it uses bsf (bit set first). This instruction has gotten slower over the evolution of Intel CPUs since tzcnt can be used without having to account for when the operand is 0. What might be happening when you enable those other instructions is the code may be calling at least the BMI1 variant lzcnt on the inverted result and subtracting it from 32. This could in fact be faster on newer generations of CPUs. Additionally, it's probably faster for Nehalem and newer CPUs to just take the unaligned load (so there's a small gap between the first core i series and haswell where this function may not be peak performance, but it still ought to be way better than the sse4 one, and at least a little better than unaligned_64). |
@KungFuJesus I think you misunderstand. I did not enable those instruction sets globally. Microbenchmarks are very good, but the purpose of these tests is to do as close to a real-world test as possible, both double-checking speeds and correctness with real-world data when using the full code-path. |
Except for the fact that you're not actually running on SSE2 era hardware. The Q9650 tells a very different story for a lot of this stuff. I'm not saying the macro benchmarks are not without merit, they just give a really crappy resolution as to what's going on. The microbenchmarks can say a lot more about what's happening. In particular, since the longest match stuff is actually inlined into a template function, there are also a lot of other effects that aren't quite capturing with the microbenchmark, either. I would like to see what the results are for you in isolation of all other effects so that I can know the difference on exactly what this code optimized. It also will give you better resolution if you use perf record to capture a profile, as it's only going to be measuring just this method and all the other noise is dropped out of the samples. |
@KungFuJesus Just to be very clear; I am not arguing against this PR at all. I just have an established review process that I go through with all PRs. With some PRs I skip steps like benchmarking because the PR clearly makes no changes that can affect those (such as CI changes), but doing these tests has caught a lot of different kinds of regressions before they make it into zlib-ng. I wish I had the time to do more in-depth testing customized to each PR, but I just don't have the bandwidth for that. I did however do some customized testing here, to ensure that I actually tested the SSE2 code-paths instead of just whatever is default on this CPU. I am going to approve this PR, now that I have found no negative effects. Btw, the benchmarks don't compile with GCC4.8, so I cannot provide those. 💣
|
I understand that much, I was more hoping for the microbenchmarks so that I could establish why it doesn't help in the case of -msse2 in your case. Compiling with that should be better by some observable margins, assuming most of the tests don't end up being stopped at their early exit with the first two bytes. By compiling with -msse2 for this file in the first revision and then in this PR, you're effectively comparing unaligned_64 to the sse2 method. And while we established that unaligned_64 is always faster than the SSE4 comparison was, I'd like to establish that the sse2 version is in fact faster for everyone than the unaligned_64 version. The issue you're seeing is because GCC 4.8 doesn't have C++11 support, which is...unfortunate. But, it can be fixed, at a fixed loop overhead cost, by modifying that for loop to instead by |
Unfortunately there are more errors when compiling the benchmarks, this is the next one to crop up but there are likely quite a few more:
I have never done much C++ programming, and last time was ~20 years ago, so I have not kept up to date with all the nice new features unfortunately. |
Ahh, that is likely due to glibc-ancient on SL/CentOS not providing fixed width types in the include (we declare it as a uint32_t). If you typedef that or find the right include to have on the ol' enterprise distributions of yore it should compile. |
Not sure about that, it kind of looks like |
Hmm, namespace specifiers definitely aren't a new thing, I'm pretty sure those have been in the grammar since the beginning of C++. I believe DoNotOptimize() is just a static function inside the benchmark namespace that effectively wraps volatile or some other keyword around it so that it can't optimize the calculation away. Though, if we're being really honest here I don't think that is 100% necessary, since I'm pretty sure we use it as an argument into the next loop iteration so the optimizer can't optimize it away since it doesn't know what sort of side effects calling the function might have. It's more there as a "just in case" so the optimizer can't get too clever about it. |
Right, as I said, I am not a C++ guy at all, I have probably forgotten 90% of the little I knew about it back then by now. 😄 Anyway, if someone wants to have a go at fixing this, I might be able to set up access to something later, but it'll probably take me some time as I don't currently have a machine ready for that but have lots of stuff on the calendar. Hmm, or perhaps it could be done by just using godbolt? |
A little bit of a challenge when it requires pulling in headers. How far do you get if you just comment out the "DoNotOptimize" line? Actually now that I think about it, the other possibility could be that you don't have a C++ compiler installed and instead of using g++ it's trying to compile it with GCC as C. |
Lines 36-45:
|
Hah, fairly cryptic but the actual bug is that you're missing a |
Doh, I just copied what you gave me, and never really looked at that line again 😆 |
Ok, only that for loop needed to be changed (in each of the four benchmark files).
|
Interestingly I got a segfault when benchmarking slide_hash_c
|
So, the value for the wsize argument there doesn't look correct. I believe that we only use power of 2 sizes, and they happen to be multiples of 64. From what I understand in practice we only actually use 2, but for sure that wsize argument does not look correct. For the compare benchmark, tha't's interesting. It looks as though for non-insignificant sizes of strings the times are basically equal, which is not what I found on my Cascade Lake system a while back. I also noticed that when compiling with -O3, the compiler actually unrolls that loop even more than we are, and to some decent amount of gain. I'll try running the compare benchmark again with the standard release flags out of cmake. |
Hmm, perhaps we chalk it up to gcc 4.8 being a fairly poor optimizer? It is rather ancient. It would be possible to get to the bottom of with perf profiles. Interestingly, despite me not specifying any special flags, it does seem to be generating tzcnt instructions which shouldn't be available with strictly SSE2 instructions (which this file should be compiling with). Here's the first few instructions of the function:
|
Oh cool, evidently the opcode they chose for bsf has some overlap with rep tzcnt such that on newer CPUs that translates to a tzcnt and older ones it's a bsf instruction. |
Not sure what defines the values of state.range that becomes wsize. But it fails on the first test, not sure whether that is supposed to be 1 like the hash tests or not. And yes, GCC 4.8 is ancient and has a pretty bad optimizer. It is also the oldest version we attempt to support. ;) We actually banked on tzcnt/bsf being pretty much compatible, I think it is still in fallback_builtins.h, but I think it was a lot more clearly commented before. Or perhaps I am remembering the discussions 😄 I'll have to log off and get in bed now 💤, thanks for playing 😄 |
@KungFuJesus I thought I'd try my hand at making a PR for fixing the benchmark on older GCC. So I am unsure what to actually test for now, any idea what feature proposal this depends on? |
So the feature in question is foreach loops, however it could very well be it just didn't default to trying c++11 mode. Try adding to the cxx flags -std=c++11. |
@KungFuJesus I considered that, but mistakenly thought; nah, GCC defaults to the highest standard it supports. I made a PR with the CMake fix. |
The SSE4 variant uses the unfortunate string comparison instructions from
SSE4.2 which not only don't work on as many CPUs but, are often slower
than the SSE2 counterparts except in very specific circumstances.
This version should be ~2x faster than unaligned_64 for larger strings
and about half the performance of AVX2 comparisons on identical
hardware.
This version is meant to supplement pre AVX hardware. An attempt was
made to align at least one of the strings since this is a lot of
pre-nehalem systems.