-
Notifications
You must be signed in to change notification settings - Fork 12.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bad optimisation of slice equality across platforms and architectures. #92215
Comments
On linux it seems to call bcmp and the direct equality is also slower. Result of the bench of a Ryzen 3950X:
I also tested this on much more sizeable randomized data and it was consistent. |
Mentioned to @RipleyTom that a few people on related issues mentioned Ryzen CPUs (they have now updated their post). I am on a 3900X and someone in the Windows thread reported theirs to be a 3900X too. These are all high-end 3rd gen Ryzen CPUs with many cores, known for their memory dependence. The M1 test goes some way towards generalising the issue, but could just be getting lucky, so I asked someone with a i3-3220 to run cargo bench. They got less extreme but similar results (equality = 489ns/iter, manual = 311ns/iter). So it would seem that this isn't entirely Ryzen specific at very least due to M1 and i3-3220 results above. |
I tried to write an equivalent test for C++, using g++ and clang++, and got similar results. g++
clang++
If the g++ results were wildly different from clang++'s, I'd point the finger at LLVM. However, g++ seems to have similar issues. In all the slower cases, we have a Or perhaps I'm missing something and this is really not a fair comparison. |
In all of the asm outputs you provided, the manual version compiles with no Could it be that simple? The overhead of all the @rustbot label -C-bug +I-slow |
Yeah, as soon as I make the needle and haystack a little bigger, the memcmp version leaves the other one in the dust: const BIG: usize = 1_000;
const SMALL: usize = 100;
fn make() -> (Vec<u8>, Vec<u8>) {
let mut haystack = Vec::with_capacity(BIG);
for _ in 0..BIG - 1 {
haystack.push(0);
}
haystack.push(1);
let mut needle = Vec::with_capacity(SMALL);
for _ in 0..SMALL - 1 {
needle.push(0);
}
needle.push(1);
(haystack, needle)
}
fn bench_equality(bencher: &mut Bencher) {
let (haystack, needle) = make();
bencher.iter(|| naive_search_equality(&haystack[..], &needle[..]));
}
fn bench_manual(bencher: &mut Bencher) {
let (haystack, needle) = make();
bencher.iter(|| naive_search_manual(&haystack[..], &needle[..]));
}
On my M1 mac, the breakpoint where the memcmp one starts to win is when the needle is 16 elements: Haystack = 150 elements:
Haystack = 150 elements:
Probably because the memcmp function starts to use some big integer comparisons or something. |
Yes it makes sense, the extra cost of memcmp probably comes from the call + the size checks to see if it can use some optimizations. If it can it's much better than byte per byte. 👍 for being more thorough than me(I checked with a much bigger haystack but I didn't check with different needle sizes 🤦 ). |
I just retested with a longer needle on M1 Mac, and it does indeed make a big difference. In fact, the memcmp version appears to be quicker with a larger needle.
I am guessing this is because it's doing something boyer-moore-horspool-esk and is spending its time building something semantically equivalent to a bad match table - the overhead of which might be dominating. That would explain why larger needles can be faster. I plan on rerunning this on x86_64 Windows and Linux (docker) - but if we assume it'll be consistent with the above for now. Which leads me to this - should we have the kind of check used branch off depending on the slice size? Is there something we can do here to get good performance in both cases? If we got this right, simple slice comparisons being sped up by 2x in some cases could be a big win for Rust; fast by default. I suppose the problem would be working out what value to branch up, and thinking about whether a sudden performance change due when a slice size threshold is met is avoidable / OK. |
I don't think this makes much sense. Isn't that an algorithm for substring searching? The only part that's out of your control here is the memcmp loop; that's just doing a single loop comparing memory. Any funky search algorithm would have to be happening in your Rust code, in lieu of the window iteration you're doing. My guess as to why it's faster is that it starts using larger integer comparisons after a certain point. There's a very interesting blog post from Microsoft of how the Windows memcpy uses similar "breakpoints" to choose when to go byte-by-byte and when to use larger integers for copying: https://msrc-blog.microsoft.com/2021/01/11/building-faster-amd64-memset-routines/
Are you suggesting that slice equality should avoid performing a call to a memcmp-like function when the slice size is found to be less than some threshold at runtime (it almost certainly can't know the length at compile time)? And to instead do byte-by-byte comparison inline in that case? I suppose it's possible that that results in a performance gain. But it's unclear to me what the overhead of the extra branch would be. And also as you said, determining what that breakpoint should be will be tricky. I think these are all reasons why one should use builtins like memcmp, bcmp, or whatever. Because people far smarter than us have been thinking about this exact problem for way longer. All of that said, I'm not really all that qualified to speak with confidence on this topic, so take a pinch of salt with it.
I mean, this is what we already see isn't it? As soon as the slice size exceeds a threshold, the performance of memcmp gets faster. |
Oops, yeah you're right, I have simply been staring at string comparisons for too long.
Yes. Or, at least, it's something worth exploring.
I agree generalling speaking, but here we're seeing a clear case where it's a lot slower.
This is a distinct possibiliy indeed.
You make a very good point.
It might be worth me graphing how memcmp behaves over different inputs to verify this. AppendixI reran the tests on Windows and on Linux (docker) just to be sure. Same findings as M1 Mac. Windows:
Docker:
|
That would be interesting to see! |
All of the information found here can be found and reproduced here.
The context for finding this problem is this: I was experimenting with string searching algorithms as a way of learning Rust, but when refactoring some code, hit a big performance regression.
The code having performance issues is a naive string search. It can be reproduced with different implementations, but I feel this one expresses things most concisely:
Let's call the above the "manual search." The compiler correctly recognises that
i
will never be out of bounds, since the window size isneedle
's length and thereforecandidate
is always the same size as the needle. All is well so far. However, this is just doing simple equality testing ... why not just check for equality?Much cleaner. However ...
The performance is attrocious compared with the prior implementation. What's going on?
After running a profiler on it, I saw some calls out to VC++ libraries in the slow version that were not present in the fast version. Thinking there might be some performance hazard in the Rust FFI hopping back and forth to so frequently, I tried statically linking things. The VC++ libraries disappeared from the profile, but the performance issue did not. All I saw was a huge amount of time spent in
memcmp
.After talking to some people on the Rust Language Community Discord server (https://discord.com/invite/rust-lang-community) - I thought it might be a Windows issue and did some investigation. I came across #90360, which looked similar to what I was seeing. However, just to be sure, I checked on other platforms.
It reproduced on Linux (inside a docker instance). Assuming this was some quirk of docker on Windows, I tried it on my M1 Mac. It reproduced again. This does not appear to restricted to Windows at all.
The rustc version used is the same on each platform (except the host, of course)
The hosts are:
The disassembly for each can be found in the repo linked at the start of this issue. The docker variant doesn't use
memcmp
, but does usebcmp
.The text was updated successfully, but these errors were encountered: