Bad optimisation of slice equality across platforms and architectures. #92215

OMGtechy · 2021-12-22T22:44:40Z

All of the information found here can be found and reproduced here.

The context for finding this problem is this: I was experimenting with string searching algorithms as a way of learning Rust, but when refactoring some code, hit a big performance regression.

The code having performance issues is a naive string search. It can be reproduced with different implementations, but I feel this one expresses things most concisely:

pub fn naive_search_manual(haystack: &[u8], needle: &[u8]) -> Option<usize> {
    if needle.is_empty() {
        return Some(0usize);
    }

    haystack.windows(needle.len()).position(|candidate| {
        for i in 0..candidate.len() {
            if needle[i] != candidate[i] {
                return false;
            }
        }

        return true;
    })
}

Let's call the above the "manual search." The compiler correctly recognises that i will never be out of bounds, since the window size is needle's length and therefore candidate is always the same size as the needle. All is well so far. However, this is just doing simple equality testing ... why not just check for equality?

pub fn naive_search_equality(haystack: &[u8], needle: &[u8]) -> Option<usize> {
    if needle.is_empty() {
        return Some(0usize);
    }

    haystack.windows(needle.len()).position(|candidate| candidate == needle)
}

Much cleaner. However ...

test bench_equality ... bench:         341 ns/iter (+/- 140)
test bench_manual   ... bench:         175 ns/iter (+/- 11)

The performance is attrocious compared with the prior implementation. What's going on?

After running a profiler on it, I saw some calls out to VC++ libraries in the slow version that were not present in the fast version. Thinking there might be some performance hazard in the Rust FFI hopping back and forth to so frequently, I tried statically linking things. The VC++ libraries disappeared from the profile, but the performance issue did not. All I saw was a huge amount of time spent in memcmp.

After talking to some people on the Rust Language Community Discord server (https://discord.com/invite/rust-lang-community) - I thought it might be a Windows issue and did some investigation. I came across #90360, which looked similar to what I was seeing. However, just to be sure, I checked on other platforms.

It reproduced on Linux (inside a docker instance). Assuming this was some quirk of docker on Windows, I tried it on my M1 Mac. It reproduced again. This does not appear to restricted to Windows at all.

The rustc version used is the same on each platform (except the host, of course)

rustc 1.57.0 (f1edd0429 2021-11-29)
binary: rustc
commit-hash: f1edd0429582dd29cccacaf50fd134b05593bd9c
commit-date: 2021-11-29
host: [ ... snip ... ]
release: 1.57.0
LLVM version: 13.0.0

The hosts are:

host: x86_64-pc-windows-msvc
host: x86_64-unknown-linux-gnu
host: aarch64-apple-darwin

The disassembly for each can be found in the repo linked at the start of this issue. The docker variant doesn't use memcmp, but does use bcmp.

The text was updated successfully, but these errors were encountered:

RipleyTom · 2021-12-22T23:07:53Z

On linux it seems to call bcmp and the direct equality is also slower.

Result of the bench of a Ryzen 3950X:

test bench_equality ... bench:         303 ns/iter (+/- 3)
test bench_manual   ... bench:         200 ns/iter (+/- 5)

I also tested this on much more sizeable randomized data and it was consistent.

OMGtechy · 2021-12-23T00:47:40Z

Mentioned to @RipleyTom that a few people on related issues mentioned Ryzen CPUs (they have now updated their post). I am on a 3900X and someone in the Windows thread reported theirs to be a 3900X too. These are all high-end 3rd gen Ryzen CPUs with many cores, known for their memory dependence.

The M1 test goes some way towards generalising the issue, but could just be getting lucky, so I asked someone with a i3-3220 to run cargo bench. They got less extreme but similar results (equality = 489ns/iter, manual = 311ns/iter).

So it would seem that this isn't entirely Ryzen specific at very least due to M1 and i3-3220 results above.

OMGtechy · 2021-12-23T18:57:09Z

I tried to write an equivalent test for C++, using g++ and clang++, and got similar results.

g++

-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
BM_memcmp         249 ns          249 ns      2830428
BM_manual         185 ns          185 ns      3585112

clang++

-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
BM_memcmp         256 ns          256 ns      2732908
BM_manual         152 ns          152 ns      4549339

If the g++ results were wildly different from clang++'s, I'd point the finger at LLVM. However, g++ seems to have similar issues. In all the slower cases, we have a memcmp or bcmp. Sure, memcmp does a little more than equality, but since this is an equality operator for a slice ... we don't need that more. We can take this extra performance and roll with it.

Or perhaps I'm missing something and this is really not a fair comparison.

BGR360 · 2021-12-25T00:05:11Z

In all of the asm outputs you provided, the manual version compiles with no calls, whereas the equality version calls a memcmp-like function.

Could it be that simple? The overhead of all the calls compared to inlining the comparisons? The slices you have in your benchmark are rather small, so it seems possible that the call overhead is dominating.

@rustbot label -C-bug +I-slow

BGR360 · 2021-12-25T01:31:47Z

Yeah, as soon as I make the needle and haystack a little bigger, the memcmp version leaves the other one in the dust:

const BIG: usize = 1_000;
const SMALL: usize = 100;

fn make() -> (Vec<u8>, Vec<u8>) {
    let mut haystack = Vec::with_capacity(BIG);
    for _ in 0..BIG - 1 {
        haystack.push(0);
    }
    haystack.push(1);

    let mut needle = Vec::with_capacity(SMALL);
    for _ in 0..SMALL - 1 {
        needle.push(0);
    }
    needle.push(1);

    (haystack, needle)
}

fn bench_equality(bencher: &mut Bencher) {
    let (haystack, needle) = make();
    bencher.iter(|| naive_search_equality(&haystack[..], &needle[..]));
}

fn bench_manual(bencher: &mut Bencher) {
    let (haystack, needle) = make();
    bencher.iter(|| naive_search_manual(&haystack[..], &needle[..]));
}

test bench_equality ... bench:       5,649 ns/iter (+/- 215)
test bench_manual   ... bench:      28,809 ns/iter (+/- 331)

On my M1 mac, the breakpoint where the memcmp one starts to win is when the needle is 16 elements:

Haystack = 150 elements:
Needle = 15 elements:

test bench_equality ... bench:       1,034 ns/iter (+/- 24)
test bench_manual   ... bench:         734 ns/iter (+/- 17)

Haystack = 150 elements:
Needle = 16 elements:

test bench_equality ... bench:         221 ns/iter (+/- 6)
test bench_manual   ... bench:         760 ns/iter (+/- 3)

Probably because the memcmp function starts to use some big integer comparisons or something.

RipleyTom · 2021-12-25T11:51:10Z

Yes it makes sense, the extra cost of memcmp probably comes from the call + the size checks to see if it can use some optimizations.

If it can it's much better than byte per byte.

👍 for being more thorough than me(I checked with a much bigger haystack but I didn't check with different needle sizes 🤦 ).

OMGtechy · 2021-12-26T02:41:03Z

I just retested with a longer needle on M1 Mac, and it does indeed make a big difference. In fact, the memcmp version appears to be quicker with a larger needle.

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s


running 4 tests
test bench_equality_long_match  ... bench:         301 ns/iter (+/- 8)
test bench_equality_short_match ... bench:         496 ns/iter (+/- 14)
test bench_manual_long_match    ... bench:       1,474 ns/iter (+/- 32)
test bench_manual_short_match   ... bench:         152 ns/iter (+/- 6)

test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured

I am guessing this is because it's doing something boyer-moore-horspool-esk and is spending its time building something semantically equivalent to a bad match table - the overhead of which might be dominating. That would explain why larger needles can be faster.

I plan on rerunning this on x86_64 Windows and Linux (docker) - but if we assume it'll be consistent with the above for now. Which leads me to this - should we have the kind of check used branch off depending on the slice size? Is there something we can do here to get good performance in both cases?

If we got this right, simple slice comparisons being sped up by 2x in some cases could be a big win for Rust; fast by default. I suppose the problem would be working out what value to branch up, and thinking about whether a sudden performance change due when a slice size threshold is met is avoidable / OK.

BGR360 · 2021-12-26T03:21:56Z

I am guessing this is because it's doing something boyer-moore-horspool-esk and is spending its time building something semantically equivalent to a bad match table

I don't think this makes much sense. Isn't that an algorithm for substring searching? The only part that's out of your control here is the memcmp loop; that's just doing a single loop comparing memory. Any funky search algorithm would have to be happening in your Rust code, in lieu of the window iteration you're doing.

My guess as to why it's faster is that it starts using larger integer comparisons after a certain point. There's a very interesting blog post from Microsoft of how the Windows memcpy uses similar "breakpoints" to choose when to go byte-by-byte and when to use larger integers for copying: https://msrc-blog.microsoft.com/2021/01/11/building-faster-amd64-memset-routines/

should we have the kind of check used branch off depending on the slice size

Are you suggesting that slice equality should avoid performing a call to a memcmp-like function when the slice size is found to be less than some threshold at runtime (it almost certainly can't know the length at compile time)? And to instead do byte-by-byte comparison inline in that case?

I suppose it's possible that that results in a performance gain. But it's unclear to me what the overhead of the extra branch would be. And also as you said, determining what that breakpoint should be will be tricky.

I think these are all reasons why one should use builtins like memcmp, bcmp, or whatever. Because people far smarter than us have been thinking about this exact problem for way longer.

All of that said, I'm not really all that qualified to speak with confidence on this topic, so take a pinch of salt with it.

whether a sudden performance change due when a slice size threshold is met is avoidable / OK.

I mean, this is what we already see isn't it? As soon as the slice size exceeds a threshold, the performance of memcmp gets faster.

OMGtechy · 2021-12-26T17:41:12Z

I don't think this makes much sense. Isn't that an algorithm for substring searching?

Oops, yeah you're right, I have simply been staring at string comparisons for too long.

Are you suggesting that slice equality should avoid performing a call to a memcmp-like function when the slice size is found to be less than some threshold at runtime (it almost certainly can't know the length at compile time)? And to instead do byte-by-byte comparison inline in that case?

Yes. Or, at least, it's something worth exploring.

I think these are all reasons why one should use builtins like memcmp, bcmp, or whatever. Because people far smarter than us have been thinking about this exact problem for way longer.

I agree generalling speaking, but here we're seeing a clear case where it's a lot slower.

But it's unclear to me what the overhead of the extra branch would be.

This is a distinct possibiliy indeed.

I mean, this is what we already see isn't it? As soon as the slice size exceeds a threshold, the performance of memcmp gets faster.

You make a very good point.

My guess as to why it's faster is that it starts using larger integer comparisons after a certain point.

It might be worth me graphing how memcmp behaves over different inputs to verify this.

Appendix

I reran the tests on Windows and on Linux (docker) just to be sure. Same findings as M1 Mac.

Windows:

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s


running 4 tests
test bench_equality_long_match  ... bench:         261 ns/iter (+/- 19)
test bench_equality_short_match ... bench:         373 ns/iter (+/- 9)
test bench_manual_long_match    ... bench:       2,027 ns/iter (+/- 80)
test bench_manual_short_match   ... bench:         209 ns/iter (+/- 17)

test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured

Docker:

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s


running 4 tests
test bench_equality_long_match  ... bench:         169 ns/iter (+/- 8)
test bench_equality_short_match ... bench:         329 ns/iter (+/- 33)
test bench_manual_long_match    ... bench:       2,611 ns/iter (+/- 586)
test bench_manual_short_match   ... bench:         235 ns/iter (+/- 19)

test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured

BGR360 · 2021-12-26T21:40:23Z

It might be worth me graphing how memcmp behaves over different inputs to verify this.

That would be interesting to see!

OMGtechy added the C-bug Category: This is a bug. label Dec 22, 2021

rustbot added I-slow Issue: Problems and improvements with respect to performance of generated code. and removed C-bug Category: This is a bug. labels Dec 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad optimisation of slice equality across platforms and architectures. #92215

Bad optimisation of slice equality across platforms and architectures. #92215

OMGtechy commented Dec 22, 2021 •

edited

RipleyTom commented Dec 22, 2021 •

edited

OMGtechy commented Dec 23, 2021 •

edited

OMGtechy commented Dec 23, 2021 •

edited

BGR360 commented Dec 25, 2021 •

edited

BGR360 commented Dec 25, 2021

RipleyTom commented Dec 25, 2021

OMGtechy commented Dec 26, 2021 •

edited

BGR360 commented Dec 26, 2021

OMGtechy commented Dec 26, 2021

BGR360 commented Dec 26, 2021

Bad optimisation of slice equality across platforms and architectures. #92215

Bad optimisation of slice equality across platforms and architectures. #92215

Comments

OMGtechy commented Dec 22, 2021 • edited

RipleyTom commented Dec 22, 2021 • edited

OMGtechy commented Dec 23, 2021 • edited

OMGtechy commented Dec 23, 2021 • edited

BGR360 commented Dec 25, 2021 • edited

BGR360 commented Dec 25, 2021

RipleyTom commented Dec 25, 2021

OMGtechy commented Dec 26, 2021 • edited

BGR360 commented Dec 26, 2021

OMGtechy commented Dec 26, 2021

Appendix

BGR360 commented Dec 26, 2021

OMGtechy commented Dec 22, 2021 •

edited

RipleyTom commented Dec 22, 2021 •

edited

OMGtechy commented Dec 23, 2021 •

edited

OMGtechy commented Dec 23, 2021 •

edited

BGR360 commented Dec 25, 2021 •

edited

OMGtechy commented Dec 26, 2021 •

edited