Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x86_64 SSE2 fast-path for str.contains(&str) and short needles #103779

Merged
merged 5 commits into from
Nov 17, 2022

Conversation

the8472
Copy link
Member

@the8472 the8472 commented Oct 30, 2022

Based on Wojciech Muła's SIMD-friendly algorithms for substring searching

The two-way algorithm is Big-O efficient but it needs to preprocess the needle
to find a "critical factorization" of it. This additional work is significant
for short needles. Additionally it mostly advances needle.len() bytes at a time.

The SIMD-based approach used here on the other hand can advance based on its
vector width, which can exceed the needle length. Except for pathological cases,
but due to being limited to small needles the worst case blowup is also small.

benchmarks taken on a Zen2, compiled with -Ccodegen-units=1:

OLD:
test str::bench_contains_16b_in_long                     ... bench:         504 ns/iter (+/- 14) = 5061 MB/s
test str::bench_contains_2b_repeated_long                ... bench:         948 ns/iter (+/- 175) = 2690 MB/s
test str::bench_contains_32b_in_long                     ... bench:         445 ns/iter (+/- 6) = 5732 MB/s
test str::bench_contains_bad_naive                       ... bench:         130 ns/iter (+/- 1) = 569 MB/s
test str::bench_contains_bad_simd                        ... bench:          84 ns/iter (+/- 8) = 880 MB/s
test str::bench_contains_equal                           ... bench:         142 ns/iter (+/- 7) = 394 MB/s
test str::bench_contains_short_long                      ... bench:         677 ns/iter (+/- 25) = 3768 MB/s
test str::bench_contains_short_short                     ... bench:          27 ns/iter (+/- 2) = 2074 MB/s

NEW:
test str::bench_contains_16b_in_long                     ... bench:          82 ns/iter (+/- 0) = 31109 MB/s
test str::bench_contains_2b_repeated_long                ... bench:          73 ns/iter (+/- 0) = 34945 MB/s
test str::bench_contains_32b_in_long                     ... bench:          71 ns/iter (+/- 1) = 35929 MB/s
test str::bench_contains_bad_naive                       ... bench:           7 ns/iter (+/- 0) = 10571 MB/s
test str::bench_contains_bad_simd                        ... bench:          97 ns/iter (+/- 41) = 762 MB/s
test str::bench_contains_equal                           ... bench:           4 ns/iter (+/- 0) = 14000 MB/s
test str::bench_contains_short_long                      ... bench:          73 ns/iter (+/- 0) = 34945 MB/s
test str::bench_contains_short_short                     ... bench:          12 ns/iter (+/- 0) = 4666 MB/s

@rustbot
Copy link
Collaborator

rustbot commented Oct 30, 2022

r? @m-ou-se

(rustbot has picked a reviewer for you, use r? to override)

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Oct 30, 2022
@rustbot

This comment was marked as resolved.

@the8472
Copy link
Member Author

the8472 commented Oct 30, 2022

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Oct 30, 2022
@bors
Copy link
Contributor

bors commented Oct 30, 2022

⌛ Trying commit f23374740b495d91f8a1efa767129fe5f34a521a with merge 00a98801b5deff4884bae13cf558c2f7e3beb9ca...

@bors
Copy link
Contributor

bors commented Oct 30, 2022

☀️ Try build successful - checks-actions
Build commit: 00a98801b5deff4884bae13cf558c2f7e3beb9ca (00a98801b5deff4884bae13cf558c2f7e3beb9ca)

@rust-timer
Copy link
Collaborator

Queued 00a98801b5deff4884bae13cf558c2f7e3beb9ca with parent f42b6fa, future comparison URL.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (00a98801b5deff4884bae13cf558c2f7e3beb9ca): comparison URL.

Overall result: ✅ improvements - no action needed

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf -perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-0.8% [-1.8%, -0.3%] 9
Improvements ✅
(secondary)
-2.0% [-3.9%, -0.3%] 12
All ❌✅ (primary) -0.8% [-1.8%, -0.3%] 9

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
0.2% [0.2%, 0.2%] 1
Regressions ❌
(secondary)
1.8% [1.8%, 1.8%] 1
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 0.2% [0.2%, 0.2%] 1

Cycles

This benchmark run did not return any relevant results for this metric.

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Oct 31, 2022
@thomcc
Copy link
Member

thomcc commented Oct 31, 2022

Hmm, is this the first libcore function with architecture-specific SIMD intrinsics? At one point I know this was kinda discouraged. I think part of the reason was miri (which we could check with #[cfg(not(miri))]) and cranelift (which IDK if we can check) (CC @RalfJung, @bjorn3).

This could be ported to core::simd fairly easily (it doesn't use anything we're missing or unhappy with), but the heavy use of movemsk means we should make sure it doesn't regress perf on non-x86 (and continue gating on it if it does), since that operation has has codegen issues on many other targets even at the best of times1.

On the bright side, this would remove most of the unsafe!

Footnotes

  1. The core::simd equivalent to _mm_movemask_epi8 (mask8x16::any) has awful scalarized codegen on some non-x86 targets, but the fact that this actually uses the resulting bitmask for more than something like a test against 0 (which can be done more efficiently), which means it probably would need more changes to avoid the hot part of the loop getting scalarized.

@thomcc
Copy link
Member

thomcc commented Oct 31, 2022

I had kind of a hard time following some parts of this (mainly the index arithmetic). To understand that, I wrote out what this the algorithm is doing. I think this has slightly less handwaving than the linked page but it could be personal. For my own reference and other, I've put it here.

(If you notice any cases where I have misunderstood something important about it, let me know (there are a few parts where I've made intentional simplifications to avoid getting into some tedious details, tho))


The SIMD optimized code is used for part of the algorithm seems to apply for needles between 2..=8 bytes (0b is impossible, 1b uses memchr), if the haystack is sufficiently big (haystack.len() must be greater or equal to needle.len() + 16). If these are met, we do the following:

  1. Take the first (index = 0) and last (index = needle.len() - 1) bytes of the needle, and make two vectors by splatting each into a 16-byte vector.

    That is first_bytes = u8x16::from([needle[0]; 16]) and last_bytes = u8x16::from([needle[needle.len() - 1]; 16]) (NB. the code uses first and last rather than the _bytes suffix).

    Using the first and last bytes allows some minor optimizations and simplifies some logic, but the indexes get used elsewhere and (IMO) are very important to understanding why this code works.

  2. Iterate over the haystack in steps of 16, and load two 16 byte vectors at each step, one at the start of the chunk, and one offset from that by needle.len() - 1, stopping before this would go out of bounds (see step 5).

    That is, chunk_a = u8x16::from_slice(&haystack[i+0..][..16]) and chunk_b = u8x16::from_slice(&haystack[i+needle.len()-1..][..16]) (NB: the code uses a and b rather than a chunk_ prefix), where i is the offset to load from (e.g. the iteration count * 16).

    Note: The + 0 and + needle.len() - 1 the same as the indices inside needle where we took the bytes from in step 1.

  3. Do packed equality tests for chunk_a vs first_bytes and chunk_b vs last_bytes to produce two masks, eq_first and eq_last. These are bitanded together to and we have a potential match if the result is non-zero.

    The goal of that is to test if there are any indices in the vectors where a match has been hit in the same index of both comparisons.

    Because the chunk_ vectors are offset by an amount equal to the index used to get the byte which built each _bytes vector, a test being true for simd_eq(chunk_a, first_bytes) and simd_eq(chunk_b, last_bytes) at some index means that at least two bytes of the needle are correct in haystack starting at chunk_offset + {index that matched}.

  4. If we've found a potential match, then (to oversimplify a bit) we go over the indexes that are true in both eq_first and eq_last, and check for a match at that index using the equivalent of needle == &haystack[i+that_index..][..needle.len()].

    Small note: The code actually avoids comparing the first and last indices of needle here, since effectively, that's what we tested in step 3 to get eq_first and eq_last. (It also uses the bitmask from movemsk, rather than the full mask, but this detail is not that important)

  5. Our main loop in step 2 has a conservative end bound, so in order ensure there's no trailing bytes we didn't check perform steps 2-4 but use an offset of haystack.len() - (needle.len() - 1) - 16 instead of i.

    This offset is chosen so that the slice loaded into chunk_b is exactly at the end of the haystack, ensuring there's no slop, and that we've tested the whole input.


I think some parts of this are clear in the code, but others aren't. The index arithmetic in particular isn't really (IMO), nor is the relationship between the byte indexs in the search vectors and the offset in the chunks (which is important to understand why this works).

The index arithmetic seems to be avoidable with iterators, which would clear a lot of this up (and also possibly might help perf by avoiding bounds checks?).

I suppose haystack.array_chunks::<16>().zip(haystack[needle.len()-1..].array_chunks::<16>()) is close to what the main iteration wants to be doing (although it does not have), but haystack.windows(needle.len() + 16).step_by(16) is closer to what it currently does, so maybe less disruptive (also maybe more optimizable? That would be something you'd probably have a better idea about, probably).

(Along those lines, I also kind of find use of the term chunk a little misleading, since it implies the iteration would be more like haystack.array_chunks::<16>(). I don't have a real alternative to suggest tho, and certainly used the term above, so whatever)


(I think I'm invested enough to want to take this review at this point...)

r? @thomcc

@rustbot rustbot assigned thomcc and unassigned m-ou-se Oct 31, 2022
Copy link
Member

@thomcc thomcc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great (and has great numbers), but I'd like to see this get a couple (big-ish) changes. #103779 (comment) has some rationale for these.

  1. Use core::simd instead of manual core::arch. I think this is needed for miri and non-LLVM codegen backends, but even if it is not, getting rid of the the need for safety comments like // SAFETY: no preconditions other than sse2 being available on trivial stuff is kinda worth it IMO. (It's also arguably easier to maintain, and safe)

    I'd probably say still keep the cfg(all(target_arch = "x86_64", target_feature = "sse2")) for now -- we probably don't want to start using core::simd on every target without careful thought, and I suspect even aarch64 is likely to have sub-par codegen for this under core::simd (even without the LLVM issue, the algorithm would need some tweaks on targets without a movemsk-style instruction).

  2. I think using iterators would make the index arithmetic and iteration logic a lot more clear. As mentioned, I had to spend a bit of time figuring out what it was doing, and this would have helped some (maybe a lot) of that.

  3. I think I'd also like some (inline) comments to explain what it's doing at a high level (no need to explain how simd works ofc, but the actual algorithm is kinda galaxybrain IMO, so some notes would be nice). Feel free to punt on this til after the others, since they might make it so that a link is enough.

Feel free to push back on these if you disagree (for example, iterators are def not the only way to make this clearer)

library/core/src/str/pattern.rs Show resolved Hide resolved
library/core/src/str/pattern.rs Outdated Show resolved Hide resolved
library/core/src/str/pattern.rs Outdated Show resolved Hide resolved
@thomcc
Copy link
Member

thomcc commented Oct 31, 2022

CC @BurntSushi (who has talked about this from time to time before)

@the8472
Copy link
Member Author

the8472 commented Oct 31, 2022

I had kind of a hard time following some parts of this (mainly the index arithmetic).

That's not just you. This went through several rounds of fixing out of bounds and integer overflow errors and lots of debug-printing to get the pieces aligned 😅

@RalfJung
Copy link
Member

Hmm, is this the first libcore function with architecture-specific SIMD intrinsics?

Seems like it is. At least Miri doesn't implement the vast majority of them and so far that has not caused errors from stdlib functions, to my knowledge.

@marmeladema
Copy link
Contributor

Not sure if you have seen it, but we already maintain a crate for this algorithm: https://github.com/cloudflare/sliceslice-rs

@the8472
Copy link
Member Author

the8472 commented Oct 31, 2022

Thanks, I wasn't aware. I may steal some tweaks from there. Although the AVX parts wouldn't be useful in core unless we start shipping some x86-64-v3 toolchain.

@marmeladema
Copy link
Contributor

If that makes it easier for rustc, we could split out the sse2 / SSE 4.2 versions out of the avx2 one.
Anyway, if you'd rather use an external crate for this and sliceslice fits your needs, we are open to discuss changes to make integration easier.

@bjorn3
Copy link
Member

bjorn3 commented Oct 31, 2022

@thomcc cg_clif should implement all intrinsics used in this PR, however they are implemented using scalar emulation. With the exception of _mm_movemask_epi8 all used vendor intrinsics are implemented using simd_* platform intrinsics or even without any simd related intrinsics at all. _mm_movemask_epi8 is implemented using llvm.x86.sse2.pmovmskb.128 which cg_clif implements for supporting the memchr crate used by regex. cg_clif lacks simd support for a lot of float operations (or implements it incorrectly) as well as more exotic simd operations for which there are no platform independent intrinsics.

@bjorn3
Copy link
Member

bjorn3 commented Oct 31, 2022

Hmm, is this the first libcore function with architecture-specific SIMD intrinsics? At one point I know this was kinda discouraged.

The main issue is that libcore is almost never recompiled and as such would never benefit from the SIMD at all. However SSE2 is a mandatory feature for x86_64, so using SSE2 SIMD on x86_64 should be fine I think.

@BurntSushi
Copy link
Member

memchr::memmem also implements a variant of this approach and AIUI memchr is already a dependency of std: https://github.com/BurntSushi/memchr/blob/8037d11b4357b0f07be2bb66dc2659d9cf28ad32/Cargo.toml#L34-L36 (Although I don't know if that means it can be used as a dependency of core.)

In practical terms, I think this is a great addition and will make a lot of common cases faster. So 👍 from me. I do agree with @thomcc though that using core::simd might be wiser here, if only as a way of dog-fooding.

One concern I do have here is whether this is a correct optimization or not. My understanding was that even though SSE2 is part of x86_64, the operating system might still choose to disable it because SSE2 requires support for the xmm registers. What that means is that, in my understanding, to be completely correct, you still have to do a CPUID check in order to use SSE2 on x86_64. But... I am not sure about this, and even if it is true, it may be irrelevant for all OS'es that we support? Dunno, but maybe something to look into.

Bigger picture, I'd like to point out that the memchr::memmem implementation kind of already did all of this substring litigation while maintaining std's time complexity guarantees. Two-Way is its path of last resort. And it uses almost exactly the algorithm implemented in this PR to accelerate the vast majority of substring searches on x86_64, even with needles bigger than 8 bytes. It also has special cases for optimizing latency in the presence of small haystacks. The only thing I can think of that would be needed to make it suitable for std is some internal rejiggering to make it so the same searcher can scan both forwards and backwards. (memchr::memmem supports reverse searching, but you have to specify that at construction time. And once the searcher is built, you can only use it for reverse searching. Similarly for forward searching. But I did this only because it's simpler, not because there is some fundamental reason why it cannot be done otherwise.)

Now whether it makes sense to depend on an external crate for such a core part of std, I'm not sure about that. We could bring memchr into the regex repo to make it a bit more official if that helps. Or maybe it makes sense to just take pieces from memchr and re-implement them in core. But there's a lot there.

In terms of the algorithm choice, memchr::memmem does effectively the same thing with one small tweak: it recognizes the choice of first and last as arbitrary, and instead tries to choose the prefilter bytes based on which bytes are thought to be less common. It's obviously a heuristic, but it works surprisingly well and is IMO better than just always selecting the first and last bytes.

Another algorithm choice would be to use the SIMD routine as a prefilter for Two-Way in the case of longer needles. But that is a more invasive change I think. (It is what memchr::memmem does.)

@bjorn3
Copy link
Member

bjorn3 commented Oct 31, 2022

https://github.com/BurntSushi/memchr/blob/8037d11b4357b0f07be2bb66dc2659d9cf28ad32/Cargo.toml#L34-L36 (Although I don't know if that means it can be used as a dependency of core.)

Nothing can be a dependency of core as everything depends on core. (technically #![no_core] exists, but you have to define a lot of lang items to do anything practical with it. Those definitions will conflict with the definitions in core)

One concern I do have here is whether this is a correct optimization or not. My understanding was that even though SSE2 is part of x86_64, the operating system might still choose to disable it because SSE2 requires support for the xmm registers. What that means is that, in my understanding, to be completely correct, you still have to do a CPUID check in order to use SSE2 on x86_64. But... I am not sure about this, and even if it is true, it may be irrelevant for all OS'es that we support? Dunno, but maybe something to look into.

LLVM already uses SSE2 everywhere (including for passing float function arguments as per the x86_64 sysv abi) without runtime checks as the target specs for almost all x86_64 targets enable SSE2. If an OS disables SSE2 it will have to be disabled in the target spec, at which point the SSE2 fast path in this PR will simply be compiled out.)

@bors
Copy link
Contributor

bors commented Nov 15, 2022

⌛ Trying commit f29ca60f231c22fc341a90ba17ea69cf45b385ca with merge b926bd75207e52f30bfdaec8a2efdaae590de6b0...

@bors
Copy link
Contributor

bors commented Nov 15, 2022

☀️ Try build successful - checks-actions
Build commit: b926bd75207e52f30bfdaec8a2efdaae590de6b0 (b926bd75207e52f30bfdaec8a2efdaae590de6b0)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (b926bd75207e52f30bfdaec8a2efdaae590de6b0): comparison URL.

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf +perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
0.4% [0.4%, 0.4%] 3
Regressions ❌
(secondary)
1.5% [1.5%, 1.5%] 1
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-0.8% [-1.0%, -0.5%] 2
All ❌✅ (primary) 0.4% [0.4%, 0.4%] 3

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
2.4% [0.1%, 7.0%] 8
Regressions ❌
(secondary)
3.2% [0.9%, 8.5%] 6
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-1.1% [-1.1%, -1.1%] 1
All ❌✅ (primary) 2.4% [0.1%, 7.0%] 8

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-2.2% [-2.2%, -2.2%] 1
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) -2.2% [-2.2%, -2.2%] 1

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Nov 15, 2022
@the8472
Copy link
Member Author

the8472 commented Nov 15, 2022

@bors r=thomcc rollup=never

The slight compile time regressions are probably unavoidable since it adds code-paths that are dispatched at runtime. Doc benchmarks are still mostly green but now below the noise threshold (compared to the initial benchmark). I think the microbenchmarks justify this change and we can tweak things around the edges later.

@bors
Copy link
Contributor

bors commented Nov 15, 2022

📌 Commit a2b2010 has been approved by thomcc

It is now in the queue for this repository.

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Nov 15, 2022
@the8472 the8472 added the perf-regression-triaged The performance regression has been triaged. label Nov 15, 2022
@Manishearth
Copy link
Member

@bors p=1

going to close the tree for non-nevers for a while so they can drain out

@bors
Copy link
Contributor

bors commented Nov 17, 2022

⌛ Testing commit a2b2010 with merge 9340e5c...

@bors
Copy link
Contributor

bors commented Nov 17, 2022

☀️ Test successful - checks-actions
Approved by: thomcc
Pushing 9340e5c to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label Nov 17, 2022
@bors bors merged commit 9340e5c into rust-lang:master Nov 17, 2022
@rustbot rustbot added this to the 1.67.0 milestone Nov 17, 2022
@rust-timer
Copy link
Collaborator

Finished benchmarking commit (9340e5c): comparison URL.

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

Next Steps: If you can justify the regressions found in this perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please open an issue or create a new PR that fixes the regressions, add a comment linking to the newly created issue or PR, and then add the perf-regression-triaged label to this PR.

@rustbot label: +perf-regression
cc @rust-lang/wg-compiler-performance

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
0.4% [0.3%, 0.6%] 4
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-0.4% [-0.5%, -0.3%] 5
Improvements ✅
(secondary)
-1.0% [-1.3%, -0.6%] 4
All ❌✅ (primary) -0.1% [-0.5%, 0.6%] 9

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
3.4% [0.1%, 6.9%] 4
Regressions ❌
(secondary)
2.5% [2.5%, 2.5%] 1
Improvements ✅
(primary)
-2.1% [-2.1%, -2.1%] 1
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 2.3% [-2.1%, 6.9%] 5

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
1.8% [1.6%, 2.1%] 3
Regressions ❌
(secondary)
1.8% [1.2%, 2.4%] 2
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 1.8% [1.6%, 2.1%] 3

@nnethercote
Copy link
Contributor

A few wins, a few losses, it roughly balances out.

@rustbot label: +perf-regression-triaged

pietroalbini added a commit to ferrocene/rust that referenced this pull request Nov 22, 2022
…=thomcc"

The current implementation seems to be unsound. See rust-lang#104726.
@pietroalbini pietroalbini mentioned this pull request Nov 22, 2022
@the8472 the8472 mentioned this pull request Nov 22, 2022
bors added a commit to rust-lang-ci/rust that referenced this pull request Nov 22, 2022
Simd contains fix

Fixes rust-lang#104726

The bug was introduced by an improvement late in the original PR (rust-lang#103779) which added the backtracking when the last and first byte of the needle were the same. That changed the meaning of the variable for the last probe offset, which I should have split into the last byte offset and last probe offset. Not doing so lead to incorrect loop conditions.
Aaron1011 pushed a commit to Aaron1011/rust that referenced this pull request Jan 6, 2023
x86_64 SSE2 fast-path for str.contains(&str) and short needles

Based on Wojciech Muła's [SIMD-friendly algorithms for substring searching](http://0x80.pl/articles/simd-strfind.html#sse-avx2)

The two-way algorithm is Big-O efficient but it needs to preprocess the needle
to find a "critical factorization" of it. This additional work is significant
for short needles. Additionally it mostly advances needle.len() bytes at a time.

The SIMD-based approach used here on the other hand can advance based on its
vector width, which can exceed the needle length. Except for pathological cases,
but due to being limited to small needles the worst case blowup is also small.

benchmarks taken on a Zen2, compiled with `-Ccodegen-units=1`:

```
OLD:
test str::bench_contains_16b_in_long                     ... bench:         504 ns/iter (+/- 14) = 5061 MB/s
test str::bench_contains_2b_repeated_long                ... bench:         948 ns/iter (+/- 175) = 2690 MB/s
test str::bench_contains_32b_in_long                     ... bench:         445 ns/iter (+/- 6) = 5732 MB/s
test str::bench_contains_bad_naive                       ... bench:         130 ns/iter (+/- 1) = 569 MB/s
test str::bench_contains_bad_simd                        ... bench:          84 ns/iter (+/- 8) = 880 MB/s
test str::bench_contains_equal                           ... bench:         142 ns/iter (+/- 7) = 394 MB/s
test str::bench_contains_short_long                      ... bench:         677 ns/iter (+/- 25) = 3768 MB/s
test str::bench_contains_short_short                     ... bench:          27 ns/iter (+/- 2) = 2074 MB/s

NEW:
test str::bench_contains_16b_in_long                     ... bench:          82 ns/iter (+/- 0) = 31109 MB/s
test str::bench_contains_2b_repeated_long                ... bench:          73 ns/iter (+/- 0) = 34945 MB/s
test str::bench_contains_32b_in_long                     ... bench:          71 ns/iter (+/- 1) = 35929 MB/s
test str::bench_contains_bad_naive                       ... bench:           7 ns/iter (+/- 0) = 10571 MB/s
test str::bench_contains_bad_simd                        ... bench:          97 ns/iter (+/- 41) = 762 MB/s
test str::bench_contains_equal                           ... bench:           4 ns/iter (+/- 0) = 14000 MB/s
test str::bench_contains_short_long                      ... bench:          73 ns/iter (+/- 0) = 34945 MB/s
test str::bench_contains_short_short                     ... bench:          12 ns/iter (+/- 0) = 4666 MB/s
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merged-by-bors This PR was explicitly merged by bors. perf-regression Performance regression. perf-regression-triaged The performance regression has been triaged. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet