Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

`str::contains` is very slow for substring checks #25483

Closed
netvl opened this Issue May 16, 2015 · 12 comments

Comments

Projects
None yet
9 participants
@netvl
Copy link
Contributor

netvl commented May 16, 2015

Originally found in SO question.

This Rust code

fn main(){

    let needle   = (0..100).map(|_| "b").collect::<String>();
    let heystack = (0..100_000).map(|_| "a").collect::<String>();

    println!("Data ready.");

    for _ in 0..1_000_000 {
        if heystack.contains( &needle ) {
            // Stuff...
        }
    }

}

Executes for up to three minutes on my machine (even with -C opt-level=3), while the equivalent Ruby or Python code:

needle   = 'b' * 100
heystack = 'a' * 100_000

puts 'Data ready.'

1_000_000.times do
    heystack.include? needle
end

finishes in several seconds. Java variant finishes almost instantly (but that seems to be due to JIT optimizations).

It looks like that pattern searching for substrings could be severely optimized.

@bluss

This comment has been minimized.

Copy link
Contributor

bluss commented May 16, 2015

StrSearcher has a lot of panic branches that can probably avoided (slicing and unwrapping that will never happen).

@ArtemGr

This comment has been minimized.

Copy link
Contributor

ArtemGr commented May 16, 2015

I wonder wether LLVM can move the comparison out of the loop and if not, why?

@huonw huonw added the I-slow label May 16, 2015

@huonw

This comment has been minimized.

Copy link
Member

huonw commented May 16, 2015

I think this is fundamentally algorithmic problem, since StrSearcher is performing a naive string search at the moment. Eliminating unnecessary panics etc should probably wait until that problem is resolved (I could be wrong).

@bluss

This comment has been minimized.

Copy link
Contributor

bluss commented May 16, 2015

Oh wow, I thought that was something that we had already fixed.

@gereeter

This comment has been minimized.

Copy link
Contributor

gereeter commented May 16, 2015

The algorithmic problem was fixed, but this fix got disabled in the switch to the pattern API. The two-way search code is still present as dead code.

@DanielKeep

This comment has been minimized.

Copy link
Contributor

DanielKeep commented May 17, 2015

For comparison, I just did a straight translation of a C implementation of BMH. Results for the test in the SO question were:

test tests::boyermoore_horspool_memmem::case_tasos  ... bench:      5241 ns/iter (+/- 222)
test tests::naive::case_tasos                       ... bench:   1567384 ns/iter (+/- 64776)
test tests::std_str_find::case_tasos                ... bench:    864885 ns/iter (+/- 45622)

Where naive is just a 100% naive char-by-char search.

@alexcrichton alexcrichton added the T-libs label May 28, 2015

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented May 28, 2015

triage: I-nominated

@bluss

This comment has been minimized.

Copy link
Contributor

bluss commented May 28, 2015

@DanielKeep Can it be adapted to the pattern API?

@DanielKeep

This comment has been minimized.

Copy link
Contributor

DanielKeep commented May 29, 2015

@bluss I don't see why not; the signature is just (&str, &str) -> Option<usize>, but given that there's already a (not being used) higher-performance searcher in the stdlib, it'd probably make more sense to restore that first and then consider further improvements independently. So long as what's there isn't pathologically bad, it'll still be a considerable improvement. :)

(Not that my BMH implementation is likely to be an improvement: it was literally just copy, pasted, and translated from the C reference code. Hell, it uses... indexing. I was in a hurry. :P)

@Kimundi

This comment has been minimized.

Copy link
Member

Kimundi commented May 29, 2015

Yeah, after implementing the pattern API I haven't found the time to reintegrate the old fast searcher code yet. Part of the issue why I haven't done it directly is that the fast string searcher only supports single ended search, while with the Pattern API you'd want double ended search.

One way to resolve this is to just hack the naive and the fast search together into a hybrid that uses the fast search for forward iteration (which is all that happened before anyway), and the naive for backwards iteration, but due to me not fully understanding the fast algorithm I was reluctant to do that as part of the original patch.

Wouldn't harm to just do this now and see if any bugs appear though...

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented Jun 1, 2015

triage: P-medium

@rust-highfive rust-highfive added P-medium and removed I-nominated labels Jun 1, 2015

@bluss

This comment has been minimized.

Copy link
Contributor

bluss commented Jun 14, 2015

This is allegedly a very interesting improvement upon our two-way search algorithm. http://drops.dagstuhl.de/opus/volltexte/2011/3355/

bluss pushed a commit to bluss/rust that referenced this issue Jun 15, 2015

Ulrik Sverdrup
Update StrSearch to use the Two Way searcher
In an effort to restore our substring search performance, revive the two
way searcher and adapt it to the Pattern API.

Fixes rust-lang#25483, a performance bug: That particular case now performs
as well in optimized rust as the ruby example.

No documentation was found on the reverse searching cas for the two way
searcher, so I have tried to do the best job possible. I had quickcheck
to fuzz test forward and reverse searching thoroughly.

I want to lift a restriction in the Pattern API that Rejects must have
indices on character boundaries, which isn't very simple to support. Any
correct matching on utf-8 text will of course have Match indices only
on character boundaries.

The two way searcher algorithm has very small, constant space overhead,
requiring no allocation. Our implementation is relatively fast,
especially due to the `byteset` addition to the algorithm, which speeds
up many no-match cases.

A bad case for the two way algorithm is:

```
let haystack = (0..10_000).map(|_| "dac").collect::<String>();
let needle = (0..100).map(|_| "bac").collect::<String>());
```

For this particular case, two way is not faster than the naive
implementation it replaces.

bluss pushed a commit to bluss/rust that referenced this issue Jun 15, 2015

Ulrik Sverdrup
core: Update StrSearch to use the Two Way searcher
In an effort to restore our substring search performance, revive the two
way searcher and adapt it to the Pattern API.

Fixes rust-lang#25483, a performance bug: That particular case now performs
as well in optimized rust as the ruby example.

No documentation was found on the reverse searching cas for the two way
searcher, so I have tried to do the best job possible. I had quickcheck
to fuzz test forward and reverse searching thoroughly.

I want to lift a restriction in the Pattern API that Rejects must have
indices on character boundaries, which isn't very simple to support. Any
correct matching on utf-8 text will of course have Match indices only
on character boundaries.

The two way searcher algorithm has very small, constant space overhead,
requiring no allocation. Our implementation is relatively fast,
especially due to the `byteset` addition to the algorithm, which speeds
up many no-match cases.

A bad case for the two way algorithm is:

```
let haystack = (0..10_000).map(|_| "dac").collect::<String>();
let needle = (0..100).map(|_| "bac").collect::<String>());
```

For this particular case, two way is not faster than the naive
implementation it replaces.

bluss pushed a commit to bluss/rust that referenced this issue Jun 15, 2015

Ulrik Sverdrup
core: Update StrSearch to use the Two Way searcher
In improve our substring search performance, revive the two way searcher
and adapt it to the Pattern API.

Fixes rust-lang#25483, a performance bug: That particular case now performs
as well in optimized rust as the ruby example.

No documentation was found on the reverse searching cas for the two way
searcher, so I have tried to do the best job possible. I had quickcheck
to fuzz test forward and reverse searching thoroughly.

The two way searcher implements both forward and reverse search,
but not double ended search at the moment. The forward and reverse parts
of the two way searcher are completely independent. This can be changed
later (note before for stabilization of StrSearcher itself).

I want to lift a restriction in the Pattern API that Rejects must have
indices on character boundaries, which isn't very simple to support. Any
correct matching on utf-8 text will of course have Match indices only
on character boundaries.

The two way searcher algorithm has very small, constant space overhead,
requiring no allocation. Our implementation is relatively fast,
especially due to the `byteset` addition to the algorithm, which speeds
up many no-match cases.

A bad case for the two way algorithm is:

```
let haystack = (0..10_000).map(|_| "dac").collect::<String>();
let needle = (0..100).map(|_| "bac").collect::<String>());
```

For this particular case, two way is not faster than the naive
implementation it replaces.

bluss pushed a commit to bluss/rust that referenced this issue Jun 15, 2015

Ulrik Sverdrup
core: Update StrSearch to use the Two Way searcher
To improve our substring search performance, revive the two way searcher
and adapt it to the Pattern API.

Fixes rust-lang#25483, a performance bug: That particular case now performs
as well in optimized rust as the ruby example.

No documentation was found on the reverse searching cas for the two way
searcher, so I have tried to do the best job possible. I had quickcheck
to fuzz test forward and reverse searching thoroughly.

The two way searcher implements both forward and reverse search,
but not double ended search at the moment. The forward and reverse parts
of the two way searcher are completely independent. This can be changed
later (note before for stabilization of StrSearcher itself).

I want to lift a restriction in the Pattern API that Rejects must have
indices on character boundaries, which isn't very simple to support. Any
correct matching on utf-8 text will of course have Match indices only
on character boundaries.

The two way searcher algorithm has very small, constant space overhead,
requiring no allocation. Our implementation is relatively fast,
especially due to the `byteset` addition to the algorithm, which speeds
up many no-match cases.

A bad case for the two way algorithm is:

```
let haystack = (0..10_000).map(|_| "dac").collect::<String>();
let needle = (0..100).map(|_| "bac").collect::<String>());
```

For this particular case, two way is not faster than the naive
implementation it replaces.

bluss pushed a commit to bluss/rust that referenced this issue Jun 15, 2015

Ulrik Sverdrup
core: Update StrSearch to use the Two Way searcher
To improve our substring search performance, revive the two way searcher
and adapt it to the Pattern API.

Fixes rust-lang#25483, a performance bug: That particular case now performs
as well in optimized rust as the ruby example.

No documentation was found on the reverse searching cas for the two way
searcher, so I have tried to do the best job possible. I had quickcheck
to fuzz test forward and reverse searching thoroughly.

The two way searcher implements both forward and reverse search,
but not double ended search at the moment. The forward and reverse parts
of the two way searcher are completely independent. This can be changed
later (note before for stabilization of StrSearcher itself).

I want to lift a restriction in the Pattern API that Rejects must have
indices on character boundaries, which isn't very simple to support. Any
correct matching on utf-8 text will of course have Match indices only
on character boundaries.

The two way searcher algorithm has very small, constant space overhead,
requiring no allocation. Our implementation is relatively fast,
especially due to the `byteset` addition to the algorithm, which speeds
up many no-match cases.

A bad case for the two way algorithm is:

```
let haystack = (0..10_000).map(|_| "dac").collect::<String>();
let needle = (0..100).map(|_| "bac").collect::<String>());
```

For this particular case, two way is not faster than the naive
implementation it replaces.

bluss pushed a commit to bluss/rust that referenced this issue Jun 16, 2015

Ulrik Sverdrup
core: Update StrSearcher to use the Two Way searcher
To improve our substring search performance, revive the two way searcher
and adapt it to the Pattern API.

Fixes rust-lang#25483, a performance bug: that particular case now completes faster
in optimized rust than in ruby (but they share the same order of magnitude).

No documentation was found on the reverse searching case for the two way
searcher, so I did my best. I had quickcheck to fuzz test forward and reverse
searching thoroughly.

The two way searcher implements both forward and reverse search,
but not double ended search. The forward and reverse parts of the two
way searcher are completely independent.

This commit changes and lifts a restriction in the Pattern API that Rejects
must have indices on character boundaries, which would have some extra
cost to support. Any correct matching on utf-8 text will of course have Match
indices only on character boundaries.

The two way searcher algorithm has very small, constant space overhead,
requiring no dynamic allocation. Our implementation is relatively fast,
especially due to the `byteset` addition to the algorithm, which speeds
up many no-match cases.

A bad case for the two way algorithm is:

```
let haystack = (0..10_000).map(|_| "dac").collect::<String>();
let needle = (0..100).map(|_| "bac").collect::<String>());
```

For this particular case, two way is not much faster than the naive
implementation it replaces.

bluss pushed a commit to bluss/rust that referenced this issue Jun 21, 2015

Ulrik Sverdrup
StrSearcher: Update substring search to use the Two Way algorithm
To improve our substring search performance, revive the two way searcher
and adapt it to the Pattern API.

Fixes rust-lang#25483, a performance bug: that particular case now completes faster
in optimized rust than in ruby (but they share the same order of magnitude).

Much thanks to @gereeter who helped me understand the reverse case
better and wrote the comment explaining `next_back` in the code.

I had quickcheck to fuzz test forward and reverse searching thoroughly.

The two way searcher implements both forward and reverse search,
but not double ended search. The forward and reverse parts of the two
way searcher are completely independent.

The two way searcher algorithm has very small, constant space overhead,
requiring no dynamic allocation. Our implementation is relatively fast,
especially due to the `byteset` addition to the algorithm, which speeds
up many no-match cases.

A bad case for the two way algorithm is:

```
let haystack = (0..10_000).map(|_| "dac").collect::<String>();
let needle = (0..100).map(|_| "bac").collect::<String>());
```

For this particular case, two way is not much faster than the naive
implementation it replaces.

bors added a commit that referenced this issue Jun 30, 2015

Auto merge of #26327 - bluss:two-way, r=aturon
Update substring search to use the Two Way algorithm

To improve our substring search performance, revive the two way searcher
and adapt it to the Pattern API.

Fixes #25483, a performance bug: that particular case now completes faster
in optimized rust than in ruby (but they share the same order of magnitude).

Many thanks to @gereeter who helped me understand the reverse case
better and wrote the comment explaining `next_back` in the code.

I had quickcheck to fuzz test forward and reverse searching thoroughly.

The two way searcher implements both forward and reverse search,
but not double ended search. The forward and reverse parts of the two
way searcher are completely independent.

The two way searcher algorithm has very small, constant space overhead,
requiring no dynamic allocation. Our implementation is relatively fast,
especially due to the `byteset` addition to the algorithm, which speeds
up many no-match cases.

A bad case for the two way algorithm is:

```
let haystack = (0..10_000).map(|_| "dac").collect::<String>();
let needle = (0..100).map(|_| "bac").collect::<String>());
```

For this particular case, two way is not much faster than the naive
implementation it replaces.

@bors bors closed this in #26327 Jun 30, 2015

jroesch added a commit to jroesch/rust that referenced this issue Jul 21, 2015

StrSearcher: Update substring search to use the Two Way algorithm
To improve our substring search performance, revive the two way searcher
and adapt it to the Pattern API.

Fixes rust-lang#25483, a performance bug: that particular case now completes faster
in optimized rust than in ruby (but they share the same order of magnitude).

Much thanks to @gereeter who helped me understand the reverse case
better and wrote the comment explaining `next_back` in the code.

I had quickcheck to fuzz test forward and reverse searching thoroughly.

The two way searcher implements both forward and reverse search,
but not double ended search. The forward and reverse parts of the two
way searcher are completely independent.

The two way searcher algorithm has very small, constant space overhead,
requiring no dynamic allocation. Our implementation is relatively fast,
especially due to the `byteset` addition to the algorithm, which speeds
up many no-match cases.

A bad case for the two way algorithm is:

```
let haystack = (0..10_000).map(|_| "dac").collect::<String>();
let needle = (0..100).map(|_| "bac").collect::<String>());
```

For this particular case, two way is not much faster than the naive
implementation it replaces.

thepowersgang added a commit to thepowersgang/rust that referenced this issue Jul 25, 2015

StrSearcher: Update substring search to use the Two Way algorithm
To improve our substring search performance, revive the two way searcher
and adapt it to the Pattern API.

Fixes rust-lang#25483, a performance bug: that particular case now completes faster
in optimized rust than in ruby (but they share the same order of magnitude).

Much thanks to @gereeter who helped me understand the reverse case
better and wrote the comment explaining `next_back` in the code.

I had quickcheck to fuzz test forward and reverse searching thoroughly.

The two way searcher implements both forward and reverse search,
but not double ended search. The forward and reverse parts of the two
way searcher are completely independent.

The two way searcher algorithm has very small, constant space overhead,
requiring no dynamic allocation. Our implementation is relatively fast,
especially due to the `byteset` addition to the algorithm, which speeds
up many no-match cases.

A bad case for the two way algorithm is:

```
let haystack = (0..10_000).map(|_| "dac").collect::<String>();
let needle = (0..100).map(|_| "bac").collect::<String>());
```

For this particular case, two way is not much faster than the naive
implementation it replaces.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.