New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use memchr for str::find(char) #46735

Merged
merged 14 commits into from Jan 1, 2018

Conversation

Projects
None yet
8 participants
@Manishearth
Member

Manishearth commented Dec 14, 2017

This is a 10x improvement for searching for characters.

This also contains the patches from #46713 . Feel free to land both separately or together.

cc @mystor @alexcrichton

r? @bluss

fixes #46693

@Manishearth

This comment has been minimized.

Member

Manishearth commented Dec 14, 2017

I haven't really tested this much, there probably are failures. Will do a second pass at self-review once I know we pass all tests (from travis)

@Manishearth

This comment has been minimized.

Member

Manishearth commented Dec 14, 2017

The memchr crate is even faster because it links to glibc's memchr, which uses SIMD and other fancy stuff. libcore can't link to this so to get these wins we'll have to do a SIMD impl ourselves.

#[bench]
fn find_char(b: &mut Bencher) {
    let x = test::black_box("Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.");
    b.iter(|| test::black_box(x.find('/')));
}


#[bench]
fn find_char_memchr(b: &mut Bencher) {
    let x = test::black_box("Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.");
    b.iter(|| test::black_box(memchr::memchr(b'/', x.as_bytes())));
}

Before:

running 2 tests
test find_char        ... bench:         593 ns/iter (+/- 201)
test find_char_memchr ... bench:           9 ns/iter (+/- 1)

After:

running 2 tests
test find_char        ... bench:          57 ns/iter (+/- 12)
test find_char_memchr ... bench:           9 ns/iter (+/- 1)
@Manishearth

This comment has been minimized.

Member

Manishearth commented Dec 14, 2017

This does not bring improvements for multibyte chars or for str::find(str). We can bring improvement for these, but it's tricky.

For str when it starts with an ASCII char we can do similar stuff as here (and then use the original algorithm to finish the match.

When the thing we're searching for is not ASCII we can still search for the first byte. However for most UTF8 text the first byte will generally be pretty uniform; i.e. if it's Arabic text will usually be 0xD8 or 0xD9, Korean will be 0xEA, 0xEB, 0xEC, or 0xED, Devanagari is usually 0xE0, etc. This means that memchr will have lots of false positives; we'll get lots of hits on the first byte and then have to check the second byte. This amount of stutter will probably make memchr's (minor) fixed overhead significant, and destroy any perf gains which we may get.

Searching for the second byte or even better, the last byte, might work better. But I'm not sure if I want to write that code right now, and the tradeoffs are a bit trickier there :)

@Manishearth Manishearth force-pushed the Manishearth:memchr-find branch from 93216f1 to f865164 Dec 16, 2017

@Manishearth Manishearth force-pushed the Manishearth:memchr-find branch from 876e2a1 to 75c07a3 Dec 18, 2017

@Manishearth

This comment has been minimized.

Member

Manishearth commented Dec 18, 2017

Bench numbers do not materially change with the UTF8 changes. I did come up with a pathological case of searching a Devanagari string for ä (which shares bytes) that ends up being 2x slower because every other character is a false positive hit (entirely negating memchr's win).

I think this pathological case is ok, it will only arise when mixing languages and for very specific characters.

I can check some form of these benchmarks into tree if y'all feel it necessary.

$ cargo bench
test find_char                            ... bench:         603 ns/iter (+/- 203)
test find_char_memchr                     ... bench:          10 ns/iter (+/- 3)
test find_multibyte_char_found            ... bench:         376 ns/iter (+/- 67)
test find_multibyte_char_notfound         ... bench:         618 ns/iter (+/- 129)
test find_multibyte_string_multibyte_char ... bench:         719 ns/iter (+/- 137)
test find_multibyte_string_pathological   ... bench:         620 ns/iter (+/- 98)

$ cargo +x-stage2 bench
test find_char                            ... bench:          67 ns/iter (+/- 45)
test find_char_memchr                     ... bench:          10 ns/iter (+/- 1)
test find_multibyte_char_found            ... bench:          50 ns/iter (+/- 12)
test find_multibyte_char_notfound         ... bench:          74 ns/iter (+/- 20)
test find_multibyte_string_multibyte_char ... bench:          74 ns/iter (+/- 20)
test find_multibyte_string_pathological   ... bench:       1,672 ns/iter (+/- 348)

Code:

#[bench]
fn find_char(b: &mut Bencher) {
    let x = test::black_box("Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.");
    b.iter(|| test::black_box(x.find('/')));
}

#[bench]
fn find_char_memchr(b: &mut Bencher) {
    let x = test::black_box("Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.");
    b.iter(|| test::black_box(memchr::memchr(b'/', x.as_bytes())));
}

#[bench]
fn find_multibyte_char_found(b: &mut Bencher) {
    let x = test::black_box("Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, ก remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.");
    b.iter(|| test::black_box(x.find('ก')));
}

#[bench]
fn find_multibyte_char_notfound(b: &mut Bencher) {
    let x = test::black_box("Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.");
    b.iter(|| test::black_box(x.find('ก')));
}

#[bench]
fn find_multibyte_string_multibyte_char(b: &mut Bencher) {
    let x = test::black_box("जलद कोल्हा आळशी कुत्रा वरुन उडी मारली, जलद कोल्हा आळशी कुत्रा वरुन उडी मारली, जलद कोल्हा आळशी कुत्रा वरुन उडी मारली, जलद कोल्हा आळशी कुत्रा वरुन उडी मारली, जलद कोल्हा आळशी कुत्रा वरुन उडी मारली, जलद कोल्हा आळशी कुत्रा वरुन उडी मारली, जलद कोल्हा आळशी कुत्रा वरुन उडी मारली");
    b.iter(|| test::black_box(x.find('ग'))); // not in the string
}

#[bench]
fn find_multibyte_string_pathological(b: &mut Bencher) {
    let x = test::black_box("जलद कोल्हा आळशी कुत्रा वरुन उडी मारली, जलद कोल्हा आळशी कुत्रा वरुन उडी मारली, जलद कोल्हा आळशी कुत्रा वरुन उडी मारली, जलद कोल्हा आळशी कुत्रा वरुन उडी मारली, जलद कोल्हा आळशी कुत्रा वरुन उडी मारली, जलद कोल्हा आळशी कुत्रा वरुन उडी मारली, जलद कोल्हा आळशी कुत्रा वरुन उडी मारली");
    b.iter(|| test::black_box(x.find('ä'))); // ä's last byte is found often in Devanagari text
}
@Manishearth

This comment has been minimized.

Member

Manishearth commented Dec 18, 2017

If we really care about the pathological case it can be avoided by having some check in the loop that after X false positives falls back to regular "loop on next" behavior.

I don't think we should, though.

We could also write some monster SSE-enabled memchr that can search for up to 4 byte units. I'm not doing that.

@Manishearth

This comment has been minimized.

Member

Manishearth commented Dec 21, 2017

@bors

This comment has been minimized.

Contributor

bors commented Dec 21, 2017

⌛️ Trying commit 9b92a44 with merge afb0c20...

bors added a commit that referenced this pull request Dec 21, 2017

Auto merge of #46735 - Manishearth:memchr-find, r=<try>
Use memchr for str::find(char)

This is a 10x improvement for searching for characters.

This also contains the patches from #46713 . Feel free to land both separately or together.

cc @mystor @alexcrichton

r? @bluss

fixes #46693
@bors

This comment has been minimized.

Contributor

bors commented Dec 21, 2017

☀️ Test successful - status-travis
State: approved= try=True

@Manishearth

This comment has been minimized.

Member

Manishearth commented Dec 21, 2017

@Manishearth Manishearth force-pushed the Manishearth:memchr-find branch from b6f2d90 to 85919a0 Dec 25, 2017

@nagisa

This comment has been minimized.

Contributor

nagisa commented Jan 1, 2018

@BurntSushi

This comment has been minimized.

Member

BurntSushi commented Jan 1, 2018

When the thing we're searching for is not ASCII we can still search for the first byte. However for most UTF8 text the first byte will generally be pretty uniform; i.e. if it's Arabic text will usually be 0xD8 or 0xD9, Korean will be 0xEA, 0xEB, 0xEC, or 0xED, Devanagari is usually 0xE0, etc. This means that memchr will have lots of false positives; we'll get lots of hits on the first byte and then have to check the second byte. This amount of stutter will probably make memchr's (minor) fixed overhead significant, and destroy any perf gains which we may get.

Searching for the second byte or even better, the last byte, might work better. But I'm not sure if I want to write that code right now, and the tradeoffs are a bit trickier there :)

Searching for the last byte is indeed a better heuristic on UTF-8 than searching for the first byte. You'd be in good company (GNU grep does that). But the last byte is still arbitrary. This is why the regex crate ranks every byte in order of what it believes is rare. Leading UTF-8 bytes are considered common while trailing bytes aren't. But you also get things like "z is rarer than a," which it commonly is. So the memchr is applied to the rarest byte in the pattern. Of course, you still wind up with pathological cases when the frequency rank doesn't match the corpus, but this will always be true when using memchr without analyzing the haystack before hand (which obviously doesn't make sense in this specific domain of text search). That code is here: https://github.com/rust-lang/regex/blob/9c790659c4e83e3497c6f2d14a818b3a69654d5f/src/literals.rs#L379-L514

(To be clear, I think the frequency rank stuff is probably overkill for searching a single char and would probably just stick to the last byte. Different story if you tackled str::find(str) though. Do we really not already use memchr in str::find(str) though?)

#[inline]
fn next(&mut self) -> SearchStep {
let old_finger = self.finger;
let slice = unsafe { self.haystack.get_unchecked(old_finger..self.haystack.len()) };

This comment has been minimized.

@BurntSushi

BurntSushi Jan 1, 2018

Member

Do the various bounds check elisions actually help here? I've tried eliding them in my own substring search algorithms and it meets with variable success.

This comment has been minimized.

@Manishearth

Manishearth Jan 1, 2018

Member

I don't think they do, but I haven't checked and it seemed pretty easy to keep that invariant. I can check if you want.

This comment has been minimized.

@BurntSushi

BurntSushi Jan 1, 2018

Member

My general position has been to not elide bounds checks unless I'm pretty sure that it matters. If it were me, I'd remove the unsafe. :)

This comment has been minimized.

@Manishearth

Manishearth Jan 1, 2018

Member

I'll do some checking later this week, for now I'll land it.

@BurntSushi

This comment has been minimized.

Member

BurntSushi commented Jan 1, 2018

This LGTM! Nice work @Manishearth :-)

@rust-lang rust-lang deleted a comment from BubbaSheen Jan 1, 2018

@Manishearth

This comment has been minimized.

Member

Manishearth commented Jan 1, 2018

You'd be in good company (GNU grep does that).

yay :)

To be clear, I think the frequency rank stuff is probably overkill for searching a single char and would probably just stick to the last byte.

phew

that sounds trickier to get right 😄

Do we really not already use memchr in str::find(str) though?

Yeah, we do an interesting but non-memchry algorithm. I considered retrofitting the existing memchr'd .find(char) into .find(str) but that would mean losing the existing algorithm which means the wins are iffier (not to mention that memchr has very little wins if you're stuttering the algorithm all the time, which is far likelier with a .find(str) built on top of .find(char))

This LGTM! Nice work @Manishearth :-)

can this be landed r=you? I've made a small mistake which I need to rectify, aside from that it seems basically ready. Or should we wait for second review?

@BurntSushi

This comment has been minimized.

Member

BurntSushi commented Jan 1, 2018

@Manishearth Yeah r=me sounds great.

@Mark-Simulacrum

This comment has been minimized.

Member

Mark-Simulacrum commented Jan 1, 2018

Perf queued; in the future please ping me directly.

@Manishearth

This comment has been minimized.

Member

Manishearth commented Jan 1, 2018

@bors r=burntsushi

@bors

This comment has been minimized.

Contributor

bors commented Jan 1, 2018

📌 Commit 5cf5516 has been approved by burntsushi

@bors

This comment has been minimized.

Contributor

bors commented Jan 1, 2018

⌛️ Testing commit 5cf5516 with merge b65f0be...

bors added a commit that referenced this pull request Jan 1, 2018

Auto merge of #46735 - Manishearth:memchr-find, r=burntsushi
Use memchr for str::find(char)

This is a 10x improvement for searching for characters.

This also contains the patches from #46713 . Feel free to land both separately or together.

cc @mystor @alexcrichton

r? @bluss

fixes #46693
@bors

This comment has been minimized.

Contributor

bors commented Jan 1, 2018

☀️ Test successful - status-appveyor, status-travis
Approved by: burntsushi
Pushing b65f0be to master...

@bors bors merged commit 5cf5516 into rust-lang:master Jan 1, 2018

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
homu Test successful
Details
rep
}
/// Return the first index matching the byte `a` in `text`.

This comment has been minimized.

@jesseschalken

jesseschalken Feb 22, 2018

a is meant to be x?

This comment has been minimized.

@Manishearth

Manishearth Feb 22, 2018

Member

yeah, fixing

This comment has been minimized.

@Manishearth

Manishearth Feb 22, 2018

Member

someone fixed it already

@Manishearth Manishearth deleted the Manishearth:memchr-find branch Feb 22, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment