Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify Teddy's match verification. #275

Closed
wants to merge 2 commits into from
Closed

Modify Teddy's match verification. #275

wants to merge 2 commits into from

Conversation

jneem
Copy link
Contributor

@jneem jneem commented Aug 15, 2016

Once Teddy finds a fingerprint, it needs to check if there is really a
match. The old approach was to iterate over the set bits of both 64-bit
halves of the fingerprint-checking vector. The new approach first
extracts a bitfield that tells which bytes of the fingerprint-checking
vector are non-zero. Then it iterates over those bytes. This seems to
be faster (up to about 10% in some benchmarks).

It seems like the main reason that this approach is faster is that most
matched fingerprints only match in one place. The new code narrows in
on the important place more quickly, whereas the old code wasted time
unnecessarily examining an empty u64.

The gain seems to be larger with AVX2 support (which is not included in
this patch). Presumably it would be even larger with AVX512.

Here are some benchmarks (on a Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz). I've left only the ones that actually use Teddy.

Before:

test misc::medium_1K                         ... bench:          62 ns/iter (+/- 0) = 16967 MB/s
test misc::medium_1MB                        ... bench:          68 ns/iter (+/- 3) = 15420647 MB/s
test misc::medium_32                         ... bench:          60 ns/iter (+/- 0) = 1000 MB/s
test misc::medium_32K                        ... bench:          60 ns/iter (+/- 2) = 546600 MB/s
test misc::one_pass_short                    ... bench:          49 ns/iter (+/- 1) = 346 MB/s
test misc::one_pass_short_not                ... bench:          56 ns/iter (+/- 25) = 303 MB/s
test regexdna::variant1                      ... bench:   4,618,632 ns/iter (+/- 43,998) = 1100 MB/s
test regexdna::variant2                      ... bench:   7,714,161 ns/iter (+/- 59,343) = 658 MB/s
test regexdna::variant3                      ... bench:   9,313,201 ns/iter (+/- 41,022) = 545 MB/s
test regexdna::variant4                      ... bench:   9,442,413 ns/iter (+/- 391,462) = 538 MB/s
test regexdna::variant5                      ... bench:   7,824,204 ns/iter (+/- 50,196) = 649 MB/s
test regexdna::variant6                      ... bench:   7,424,271 ns/iter (+/- 59,398) = 684 MB/s
test regexdna::variant7                      ... bench:   8,191,592 ns/iter (+/- 50,278) = 620 MB/s
test regexdna::variant8                      ... bench:   8,295,141 ns/iter (+/- 47,041) = 612 MB/s
test regexdna::variant9                      ... bench:   8,095,375 ns/iter (+/- 58,014) = 627 MB/s
test sherlock::holmes_cochar_watson          ... bench:     195,226 ns/iter (+/- 4,244) = 3047 MB/s
test sherlock::holmes_coword_watson          ... bench:     638,784 ns/iter (+/- 10,402) = 931 MB/s
test sherlock::name_alt1                     ... bench:      40,088 ns/iter (+/- 1,697) = 14840 MB/s
test sherlock::name_alt2                     ... bench:     168,263 ns/iter (+/- 5,849) = 3535 MB/s
test sherlock::name_alt3                     ... bench:     180,960 ns/iter (+/- 2,098) = 3287 MB/s
test sherlock::name_alt4                     ... bench:     211,106 ns/iter (+/- 2,265) = 2818 MB/s
test sherlock::name_alt4_nocase              ... bench:     294,980 ns/iter (+/- 4,287) = 2016 MB/s
test sherlock::name_alt5                     ... bench:     173,042 ns/iter (+/- 2,525) = 3438 MB/s
test sherlock::name_alt5_nocase              ... bench:     800,761 ns/iter (+/- 26,055) = 742 MB/s
test sherlock::name_holmes_nocase            ... bench:     248,508 ns/iter (+/- 4,080) = 2394 MB/s
test sherlock::name_sherlock_holmes_nocase   ... bench:     226,836 ns/iter (+/- 3,920) = 2622 MB/s
test sherlock::name_sherlock_nocase          ... bench:     222,174 ns/iter (+/- 7,628) = 2677 MB/s
test sherlock::quotes                        ... bench:     577,495 ns/iter (+/- 8,331) = 1030 MB/s
test sherlock::the_nocase                    ... bench:     529,896 ns/iter (+/- 7,982) = 1122 MB/s

After:

test misc::medium_1K                         ... bench:          62 ns/iter (+/- 1) = 16967 MB/s
test misc::medium_1MB                        ... bench:          64 ns/iter (+/- 0) = 16384437 MB/s
test misc::medium_32                         ... bench:          60 ns/iter (+/- 0) = 1000 MB/s
test misc::medium_32K                        ... bench:          60 ns/iter (+/- 1) = 546600 MB/s
test misc::one_pass_short                    ... bench:          47 ns/iter (+/- 2) = 361 MB/s
test misc::one_pass_short_not                ... bench:          49 ns/iter (+/- 1) = 346 MB/s
test regexdna::variant1                      ... bench:   4,440,246 ns/iter (+/- 50,148) = 1144 MB/s
test regexdna::variant2                      ... bench:   8,062,373 ns/iter (+/- 45,368) = 630 MB/s
test regexdna::variant3                      ... bench:   9,511,231 ns/iter (+/- 62,748) = 534 MB/s
test regexdna::variant4                      ... bench:   9,561,492 ns/iter (+/- 56,091) = 531 MB/s
test regexdna::variant5                      ... bench:   8,168,135 ns/iter (+/- 57,706) = 622 MB/s
test regexdna::variant6                      ... bench:   7,873,334 ns/iter (+/- 164,196) = 645 MB/s
test regexdna::variant7                      ... bench:   7,959,515 ns/iter (+/- 45,165) = 638 MB/s
test regexdna::variant8                      ... bench:   8,049,168 ns/iter (+/- 55,434) = 631 MB/s
test regexdna::variant9                      ... bench:   7,953,276 ns/iter (+/- 60,756) = 639 MB/s
test sherlock::holmes_cochar_watson          ... bench:     176,303 ns/iter (+/- 4,207) = 3374 MB/s
test sherlock::holmes_coword_watson          ... bench:     627,607 ns/iter (+/- 9,955) = 947 MB/s
test sherlock::name_alt1                     ... bench:      40,231 ns/iter (+/- 902) = 14787 MB/s
test sherlock::name_alt2                     ... bench:     149,595 ns/iter (+/- 2,185) = 3976 MB/s
test sherlock::name_alt3                     ... bench:     162,793 ns/iter (+/- 2,406) = 3654 MB/s
test sherlock::name_alt4                     ... bench:     193,236 ns/iter (+/- 3,016) = 3078 MB/s
test sherlock::name_alt4_nocase              ... bench:     275,251 ns/iter (+/- 4,332) = 2161 MB/s
test sherlock::name_alt5                     ... bench:     155,146 ns/iter (+/- 2,357) = 3834 MB/s
test sherlock::name_alt5_nocase              ... bench:     765,181 ns/iter (+/- 8,713) = 777 MB/s
test sherlock::name_holmes_nocase            ... bench:     226,658 ns/iter (+/- 7,667) = 2624 MB/s
test sherlock::name_sherlock_holmes_nocase   ... bench:     203,958 ns/iter (+/- 2,969) = 2916 MB/s
test sherlock::name_sherlock_nocase          ... bench:     199,780 ns/iter (+/- 3,442) = 2977 MB/s
test sherlock::quotes                        ... bench:     579,021 ns/iter (+/- 11,342) = 1027 MB/s
test sherlock::the_nocase                    ... bench:     518,571 ns/iter (+/- 8,527) = 1147 MB/s

Once Teddy finds a fingerprint, it needs to check if there is really a
match. The old approach was to iterate over the set bits of both 64-bit
halves of the fingerprint-checking vector. The new approach first
extracts a bitfield that tells which bytes of the fingerprint-checking
vector are non-zero. Then it iterates over those bytes. This seems to
be faster (up to about 10% in some benchmarks).

It seems like the main reason that this approach is faster is that most
matched fingerprints only match in one place. The new code narrows in
on the important place more quickly, whereas the old code wasted time
unnecessarily examining an empty u64.

The gain seems to be larger with AVX2 support (which is not included in
this patch). Presumably it would be even larger with AVX512.
@BurntSushi
Copy link
Member

I rebased, squashed and merged this in 12cb63b

Thanks!

@BurntSushi BurntSushi closed this Sep 4, 2016
@BurntSushi
Copy link
Member

@jneem I somehow missed your comment about AVX2. Are you working on that? That'd be delicious. :-)

@jneem
Copy link
Contributor Author

jneem commented Sep 6, 2016

Yes, but it's in a separate crate (https://github.com/jneem/teddy), which has diverged a bit. It also implements aligned loads, which gives another 10% or so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants