Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x86_64 SSE2 fast-path for str.contains(&str) and short needles #103779

Merged
merged 5 commits into from
Nov 17, 2022

Commits on Nov 14, 2022

  1. Configuration menu
    Copy the full SHA
    4844e51 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    467b299 View commit details
    Browse the repository at this point in the history
  3. x86_64 SSE2 fast-path for str.contains(&str) and short needles

    Based on Wojciech Muła's "SIMD-friendly algorithms for substring searching"[0]
    
    The two-way algorithm is Big-O efficient but it needs to preprocess the needle
    to find a "criticla factorization" of it. This additional work is significant
    for short needles. Additionally it mostly advances needle.len() bytes at a time.
    
    The SIMD-based approach used here on the other hand can advance based on its
    vector width, which can exceed the needle length. Except for pathological cases,
    but due to being limited to small needles the worst case blowup is also small.
    
    benchmarks taken on a Zen2:
    
    ```
    16CGU, OLD:
    test str::bench_contains_short_short                     ... bench:          27 ns/iter (+/- 1)
    test str::bench_contains_short_long                      ... bench:         667 ns/iter (+/- 29)
    test str::bench_contains_bad_naive                       ... bench:         131 ns/iter (+/- 2)
    test str::bench_contains_bad_simd                        ... bench:         130 ns/iter (+/- 2)
    test str::bench_contains_equal                           ... bench:         148 ns/iter (+/- 4)
    
    
    16CGU, NEW:
    test str::bench_contains_short_short                     ... bench:           8 ns/iter (+/- 0)
    test str::bench_contains_short_long                      ... bench:         135 ns/iter (+/- 4)
    test str::bench_contains_bad_naive                       ... bench:         130 ns/iter (+/- 2)
    test str::bench_contains_bad_simd                        ... bench:         292 ns/iter (+/- 1)
    test str::bench_contains_equal                           ... bench:           3 ns/iter (+/- 0)
    
    
    1CGU, OLD:
    test str::bench_contains_short_short                     ... bench:          30 ns/iter (+/- 0)
    test str::bench_contains_short_long                      ... bench:         713 ns/iter (+/- 17)
    test str::bench_contains_bad_naive                       ... bench:         131 ns/iter (+/- 3)
    test str::bench_contains_bad_simd                        ... bench:         130 ns/iter (+/- 3)
    test str::bench_contains_equal                           ... bench:         148 ns/iter (+/- 6)
    
    1CGU, NEW:
    test str::bench_contains_short_short                     ... bench:          10 ns/iter (+/- 0)
    test str::bench_contains_short_long                      ... bench:         111 ns/iter (+/- 0)
    test str::bench_contains_bad_naive                       ... bench:         135 ns/iter (+/- 3)
    test str::bench_contains_bad_simd                        ... bench:         274 ns/iter (+/- 2)
    test str::bench_contains_equal                           ... bench:           4 ns/iter (+/- 0)
    ```
    
    
    [0] http://0x80.pl/articles/simd-strfind.html#sse-avx2
    the8472 committed Nov 14, 2022
    Configuration menu
    Copy the full SHA
    3d4a848 View commit details
    Browse the repository at this point in the history

Commits on Nov 15, 2022

  1. generalize str.contains() tests to a range of haystack sizes

    The Big-O is cubic, but this is only called with ~70 chars so it's still fast enough
    the8472 committed Nov 15, 2022
    Configuration menu
    Copy the full SHA
    c37e8fa View commit details
    Browse the repository at this point in the history
  2. - convert from core::arch to core::simd

    - bump simd compare to 32bytes
    - import small slice compare code from memmem crate
    - try a few different probe bytes to avoid degenerate cases
      - but special-case 2-byte needles
    the8472 committed Nov 15, 2022
    Configuration menu
    Copy the full SHA
    a2b2010 View commit details
    Browse the repository at this point in the history