goodbye simd crate, hello std::arch #456

BurntSushi · 2018-03-13T01:55:44Z

This PR ports the regex's crate use of SIMD to std::arch, which in turn drops the dependency on the simd crate and any compile time SIMD configuration requirements. As a bonus, we also add an AVX2 variant of what used to be an exclusively SSSE3 algorithm.

We do this by adding a new feature unstable, which when enabled, will cause the regex crate to automatically use SSSE3 or AVX2 optimized variants of certain literal algorithms (specifically, the Teddy multi-matcher), depending on which CPU features are available at runtime. Once std::arch is stabilized, these optimizations will be enabled automatically.

Performance improvements from no-SIMD to SSSE3 (which roughly match the status quo, when SSSE3 is enabled at compile time):

 sherlock::holmes_cochar_watson         193,079 (3081 MB/s)             160,996 (3695 MB/s)               -32,083  -16.62%   x 1.20
 sherlock::name_alt2                    166,895 (3564 MB/s)             126,387 (4707 MB/s)               -40,508  -24.27%   x 1.32
 sherlock::name_alt3                    1,090,127 (545 MB/s)            137,516 (4326 MB/s)              -952,611  -87.39%   x 7.93
 sherlock::name_alt4                    200,849 (2962 MB/s)             164,347 (3619 MB/s)               -36,502  -18.17%   x 1.22
 sherlock::name_alt4_nocase             1,155,389 (514 MB/s)            225,784 (2634 MB/s)              -929,605  -80.46%   x 5.12
 sherlock::name_alt5                    231,145 (2573 MB/s)             131,767 (4515 MB/s)               -99,378  -42.99%   x 1.75
 sherlock::name_alt5_nocase             1,158,401 (513 MB/s)            550,028 (1081 MB/s)              -608,373  -52.52%   x 2.11
 sherlock::name_holmes_nocase           943,570 (630 MB/s)              190,772 (3118 MB/s)              -752,798  -79.78%   x 4.95
 sherlock::name_sherlock_holmes_nocase  1,084,468 (548 MB/s)            170,387 (3491 MB/s)              -914,081  -84.29%   x 6.36
 sherlock::name_sherlock_nocase         1,077,668 (552 MB/s)            163,711 (3634 MB/s)              -913,957  -84.81%   x 6.58
 sherlock::the_nocase                   1,356,722 (438 MB/s)            392,886 (1514 MB/s)              -963,836  -71.04%   x 3.45

And then improvements from SSSE3 to AVX2:

 sherlock::holmes_cochar_watson         160,996 (3695 MB/s)          104,124 (5713 MB/s)              -56,872  -35.33%   x 1.55
 sherlock::holmes_coword_watson         554,179 (1073 MB/s)          495,262 (1201 MB/s)              -58,917  -10.63%   x 1.12
 sherlock::name_alt2                    126,387 (4707 MB/s)          85,083 (6992 MB/s)               -41,304  -32.68%   x 1.49
 sherlock::name_alt3                    137,516 (4326 MB/s)          94,820 (6274 MB/s)               -42,696  -31.05%   x 1.45
 sherlock::name_alt4                    164,347 (3619 MB/s)          120,466 (4938 MB/s)              -43,881  -26.70%   x 1.36
 sherlock::name_alt4_nocase             225,784 (2634 MB/s)          180,290 (3299 MB/s)              -45,494  -20.15%   x 1.25
 sherlock::name_alt5                    131,767 (4515 MB/s)          86,539 (6874 MB/s)               -45,228  -34.32%   x 1.52
 sherlock::name_holmes_nocase           190,772 (3118 MB/s)          147,946 (4021 MB/s)              -42,826  -22.45%   x 1.29
 sherlock::name_sherlock_holmes_nocase  170,387 (3491 MB/s)          124,611 (4774 MB/s)              -45,776  -26.87%   x 1.37
 sherlock::name_sherlock_nocase         163,711 (3634 MB/s)          121,786 (4885 MB/s)              -41,925  -25.61%   x 1.34

🎉

BurntSushi · 2018-03-13T01:59:54Z

Is it possible to use a union on Rust nightly but still compile on Rust 1.12? 🤔

BurntSushi · 2018-03-13T02:08:43Z

Indeed, it is possible:

macro_rules! defunion {
    () => {
        #[derive(Clone, Copy)]
        #[allow(non_camel_case_types)]
        pub union u8x32 {
            vector: __m256i,
            bytes: [u8; 32],
        }
    }
}

defunion!();

This commit ports the Teddy searcher to use std::arch and moves off the portable SIMD vector API. Performance remains the same, and it looks like the codegen is identical, which is great! This also makes the `simd-accel` feature a no-op and adds a new `unstable` feature which will enable the Teddy optimization. The `-C target-feature` or `-C target-cpu` settings are no longer necessary, since this will now do runtime target feature detection. We also add a new `unstable` feature to the regex crate, which will enable this new use of std::arch. Once enabled, the Teddy optimizations becomes available automatically without any additional compile time flags.

@jneem

This commit adds a copy of the Teddy searcher that works on AVX2. We don't attempt to reuse any code between them just yet, and instead just copy & paste and tweak parts of it to work on 32 bytes instead of 16. (Some parts were trickier than others. For example, @jneem figured out how to nearly compensate for the lack of a real 256-bit bytewise PALIGNR instruction, which we borrow here.) Overall, AVX2 provides a nice bump in performance.

We no longer need to enable SIMD optimizations at compile time. They are automatically enabled when regex is compiled with the `unstable` feature.

This removes our compile time SIMD flags and replaces them with the `unstable` feature, which will cause CI to use whatever CPU features are available. Ideally, we would test each important CPU feature combinations, but I'd like to avoid doing that in one CI job and instead split them out into separate CI jobs to keep CI times low. That requires more work.

gitignore: add tmp dir

58dc611

BurntSushi added 5 commits March 12, 2018 22:09

bench: remove RUSTFLAGS

02962df

We no longer need to enable SIMD optimizations at compile time. They are automatically enabled when regex is compiled with the `unstable` feature.

doc: note the new unstable feature

75055b6

BurntSushi force-pushed the ag/stdsimd branch from 5d3c524 to 75055b6 Compare March 13, 2018 02:11

BurntSushi merged commit 27ed3fa into rust-lang:master Mar 13, 2018

BurntSushi deleted the ag/stdsimd branch March 13, 2018 02:32

BurntSushi mentioned this pull request Mar 13, 2018

update regex to 0.2.9, enables portable binary releases BurntSushi/ripgrep#857

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

goodbye simd crate, hello std::arch #456

goodbye simd crate, hello std::arch #456

BurntSushi commented Mar 13, 2018 •

edited

Loading

BurntSushi commented Mar 13, 2018

BurntSushi commented Mar 13, 2018

goodbye simd crate, hello std::arch #456

goodbye simd crate, hello std::arch #456

Conversation

BurntSushi commented Mar 13, 2018 • edited Loading

BurntSushi commented Mar 13, 2018

BurntSushi commented Mar 13, 2018

BurntSushi commented Mar 13, 2018 •

edited

Loading