New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

goodbye simd crate, hello std::arch #456

Merged
merged 6 commits into from Mar 13, 2018

Conversation

Projects
None yet
1 participant
@BurntSushi
Member

BurntSushi commented Mar 13, 2018

This PR ports the regex's crate use of SIMD to std::arch, which in turn drops the dependency on the simd crate and any compile time SIMD configuration requirements. As a bonus, we also add an AVX2 variant of what used to be an exclusively SSSE3 algorithm.

We do this by adding a new feature unstable, which when enabled, will cause the regex crate to automatically use SSSE3 or AVX2 optimized variants of certain literal algorithms (specifically, the Teddy multi-matcher), depending on which CPU features are available at runtime. Once std::arch is stabilized, these optimizations will be enabled automatically.

Performance improvements from no-SIMD to SSSE3 (which roughly match the status quo, when SSSE3 is enabled at compile time):

 sherlock::holmes_cochar_watson         193,079 (3081 MB/s)             160,996 (3695 MB/s)               -32,083  -16.62%   x 1.20
 sherlock::name_alt2                    166,895 (3564 MB/s)             126,387 (4707 MB/s)               -40,508  -24.27%   x 1.32
 sherlock::name_alt3                    1,090,127 (545 MB/s)            137,516 (4326 MB/s)              -952,611  -87.39%   x 7.93
 sherlock::name_alt4                    200,849 (2962 MB/s)             164,347 (3619 MB/s)               -36,502  -18.17%   x 1.22
 sherlock::name_alt4_nocase             1,155,389 (514 MB/s)            225,784 (2634 MB/s)              -929,605  -80.46%   x 5.12
 sherlock::name_alt5                    231,145 (2573 MB/s)             131,767 (4515 MB/s)               -99,378  -42.99%   x 1.75
 sherlock::name_alt5_nocase             1,158,401 (513 MB/s)            550,028 (1081 MB/s)              -608,373  -52.52%   x 2.11
 sherlock::name_holmes_nocase           943,570 (630 MB/s)              190,772 (3118 MB/s)              -752,798  -79.78%   x 4.95
 sherlock::name_sherlock_holmes_nocase  1,084,468 (548 MB/s)            170,387 (3491 MB/s)              -914,081  -84.29%   x 6.36
 sherlock::name_sherlock_nocase         1,077,668 (552 MB/s)            163,711 (3634 MB/s)              -913,957  -84.81%   x 6.58
 sherlock::the_nocase                   1,356,722 (438 MB/s)            392,886 (1514 MB/s)              -963,836  -71.04%   x 3.45

And then improvements from SSSE3 to AVX2:

 sherlock::holmes_cochar_watson         160,996 (3695 MB/s)          104,124 (5713 MB/s)              -56,872  -35.33%   x 1.55
 sherlock::holmes_coword_watson         554,179 (1073 MB/s)          495,262 (1201 MB/s)              -58,917  -10.63%   x 1.12
 sherlock::name_alt2                    126,387 (4707 MB/s)          85,083 (6992 MB/s)               -41,304  -32.68%   x 1.49
 sherlock::name_alt3                    137,516 (4326 MB/s)          94,820 (6274 MB/s)               -42,696  -31.05%   x 1.45
 sherlock::name_alt4                    164,347 (3619 MB/s)          120,466 (4938 MB/s)              -43,881  -26.70%   x 1.36
 sherlock::name_alt4_nocase             225,784 (2634 MB/s)          180,290 (3299 MB/s)              -45,494  -20.15%   x 1.25
 sherlock::name_alt5                    131,767 (4515 MB/s)          86,539 (6874 MB/s)               -45,228  -34.32%   x 1.52
 sherlock::name_holmes_nocase           190,772 (3118 MB/s)          147,946 (4021 MB/s)              -42,826  -22.45%   x 1.29
 sherlock::name_sherlock_holmes_nocase  170,387 (3491 MB/s)          124,611 (4774 MB/s)              -45,776  -26.87%   x 1.37
 sherlock::name_sherlock_nocase         163,711 (3634 MB/s)          121,786 (4885 MB/s)              -41,925  -25.61%   x 1.34

馃帀

@BurntSushi

This comment has been minimized.

Member

BurntSushi commented Mar 13, 2018

Is it possible to use a union on Rust nightly but still compile on Rust 1.12? 馃

@BurntSushi

This comment has been minimized.

Member

BurntSushi commented Mar 13, 2018

Indeed, it is possible:

macro_rules! defunion {
    () => {
        #[derive(Clone, Copy)]
        #[allow(non_camel_case_types)]
        pub union u8x32 {
            vector: __m256i,
            bytes: [u8; 32],
        }
    }
}

defunion!();

BurntSushi added some commits Mar 11, 2018

teddy: port teddy searcher to std::arch
This commit ports the Teddy searcher to use std::arch and moves off the
portable SIMD vector API. Performance remains the same, and it looks
like the codegen is identical, which is great!

This also makes the `simd-accel` feature a no-op and adds a new
`unstable` feature which will enable the Teddy optimization. The `-C
target-feature` or `-C target-cpu` settings are no longer necessary,
since this will now do runtime target feature detection.

We also add a new `unstable` feature to the regex crate, which will
enable this new use of std::arch. Once enabled, the Teddy optimizations
becomes available automatically without any additional compile time
flags.
teddy: port teddy searcher to AVX2
This commit adds a copy of the Teddy searcher that works on AVX2. We
don't attempt to reuse any code between them just yet, and instead just
copy & paste and tweak parts of it to work on 32 bytes instead of 16.
(Some parts were trickier than others. For example, @jneem figured out
how to nearly compensate for the lack of a real 256-bit bytewise PALIGNR
instruction, which we borrow here.)

Overall, AVX2 provides a nice bump in performance.
bench: remove RUSTFLAGS
We no longer need to enable SIMD optimizations at compile time. They are
automatically enabled when regex is compiled with the `unstable`
feature.
ci: remove RUSTFLAGS, enable unstable
This removes our compile time SIMD flags and replaces them with the
`unstable` feature, which will cause CI to use whatever CPU features are
available.

Ideally, we would test each important CPU feature combinations, but I'd
like to avoid doing that in one CI job and instead split them out into
separate CI jobs to keep CI times low. That requires more work.

@BurntSushi BurntSushi merged commit 27ed3fa into rust-lang:master Mar 13, 2018

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@BurntSushi BurntSushi deleted the BurntSushi:ag/stdsimd branch Mar 13, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment