-
-
Notifications
You must be signed in to change notification settings - Fork 14.4k
Improve is_ascii performance on x86_64 with explicit SSE2 intrinsics #151611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Improve is_ascii performance on x86_64 with explicit SSE2 intrinsics #151611
Conversation
This comment has been minimized.
This comment has been minimized.
9b57e05 to
e166f69
Compare
This comment has been minimized.
This comment has been minimized.
|
r? @folkertdev |
|
Neat! Can you remove those mentions to github issues/PRs from the commit message (they spam those pages). The PR description is where that sort of extra context can go. |
Use explicit SSE2 intrinsics to avoid LLVM's broken AVX-512 auto-vectorization which generates ~31 kshiftrd instructions. Performance - AVX-512: 34-48x faster - SSE2: 1.5-2x faster Improves on earlier pr
e166f69 to
a72f68e
Compare
|
I ran this on a non-avx512 machine too (just avx2) and also see there that it's better across the board. Thanks for looking into this! @bors r+ |
…erformance, r=folkertdev Improve is_ascii performance on x86_64 with explicit SSE2 intrinsics # Summary Improves `slice::is_ascii` performance for SSE2 target roughly 1.5-2x on larger inputs. AVX-512 keeps similiar performance characteristics. This is building on the work already merged in rust-lang#151259. In particular this PR improves the default SSE2 performance, I don't consider this a temporary fix anymore. Thanks to @folkertdev for pointing me to consider `as_chunk` again. # The implementation: - Uses 64-byte chunks with 4x 16-byte SSE2 loads OR'd together - Extracts the MSB mask with a single `pmovmskb` instruction - Falls back to usize-at-a-time SWAR for inputs < 64 bytes # Performance impact (vs before rust-lang#151259): - AVX-512: 34-48x faster - SSE2: 1.5-2x faster <details> <summary>Benchmark Results (click to expand)</summary> Benchmarked on AMD Ryzen 9 9950X (AVX-512 capable). Values show relative performance (1.00 = fastest). Tops out at 139GB/s for large inputs. ### early_non_ascii | Input Size | new_avx512 | new_sse2 | old_avx512 | old_sse2 | |------------|------------|----------|------------|----------| | 64 | 1.01 | **1.00** | 13.45 | 1.13 | | 1024 | 1.01 | **1.00** | 13.53 | 1.14 | | 65536 | 1.01 | **1.00** | 13.99 | 1.12 | | 1048576 | 1.02 | **1.00** | 13.29 | 1.12 | ### late_non_ascii | Input Size | new_avx512 | new_sse2 | old_avx512 | old_sse2 | |------------|------------|----------|------------|----------| | 64 | **1.00** | 1.01 | 13.37 | 1.13 | | 1024 | 1.10 | **1.00** | 42.42 | 1.95 | | 65536 | **1.00** | 1.06 | 42.22 | 1.73 | | 1048576 | **1.00** | 1.03 | 34.73 | 1.46 | ### pure_ascii | Input Size | new_avx512 | new_sse2 | old_avx512 | old_sse2 | |------------|------------|----------|------------|----------| | 4 | 1.03 | **1.00** | 1.75 | 1.32 | | 8 | **1.00** | 1.14 | 3.89 | 2.06 | | 16 | **1.00** | 1.04 | 1.13 | 1.62 | | 32 | 1.07 | 1.19 | 5.11 | **1.00** | | 64 | **1.00** | 1.13 | 13.32 | 1.57 | | 128 | **1.00** | 1.01 | 19.97 | 1.55 | | 256 | **1.00** | 1.02 | 27.77 | 1.61 | | 1024 | **1.00** | 1.02 | 41.34 | 1.84 | | 4096 | 1.02 | **1.00** | 45.61 | 1.98 | | 16384 | 1.01 | **1.00** | 48.67 | 2.04 | | 65536 | **1.00** | 1.03 | 43.86 | 1.77 | | 262144 | **1.00** | 1.06 | 41.44 | 1.79 | | 1048576 | 1.02 | **1.00** | 35.36 | 1.44 | </details> Adds assembly test to verify: - `kshiftrd`/`kshiftrq` are NOT generated - `pmovmskb`/`vpor` ARE generated ## Reproduction / Test Projects Standalone validation tools: https://github.com/bonega/is-ascii-fix-validation - `bench/` - Criterion benchmarks for SSE2 vs AVX-512 comparison - `fuzz/` - Compares old/new implementations with libfuzzer Relates to: llvm/llvm-project#176906
…uwer Rollup of 11 pull requests Successful merges: - #145393 (Add codegen test for removing trailing zeroes from `NonZero`) - #148764 (ptr_aligment_type: add more APIs) - #149869 (std: avoid tearing `dbg!` prints) - #150065 (add CSE optimization tests for iterating over slice) - #150842 (Fix(lib/win/thread): Ensure `Sleep`'s usage passes over the requested duration under Win7) - #151505 (Various refactors to the proc_macro bridge) - #151560 (relnotes: fix 1.93's `as_mut_array` methods) - #151611 (Improve is_ascii performance on x86_64 with explicit SSE2 intrinsics) - #151317 (x86 soft-float feature: mark it as forbidden rather than unstable) - #151577 (Rename `DepKindStruct` to `DepKindVTable`) - #151620 (Fix 'the the' typo in library/core/src/array/iter.rs)
|
@bors r- |
|
Commit a72f68e has been unapproved. |
The SSE2 helper function is not inlined across crate boundaries, so we cannot verify the codegen in an assembly test. The fix is still verified by the absence of performance regression.
|
I dropped the assembly tests for X86-64 since the SSE2 function is not inlined and impossible to test. Adding
|
|
You can add I also needed to swap the |
|
in addition, with |
|
@folkertdev I brought the tests back with Thank you |
|
@bors r+ |
Summary
Improves
slice::is_asciiperformance for SSE2 target roughly 1.5-2x on larger inputs.AVX-512 keeps similiar performance characteristics.
This is building on the work already merged in #151259.
In particular this PR improves the default SSE2 performance, I don't consider this a temporary fix anymore.
Thanks to @folkertdev for pointing me to consider
as_chunkagain.The implementation:
pmovmskbinstructionPerformance impact (vs before #151259):
AVX-512: 34-48x faster
SSE2: 1.5-2x faster
Benchmark Results (click to expand)
Benchmarked on AMD Ryzen 9 9950X (AVX-512 capable). Values show relative performance (1.00 = fastest).
Tops out at 139GB/s for large inputs.
early_non_ascii
late_non_ascii
pure_ascii
Reproduction / Test Projects
Standalone validation tools: https://github.com/bonega/is-ascii-fix-validation
bench/- Criterion benchmarks for SSE2 vs AVX-512 comparisonfuzz/- Compares old/new implementations with libfuzzerRelates to: llvm/llvm-project#176906