Improve is_ascii performance on x86_64 with explicit SSE2 intrinsics #151611

bonega · 2026-01-24T19:27:40Z

Summary

Improves slice::is_ascii performance for SSE2 target roughly 1.5-2x on larger inputs.
AVX-512 keeps similiar performance characteristics.

This is building on the work already merged in #151259.
In particular this PR improves the default SSE2 performance, I don't consider this a temporary fix anymore.
Thanks to @folkertdev for pointing me to consider as_chunk again.

The implementation:

Uses 64-byte chunks with 4x 16-byte SSE2 loads OR'd together
Extracts the MSB mask with a single pmovmskb instruction
Falls back to usize-at-a-time SWAR for inputs < 64 bytes

Performance impact (vs before #151259):

AVX-512: 34-48x faster

SSE2: 1.5-2x faster

Benchmark Results (click to expand)

Benchmarked on AMD Ryzen 9 9950X (AVX-512 capable). Values show relative performance (1.00 = fastest).
Tops out at 139GB/s for large inputs.

early_non_ascii

Input Size	new_avx512	new_sse2	old_avx512	old_sse2
64	1.01	1.00	13.45	1.13
1024	1.01	1.00	13.53	1.14
65536	1.01	1.00	13.99	1.12
1048576	1.02	1.00	13.29	1.12

late_non_ascii

Input Size	new_avx512	new_sse2	old_avx512	old_sse2
64	1.00	1.01	13.37	1.13
1024	1.10	1.00	42.42	1.95
65536	1.00	1.06	42.22	1.73
1048576	1.00	1.03	34.73	1.46

pure_ascii

Input Size	new_avx512	new_sse2	old_avx512	old_sse2
4	1.03	1.00	1.75	1.32
8	1.00	1.14	3.89	2.06
16	1.00	1.04	1.13	1.62
32	1.07	1.19	5.11	1.00
64	1.00	1.13	13.32	1.57
128	1.00	1.01	19.97	1.55
256	1.00	1.02	27.77	1.61
1024	1.00	1.02	41.34	1.84
4096	1.02	1.00	45.61	1.98
16384	1.01	1.00	48.67	2.04
65536	1.00	1.03	43.86	1.77
262144	1.00	1.06	41.44	1.79
1048576	1.02	1.00	35.36	1.44

Reproduction / Test Projects

Standalone validation tools: https://github.com/bonega/is-ascii-fix-validation

bench/ - Criterion benchmarks for SSE2 vs AVX-512 comparison
fuzz/ - Compares old/new implementations with libfuzzer

Relates to: llvm/llvm-project#176906

rustbot · 2026-01-24T19:27:45Z

r? @joboet

rustbot has assigned @joboet.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

bonega · 2026-01-24T19:31:35Z

r? @folkertdev

folkertdev · 2026-01-24T19:40:44Z

Neat!

Can you remove those mentions to github issues/PRs from the commit message (they spam those pages). The PR description is where that sort of extra context can go.

Use explicit SSE2 intrinsics to avoid LLVM's broken AVX-512 auto-vectorization which generates ~31 kshiftrd instructions. Performance - AVX-512: 34-48x faster - SSE2: 1.5-2x faster Improves on earlier pr

folkertdev · 2026-01-24T22:29:20Z

I ran this on a non-avx512 machine too (just avx2) and also see there that it's better across the board. Thanks for looking into this!

@bors r+

rust-bors · 2026-01-24T22:29:23Z

📌 Commit a72f68e has been approved by folkertdev

It is now in the queue for this repository.

@folkertdev

…erformance, r=folkertdev Improve is_ascii performance on x86_64 with explicit SSE2 intrinsics # Summary Improves `slice::is_ascii` performance for SSE2 target roughly 1.5-2x on larger inputs. AVX-512 keeps similiar performance characteristics. This is building on the work already merged in rust-lang#151259. In particular this PR improves the default SSE2 performance, I don't consider this a temporary fix anymore. Thanks to @folkertdev for pointing me to consider `as_chunk` again. # The implementation: - Uses 64-byte chunks with 4x 16-byte SSE2 loads OR'd together - Extracts the MSB mask with a single `pmovmskb` instruction - Falls back to usize-at-a-time SWAR for inputs < 64 bytes # Performance impact (vs before rust-lang#151259): - AVX-512: 34-48x faster - SSE2: 1.5-2x faster <details> <summary>Benchmark Results (click to expand)</summary> Benchmarked on AMD Ryzen 9 9950X (AVX-512 capable). Values show relative performance (1.00 = fastest). Tops out at 139GB/s for large inputs. ### early_non_ascii | Input Size | new_avx512 | new_sse2 | old_avx512 | old_sse2 | |------------|------------|----------|------------|----------| | 64 | 1.01 | **1.00** | 13.45 | 1.13 | | 1024 | 1.01 | **1.00** | 13.53 | 1.14 | | 65536 | 1.01 | **1.00** | 13.99 | 1.12 | | 1048576 | 1.02 | **1.00** | 13.29 | 1.12 | ### late_non_ascii | Input Size | new_avx512 | new_sse2 | old_avx512 | old_sse2 | |------------|------------|----------|------------|----------| | 64 | **1.00** | 1.01 | 13.37 | 1.13 | | 1024 | 1.10 | **1.00** | 42.42 | 1.95 | | 65536 | **1.00** | 1.06 | 42.22 | 1.73 | | 1048576 | **1.00** | 1.03 | 34.73 | 1.46 | ### pure_ascii | Input Size | new_avx512 | new_sse2 | old_avx512 | old_sse2 | |------------|------------|----------|------------|----------| | 4 | 1.03 | **1.00** | 1.75 | 1.32 | | 8 | **1.00** | 1.14 | 3.89 | 2.06 | | 16 | **1.00** | 1.04 | 1.13 | 1.62 | | 32 | 1.07 | 1.19 | 5.11 | **1.00** | | 64 | **1.00** | 1.13 | 13.32 | 1.57 | | 128 | **1.00** | 1.01 | 19.97 | 1.55 | | 256 | **1.00** | 1.02 | 27.77 | 1.61 | | 1024 | **1.00** | 1.02 | 41.34 | 1.84 | | 4096 | 1.02 | **1.00** | 45.61 | 1.98 | | 16384 | 1.01 | **1.00** | 48.67 | 2.04 | | 65536 | **1.00** | 1.03 | 43.86 | 1.77 | | 262144 | **1.00** | 1.06 | 41.44 | 1.79 | | 1048576 | 1.02 | **1.00** | 35.36 | 1.44 | </details> Adds assembly test to verify: - `kshiftrd`/`kshiftrq` are NOT generated - `pmovmskb`/`vpor` ARE generated ## Reproduction / Test Projects Standalone validation tools: https://github.com/bonega/is-ascii-fix-validation - `bench/` - Criterion benchmarks for SSE2 vs AVX-512 comparison - `fuzz/` - Compares old/new implementations with libfuzzer Relates to: llvm/llvm-project#176906

…uwer Rollup of 11 pull requests Successful merges: - #145393 (Add codegen test for removing trailing zeroes from `NonZero`) - #148764 (ptr_aligment_type: add more APIs) - #149869 (std: avoid tearing `dbg!` prints) - #150065 (add CSE optimization tests for iterating over slice) - #150842 (Fix(lib/win/thread): Ensure `Sleep`'s usage passes over the requested duration under Win7) - #151505 (Various refactors to the proc_macro bridge) - #151560 (relnotes: fix 1.93's `as_mut_array` methods) - #151611 (Improve is_ascii performance on x86_64 with explicit SSE2 intrinsics) - #151317 (x86 soft-float feature: mark it as forbidden rather than unstable) - #151577 (Rename `DepKindStruct` to `DepKindVTable`) - #151620 (Fix 'the the' typo in library/core/src/array/iter.rs)

matthiaskrgr · 2026-01-25T06:44:46Z

@bors r-
failed in #151627 (comment)

rust-bors · 2026-01-25T06:44:50Z

Commit a72f68e has been unapproved.

The SSE2 helper function is not inlined across crate boundaries, so we cannot verify the codegen in an assembly test. The fix is still verified by the absence of performance regression.

bonega · 2026-01-25T08:50:29Z

I dropped the assembly tests for X86-64 since the SSE2 function is not inlined and impossible to test.

Adding #[inline(always)] is an option, but I don't like it.

LLVM should get to decide to inline or not.
Extra code would be added to all callsites.

folkertdev · 2026-01-25T17:51:23Z

You can add #[inline]. I agree that inline(always) is too much, but #[inline] does the job at -Copt-level=3, and it is already used for all of the other is_ascii helpers. The loop here only contributes marginally to the function, and in many simple cases will get mostly optimized out.

I also needed to swap the por and pmovmskb lines, so that they agree with their order in the algorithm.

folkertdev · 2026-01-25T17:52:18Z

in addition, with #[inline] we can keep the test coverage for the generated instructions. I think that is valuable.

bonega · 2026-01-25T19:10:21Z

@folkertdev I brought the tests back with #[inline].
Still maybe a bit brittle depending on whatever optimization LLVM chooses, but worst case we just fix the test if it breaks in the future.

Thank you

folkertdev · 2026-01-25T23:21:15Z

@bors r+

rust-bors · 2026-01-25T23:21:18Z

📌 Commit dbc870a has been approved by folkertdev

It is now in the queue for this repository.

rustbot assigned joboet Jan 24, 2026

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Jan 24, 2026

This comment has been minimized.

Sign in to view

bonega force-pushed the improve-is-slice-is-ascii-performance branch from 9b57e05 to e166f69 Compare January 24, 2026 19:28

This comment has been minimized.

Sign in to view

rustbot assigned folkertdev and unassigned joboet Jan 24, 2026

Fix is_ascii performance on x86_64 with explicit SSE2 intrinsics

a72f68e

Use explicit SSE2 intrinsics to avoid LLVM's broken AVX-512 auto-vectorization which generates ~31 kshiftrd instructions. Performance - AVX-512: 34-48x faster - SSE2: 1.5-2x faster Improves on earlier pr

bonega force-pushed the improve-is-slice-is-ascii-performance branch from e166f69 to a72f68e Compare January 24, 2026 21:04

rust-bors bot added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jan 24, 2026

JonathanBrouwer mentioned this pull request Jan 25, 2026

Rollup of 11 pull requests #151627

Closed

rust-bors bot added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels Jan 25, 2026

Remove x86_64 assembly test for is_ascii

cbcd869

The SSE2 helper function is not inlined across crate boundaries, so we cannot verify the codegen in an assembly test. The fix is still verified by the absence of performance regression.

Mark is_ascii_sse2 as #[inline]

dbc870a

rust-bors bot added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Jan 25, 2026

Uh oh!

Improve is_ascii performance on x86_64 with explicit SSE2 intrinsics #151611

Are you sure you want to change the base?

Improve is_ascii performance on x86_64 with explicit SSE2 intrinsics #151611

Conversation

bonega commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

The implementation:

Performance impact (vs before #151259):

early_non_ascii

late_non_ascii

pure_ascii

Reproduction / Test Projects

Uh oh!

rustbot commented Jan 24, 2026

Uh oh!

This comment has been minimized.

This comment has been minimized.

bonega commented Jan 24, 2026

Uh oh!

folkertdev commented Jan 24, 2026

Uh oh!

folkertdev commented Jan 24, 2026

Uh oh!

rust-bors bot commented Jan 24, 2026

Uh oh!

matthiaskrgr commented Jan 25, 2026

Uh oh!

rust-bors bot commented Jan 25, 2026

Uh oh!

bonega commented Jan 25, 2026

Uh oh!

folkertdev commented Jan 25, 2026

Uh oh!

folkertdev commented Jan 25, 2026

Uh oh!

bonega commented Jan 25, 2026

Uh oh!

folkertdev commented Jan 25, 2026

Uh oh!

rust-bors bot commented Jan 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

bonega commented Jan 24, 2026 •

edited

Loading