Skip to content

Conversation

@bonega
Copy link
Contributor

@bonega bonega commented Jan 24, 2026

Summary

Improves slice::is_ascii performance for SSE2 target roughly 1.5-2x on larger inputs.
AVX-512 keeps similiar performance characteristics.

This is building on the work already merged in #151259.
In particular this PR improves the default SSE2 performance, I don't consider this a temporary fix anymore.
Thanks to @folkertdev for pointing me to consider as_chunk again.

The implementation:

  • Uses 64-byte chunks with 4x 16-byte SSE2 loads OR'd together
  • Extracts the MSB mask with a single pmovmskb instruction
  • Falls back to usize-at-a-time SWAR for inputs < 64 bytes

Performance impact (vs before #151259):

  • AVX-512: 34-48x faster

  • SSE2: 1.5-2x faster

    Benchmark Results (click to expand)

    Benchmarked on AMD Ryzen 9 9950X (AVX-512 capable). Values show relative performance (1.00 = fastest).
    Tops out at 139GB/s for large inputs.

    early_non_ascii

    Input Size new_avx512 new_sse2 old_avx512 old_sse2
    64 1.01 1.00 13.45 1.13
    1024 1.01 1.00 13.53 1.14
    65536 1.01 1.00 13.99 1.12
    1048576 1.02 1.00 13.29 1.12

    late_non_ascii

    Input Size new_avx512 new_sse2 old_avx512 old_sse2
    64 1.00 1.01 13.37 1.13
    1024 1.10 1.00 42.42 1.95
    65536 1.00 1.06 42.22 1.73
    1048576 1.00 1.03 34.73 1.46

    pure_ascii

    Input Size new_avx512 new_sse2 old_avx512 old_sse2
    4 1.03 1.00 1.75 1.32
    8 1.00 1.14 3.89 2.06
    16 1.00 1.04 1.13 1.62
    32 1.07 1.19 5.11 1.00
    64 1.00 1.13 13.32 1.57
    128 1.00 1.01 19.97 1.55
    256 1.00 1.02 27.77 1.61
    1024 1.00 1.02 41.34 1.84
    4096 1.02 1.00 45.61 1.98
    16384 1.01 1.00 48.67 2.04
    65536 1.00 1.03 43.86 1.77
    262144 1.00 1.06 41.44 1.79
    1048576 1.02 1.00 35.36 1.44

Reproduction / Test Projects

Standalone validation tools: https://github.com/bonega/is-ascii-fix-validation

  • bench/ - Criterion benchmarks for SSE2 vs AVX-512 comparison
  • fuzz/ - Compares old/new implementations with libfuzzer

Relates to: llvm/llvm-project#176906

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Jan 24, 2026
@rustbot
Copy link
Collaborator

rustbot commented Jan 24, 2026

r? @joboet

rustbot has assigned @joboet.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@rustbot

This comment has been minimized.

@bonega bonega force-pushed the improve-is-slice-is-ascii-performance branch from 9b57e05 to e166f69 Compare January 24, 2026 19:28
@rustbot

This comment has been minimized.

@bonega
Copy link
Contributor Author

bonega commented Jan 24, 2026

r? @folkertdev

@rustbot rustbot assigned folkertdev and unassigned joboet Jan 24, 2026
@folkertdev
Copy link
Contributor

Neat!

Can you remove those mentions to github issues/PRs from the commit message (they spam those pages). The PR description is where that sort of extra context can go.

Use explicit SSE2 intrinsics to avoid LLVM's broken AVX-512
auto-vectorization which generates ~31 kshiftrd instructions.

Performance
- AVX-512: 34-48x faster
- SSE2: 1.5-2x faster

Improves on earlier pr
@bonega bonega force-pushed the improve-is-slice-is-ascii-performance branch from e166f69 to a72f68e Compare January 24, 2026 21:04
@folkertdev
Copy link
Contributor

I ran this on a non-avx512 machine too (just avx2) and also see there that it's better across the board. Thanks for looking into this!

@bors r+

@rust-bors
Copy link
Contributor

rust-bors bot commented Jan 24, 2026

📌 Commit a72f68e has been approved by folkertdev

It is now in the queue for this repository.

@rust-bors rust-bors bot added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jan 24, 2026
JonathanBrouwer added a commit to JonathanBrouwer/rust that referenced this pull request Jan 25, 2026
…erformance, r=folkertdev

Improve is_ascii performance on x86_64 with explicit SSE2 intrinsics

# Summary

Improves `slice::is_ascii` performance for SSE2 target roughly 1.5-2x on larger inputs.
AVX-512 keeps similiar performance characteristics.

This is building on the work already merged in rust-lang#151259.
In particular this PR improves the default SSE2 performance, I don't consider this a temporary fix anymore.
Thanks to @folkertdev for pointing me to consider `as_chunk` again.

# The implementation:
- Uses 64-byte chunks with 4x 16-byte SSE2 loads OR'd together
- Extracts the MSB mask with a single `pmovmskb` instruction
- Falls back to usize-at-a-time SWAR for inputs < 64 bytes

# Performance impact (vs before rust-lang#151259):
- AVX-512: 34-48x faster
- SSE2: 1.5-2x faster

  <details>
  <summary>Benchmark Results (click to expand)</summary>

  Benchmarked on AMD Ryzen 9 9950X (AVX-512 capable). Values show relative performance (1.00 = fastest).
  Tops out at 139GB/s for large inputs.

  ### early_non_ascii

  | Input Size | new_avx512 | new_sse2 | old_avx512 | old_sse2 |
  |------------|------------|----------|------------|----------|
  | 64 | 1.01 | **1.00** | 13.45 | 1.13 |
  | 1024 | 1.01 | **1.00** | 13.53 | 1.14 |
  | 65536 | 1.01 | **1.00** | 13.99 | 1.12 |
  | 1048576 | 1.02 | **1.00** | 13.29 | 1.12 |

  ### late_non_ascii

  | Input Size | new_avx512 | new_sse2 | old_avx512 | old_sse2 |
  |------------|------------|----------|------------|----------|
  | 64 | **1.00** | 1.01 | 13.37 | 1.13 |
  | 1024 | 1.10 | **1.00** | 42.42 | 1.95 |
  | 65536 | **1.00** | 1.06 | 42.22 | 1.73 |
  | 1048576 | **1.00** | 1.03 | 34.73 | 1.46 |

  ### pure_ascii

  | Input Size | new_avx512 | new_sse2 | old_avx512 | old_sse2 |
  |------------|------------|----------|------------|----------|
  | 4 | 1.03 | **1.00** | 1.75 | 1.32 |
  | 8 | **1.00** | 1.14 | 3.89 | 2.06 |
  | 16 | **1.00** | 1.04 | 1.13 | 1.62 |
  | 32 | 1.07 | 1.19 | 5.11 | **1.00** |
  | 64 | **1.00** | 1.13 | 13.32 | 1.57 |
  | 128 | **1.00** | 1.01 | 19.97 | 1.55 |
  | 256 | **1.00** | 1.02 | 27.77 | 1.61 |
  | 1024 | **1.00** | 1.02 | 41.34 | 1.84 |
  | 4096 | 1.02 | **1.00** | 45.61 | 1.98 |
  | 16384 | 1.01 | **1.00** | 48.67 | 2.04 |
  | 65536 | **1.00** | 1.03 | 43.86 | 1.77 |
  | 262144 | **1.00** | 1.06 | 41.44 | 1.79 |
  | 1048576 | 1.02 | **1.00** | 35.36 | 1.44 |

  </details>

Adds assembly test to verify:
- `kshiftrd`/`kshiftrq` are NOT generated
- `pmovmskb`/`vpor` ARE generated

## Reproduction / Test Projects

Standalone validation tools: https://github.com/bonega/is-ascii-fix-validation

- `bench/` - Criterion benchmarks for SSE2 vs AVX-512 comparison
- `fuzz/` - Compares old/new implementations with libfuzzer

Relates to: llvm/llvm-project#176906
rust-bors bot pushed a commit that referenced this pull request Jan 25, 2026
…uwer

Rollup of 11 pull requests

Successful merges:

 - #145393 (Add codegen test for removing trailing zeroes from `NonZero`)
 - #148764 (ptr_aligment_type: add more APIs)
 - #149869 (std: avoid tearing `dbg!` prints)
 - #150065 (add CSE optimization tests for iterating over slice)
 - #150842 (Fix(lib/win/thread): Ensure `Sleep`'s usage passes over the requested duration under Win7)
 - #151505 (Various refactors to the proc_macro bridge)
 - #151560 (relnotes: fix 1.93's `as_mut_array` methods)
 - #151611 (Improve is_ascii performance on x86_64 with explicit SSE2 intrinsics)
 - #151317 (x86 soft-float feature: mark it as forbidden rather than unstable)
 - #151577 (Rename `DepKindStruct` to `DepKindVTable`)
 - #151620 (Fix 'the the' typo in library/core/src/array/iter.rs)
@matthiaskrgr
Copy link
Member

@bors r-
failed in #151627 (comment)

@rust-bors rust-bors bot added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels Jan 25, 2026
@rust-bors
Copy link
Contributor

rust-bors bot commented Jan 25, 2026

Commit a72f68e has been unapproved.

The SSE2 helper function is not inlined across crate boundaries,
so we cannot verify the codegen in an assembly test. The fix is
still verified by the absence of performance regression.
@bonega
Copy link
Contributor Author

bonega commented Jan 25, 2026

I dropped the assembly tests for X86-64 since the SSE2 function is not inlined and impossible to test.

Adding #[inline(always)] is an option, but I don't like it.

  • LLVM should get to decide to inline or not.
  • Extra code would be added to all callsites.

@folkertdev
Copy link
Contributor

You can add #[inline]. I agree that inline(always) is too much, but #[inline] does the job at -Copt-level=3, and it is already used for all of the other is_ascii helpers. The loop here only contributes marginally to the function, and in many simple cases will get mostly optimized out.

I also needed to swap the por and pmovmskb lines, so that they agree with their order in the algorithm.

@folkertdev
Copy link
Contributor

in addition, with #[inline] we can keep the test coverage for the generated instructions. I think that is valuable.

@bonega
Copy link
Contributor Author

bonega commented Jan 25, 2026

@folkertdev I brought the tests back with #[inline].
Still maybe a bit brittle depending on whatever optimization LLVM chooses, but worst case we just fix the test if it breaks in the future.

Thank you

@folkertdev
Copy link
Contributor

@bors r+

@rust-bors
Copy link
Contributor

rust-bors bot commented Jan 25, 2026

📌 Commit dbc870a has been approved by folkertdev

It is now in the queue for this repository.

@rust-bors rust-bors bot added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Jan 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants