-
Notifications
You must be signed in to change notification settings - Fork 13.9k
slice/ascii: Optimize eq_ignore_ascii_case
with auto-vectorization
#147436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
b031c51
7b88f48
a5ba248
31bf836
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -60,6 +60,25 @@ impl [u8] { | |
return false; | ||
} | ||
|
||
#[cfg(all(target_arch = "x86_64", target_feature = "sse2"))] | ||
{ | ||
const CHUNK_SIZE: usize = 16; | ||
// The following function has two invariants: | ||
// 1. The slice lengths must be equal, which we checked above. | ||
// 2. The slice lengths must greater than or equal to N, which this | ||
// if-statement is checking. | ||
if self.len() >= CHUNK_SIZE { | ||
return self.eq_ignore_ascii_case_chunks::<CHUNK_SIZE>(other); | ||
} | ||
} | ||
|
||
self.eq_ignore_ascii_case_simple(other) | ||
} | ||
|
||
/// ASCII case-insensitive equality check without chunk-at-a-time | ||
/// optimization. | ||
#[inline] | ||
const fn eq_ignore_ascii_case_simple(&self, other: &[u8]) -> bool { | ||
// FIXME(const-hack): This implementation can be reverted when | ||
// `core::iter::zip` is allowed in const. The original implementation: | ||
// self.len() == other.len() && iter::zip(self, other).all(|(a, b)| a.eq_ignore_ascii_case(b)) | ||
|
@@ -78,6 +97,65 @@ impl [u8] { | |
true | ||
} | ||
|
||
/// Optimized version of `eq_ignore_ascii_case` to process chunks at a time. | ||
/// | ||
/// Platforms that have SIMD instructions may benefit from this | ||
/// implementation over `eq_ignore_ascii_case_simple`. | ||
/// | ||
/// # Invariants | ||
/// | ||
/// The caller must guarantee that the slices are equal in length, and the | ||
/// slice lengths are greater than or equal to `N` bytes. | ||
#[cfg(all(target_arch = "x86_64", target_feature = "sse2"))] | ||
#[inline] | ||
const fn eq_ignore_ascii_case_chunks<const N: usize>(&self, other: &[u8]) -> bool { | ||
// FIXME(const-hack): The while-loops that follow should be replaced by | ||
// for-loops when available in const. | ||
|
||
let (self_chunks, self_rem) = self.as_chunks::<N>(); | ||
let (other_chunks, _) = other.as_chunks::<N>(); | ||
|
||
// Branchless check to encourage auto-vectorization | ||
#[inline(always)] | ||
const fn eq_ignore_ascii_inner<const L: usize>(lhs: &[u8; L], rhs: &[u8; L]) -> bool { | ||
let mut equal_ascii = true; | ||
let mut j = 0; | ||
while j < L { | ||
equal_ascii &= lhs[j].eq_ignore_ascii_case(&rhs[j]); | ||
j += 1; | ||
} | ||
|
||
equal_ascii | ||
} | ||
|
||
// Process the chunks, returning early if an inequality is found | ||
let mut i = 0; | ||
while i < self_chunks.len() && i < other_chunks.len() { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, it feels wrong that we need to check both lengths here. The answer can never actually differ, right? (Because the caller checks equal-length first anyway.) The function boundary split feels awkward, since it's losing information -- both in simple and non-simple -- that would be helpful to the optimizer when it comes to removing bounds checks. What does the actual loop condition look like in llvm or asm here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
It does feel wrong, and you are correct about how they can never differ. I included this to eliminate an extra branch and panic in the loop. The comparison gets optimized out so there is no extra length check. I never got around to filing this one as a bug report.
Left is with both checks, right is with only 1 Difference in the loop condition is this on the right which jumps to a panic .LBB0_3:
sub r8, 1
jb .LBB0_15 edit: here's a more minimal version with just the chunk loop https://rust.godbolt.org/z/MGEv3qszY There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Reported as #147953 |
||
if !eq_ignore_ascii_inner(&self_chunks[i], &other_chunks[i]) { | ||
return false; | ||
} | ||
i += 1; | ||
} | ||
|
||
// Check the length invariant which is necessary for the tail-handling | ||
// logic to be correct. This should have been upheld by the caller, | ||
// otherwise lengths less than N will compare as true without any | ||
// checking. | ||
debug_assert!(self.len() >= N); | ||
|
||
// If there are remaining tails, load the last N bytes in the slices to | ||
// avoid falling back to per-byte checking. | ||
Comment on lines
+146
to
+147
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is only correct because of a check in the caller that's not documented in the comments on this method nor in the code here. I would suggest at least having a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed, I've been thinking about this since my last push. I think calling this function with a const generic parameter might be a better choice to encode the information, so I'm going to change it to that along with a |
||
if !self_rem.is_empty() { | ||
if let (Some(a_rem), Some(b_rem)) = (self.last_chunk::<N>(), other.last_chunk::<N>()) { | ||
if !eq_ignore_ascii_inner(a_rem, b_rem) { | ||
return false; | ||
} | ||
} | ||
} | ||
|
||
true | ||
} | ||
|
||
/// Converts this slice to its ASCII upper case equivalent in-place. | ||
/// | ||
/// ASCII letters 'a' to 'z' are mapped to 'A' to 'Z', | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
use test::{Bencher, black_box}; | ||
|
||
use super::corpora::*; | ||
|
||
#[bench] | ||
fn bench_str_under_8_bytes_eq(b: &mut Bencher) { | ||
let s = black_box("foo"); | ||
let other = black_box("foo"); | ||
b.iter(|| assert!(s.eq_ignore_ascii_case(other))) | ||
} | ||
|
||
#[bench] | ||
fn bench_str_of_8_bytes_eq(b: &mut Bencher) { | ||
let s = black_box(en::TINY); | ||
let other = black_box(en::TINY); | ||
b.iter(|| assert!(s.eq_ignore_ascii_case(other))) | ||
} | ||
|
||
#[bench] | ||
fn bench_str_17_bytes_eq(b: &mut Bencher) { | ||
let s = black_box(&en::SMALL[..17]); | ||
let other = black_box(&en::SMALL[..17]); | ||
b.iter(|| assert!(s.eq_ignore_ascii_case(other))) | ||
} | ||
|
||
#[bench] | ||
fn bench_str_31_bytes_eq(b: &mut Bencher) { | ||
let s = black_box(&en::SMALL[..31]); | ||
let other = black_box(&en::SMALL[..31]); | ||
b.iter(|| assert!(s.eq_ignore_ascii_case(other))) | ||
} | ||
|
||
#[bench] | ||
fn bench_medium_str_eq(b: &mut Bencher) { | ||
let s = black_box(en::MEDIUM); | ||
let other = black_box(en::MEDIUM); | ||
b.iter(|| assert!(s.eq_ignore_ascii_case(other))) | ||
} | ||
|
||
#[bench] | ||
fn bench_large_str_eq(b: &mut Bencher) { | ||
let s = black_box(en::LARGE); | ||
let other = black_box(en::LARGE); | ||
b.iter(|| assert!(s.eq_ignore_ascii_case(other))) | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
//@ compile-flags: -Copt-level=3 | ||
//@ only-x86_64 | ||
#![crate_type = "lib"] | ||
|
||
// Ensure that the optimized variant of the function gets auto-vectorized and | ||
// that the inner helper function is inlined. | ||
// CHECK-LABEL: @eq_ignore_ascii_case_autovectorized | ||
#[no_mangle] | ||
pub fn eq_ignore_ascii_case_autovectorized(s: &str, other: &str) -> bool { | ||
// CHECK: load <16 x i8> | ||
// CHECK: load <16 x i8> | ||
// CHECK: bitcast <16 x i1> | ||
// CHECK-NOT: call {{.*}}eq_ignore_ascii_inner | ||
// CHECK-NOT: panic | ||
s.eq_ignore_ascii_case(other) | ||
} |
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix: you don't need
always
here. It's a non-generic function which is why it went badly with no attribute, but#[inline]
is sufficient.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without
always
, it fails the filecheck test I've written and doesn't seem to inline as shown in the godbolt.Left is
inline(always)
, right isinline
https://rust.godbolt.org/z/Px6nMPWvK
edit: I realize above I included const generics.
With all generics removed, it still doesn't seem to inline the inner call without
always
https://rust.godbolt.org/z/4Tq5hWY6T