-
Notifications
You must be signed in to change notification settings - Fork 12.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize escape_ascii
.
#125340
base: master
Are you sure you want to change the base?
Optimize escape_ascii
.
#125340
Conversation
rustbot has assigned @Mark-Simulacrum. Use |
r? @Kobzol |
I'm probably not the best person to review this, but I can try. I have the same question as here though - do you have some (micro)benchmarks to show that this is an improvement? :) |
This comment has been minimized.
This comment has been minimized.
6bfb89d
to
8b94af3
Compare
@Kobzol, what's the best way to do a benchmark for this? Just create a standalone crate with two versions of this function, or is there a recommended way to test against different commits in this repo? |
This comment has been minimized.
This comment has been minimized.
Well, that depends. From the microbenchmark side, you could show e.g. on godbolt that this produces "objectively" better asssembly. From the macrobenchmark side, you would probably bring some program that is actually improved by this change. Usually people have some explicit motivation for doing these kinds of optimizations, which is demonstrated by some change either in codegen or an improvement for some real-world code. |
I have updated the Godbolt link in the PR description to reflect the current changes, i.e. 3 fewer jumps and 7 fewer instructions. I have also done a micro benchmark using Source#![feature(ascii_char)]
#![feature(ascii_char_variants)]
#![feature(let_chains)]
#![feature(inline_const)]
#![feature(const_option)]
use core::ascii;
use core::ops::Range;
use criterion::{criterion_group, criterion_main, BenchmarkId, Criterion, PlotConfiguration};
const HEX_DIGITS: [ascii::Char; 16] = *b"0123456789abcdef".as_ascii().unwrap();
#[inline]
const fn backslash<const N: usize>(a: ascii::Char) -> ([ascii::Char; N], Range<u8>) {
const { assert!(N >= 2) };
let mut output = [ascii::Char::Null; N];
output[0] = ascii::Char::ReverseSolidus;
output[1] = a;
(output, 0..2)
}
#[inline]
const fn escape_ascii_before<const N: usize>(byte: u8) -> ([ascii::Char; N], Range<u8>) {
const { assert!(N >= 4) };
match byte {
b'\t' => backslash(ascii::Char::SmallT),
b'\r' => backslash(ascii::Char::SmallR),
b'\n' => backslash(ascii::Char::SmallN),
b'\\' => backslash(ascii::Char::ReverseSolidus),
b'\'' => backslash(ascii::Char::Apostrophe),
b'\"' => backslash(ascii::Char::QuotationMark),
byte => {
let mut output = [ascii::Char::Null; N];
if let Some(c) = byte.as_ascii()
&& !byte.is_ascii_control()
{
output[0] = c;
(output, 0..1)
} else {
let hi = HEX_DIGITS[(byte >> 4) as usize];
let lo = HEX_DIGITS[(byte & 0xf) as usize];
output[0] = ascii::Char::ReverseSolidus;
output[1] = ascii::Char::SmallX;
output[2] = hi;
output[3] = lo;
(output, 0..4)
}
}
}
}
#[inline]
const fn escape_ascii_after<const N: usize>(byte: u8) -> ([ascii::Char; N], Range<u8>) {
const { assert!(N >= 4) };
let mut output = [ascii::Char::Null; N];
// NOTE: This `match` is roughly ordered by the frequency of ASCII
// characters for performance.
match byte.as_ascii() {
Some(
c @ ascii::Char::QuotationMark
| c @ ascii::Char::Apostrophe
| c @ ascii::Char::ReverseSolidus,
) => backslash(c),
Some(c) if !byte.is_ascii_control() => {
output[0] = c;
(output, 0..1)
}
Some(ascii::Char::LineFeed) => backslash(ascii::Char::SmallN),
Some(ascii::Char::CarriageReturn) => backslash(ascii::Char::SmallR),
Some(ascii::Char::CharacterTabulation) => backslash(ascii::Char::SmallT),
_ => {
let hi = HEX_DIGITS[(byte >> 4) as usize];
let lo = HEX_DIGITS[(byte & 0xf) as usize];
output[0] = ascii::Char::ReverseSolidus;
output[1] = ascii::Char::SmallX;
output[2] = hi;
output[3] = lo;
(output, 0..4)
}
}
}
pub fn criterion_benchmark(c: &mut Criterion) {
let mut group = c.benchmark_group("escape_ascii");
group.sample_size(1000);
for i in [b'a', b'Z', b'\"', b'\t', b'\n', b'\xff'] {
let i_s = if let Some(c) = i.as_ascii() {
format!("{c:?}")
} else {
format!("'\\x{i:02x}'")
};
group.bench_with_input(BenchmarkId::new("before", &i_s), &i, |b, i| {
b.iter(|| escape_ascii_before::<4>(*i));
});
group.bench_with_input(BenchmarkId::new("after", &i_s), &i, |b, i| {
b.iter(|| escape_ascii_after::<4>(*i));
});
}
group.finish();
}
criterion_group!(benches, criterion_benchmark);
criterion_main!(benches); Output
Graph (unfortunately Y-axis is not sorted by input): |
Your benchmark was executed on a single byte input? It would be good to also see how it behaves on something larger, e.g. a short/medium size/long byte slice, to see the effects in practice. Could you describe the motivation for this change? If I understand your comment correctly, "frequency of ASCII characters" means how often do given characters appear in the input. It makes sense to me to optimize for the common case, which I would expect is that the input does not need to be escaped at all. So my intuition would be to start with first checking if it's an alphabetic ASCII character, and then continue from there. So this optimization seems reasonable, in general. I just wonder if you have some use-case where this escaping is an actual bottleneck and we could actually see some wins in practice? Btw, in general, the fact that there are less instructions doesn't necessarily mean that the code will be faster. In microarchitecture simulation ( |
Hmm. Omitting the non-ASCII case, perhaps this could be done with a lookup table? You could squeeze it down to just 127 bytes if you use the eighth bit to determine if there should be a backslash, since the escaped character will only need 7 bits. This way, you don't need to worry about ordering things by prevalence. Have no idea what the current codegen looks like so I dunno if it'd be much faster, but that feels like the best route to me. |
This comment has been minimized.
This comment has been minimized.
I have made some further changes and updated the Godbolt link in the PR description. The instruction count is again slightly lower, and LLCM-MCA now also shows fewer instructions and better IPC and throughput. I re-ran the previous benchmark with larger inputs (a 100MB file with random data, and a 100MB JSON file). The results show no difference between the two functions: I also ran LLVM-MCA locally for Cortex M4, and it shows ~25% fewer instructions with ~35% higher throughput: LLVM-MCA (Cortex M4) - before
cargo asm --features before --lib --target thumbv7em-none-eabihf --att --mca --mca-arg=-mcpu=cortex-m4
LLVM-MCA (Cortex M4) - after
cargo asm --features after --lib --target thumbv7em-none-eabihf --att --mca --mca-arg=-mcpu=cortex-m4
|
I suspect that in the grand scheme of things (escaping strings, rather than chars), this might not have such a large effect (btw https://lemire.me/blog/2024/05/31/quickly-checking-whether-a-string-needs-escaping/ might be interesting to you). The code looked a bit more readable before, but not strong opinion on my side. r? libs |
Follow-up to #124307. CC @joboet
Alternative/addition to #125317.
Based on #124307 (comment), it doesn't look like this function is the cause for the regression, but this change produces even fewer instructions (https://rust.godbolt.org/z/nebzqoveG).