Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize escape_ascii. #125340

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

reitermarkus
Copy link
Contributor

@reitermarkus reitermarkus commented May 20, 2024

Follow-up to #124307. CC @joboet

Alternative/addition to #125317.

Based on #124307 (comment), it doesn't look like this function is the cause for the regression, but this change produces even fewer instructions (https://rust.godbolt.org/z/nebzqoveG).

@rustbot
Copy link
Collaborator

rustbot commented May 20, 2024

r? @Mark-Simulacrum

rustbot has assigned @Mark-Simulacrum.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels May 20, 2024
@reitermarkus
Copy link
Contributor Author

r? @Kobzol

@rustbot rustbot assigned Kobzol and unassigned Mark-Simulacrum May 20, 2024
@Kobzol
Copy link
Contributor

Kobzol commented May 20, 2024

I'm probably not the best person to review this, but I can try. I have the same question as here though - do you have some (micro)benchmarks to show that this is an improvement? :)

@rust-log-analyzer

This comment has been minimized.

@reitermarkus
Copy link
Contributor Author

@Kobzol, what's the best way to do a benchmark for this? Just create a standalone crate with two versions of this function, or is there a recommended way to test against different commits in this repo?

@rust-log-analyzer

This comment has been minimized.

@Kobzol
Copy link
Contributor

Kobzol commented May 22, 2024

Well, that depends. From the microbenchmark side, you could show e.g. on godbolt that this produces "objectively" better asssembly. From the macrobenchmark side, you would probably bring some program that is actually improved by this change.

Usually people have some explicit motivation for doing these kinds of optimizations, which is demonstrated by some change either in codegen or an improvement for some real-world code.

@reitermarkus
Copy link
Contributor Author

reitermarkus commented May 23, 2024

e.g. on godbolt

I have updated the Godbolt link in the PR description to reflect the current changes, i.e. 3 fewer jumps and 7 fewer instructions.

I have also done a micro benchmark using criterion:

Source
#![feature(ascii_char)]
#![feature(ascii_char_variants)]
#![feature(let_chains)]
#![feature(inline_const)]
#![feature(const_option)]

use core::ascii;
use core::ops::Range;

use criterion::{criterion_group, criterion_main, BenchmarkId, Criterion, PlotConfiguration};

const HEX_DIGITS: [ascii::Char; 16] = *b"0123456789abcdef".as_ascii().unwrap();

#[inline]
const fn backslash<const N: usize>(a: ascii::Char) -> ([ascii::Char; N], Range<u8>) {
    const { assert!(N >= 2) };
    let mut output = [ascii::Char::Null; N];
    output[0] = ascii::Char::ReverseSolidus;
    output[1] = a;
    (output, 0..2)
}

#[inline]
const fn escape_ascii_before<const N: usize>(byte: u8) -> ([ascii::Char; N], Range<u8>) {
    const { assert!(N >= 4) };

    match byte {
        b'\t' => backslash(ascii::Char::SmallT),
        b'\r' => backslash(ascii::Char::SmallR),
        b'\n' => backslash(ascii::Char::SmallN),
        b'\\' => backslash(ascii::Char::ReverseSolidus),
        b'\'' => backslash(ascii::Char::Apostrophe),
        b'\"' => backslash(ascii::Char::QuotationMark),
        byte => {
            let mut output = [ascii::Char::Null; N];

            if let Some(c) = byte.as_ascii()
                && !byte.is_ascii_control()
            {
                output[0] = c;
                (output, 0..1)
            } else {
                let hi = HEX_DIGITS[(byte >> 4) as usize];
                let lo = HEX_DIGITS[(byte & 0xf) as usize];

                output[0] = ascii::Char::ReverseSolidus;
                output[1] = ascii::Char::SmallX;
                output[2] = hi;
                output[3] = lo;

                (output, 0..4)
            }
        }
    }
}

#[inline]
const fn escape_ascii_after<const N: usize>(byte: u8) -> ([ascii::Char; N], Range<u8>) {
    const { assert!(N >= 4) };

    let mut output = [ascii::Char::Null; N];

    // NOTE: This `match` is roughly ordered by the frequency of ASCII
    //       characters for performance.
    match byte.as_ascii() {
        Some(
            c @ ascii::Char::QuotationMark
            | c @ ascii::Char::Apostrophe
            | c @ ascii::Char::ReverseSolidus,
        ) => backslash(c),
        Some(c) if !byte.is_ascii_control() => {
            output[0] = c;
            (output, 0..1)
        }
        Some(ascii::Char::LineFeed) => backslash(ascii::Char::SmallN),
        Some(ascii::Char::CarriageReturn) => backslash(ascii::Char::SmallR),
        Some(ascii::Char::CharacterTabulation) => backslash(ascii::Char::SmallT),
        _ => {
            let hi = HEX_DIGITS[(byte >> 4) as usize];
            let lo = HEX_DIGITS[(byte & 0xf) as usize];

            output[0] = ascii::Char::ReverseSolidus;
            output[1] = ascii::Char::SmallX;
            output[2] = hi;
            output[3] = lo;

            (output, 0..4)
        }
    }
}

pub fn criterion_benchmark(c: &mut Criterion) {
    let mut group = c.benchmark_group("escape_ascii");

    group.sample_size(1000);

    for i in [b'a', b'Z', b'\"', b'\t', b'\n', b'\xff'] {
        let i_s = if let Some(c) = i.as_ascii() {
            format!("{c:?}")
        } else {
            format!("'\\x{i:02x}'")
        };

        group.bench_with_input(BenchmarkId::new("before", &i_s), &i, |b, i| {
            b.iter(|| escape_ascii_before::<4>(*i));
        });
        group.bench_with_input(BenchmarkId::new("after", &i_s), &i, |b, i| {
            b.iter(|| escape_ascii_after::<4>(*i));
        });
    }

    group.finish();
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);
Output
escape_ascii/before/'a' time:   [1.6945 ns 1.7047 ns 1.7170 ns]
Found 21 outliers among 1000 measurements (2.10%)
  8 (0.80%) low mild
  4 (0.40%) high mild
  9 (0.90%) high severe
escape_ascii/after/'a'  time:   [427.36 ps 428.23 ps 429.15 ps]
Found 23 outliers among 1000 measurements (2.30%)
  2 (0.20%) high mild
  21 (2.10%) high severe
escape_ascii/before/'Z' time:   [1.6944 ns 1.6971 ns 1.6996 ns]
escape_ascii/after/'Z'  time:   [430.95 ps 431.52 ps 432.06 ps]
Found 372 outliers among 1000 measurements (37.20%)
  230 (23.00%) low severe
  37 (3.70%) high mild
  105 (10.50%) high severe
escape_ascii/before/'"' time:   [1.3287 ns 1.3308 ns 1.3328 ns]
Found 1 outliers among 1000 measurements (0.10%)
  1 (0.10%) high mild
escape_ascii/after/'"'  time:   [429.44 ps 430.54 ps 431.73 ps]
Found 9 outliers among 1000 measurements (0.90%)
  2 (0.20%) high mild
  7 (0.70%) high severe
escape_ascii/before/'\t'
                        time:   [1.3326 ns 1.3369 ns 1.3413 ns]
Found 99 outliers among 1000 measurements (9.90%)
  80 (8.00%) high mild
  19 (1.90%) high severe
escape_ascii/after/'\t' time:   [1.3184 ns 1.3215 ns 1.3246 ns]
Found 308 outliers among 1000 measurements (30.80%)
  158 (15.80%) low mild
  10 (1.00%) high mild
  140 (14.00%) high severe
escape_ascii/before/'\n'
                        time:   [1.3336 ns 1.3377 ns 1.3419 ns]
escape_ascii/after/'\n' time:   [1.3033 ns 1.3057 ns 1.3080 ns]
Found 223 outliers among 1000 measurements (22.30%)
  210 (21.00%) low mild
  9 (0.90%) high mild
  4 (0.40%) high severe
escape_ascii/before/'\xff'
                        time:   [1.5074 ns 1.5116 ns 1.5168 ns]
Found 7 outliers among 1000 measurements (0.70%)
  3 (0.30%) high mild
  4 (0.40%) high severe
escape_ascii/after/'\xff'
                        time:   [444.86 ps 456.22 ps 469.96 ps]
Found 51 outliers among 1000 measurements (5.10%)
  8 (0.80%) high mild
  43 (4.30%) high severe

Graph (unfortunately Y-axis is not sorted by input):

violin

@Kobzol
Copy link
Contributor

Kobzol commented May 24, 2024

Your benchmark was executed on a single byte input? It would be good to also see how it behaves on something larger, e.g. a short/medium size/long byte slice, to see the effects in practice.

Could you describe the motivation for this change? If I understand your comment correctly, "frequency of ASCII characters" means how often do given characters appear in the input. It makes sense to me to optimize for the common case, which I would expect is that the input does not need to be escaped at all. So my intuition would be to start with first checking if it's an alphabetic ASCII character, and then continue from there. So this optimization seems reasonable, in general. I just wonder if you have some use-case where this escaping is an actual bottleneck and we could actually see some wins in practice?

Btw, in general, the fact that there are less instructions doesn't necessarily mean that the code will be faster. In microarchitecture simulation (llvm mca), the original code seems to have better IPC (https://rust.godbolt.org/z/3qKeohGjs), athough in this case it's hard to decide upon that, because this function is very data dependent.

@clarfonthey
Copy link
Contributor

Hmm.

Omitting the non-ASCII case, perhaps this could be done with a lookup table? You could squeeze it down to just 127 bytes if you use the eighth bit to determine if there should be a backslash, since the escaped character will only need 7 bits. This way, you don't need to worry about ordering things by prevalence. Have no idea what the current codegen looks like so I dunno if it'd be much faster, but that feels like the best route to me.

@rust-log-analyzer

This comment has been minimized.

@reitermarkus
Copy link
Contributor Author

I have made some further changes and updated the Godbolt link in the PR description. The instruction count is again slightly lower, and LLCM-MCA now also shows fewer instructions and better IPC and throughput.

I re-ran the previous benchmark with larger inputs (a 100MB file with random data, and a 100MB JSON file). The results show no difference between the two functions:

violin
violin
violin

I also ran LLVM-MCA locally for Cortex M4, and it shows ~25% fewer instructions with ~35% higher throughput:

LLVM-MCA (Cortex M4) - before

cargo asm --features before --lib --target thumbv7em-none-eabihf --att --mca --mca-arg=-mcpu=cortex-m4

    Finished release [optimized] target(s) in 0.03s

Iterations:        100
Instructions:      6900
Total Cycles:      6901
Total uOps:        6900

Dispatch Width:    1
uOps Per Cycle:    1.00
IPC:               1.00
Block RThroughput: 69.0


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      1     1.00                        mvn	r2, #8
 1      1     1.00                        uxtab	r3, r2, r1
 1      1     1.00                        uxtb.w	r12, r1
 1      1     1.00                        cmp	r3, #30
 1      1     1.00                  U     bhi	.LBB0_4
 1      1     1.00                  U     tbb	[pc, r3]
 1      1     1.00                        mov.w	r1, #512
 1      1     1.00           *            strh	r1, [r0, #4]
 1      1     1.00                        movw	r1, #29788
 1      1     1.00           *            str	r1, [r0]
 1      1     1.00                  U     bx	lr
 1      1     1.00                        cmp.w	r12, #92
 1      1     1.00                  U     bne	.LBB0_6
 1      1     1.00                        mov.w	r1, #512
 1      1     1.00           *            strh	r1, [r0, #4]
 1      1     1.00                        movw	r1, #23644
 1      1     1.00           *            str	r1, [r0]
 1      1     1.00                  U     bx	lr
 1      1     1.00                        cmp.w	r12, #128
 1      1     1.00                        mov	r3, r12
 1      1     1.00                  U     it	hs
 1      1     1.00                        movhs	r3, #128
 1      1     1.00                        sxtb	r2, r1
 1      1     1.00                        cmp	r2, #0
 1      1     1.00                  U     bmi	.LBB0_9
 1      1     1.00                        cmp.w	r12, #32
 1      1     1.00                  U     blo	.LBB0_9
 1      1     1.00                        cmp.w	r12, #127
 1      1     1.00                  U     itttt	ne
 1      1     1.00                        movne	r1, #1
 1      1     1.00           *            strbne	r1, [r0, #5]
 1      1     1.00                        movne	r1, #0
 1      1     1.00           *            strne.w	r1, [r0, #1]
 1      1     1.00                  U     itt	ne
 1      1     1.00           *            strbne	r3, [r0]
 1      1     1.00                  U     bxne	lr
 1      1     1.00                        movw	r3, :lower16:.L__unnamed_1
 1      1     1.00                        mov.w	r2, #1024
 1      1     1.00                        and	r1, r1, #15
 1      1     1.00           *            strh	r2, [r0, #4]
 1      1     1.00                        movt	r3, :upper16:.L__unnamed_1
 1      1     1.00                        lsr.w	r2, r12, #4
 1      2     1.00    *                   ldrb	r2, [r3, r2]
 1      2     1.00    *                   ldrb	r1, [r3, r1]
 1      1     1.00                        movw	r3, #30812
 1      1     1.00           *            strh	r3, [r0]
 1      1     1.00           *            strb	r1, [r0, #3]
 1      1     1.00           *            strb	r2, [r0, #2]
 1      1     1.00                  U     bx	lr
 1      1     1.00                        mov.w	r1, #512
 1      1     1.00           *            strh	r1, [r0, #4]
 1      1     1.00                        movw	r1, #28252
 1      1     1.00           *            str	r1, [r0]
 1      1     1.00                  U     bx	lr
 1      1     1.00                        mov.w	r1, #512
 1      1     1.00           *            strh	r1, [r0, #4]
 1      1     1.00                        movw	r1, #29276
 1      1     1.00           *            str	r1, [r0]
 1      1     1.00                  U     bx	lr
 1      1     1.00                        mov.w	r1, #512
 1      1     1.00           *            strh	r1, [r0, #4]
 1      1     1.00                        movw	r1, #8796
 1      1     1.00           *            str	r1, [r0]
 1      1     1.00                  U     bx	lr
 1      1     1.00                        mov.w	r1, #512
 1      1     1.00           *            strh	r1, [r0, #4]
 1      1     1.00                        movw	r1, #10076
 1      1     1.00           *            str	r1, [r0]
 1      1     1.00                  U     bx	lr


Resources:
[0]   - M4Unit


Resource pressure per iteration:
[0]    
69.00  

Resource pressure by instruction:
[0]    Instructions:
1.00   mvn	r2, #8
1.00   uxtab	r3, r2, r1
1.00   uxtb.w	r12, r1
1.00   cmp	r3, #30
1.00   bhi	.LBB0_4
1.00   tbb	[pc, r3]
1.00   mov.w	r1, #512
1.00   strh	r1, [r0, #4]
1.00   movw	r1, #29788
1.00   str	r1, [r0]
1.00   bx	lr
1.00   cmp.w	r12, #92
1.00   bne	.LBB0_6
1.00   mov.w	r1, #512
1.00   strh	r1, [r0, #4]
1.00   movw	r1, #23644
1.00   str	r1, [r0]
1.00   bx	lr
1.00   cmp.w	r12, #128
1.00   mov	r3, r12
1.00   it	hs
1.00   movhs	r3, #128
1.00   sxtb	r2, r1
1.00   cmp	r2, #0
1.00   bmi	.LBB0_9
1.00   cmp.w	r12, #32
1.00   blo	.LBB0_9
1.00   cmp.w	r12, #127
1.00   itttt	ne
1.00   movne	r1, #1
1.00   strbne	r1, [r0, #5]
1.00   movne	r1, #0
1.00   strne.w	r1, [r0, #1]
1.00   itt	ne
1.00   strbne	r3, [r0]
1.00   bxne	lr
1.00   movw	r3, :lower16:.L__unnamed_1
1.00   mov.w	r2, #1024
1.00   and	r1, r1, #15
1.00   strh	r2, [r0, #4]
1.00   movt	r3, :upper16:.L__unnamed_1
1.00   lsr.w	r2, r12, #4
1.00   ldrb	r2, [r3, r2]
1.00   ldrb	r1, [r3, r1]
1.00   movw	r3, #30812
1.00   strh	r3, [r0]
1.00   strb	r1, [r0, #3]
1.00   strb	r2, [r0, #2]
1.00   bx	lr
1.00   mov.w	r1, #512
1.00   strh	r1, [r0, #4]
1.00   movw	r1, #28252
1.00   str	r1, [r0]
1.00   bx	lr
1.00   mov.w	r1, #512
1.00   strh	r1, [r0, #4]
1.00   movw	r1, #29276
1.00   str	r1, [r0]
1.00   bx	lr
1.00   mov.w	r1, #512
1.00   strh	r1, [r0, #4]
1.00   movw	r1, #8796
1.00   str	r1, [r0]
1.00   bx	lr
1.00   mov.w	r1, #512
1.00   strh	r1, [r0, #4]
1.00   movw	r1, #10076
1.00   str	r1, [r0]
1.00   bx	lr
LLVM-MCA (Cortex M4) - after

cargo asm --features after --lib --target thumbv7em-none-eabihf --att --mca --mca-arg=-mcpu=cortex-m4

    Finished release [optimized] target(s) in 0.02s

Iterations:        100
Instructions:      5100
Total Cycles:      5301
Total uOps:        5100

Dispatch Width:    1
uOps Per Cycle:    0.96
IPC:               0.96
Block RThroughput: 51.0


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      1     1.00           *      U     push	{r4, r6, r7, lr}
 1      1     1.00                  U     add	r7, sp, #8
 1      1     1.00                        uxtb	r4, r1
 1      1     1.00                        movw	r12, :lower16:.L__unnamed_1
 1      1     1.00                        and	r3, r1, #15
 1      1     1.00                        movt	r12, :upper16:.L__unnamed_1
 1      1     1.00                        lsrs	r2, r4, #4
 1      1     1.00                        cmp	r4, #126
 1      2     1.00    *                   ldrb.w	lr, [r12, r3]
 1      2     1.00    *                   ldrb.w	r12, [r12, r2]
 1      1     1.00                  U     bhi	.LBB0_9
 1      1     1.00                        movs	r2, #92
 1      1     1.00                        movs	r3, #2
 1      1     1.00                        cmp	r4, #34
 1      1     1.00                  U     beq	.LBB0_4
 1      1     1.00                        cmp	r4, #39
 1      1     1.00                  U     beq	.LBB0_4
 1      1     1.00                        cmp	r4, #92
 1      1     1.00                  U     bne	.LBB0_5
 1      1     1.00                        mov	r4, r1
 1      1     1.00                        b	.LBB0_10
 1      1     1.00                        cmp	r4, #31
 1      1     1.00                  U     bls	.LBB0_7
 1      1     1.00                        movs	r4, #120
 1      1     1.00                        movs	r3, #1
 1      1     1.00                        mov	r2, r1
 1      1     1.00                        b	.LBB0_10
 1      1     1.00                        subs	r1, #9
 1      1     1.00                        uxtb	r2, r1
 1      1     1.00                        cmp	r2, #4
 1      1     1.00                  U     bhi	.LBB0_9
 1      1     1.00                        movw	r2, :lower16:.Lswitch.table.after.1
 1      1     1.00                        sxtb	r1, r1
 1      1     1.00                        movt	r2, :upper16:.Lswitch.table.after.1
 1      2     1.00    *                   ldrb	r4, [r2, r1]
 1      1     1.00                        movw	r2, :lower16:.Lswitch.table.after
 1      1     1.00                        movt	r2, :upper16:.Lswitch.table.after
 1      2     1.00    *                   ldrb	r3, [r2, r1]
 1      1     1.00                        movs	r2, #92
 1      1     1.00                        b	.LBB0_10
 1      1     1.00                        movs	r4, #120
 1      1     1.00                        movs	r2, #92
 1      1     1.00                        movs	r3, #4
 1      1     1.00                        movs	r1, #0
 1      1     1.00           *            strb	r3, [r0, #5]
 1      1     1.00           *            strb	r1, [r0, #4]
 1      1     1.00           *            strb.w	lr, [r0, #3]
 1      1     1.00           *            strb.w	r12, [r0, #2]
 1      1     1.00           *            strb	r4, [r0, #1]
 1      1     1.00           *            strb	r2, [r0]
 1      2     1.00    *             U     pop	{r4, r6, r7, pc}


Resources:
[0]   - M4Unit


Resource pressure per iteration:
[0]    
51.00  

Resource pressure by instruction:
[0]    Instructions:
1.00   push	{r4, r6, r7, lr}
1.00   add	r7, sp, #8
1.00   uxtb	r4, r1
1.00   movw	r12, :lower16:.L__unnamed_1
1.00   and	r3, r1, #15
1.00   movt	r12, :upper16:.L__unnamed_1
1.00   lsrs	r2, r4, #4
1.00   cmp	r4, #126
1.00   ldrb.w	lr, [r12, r3]
1.00   ldrb.w	r12, [r12, r2]
1.00   bhi	.LBB0_9
1.00   movs	r2, #92
1.00   movs	r3, #2
1.00   cmp	r4, #34
1.00   beq	.LBB0_4
1.00   cmp	r4, #39
1.00   beq	.LBB0_4
1.00   cmp	r4, #92
1.00   bne	.LBB0_5
1.00   mov	r4, r1
1.00   b	.LBB0_10
1.00   cmp	r4, #31
1.00   bls	.LBB0_7
1.00   movs	r4, #120
1.00   movs	r3, #1
1.00   mov	r2, r1
1.00   b	.LBB0_10
1.00   subs	r1, #9
1.00   uxtb	r2, r1
1.00   cmp	r2, #4
1.00   bhi	.LBB0_9
1.00   movw	r2, :lower16:.Lswitch.table.after.1
1.00   sxtb	r1, r1
1.00   movt	r2, :upper16:.Lswitch.table.after.1
1.00   ldrb	r4, [r2, r1]
1.00   movw	r2, :lower16:.Lswitch.table.after
1.00   movt	r2, :upper16:.Lswitch.table.after
1.00   ldrb	r3, [r2, r1]
1.00   movs	r2, #92
1.00   b	.LBB0_10
1.00   movs	r4, #120
1.00   movs	r2, #92
1.00   movs	r3, #4
1.00   movs	r1, #0
1.00   strb	r3, [r0, #5]
1.00   strb	r1, [r0, #4]
1.00   strb.w	lr, [r0, #3]
1.00   strb.w	r12, [r0, #2]
1.00   strb	r4, [r0, #1]
1.00   strb	r2, [r0]
1.00   pop	{r4, r6, r7, pc}

@Kobzol
Copy link
Contributor

Kobzol commented Jun 3, 2024

I suspect that in the grand scheme of things (escaping strings, rather than chars), this might not have such a large effect (btw https://lemire.me/blog/2024/05/31/quickly-checking-whether-a-string-needs-escaping/ might be interesting to you). The code looked a bit more readable before, but not strong opinion on my side.

r? libs

@rustbot rustbot assigned joboet and unassigned Kobzol Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants