Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the branches from len_utf8 #125129

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

cuviper
Copy link
Member

@cuviper cuviper commented May 14, 2024

This changes len_utf8 to add all of the range comparisons together,
rather than branching on each one. We should definitely test performance
though, because it's possible that this will pessimize mostly-ascii
inputs that would have had a short branch-predicted path before.

This changes `len_utf8` to add all of the range comparisons together,
rather than branching on each one. We should definitely test performance
though, because it's possible that this will pessimize mostly-ascii
inputs that would have had a short branch-predicted path before.
@rustbot
Copy link
Collaborator

rustbot commented May 14, 2024

r? @Mark-Simulacrum

rustbot has assigned @Mark-Simulacrum.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels May 14, 2024
@cuviper
Copy link
Member Author

cuviper commented May 14, 2024

Here's a godbolt comparison. For a single character, it looks like this:

Before:

len_char:
        mov     eax, 1
        cmp     edi, 128
        jb      .LBB0_3
        mov     eax, 2
        cmp     edi, 2048
        jb      .LBB0_3
        cmp     edi, 65536
        mov     eax, 4
        sbb     rax, 0
.LBB0_3:
        ret

After:

len_char:
        xor     eax, eax
        cmp     edi, 127
        seta    al
        cmp     edi, 2048
        sbb     rax, -1
        cmp     edi, 65536
        sbb     rax, -1
        inc     rax
        ret

I also included an example of summing [char] lengths, and the new implementation shows auto-vectorization.

cc @lincot @scottmcm -- this may be relevant to #124810 too.

@cuviper
Copy link
Member Author

cuviper commented May 14, 2024

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label May 14, 2024
bors added a commit to rust-lang-ci/rust that referenced this pull request May 14, 2024
Remove the branches from `len_utf8`

This changes `len_utf8` to add all of the range comparisons together,
rather than branching on each one. We should definitely test performance
though, because it's possible that this will pessimize mostly-ascii
inputs that would have had a short branch-predicted path before.
@bors
Copy link
Contributor

bors commented May 14, 2024

⌛ Trying commit 38f14be with merge 1c3131c...

@bors
Copy link
Contributor

bors commented May 14, 2024

☀️ Try build successful - checks-actions
Build commit: 1c3131c (1c3131cffea82ac12fe79bbd8ae3fc5dbbde3650)

@rust-timer

This comment has been minimized.

@lincot
Copy link

lincot commented May 14, 2024

In a benchmark, current len_utf8 gives me:

test bench_len_chars    ... bench:         290 ns/iter (+/- 7) = 3531 MB/s
test bench_push_1_byte  ... bench:      10,489 ns/iter (+/- 169) = 953 MB/s
test bench_push_2_bytes ... bench:      14,303 ns/iter (+/- 295) = 1398 MB/s
test bench_push_3_bytes ... bench:      19,078 ns/iter (+/- 328) = 1572 MB/s
test bench_push_4_bytes ... bench:      23,845 ns/iter (+/- 357) = 1677 MB/s

Branchless:

test bench_len_chars    ... bench:         315 ns/iter (+/- 8) = 3250 MB/s
test bench_push_1_byte  ... bench:      12,196 ns/iter (+/- 209) = 819 MB/s
test bench_push_2_bytes ... bench:      14,433 ns/iter (+/- 283) = 1385 MB/s
test bench_push_3_bytes ... bench:      16,871 ns/iter (+/- 356) = 1778 MB/s
test bench_push_4_bytes ... bench:      21,463 ns/iter (+/- 681) = 1863 MB/s
bench.rs
#![feature(test)]

extern crate test;
use core::{array, mem::MaybeUninit};
use rand::seq::SliceRandom;
use rand_pcg::Pcg64Mcg;
use test::{black_box, Bencher};

const TAG_CONT: u8 = 0b1000_0000;
const TAG_TWO_B: u8 = 0b1100_0000;
const TAG_THREE_B: u8 = 0b1110_0000;
const TAG_FOUR_B: u8 = 0b1111_0000;
const MAX_ONE_B: u32 = 0x80;
const MAX_TWO_B: u32 = 0x800;
const MAX_THREE_B: u32 = 0x10000;

#[inline]
const fn len_utf8(code: u32) -> usize {
    const BRANCHLESS: bool = true;

    if BRANCHLESS {
        1 + ((code >= MAX_ONE_B) as usize)
            + ((code >= MAX_TWO_B) as usize)
            + ((code >= MAX_THREE_B) as usize)
    } else {
        if code < MAX_ONE_B {
            1
        } else if code < MAX_TWO_B {
            2
        } else if code < MAX_THREE_B {
            3
        } else {
            4
        }
    }
}

#[inline]
fn len_chars(cs: &[char]) -> usize {
    cs.iter().map(|&c| len_utf8(c as u32)).sum()
}

#[inline]
pub fn push(s: &mut String, ch: char) {
    let len = s.len();
    let ch_len = len_utf8(ch as u32);
    s.reserve(ch_len);

    // SAFETY: at least the length needed to encode `ch`
    // has been reserved in `self`
    unsafe {
        encode_utf8_raw_unchecked(ch as u32, s.as_mut_vec().spare_capacity_mut());
        s.as_mut_vec().set_len(len + ch_len);
    }
}

#[inline]
pub fn encode_utf8_raw(code: u32, dst: &mut [u8]) -> &mut [u8] {
    let len = len_utf8(code);
    if dst.len() < len {
        panic!(
            "encode_utf8: need {} bytes to encode U+{:X}, but the buffer has {}",
            len,
            code,
            dst.len(),
        );
    }

    // SAFETY: `encode_utf8_raw_unchecked` only writes initialized bytes to the slice,
    // `dst` has been checked to be long enough to hold the encoded codepoint
    unsafe { encode_utf8_raw_unchecked(code, &mut *(dst as *mut [u8] as *mut [MaybeUninit<u8>])) }
}

#[inline]
pub unsafe fn encode_utf8_raw_unchecked(code: u32, dst: &mut [MaybeUninit<u8>]) -> &mut [u8] {
    let len = len_utf8(code);
    // SAFETY: the caller must guarantee that `dst` is at least `len` bytes long
    unsafe {
        match len {
            1 => {
                dst.get_unchecked_mut(0).write(code as u8);
            }
            2 => {
                dst.get_unchecked_mut(0)
                    .write((code >> 6 & 0x1F) as u8 | TAG_TWO_B);
                dst.get_unchecked_mut(1)
                    .write((code & 0x3F) as u8 | TAG_CONT);
            }
            3 => {
                dst.get_unchecked_mut(0)
                    .write((code >> 12 & 0x0F) as u8 | TAG_THREE_B);
                dst.get_unchecked_mut(1)
                    .write((code >> 6 & 0x3F) as u8 | TAG_CONT);
                dst.get_unchecked_mut(2)
                    .write((code & 0x3F) as u8 | TAG_CONT);
            }
            4 => {
                dst.get_unchecked_mut(0)
                    .write((code >> 18 & 0x07) as u8 | TAG_FOUR_B);
                dst.get_unchecked_mut(1)
                    .write((code >> 12 & 0x3F) as u8 | TAG_CONT);
                dst.get_unchecked_mut(2)
                    .write((code >> 6 & 0x3F) as u8 | TAG_CONT);
                dst.get_unchecked_mut(3)
                    .write((code & 0x3F) as u8 | TAG_CONT);
            }
            _ => unreachable!(),
        }
    }

    // SAFETY: data has been written to the first `len` bytes
    unsafe { &mut *(dst.get_unchecked_mut(..len) as *mut [MaybeUninit<u8>] as *mut [u8]) }
}

#[bench]
fn bench_len_chars(bencher: &mut Bencher) {
    const BYTES: usize = 1024;
    bencher.bytes = BYTES as _;
    let mut rng = Pcg64Mcg::new(0xcafe_f00d_d15e_a5e5);
    let cs: [_; BYTES] = array::from_fn(|_| *['0', 'д', '❗', '🤨'].choose(&mut rng).unwrap());
    bencher.iter(|| len_chars(black_box(&cs)));
}

const ITERATIONS: u64 = if cfg!(miri) { 1 } else { 10_000 };

#[bench]
fn bench_push_1_byte(bencher: &mut Bencher) {
    const CHAR: char = '0';
    assert_eq!(CHAR.len_utf8(), 1);
    bencher.bytes = ITERATIONS;
    bencher.iter(|| {
        let mut s = String::with_capacity(ITERATIONS as _);
        for _ in 0..black_box(ITERATIONS) {
            push(&mut s, black_box(CHAR));
        }
        s
    });
}

#[bench]
fn bench_push_2_bytes(bencher: &mut Bencher) {
    const CHAR: char = 'д';
    assert_eq!(CHAR.len_utf8(), 2);
    bencher.bytes = 2 * ITERATIONS;
    bencher.iter(|| {
        let mut s = String::with_capacity((2 * ITERATIONS) as _);
        for _ in 0..black_box(ITERATIONS) {
            push(&mut s, black_box(CHAR));
        }
        s
    });
}

#[bench]
fn bench_push_3_bytes(bencher: &mut Bencher) {
    const CHAR: char = '❗';
    assert_eq!(CHAR.len_utf8(), 3);
    bencher.bytes = 3 * ITERATIONS;
    bencher.iter(|| {
        let mut s = String::with_capacity((3 * ITERATIONS) as _);
        for _ in 0..black_box(ITERATIONS) {
            push(&mut s, black_box(CHAR));
        }
        s
    });
}

#[bench]
fn bench_push_4_bytes(bencher: &mut Bencher) {
    const CHAR: char = '🤨';
    assert_eq!(CHAR.len_utf8(), 4);
    bencher.bytes = 4 * ITERATIONS;
    bencher.iter(|| {
        let mut s = String::with_capacity((4 * ITERATIONS) as _);
        for _ in 0..black_box(ITERATIONS) {
            push(&mut s, black_box(CHAR));
        }
        s
    });
}

So despite autovectorization, it appears to be slower for len_chars
with chars of random length. In String::push it sacrifices ASCII
for 3 and 4 byte chars.

Also, curiously, if we hint the compiler that the branchless version equals
to the non-branchless version, it uses the latter: godbolt.

@cuviper
Copy link
Member Author

cuviper commented May 14, 2024

Thanks for the benchmark! I agree that ascii is taking a hit, like I originally suspected, but my results look more favorable on the vectorized sum.

AMD Ryzen 7 5800X, original:

test bench_len_chars    ... bench:         619.19 ns/iter (+/- 26.64) = 1654 MB/s
test bench_push_1_byte  ... bench:      11,439.94 ns/iter (+/- 301.62) = 874 MB/s
test bench_push_2_bytes ... bench:      10,856.02 ns/iter (+/- 439.35) = 1842 MB/s
test bench_push_3_bytes ... bench:      15,011.17 ns/iter (+/- 339.38) = 1998 MB/s
test bench_push_4_bytes ... bench:      19,140.64 ns/iter (+/- 212.55) = 2089 MB/s

branchless:

test bench_len_chars    ... bench:         354.67 ns/iter (+/- 1.95) = 2892 MB/s
test bench_push_1_byte  ... bench:      11,122.96 ns/iter (+/- 518.84) = 899 MB/s
test bench_push_2_bytes ... bench:      12,862.57 ns/iter (+/- 109.74) = 1554 MB/s
test bench_push_3_bytes ... bench:      14,385.21 ns/iter (+/- 160.54) = 2085 MB/s
test bench_push_4_bytes ... bench:      16,230.85 ns/iter (+/- 198.19) = 2464 MB/s

AMD Ryzen 7 7700X, original:

test bench_len_chars    ... bench:         488.00 ns/iter (+/- 12.27) = 2102 MB/s
test bench_push_1_byte  ... bench:       5,816.53 ns/iter (+/- 163.51) = 1719 MB/s
test bench_push_2_bytes ... bench:       9,846.46 ns/iter (+/- 323.56) = 2031 MB/s
test bench_push_3_bytes ... bench:      12,459.59 ns/iter (+/- 2,992.14) = 2407 MB/s
test bench_push_4_bytes ... bench:      14,598.12 ns/iter (+/- 155.69) = 2740 MB/s

branchless:

test bench_len_chars    ... bench:         312.69 ns/iter (+/- 10.73) = 3282 MB/s
test bench_push_1_byte  ... bench:       8,301.02 ns/iter (+/- 42.66) = 1204 MB/s
test bench_push_2_bytes ... bench:      11,581.18 ns/iter (+/- 39.54) = 1726 MB/s
test bench_push_3_bytes ... bench:      12,544.53 ns/iter (+/- 533.95) = 2391 MB/s
test bench_push_4_bytes ... bench:      14,519.43 ns/iter (+/- 36.51) = 2755 MB/s

Intel i7-1365U, original:

test bench_len_chars    ... bench:         629.65 ns/iter (+/- 13.55) = 1627 MB/s
test bench_push_1_byte  ... bench:       5,443.92 ns/iter (+/- 157.32) = 1837 MB/s
test bench_push_2_bytes ... bench:      11,739.24 ns/iter (+/- 370.86) = 1703 MB/s
test bench_push_3_bytes ... bench:      14,572.58 ns/iter (+/- 25.91) = 2058 MB/s
test bench_push_4_bytes ... bench:      17,236.41 ns/iter (+/- 435.68) = 2320 MB/s

branchless:

test bench_len_chars    ... bench:         551.74 ns/iter (+/- 1.60) = 1858 MB/s
test bench_push_1_byte  ... bench:      10,603.42 ns/iter (+/- 17.86) = 943 MB/s
test bench_push_2_bytes ... bench:      14,588.05 ns/iter (+/- 37.33) = 1370 MB/s
test bench_push_3_bytes ... bench:      16,011.42 ns/iter (+/- 29.64) = 1873 MB/s
test bench_push_4_bytes ... bench:      18,701.51 ns/iter (+/- 18.93) = 2138 MB/s

All of these were using the current nightly on Fedora 40, with default target options.
(i.e. no -Ctarget-cpu or features to enable extra vector stuff.)

$ rustc +nightly -Vv
rustc 1.80.0-nightly (ab14f944a 2024-05-13)
binary: rustc
commit-hash: ab14f944afe4234db378ced3801e637eae6c0f30
commit-date: 2024-05-13
host: x86_64-unknown-linux-gnu
release: 1.80.0-nightly
LLVM version: 18.1.4

@cuviper
Copy link
Member Author

cuviper commented May 14, 2024

I'll still wait for results from the perf server, but I'll be fine with closing this if there's no clear gain, which seems likely.

@orlp
Copy link
Contributor

orlp commented May 15, 2024

Can I suggest a hybrid?

pub fn len_utf8_semibranchless(code: u32) -> usize {
    if code < MAX_ONE_B {
        1
    } else {
        2
        + ((code >= MAX_TWO_B) as usize)
        + ((code >= MAX_THREE_B) as usize)
    }
}

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (1c3131c): comparison URL.

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
0.3% [0.2%, 0.4%] 14
Regressions ❌
(secondary)
0.4% [0.3%, 0.4%] 6
Improvements ✅
(primary)
-0.5% [-0.6%, -0.4%] 2
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 0.2% [-0.6%, 0.4%] 16

Max RSS (memory usage)

Results (primary 0.8%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
3.5% [2.5%, 4.4%] 2
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-4.5% [-4.5%, -4.5%] 1
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 0.8% [-4.5%, 4.4%] 3

Cycles

This benchmark run did not return any relevant results for this metric.

Binary size

Results (primary -0.0%, secondary 0.1%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
0.1% [0.0%, 0.1%] 12
Regressions ❌
(secondary)
0.1% [0.1%, 0.1%] 1
Improvements ✅
(primary)
-0.1% [-0.3%, -0.0%] 9
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) -0.0% [-0.3%, 0.1%] 21

Bootstrap: 679.638s -> 678.862s (-0.11%)
Artifact size: 316.03 MiB -> 316.26 MiB (0.07%)

@rustbot rustbot added perf-regression Performance regression. and removed S-waiting-on-perf Status: Waiting on a perf run to be completed. labels May 15, 2024
@scottmcm
Copy link
Member

I do wonder if we should lean into the ASCII branch more. Letting the branch predictor assist runs of ASCII or non-ASCII, which are probably fairly common, might end up being a good idea, like we have branchy fast-paths for ASCII in the UTF-8 checks.

@lincot
Copy link

lincot commented May 15, 2024

Indeed, it was the -Ctarget-cpu option that messed up with vectorization.
Without it, the results are:

Original:

test bench_len_chars    ... bench:         744 ns/iter (+/- 13) = 1376 MB/s
test bench_push_1_byte  ... bench:      12,482 ns/iter (+/- 225) = 801 MB/s
test bench_push_2_bytes ... bench:      12,123 ns/iter (+/- 308) = 1649 MB/s
test bench_push_3_bytes ... bench:      16,598 ns/iter (+/- 308) = 1807 MB/s
test bench_push_4_bytes ... bench:      21,426 ns/iter (+/- 668) = 1866 MB/s

Branchless:

test bench_len_chars    ... bench:         404 ns/iter (+/- 9) = 2534 MB/s
test bench_push_1_byte  ... bench:      12,507 ns/iter (+/- 399) = 799 MB/s
test bench_push_2_bytes ... bench:      14,826 ns/iter (+/- 183) = 1348 MB/s
test bench_push_3_bytes ... bench:      16,058 ns/iter (+/- 182) = 1868 MB/s
test bench_push_4_bytes ... bench:      18,949 ns/iter (+/- 254) = 2110 MB/s

Semibranchless:

test bench_len_chars    ... bench:         542 ns/iter (+/- 16) = 1889 MB/s
test bench_push_1_byte  ... bench:      12,658 ns/iter (+/- 377) = 790 MB/s
test bench_push_2_bytes ... bench:      14,220 ns/iter (+/- 193) = 1406 MB/s
test bench_push_3_bytes ... bench:      16,635 ns/iter (+/- 210) = 1803 MB/s
test bench_push_4_bytes ... bench:      21,321 ns/iter (+/- 227) = 1876 MB/s

@cuviper
Copy link
Member Author

cuviper commented May 15, 2024

Trying @orlp's suggestion...

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label May 15, 2024
bors added a commit to rust-lang-ci/rust that referenced this pull request May 15, 2024
Remove the branches from `len_utf8`

This changes `len_utf8` to add all of the range comparisons together,
rather than branching on each one. We should definitely test performance
though, because it's possible that this will pessimize mostly-ascii
inputs that would have had a short branch-predicted path before.
@bors
Copy link
Contributor

bors commented May 15, 2024

⌛ Trying commit ba2f5a9 with merge 681b867...

@bors
Copy link
Contributor

bors commented May 15, 2024

☀️ Try build successful - checks-actions
Build commit: 681b867 (681b867cb4045990114517eda02cfc9a3f8cb9c8)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (681b867): comparison URL.

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
0.3% [0.3%, 0.3%] 1
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-0.3% [-0.4%, -0.2%] 2
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) -0.1% [-0.4%, 0.3%] 3

Max RSS (memory usage)

Results (primary -2.3%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
1.4% [0.0%, 2.8%] 2
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-4.8% [-9.1%, -2.6%] 3
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) -2.3% [-9.1%, 2.8%] 5

Cycles

This benchmark run did not return any relevant results for this metric.

Binary size

Results (primary 0.0%, secondary 0.1%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
0.1% [0.0%, 0.4%] 5
Regressions ❌
(secondary)
0.1% [0.1%, 0.1%] 3
Improvements ✅
(primary)
-0.1% [-0.2%, -0.0%] 6
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 0.0% [-0.2%, 0.4%] 11

Bootstrap: 678.357s -> 680.336s (0.29%)
Artifact size: 316.14 MiB -> 316.20 MiB (0.02%)

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label May 15, 2024
@lincot
Copy link

lincot commented May 15, 2024

In String::push, in the second use of len_uft8 we match against
the possible lengths, so the branches are inevitable,
and the branchless version doesn't play well there (godbolt).
In the first use, the reserve call, the branchless version lengthens the ASCII
path, and the semibranchless only removes a single jump (godbolt).

The semibranchless version also doesn't allow vectorization.

I've added a benchmark for random chars and have found the mere presence of it
to slow down bench_push_1_byte
(fixed by setting codegen-units to 1).

benchmark results

Branchy, then branchy:

test bench_push_1_byte       ... bench:      10,436 ns/iter (+/- 116) = 958 MB/s
test bench_push_2_bytes      ... bench:      12,065 ns/iter (+/- 136) = 1657 MB/s
test bench_push_3_bytes      ... bench:      16,792 ns/iter (+/- 243) = 1786 MB/s
test bench_push_4_bytes      ... bench:      19,085 ns/iter (+/- 379) = 2095 MB/s
test bench_push_random_bytes ... bench:      22,145 ns/iter (+/- 851) = 1128 MB/s

Branchless, then branchy:

test bench_push_1_byte       ... bench:      11,297 ns/iter (+/- 146) = 885 MB/s
test bench_push_2_bytes      ... bench:      12,112 ns/iter (+/- 250) = 1651 MB/s
test bench_push_3_bytes      ... bench:      16,727 ns/iter (+/- 110) = 1793 MB/s
test bench_push_4_bytes      ... bench:      19,365 ns/iter (+/- 440) = 2065 MB/s
test bench_push_random_bytes ... bench:      19,006 ns/iter (+/- 465) = 1315 MB/s

Semibranchless, then branchy:

test bench_push_1_byte       ... bench:      10,452 ns/iter (+/- 141) = 956 MB/s
test bench_push_2_bytes      ... bench:      12,218 ns/iter (+/- 337) = 1636 MB/s
test bench_push_3_bytes      ... bench:      16,955 ns/iter (+/- 312) = 1769 MB/s
test bench_push_4_bytes      ... bench:      19,153 ns/iter (+/- 990) = 2088 MB/s
test bench_push_random_bytes ... bench:      20,529 ns/iter (+/- 779) = 1217 MB/s

Branchless, then branchless:

test bench_push_1_byte       ... bench:      12,068 ns/iter (+/- 757) = 828 MB/s
test bench_push_2_bytes      ... bench:      14,547 ns/iter (+/- 341) = 1374 MB/s
test bench_push_3_bytes      ... bench:      16,833 ns/iter (+/- 370) = 1782 MB/s
test bench_push_4_bytes      ... bench:      19,103 ns/iter (+/- 172) = 2093 MB/s
test bench_push_random_bytes ... bench:      75,144 ns/iter (+/- 576) = 332 MB/s
bench.rs
#![feature(test)]

extern crate test;
use core::{array, mem::MaybeUninit};
use rand::seq::SliceRandom;
use rand_pcg::Pcg64Mcg;
use test::{black_box, Bencher};

const TAG_CONT: u8 = 0b1000_0000;
const TAG_TWO_B: u8 = 0b1100_0000;
const TAG_THREE_B: u8 = 0b1110_0000;
const TAG_FOUR_B: u8 = 0b1111_0000;
const MAX_ONE_B: u32 = 0x80;
const MAX_TWO_B: u32 = 0x800;
const MAX_THREE_B: u32 = 0x10000;

#[inline]
const fn len_utf8_branchy(code: u32) -> usize {
    if code < MAX_ONE_B {
        1
    } else if code < MAX_TWO_B {
        2
    } else if code < MAX_THREE_B {
        3
    } else {
        4
    }
}

#[inline]
const fn len_utf8_branchless(code: u32) -> usize {
    1 + ((code >= MAX_ONE_B) as usize)
        + ((code >= MAX_TWO_B) as usize)
        + ((code >= MAX_THREE_B) as usize)
}

#[inline]
const fn len_utf8_semibranchless(code: u32) -> usize {
    if code < MAX_ONE_B {
        1
    } else {
        2 + ((code >= MAX_TWO_B) as usize) + ((code >= MAX_THREE_B) as usize)
    }
}

#[inline]
pub fn push(s: &mut String, ch: char) {
    let len = s.len();
    let ch_len = len_utf8_branchless(ch as u32);
    s.reserve(ch_len);

    // SAFETY: at least the length needed to encode `ch`
    // has been reserved in `self`
    unsafe {
        encode_utf8_raw_unchecked(ch as u32, s.as_mut_vec().spare_capacity_mut());
        s.as_mut_vec().set_len(len + ch_len);
    }
}

#[inline]
pub unsafe fn encode_utf8_raw_unchecked(code: u32, dst: &mut [MaybeUninit<u8>]) -> &mut [u8] {
    let len = len_utf8_branchy(code);
    // SAFETY: the caller must guarantee that `dst` is at least `len` bytes long
    unsafe {
        match len {
            1 => {
                dst.get_unchecked_mut(0).write(code as u8);
            }
            2 => {
                dst.get_unchecked_mut(0)
                    .write((code >> 6 & 0x1F) as u8 | TAG_TWO_B);
                dst.get_unchecked_mut(1)
                    .write((code & 0x3F) as u8 | TAG_CONT);
            }
            3 => {
                dst.get_unchecked_mut(0)
                    .write((code >> 12 & 0x0F) as u8 | TAG_THREE_B);
                dst.get_unchecked_mut(1)
                    .write((code >> 6 & 0x3F) as u8 | TAG_CONT);
                dst.get_unchecked_mut(2)
                    .write((code & 0x3F) as u8 | TAG_CONT);
            }
            4 => {
                dst.get_unchecked_mut(0)
                    .write((code >> 18 & 0x07) as u8 | TAG_FOUR_B);
                dst.get_unchecked_mut(1)
                    .write((code >> 12 & 0x3F) as u8 | TAG_CONT);
                dst.get_unchecked_mut(2)
                    .write((code >> 6 & 0x3F) as u8 | TAG_CONT);
                dst.get_unchecked_mut(3)
                    .write((code & 0x3F) as u8 | TAG_CONT);
            }
            _ => unreachable!(),
        }
    }

    // SAFETY: data has been written to the first `len` bytes
    unsafe { &mut *(dst.get_unchecked_mut(..len) as *mut [MaybeUninit<u8>] as *mut [u8]) }
}

const ITERATIONS: u64 = if cfg!(miri) { 1 } else { 10_000 };

#[bench]
fn bench_push_1_byte(bencher: &mut Bencher) {
    const CHAR: char = '0';
    assert_eq!(CHAR.len_utf8(), 1);
    bencher.bytes = ITERATIONS;
    bencher.iter(|| {
        let mut s = String::with_capacity(ITERATIONS as _);
        for _ in 0..black_box(ITERATIONS) {
            push(&mut s, black_box(CHAR));
        }
        s
    });
}

#[bench]
fn bench_push_2_bytes(bencher: &mut Bencher) {
    const CHAR: char = 'д';
    assert_eq!(CHAR.len_utf8(), 2);
    bencher.bytes = 2 * ITERATIONS;
    bencher.iter(|| {
        let mut s = String::with_capacity((2 * ITERATIONS) as _);
        for _ in 0..black_box(ITERATIONS) {
            push(&mut s, black_box(CHAR));
        }
        s
    });
}

#[bench]
fn bench_push_3_bytes(bencher: &mut Bencher) {
    const CHAR: char = '❗';
    assert_eq!(CHAR.len_utf8(), 3);
    bencher.bytes = 3 * ITERATIONS;
    bencher.iter(|| {
        let mut s = String::with_capacity((3 * ITERATIONS) as _);
        for _ in 0..black_box(ITERATIONS) {
            push(&mut s, black_box(CHAR));
        }
        s
    });
}

#[bench]
fn bench_push_4_bytes(bencher: &mut Bencher) {
    const CHAR: char = '🤨';
    assert_eq!(CHAR.len_utf8(), 4);
    bencher.bytes = 4 * ITERATIONS;
    bencher.iter(|| {
        let mut s = String::with_capacity((4 * ITERATIONS) as _);
        for _ in 0..black_box(ITERATIONS) {
            push(&mut s, black_box(CHAR));
        }
        s
    });
}

#[bench]
fn bench_push_random_bytes(bencher: &mut Bencher) {
    bencher.bytes = (2 + 3) * ITERATIONS / 2;
    let mut rng = Pcg64Mcg::new(0xcafe_f00d_d15e_a5e5);
    let input: [_; ITERATIONS as usize] =
        array::from_fn(|_| *['0', 'д', '❗', '🤨'].choose(&mut rng).unwrap());
    bencher.iter(|| {
        let mut s = String::with_capacity((4 * ITERATIONS) as _);
        for c in input {
            push(&mut s, black_box(c));
        }
        s
    });
}

@Mark-Simulacrum Mark-Simulacrum added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels May 25, 2024
@Mark-Simulacrum
Copy link
Member

The results look pretty neutral to me, @cuviper -- feel free to close or continue iterating.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
perf-regression Performance regression. S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants