remove reserve_for_push #104668

conradludgate · 2022-11-21T10:52:49Z

Based on a quick investigation around extend_one, I don't believe the reserve_for_push abstraction is beneficial.

Quick benchmarks locally show a small improvement, would be interested in a try build too

rustbot · 2022-11-21T10:52:56Z

(rustbot has picked a reviewer for you, use r? to override)

rustbot · 2022-11-21T10:52:59Z

Hey! It looks like you've submitted a new PR for the library teams!

If this PR contains changes to any rust-lang/rust public library APIs then please comment with @rustbot label +T-libs-api -T-libs to tag it appropriately. If this PR contains changes to any unstable APIs please edit the PR description to add a link to the relevant API Change Proposal or create one if you haven't already. If you're unsure where your change falls no worries, just leave it as is and the reviewer will take a look and make a decision to forward on if necessary.

Examples of T-libs-api changes:

Stabilizing library features
Introducing insta-stable changes such as new implementations of existing stable traits on existing stable types
Introducing new or changing existing unstable library APIs (excluding permanently unstable features / features without a tracking issue)
Changing public documentation in ways that create new stability guarantees
Changing observable runtime behavior of library APIs

Kobzol · 2022-11-21T12:28:29Z

It was introduced ~a year ago (#91352) and then it was a compile time win. Let's see what happens now :)

@bors try @rust-timer queue

bors · 2022-11-21T12:28:38Z

⌛ Trying commit 48cb20e with merge 5c3b88615f267d31914a465c4f0efe1b142a7467...

conradludgate · 2022-11-21T12:37:10Z

Thanks, If this makes compile times slower (I guess due to inlining) then I have an alternative idea to use #[cold] or unlikely

Kobzol · 2022-11-21T12:38:33Z

Well, runtime benchmarks are still WIP, so we can only measure compile time performance + runtime performance of rustc itself on CI. If you think that this change produces runtime performance benefits, then please post some benchmark numbers.

conradludgate · 2022-11-21T14:26:01Z

A simple benchmark with criterion 0.4 pushing 1<<18 numbers into the vec

use criterion::{black_box, criterion_group, criterion_main, Criterion};

pub fn push(c: &mut Criterion) {
    c.bench_function("push", |b| {
        b.iter(|| {
            let mut v = Vec::new();
            for i in black_box(1..1 << 18) {
                v.push(i);
            }
            black_box(v);
        })
    });
}

criterion_group!(benches, push);
criterion_main!(benches);

In case criterion didn't like my changes, I also made a much simpler benchmark that got -50% improvements

#[test]
fn foo_bench() {
    let mut durs = vec![Duration::ZERO; 10000];
    let mut start = Instant::now();
    for dur in &mut durs {
        black_box(foo());
        let next = Instant::now();
        *dur = next - start;
        start = next;
    }
    dbg!(durs.into_iter().sum::<Duration>() / 10000);
}

I couldn't believe these numbers, so I rebuild stage2 from master and I got the same results as nightly. I'm not sure if there's a fault with my benchmark methodology here. This is running on a MacBook Pro 13" M1

conradludgate · 2022-11-21T14:33:44Z

A theory could be that the improvements had a compounding effect within criterion since they must internally use vec.push.

Another theory could be that my new code just doesn't work correctly and is UB and gets optimised away (This seems extremely unlikely though)

Kobzol · 2022-11-21T14:53:44Z

It doesn't have to be optimized away just because it's UB. Now the code isn't hidden behind #[inline(never)], so in theory it could be inlined all the way and then optimized away (although the reserve function also contains an uninlineable block of code..). Could you please try to post the assembly? E.g. using cargo asm..

bors · 2022-11-21T15:08:58Z

☀️ Try build successful - checks-actions
Build commit: 5c3b88615f267d31914a465c4f0efe1b142a7467 (5c3b88615f267d31914a465c4f0efe1b142a7467)

conradludgate · 2022-11-21T15:30:47Z

Ok, I think the inlining was quite aggressive. I moved the vec push loop from the benchmark iter into a library function, and called that instead:

This code has ~ 5-15% reduction in time.

The asm of the library function isn't different from before, apart from calling do_reserve_and_handle instead of reserve_for_push.

The gains seem to be from do_reserve_and_handle, for some reason it produce different codegen here. Maybe it is able to optimise slightly better when it doesn't need to do it twice?

saethlin · 2022-11-21T16:18:56Z

(it is generally preferable to paste text not images)

Some of these benchmarks look encouraging but I am suspicious because this is all microbenchmarking. In a microbenchmark the optimal inlining heuristic is often "yes" but in a real program, whether or not cold paths are inlined often has a cascading effect on later inlining decisions.

So I'd want to know in addition to these benchmarks and whatever rustc-perf says, what is the code size a Vec::push after this change? If it has shrunk, that is awesome.

rust-timer · 2022-11-21T16:27:48Z

Finished benchmarking commit (5c3b88615f267d31914a465c4f0efe1b142a7467): comparison URL.

Overall result: ❌ regressions - ACTION NEEDED

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf +perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	0.6%	[0.2%, 1.5%]	15
Regressions ❌ (secondary)	0.5%	[0.3%, 0.7%]	14
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	0.6%	[0.2%, 1.5%]	15

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	0.1%	[0.1%, 0.1%]	1
Regressions ❌ (secondary)	2.4%	[2.4%, 2.4%]	1
Improvements ✅ (primary)	-4.6%	[-4.6%, -4.6%]	1
Improvements ✅ (secondary)	-2.2%	[-3.2%, -0.9%]	7
All ❌✅ (primary)	-2.3%	[-4.6%, 0.1%]	2

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	0.8%	[0.8%, 0.8%]	1
Regressions ❌ (secondary)	2.4%	[2.2%, 2.6%]	3
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	0.8%	[0.8%, 0.8%]	1

the8472 · 2022-11-21T16:32:14Z

Alloc has a bunch of built-in benches. you can run them with ./x bench --stage 0 library/alloc. There are some additional instruction in the std dev guide

Edit: ah nevermind, I just saw the perf rlo results.

conradludgate · 2022-11-21T16:48:19Z

Alloc has a bunch of built-in benches

I couldn't find many that do a Vec::push underneath.

Running the benchmarks against String, because they have some String::push calls:

master:

test string::bench_exact_size_shrink_to_fit              ... bench:          28 ns/iter (+/- 0)
test string::bench_from                                  ... bench:          28 ns/iter (+/- 0)
test string::bench_from_str                              ... bench:          28 ns/iter (+/- 0)
test string::bench_insert_char_long                      ... bench:          76 ns/iter (+/- 1)
test string::bench_insert_char_short                     ... bench:          74 ns/iter (+/- 0)
test string::bench_insert_str_long                       ... bench:          73 ns/iter (+/- 1)
test string::bench_insert_str_short                      ... bench:          74 ns/iter (+/- 0)
test string::bench_push_char_one_byte                    ... bench:       4,567 ns/iter (+/- 64) = 2189 MB/s
test string::bench_push_char_two_bytes                   ... bench:      26,438 ns/iter (+/- 590) = 756 MB/s
test string::bench_push_str                              ... bench:          31 ns/iter (+/- 0)
test string::bench_push_str_one_byte                     ... bench:      23,057 ns/iter (+/- 247) = 433 MB/s
test string::bench_to_string                             ... bench:          28 ns/iter (+/- 0)
test string::bench_with_capacity                         ... bench:          26 ns/iter (+/- 0)
test string::from_utf8_lossy_100_ascii                   ... bench:          69 ns/iter (+/- 0)
test string::from_utf8_lossy_100_invalid                 ... bench:       1,120 ns/iter (+/- 25)
test string::from_utf8_lossy_100_multibyte               ... bench:          65 ns/iter (+/- 1)
test string::from_utf8_lossy_invalid                     ... bench:         117 ns/iter (+/- 1)

branch:

test string::bench_exact_size_shrink_to_fit              ... bench:          29 ns/iter (+/- 1)
test string::bench_from                                  ... bench:          28 ns/iter (+/- 0)
test string::bench_from_str                              ... bench:          28 ns/iter (+/- 0)
test string::bench_insert_char_long                      ... bench:          76 ns/iter (+/- 1)
test string::bench_insert_char_short                     ... bench:          74 ns/iter (+/- 1)
test string::bench_insert_str_long                       ... bench:          73 ns/iter (+/- 0)
test string::bench_insert_str_short                      ... bench:          74 ns/iter (+/- 0)
test string::bench_push_char_one_byte                    ... bench:       4,568 ns/iter (+/- 18) = 2189 MB/s
test string::bench_push_char_two_bytes                   ... bench:      29,720 ns/iter (+/- 1,413) = 672 MB/s
test string::bench_push_str                              ... bench:          30 ns/iter (+/- 0)
test string::bench_push_str_one_byte                     ... bench:      26,201 ns/iter (+/- 145) = 381 MB/s
test string::bench_to_string                             ... bench:          28 ns/iter (+/- 0)
test string::bench_with_capacity                         ... bench:          26 ns/iter (+/- 0)
test string::from_utf8_lossy_100_ascii                   ... bench:          37 ns/iter (+/- 0)
test string::from_utf8_lossy_100_invalid                 ... bench:       1,172 ns/iter (+/- 22)
test string::from_utf8_lossy_100_multibyte               ... bench:          55 ns/iter (+/- 1)
test string::from_utf8_lossy_invalid                     ... bench:         113 ns/iter (+/- 2)

This looks like a regression to me. I don't understand where from though, as the outputs in the String::push asm look identical to me, and I don't understand why push_str would be impacted as it doesn't perform any Vec::push calls underneath

the8472 · 2022-11-21T17:19:52Z

As the linked guide sasys, you'd have to make sure it's not noise. But even if it isn't, it's not an improvement and the perf.rlo benchmarks don't look good either and binary sizes have increased too.

thomcc · 2022-11-24T08:24:25Z

Yeah in it's current state, I don't see evidence that this helps.

@rustbot author

remove reserve_for_push

48cb20e

rustbot assigned thomcc Nov 21, 2022

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Nov 21, 2022

This comment has been minimized.

Sign in to view

rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Nov 21, 2022

This comment has been minimized.

Sign in to view

rustbot added perf-regression Performance regression. and removed S-waiting-on-perf Status: Waiting on a perf run to be completed. labels Nov 21, 2022

rustbot added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Nov 24, 2022

conradludgate closed this Jan 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove reserve_for_push #104668

remove reserve_for_push #104668

conradludgate commented Nov 21, 2022

rustbot commented Nov 21, 2022

rustbot commented Nov 21, 2022

Kobzol commented Nov 21, 2022

This comment has been minimized.

bors commented Nov 21, 2022

conradludgate commented Nov 21, 2022

Kobzol commented Nov 21, 2022

conradludgate commented Nov 21, 2022

conradludgate commented Nov 21, 2022

Kobzol commented Nov 21, 2022 •

edited

bors commented Nov 21, 2022

This comment has been minimized.

conradludgate commented Nov 21, 2022 •

edited

saethlin commented Nov 21, 2022

rust-timer commented Nov 21, 2022

the8472 commented Nov 21, 2022 •

edited

conradludgate commented Nov 21, 2022 •

edited

the8472 commented Nov 21, 2022

thomcc commented Nov 24, 2022

remove reserve_for_push #104668

remove reserve_for_push #104668

Conversation

conradludgate commented Nov 21, 2022

rustbot commented Nov 21, 2022

rustbot commented Nov 21, 2022

Kobzol commented Nov 21, 2022

This comment has been minimized.

bors commented Nov 21, 2022

conradludgate commented Nov 21, 2022

Kobzol commented Nov 21, 2022

conradludgate commented Nov 21, 2022

conradludgate commented Nov 21, 2022

Kobzol commented Nov 21, 2022 • edited

bors commented Nov 21, 2022

This comment has been minimized.

conradludgate commented Nov 21, 2022 • edited

saethlin commented Nov 21, 2022

rust-timer commented Nov 21, 2022

Overall result: ❌ regressions - ACTION NEEDED

Instruction count

Max RSS (memory usage)

Cycles

the8472 commented Nov 21, 2022 • edited

conradludgate commented Nov 21, 2022 • edited

the8472 commented Nov 21, 2022

thomcc commented Nov 24, 2022

Kobzol commented Nov 21, 2022 •

edited

conradludgate commented Nov 21, 2022 •

edited

the8472 commented Nov 21, 2022 •

edited

conradludgate commented Nov 21, 2022 •

edited