perf: Use `for_each` in `Vec::extend` #68046

Marwes · 2020-01-09T08:17:25Z

for_each are specialized for iterators such as chain allowing for
faster iteration than a normal for/while loop.

Note that since this only checks size_hint once at the start it may
end up needing to call reserve more in the case that size_hint
returns a larger and more accurate lower bound during iteration.

This could maybe be alleviated with an implementation closer to the current
one but the extra complexity will likely end up harming the normal case
of an accurate or 0 (think filter) lower bound.

while let Some(element) = iterator.next() {
    let (lower, _) = iterator.size_hint();
    self.reserve(lower.saturating_add(1));
    unsafe {
        let len = self.len();
        ptr::write(self.get_unchecked_mut(len), element);
        // NB can't overflow since we would have had to alloc the address space
        self.set_len(len + 1);
    }

    iterator.by_ref().take(self.capacity()).for_each(|element| {
        unsafe {
            let len = self.len();
            ptr::write(self.get_unchecked_mut(len), element);
            // NB can't overflow since we would have had to alloc the address space
            self.set_len(len + 1);
        }
    });
}

// OR

let (lower, _) = iterator.size_hint();
self.reserve(lower);
loop {
    let result = iterator.by_ref().try_for_each(|element| {
        if self.len() == self.capacity() {
            return Err(element);
        }
        unsafe {
            let len = self.len();
            ptr::write(self.get_unchecked_mut(len), element);
            // NB can't overflow since we would have had to alloc the address space
            self.set_len(len + 1);
        }
        Ok(())
    });

    match result {
        Ok(()) => break,
        Err(element) => {
            let (lower, _) = iterator.size_hint();
            self.reserve(lower.saturating_add(1));
            self.push(element);
        }
    }
}

Closes #63340

rust-highfive · 2020-01-09T08:17:36Z

r? @Mark-Simulacrum

(rust_highfive has picked a reviewer for you, use r? to override)

Marwes · 2020-01-09T08:21:20Z

It seems that benchmarks aren't run/checked by CI as ./x.py bench --stage=1 src/liballoc failed with a missing import for black_box.

Centril · 2020-01-09T09:06:47Z

@bors try @rust-timer queue

rust-timer · 2020-01-09T09:06:49Z

Awaiting bors try build completion

bors · 2020-01-09T09:07:01Z

⌛ Trying commit fd175a8 with merge 45ce712...

perf: Use `for_each` in `Vec::extend` `for_each` are specialized for iterators such as `chain` allowing for faster iteration than a normal `for/while` loop. Note that since this only checks `size_hint` once at the start it may end up needing to call `reserve` more in the case that `size_hint` returns a larger and more accurate lower bound during iteration. This could maybe be alleviated with an implementation closure like the current one but the extra complexity will likely end up harming the normal case of an accurate or 0 (think `filter`) lower bound. ```rust while let Some(element) = iterator.next() { let (lower, _) = iterator.size_hint(); self.reserve(lower.saturating_add(1)); unsafe { let len = self.len(); ptr::write(self.get_unchecked_mut(len), element); // NB can't overflow since we would have had to alloc the address space self.set_len(len + 1); } iterator.by_ref().take(self.capacity()).for_each(|element| { unsafe { let len = self.len(); ptr::write(self.get_unchecked_mut(len), element); // NB can't overflow since we would have had to alloc the address space self.set_len(len + 1); } }); } // OR let (lower, _) = iterator.size_hint(); self.reserve(lower); loop { let result = iterator.by_ref().try_for_each(|element| { if self.len() == self.capacity() { return Err(element); } unsafe { let len = self.len(); ptr::write(self.get_unchecked_mut(len), element); // NB can't overflow since we would have had to alloc the address space self.set_len(len + 1); } Ok(()) }); match result { Ok(()) => break, Err(element) => { let (lower, _) = iterator.size_hint(); self.reserve(lower.saturating_add(1)); self.push(element); } } } ``` Closes #63340

bors · 2020-01-09T11:41:31Z

☀️ Try build successful - checks-azure
Build commit: 45ce712 (45ce712754dc7ab7e68cbc506e6d43eb04e74128)

rust-timer · 2020-01-09T11:41:33Z

Queued 45ce712 with parent adc6572, future comparison URL.

rust-timer · 2020-01-09T13:31:45Z

Finished benchmarking try commit 45ce712, comparison URL.

Marwes · 2020-01-09T17:30:59Z

The piston-image regression don't make any sense...

Mark-Simulacrum · 2020-01-19T02:29:15Z

I don't have time myself to figure out why the benchmarks here regressed, but it's not obvious that it's truly spurious; if we're generating many more impls or more complicated impls due to unrolling or so it's not impossible we're regressing.

(For one thing, the new code contains a closure, whereas the previous didn't, and could plausibly instantiate less as such).

Happy to rerun the compiler benchmarks if you want to remove the closure (I suspect it might be replaceable with Vec::push).

I would also like to see some ad-hoc benchmarks (e.g., extending with 100 elements or something) done locally before/after this.

rust-highfive · 2020-01-21T18:05:55Z

Your PR failed (pretty log, raw log). Through arcane magic we have determined that the following fragments from the build log may contain information about the problem.

Click to expand the log.

2020-01-21T18:05:13.5527553Z ========================== Starting Command Output ===========================
2020-01-21T18:05:13.5531691Z [command]/bin/bash --noprofile --norc /home/vsts/work/_temp/c22da964-3b9a-4ac0-ba81-1adfd7784d5e.sh
2020-01-21T18:05:13.5531798Z 
2020-01-21T18:05:13.5536050Z ##[section]Finishing: Disable git automatic line ending conversion
2020-01-21T18:05:13.5542651Z ##[section]Starting: Checkout rust-lang/rust@refs/pull/68046/merge to s
2020-01-21T18:05:13.5544420Z Task         : Get sources
2020-01-21T18:05:13.5544501Z Description  : Get sources from a repository. Supports Git, TfsVC, and SVN repositories.
2020-01-21T18:05:13.5544535Z Version      : 1.0.0
2020-01-21T18:05:13.5544566Z Author       : Microsoft
---
2020-01-21T18:05:14.5844515Z ##[command]git remote add origin https://github.com/rust-lang/rust
2020-01-21T18:05:14.5926728Z ##[command]git config gc.auto 0
2020-01-21T18:05:14.5933752Z ##[command]git config --get-all http.https://github.com/rust-lang/rust.extraheader
2020-01-21T18:05:14.5938643Z ##[command]git config --get-all http.proxy
2020-01-21T18:05:14.5946458Z ##[command]git -c http.extraheader="AUTHORIZATION: basic ***" fetch --force --tags --prune --progress --no-recurse-submodules --depth=2 origin +refs/heads/*:refs/remotes/origin/* +refs/pull/68046/merge:refs/remotes/pull/68046/merge

I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact @TimNN. (Feature Requests)

the8472 · 2020-01-22T10:10:36Z

Note that since this only checks size_hint once at the start it may
end up needing to call reserve more in the case that size_hint
returns a larger and more accurate lower bound during iteration.

It's worse than that. The current implementation takes one element from the iterator and then queried for a size_hint, this PR takes the size_hint first. The standard library contains iterators that initially can't provide a size hint at all and only provide better ones as you iterate, perhaps even only later down the road.

The btree range iterators and flattening iterators would be such examples.

let mut flat = vec![vec!['b'; 1], vec!['c'; 1_000_000]].into_iter().flatten();
dbg!(flat.size_hint());
flat.next();
dbg!(flat.size_hint());
flat.next();
dbg!(flat.size_hint());

[src/main.rs:7] flat.size_hint() = (
    0,
    None,
)
[src/main.rs:9] flat.size_hint() = (
    0,
    None,
)
[src/main.rs:11] flat.size_hint() = (
    999999,
    Some(
        999999,
    ),
)

I don't think this is currently covered by benchmarks, so the drawbacks of one case are easily missed over the advantages in another case.

the8472

Have you run the whole battery of vec benchmarks and done a before/after comparison, e.g. via cargo benchcmp?

the8472 · 2020-01-22T21:05:09Z

src/liballoc/vec.rs

+        self.reserve(lower);
+        loop {
+            let cap = self.capacity();
+            let result = iterator.by_ref().try_fold((), |(), element| {


you can use try_for_each which takes a reference (so no by_ref needed) and doesn't require an element to fold through.

Until #46477 is fixed you probably should replace the closure by passing a free function instead (such as #62429 did for iterators). I think that's what @Mark-Simulacrum meant.

you can use try_for_each which takes a reference (so no by_ref needed) and doesn't require an element to fold through.

try_for_each just forwards to try_fold for Chain (same for most other iterators). I wanted to avoid getting the optimizer tripping up as much as possible as I have not seen any improvements so far.

try_fold also only takes a &mut so by_ref should definitely be removed as I just saw that Iterator for &mut T is unable to forward try_fold and friends due to object safety ~~which I means I were just using the default implementation for try_fold, easily explaining why I am not seeing any improvement for chained iterators...~~ by_ref() shouldn't affect anything in this case actually. The correct try_fold should still be selected as it would require calling try_fold on a &mut &mut T to get the wrong one.

This might be fixable with some specialization at least #68472

Until #46477 is fixed you probably should replace the closure by passing a free function instead (such as #62429 did for iterators). I think that's what @Mark-Simulacrum meant.

Yeah, should do that. Was just trying to get an actual benchmark showing that this was an improvement first :/

the8472 · 2020-01-22T21:07:14Z

src/liballoc/vec.rs

-                // NB can't overflow since we would have had to alloc the address space
-                self.set_len(len + 1);
+        let (lower, _) = iterator.size_hint();
+        self.reserve(lower);


This still has the issue of trying to reserve before advancing the iterator. taking an element and then checking the hint and capacity is strictly better because the hint will be more accurate and if you take a None you can skip the hint calculation and reserve attempt.

The benefit is that it avoids calling next at all. While the hint should be more accurate by calling next it would also be even more accurate if we called next twice so I don't quite buy that argument. Calling next once would give a fast path for the empty iterator at least which I find a more convincing argument, however it is only a benefit if size_hint(); reserve(0) is slow enough to warrant this fast path.

Pushed another variant which does next first which should optimize better for the empty iterator. Haven't got time to benchmark it yet.

I was more concerned about doing an suboptimal resize if we got an incorrect lower bound on the first attempt, but you're right, if it's just 0 it hopefully would be cheap enough and the next reserve will do it properly.

Have gone a bit back and forth but I think the of reserving eagerly has a better tradeoff.

On size_hint == 0 it does call reserve unnecessarily but it is only one more branch in an already fast case. If size_hint > 0 then most of the time we will reserve either way and most of the time with the same value (size_hint_minus_removed_value + 1 vs size_hint).

The only case where I'd expect it to be faster to call next first is if size_hint is expensive, despite the iterator being empty and next being faster to call than size_hint which seems unlikely (happy to be corrected however!).

Swatinem · 2020-01-23T19:37:16Z

Just a driveby comment, without knowing any of the details…

Is it possible this is an opposite case of #64572 ?

the8472 · 2020-01-24T09:17:17Z

In some ways. But that PR was superseded by #64600 and #64885 which achieved most of those gains without doing away with internal iteration. So ideally we'd improve chain iterators here without incurring significant losses elsewhere.

Marwes · 2020-01-24T13:45:48Z

LLVM seems to be clever enough to optimize the current, next calling extend to the same optimized code that we expect from try_fold. Will try to create a better benchmark which shows an actual difference between the approaches.

Another perf run could be interesting though, to see if the regressions are at least fixed with the current implementation (seeing lots of variance in my benchmarks, but it does not appear slower at least).

the8472 · 2020-01-24T13:52:49Z

seeing lots of variance in my benchmarks

Monitor your CPU clocks when running microbenchmarks. Thermal throttling can confound results. Disabling boost clocks can help if that's the case.

`for_each` are specialized for iterators such as `chain` allowing for faster iteration than a normal `for/while` loop. Note that since this only checks `size_hint` once at the start it may end up needing to call `reserve` more in the case that `size_hint` returns a larger and more accurate lower bound during iteration. This could maybe be alleviated with an implementation closure like the current one but the extra complexity will likely end up harming the normal case of an accurate or 0 (think `filter`) lower bound. ```rust while let Some(element) = iterator.next() { let (lower, _) = iterator.size_hint(); self.reserve(lower.saturating_add(1)); unsafe { let len = self.len(); ptr::write(self.get_unchecked_mut(len), element); // NB can't overflow since we would have had to alloc the address space self.set_len(len + 1); } iterator.by_ref().take(self.capacity()).for_each(|element| { unsafe { let len = self.len(); ptr::write(self.get_unchecked_mut(len), element); // NB can't overflow since we would have had to alloc the address space self.set_len(len + 1); } }); } // OR let (lower, _) = iterator.size_hint(); self.reserve(lower); loop { let result = iterator.by_ref().try_for_each(|element| { if self.len() == self.capacity() { return Err(element); } unsafe { let len = self.len(); ptr::write(self.get_unchecked_mut(len), element); // NB can't overflow since we would have had to alloc the address space self.set_len(len + 1); } Ok(()) }); match result { Ok(()) => break, Err(element) => { let (lower, _) = iterator.size_hint(); self.reserve(lower.saturating_add(1)); self.push(element); } } } ``` Closes rust-lang#63340

Should put less stress on LLVM since there are less closures passed around and lets us refine how much we reserve with `size_hint` if the first guess is too low.

Marwes · 2020-02-25T22:22:23Z

This could use another perf run with the latest changes. I suspect it won't be enough to remedy the compile time regressions though.

joelpalmer · 2020-03-09T11:05:43Z

Ping from Triage: Any updates? @Marwes?

Marwes · 2020-03-09T12:49:17Z

This could use another perf run with the latest changes. I suspect it won't be enough to remedy the compile time regressions though.

#68046 (comment)

If the perf run still shows an unacceptable regression I don't see a way to land this right now. Perhaps with MIR optimizations acting on the generic functions it could be optimized enough before getting to LLVM to make it possible, but otherwise the overhead is fairly fundamental to the change.

Dylan-DPC-zz · 2020-03-13T11:02:18Z

@bors try @rust-timer queue

rust-timer · 2020-03-13T11:02:20Z

Awaiting bors try build completion

bors · 2020-03-13T11:02:31Z

⌛ Trying commit e41f55e with merge f2ee309252c8b5a6db0a206d760db8047cfb69eb...

bors · 2020-03-13T15:17:31Z

💥 Test timed out

rust-highfive · 2020-03-13T16:05:52Z

Your PR failed (pretty log, raw log). Through arcane magic we have determined that the following fragments from the build log may contain information about the problem.

Click to expand the log.

I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact @rust-lang/infra. (Feature Requests)

Dylan-DPC-zz · 2020-03-21T20:26:32Z

@bors try @rust-timer queue

rust-timer · 2020-03-21T20:26:33Z

Awaiting bors try build completion

bors · 2020-03-21T20:26:46Z

⌛ Trying commit e41f55e with merge 18cc66a1f5cf20ca7c7e28fbe43469d13c436b05...

bors · 2020-03-21T23:03:02Z

☀️ Try build successful - checks-azure
Build commit: 18cc66a1f5cf20ca7c7e28fbe43469d13c436b05 (18cc66a1f5cf20ca7c7e28fbe43469d13c436b05)

rust-timer · 2020-03-21T23:03:04Z

Queued 18cc66a1f5cf20ca7c7e28fbe43469d13c436b05 with parent 38114ff, future comparison URL.

JohnCSimon · 2020-04-06T03:15:31Z

Ping from Triage: Any updates @Marwes? Thank you.

Marwes · 2020-04-06T08:29:41Z

I can't see a way to resolve the compiler performance regressions from this at this time. Perhaps with more optimizations before monomorphization the overhead could be reduced enough (so that LLVM wouldn't need to instantiate and inline this added complexity for each and every iterator.

rust-highfive assigned Mark-Simulacrum Jan 9, 2020

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Jan 9, 2020

Mark-Simulacrum added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jan 19, 2020

Marwes force-pushed the extend_for_each branch from fd175a8 to acf3b98 Compare January 21, 2020 18:04

Marwes force-pushed the extend_for_each branch from acf3b98 to e3787b2 Compare January 22, 2020 16:09

the8472 reviewed Jan 22, 2020

View reviewed changes

Marwes mentioned this pull request Jan 22, 2020

perf: Let &mut T iterators forward for_each and friends #68472

Closed

Marwes force-pushed the extend_for_each branch 3 times, most recently from 15de55b to 6532de7 Compare January 23, 2020 16:36

Marwes force-pushed the extend_for_each branch from 6532de7 to 0ac47b4 Compare January 27, 2020 11:19

Markus Westerlind and others added 6 commits February 6, 2020 21:14

Use a more complicated chain in benchmarks

9b0cdeb

Use try_fold in Vec::extend

3cfa3aa

Should put less stress on LLVM since there are less closures passed around and lets us refine how much we reserve with `size_hint` if the first guess is too low.

No by_ref

824151e

Avoid a closure in Vec::extend_desugared

a092fe1

Massage extend_desugared to reduce the compiletime impact

e41f55e

Marwes force-pushed the extend_for_each branch from aa08f68 to e41f55e Compare February 6, 2020 20:21

bors added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Mar 13, 2020

Mark-Simulacrum added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 29, 2020

Marwes closed this Apr 6, 2020

the8472 mentioned this pull request Dec 13, 2020

Calling Vec::extend repeatedly in a for loop is faster than calling it once on iter::flatten #79992

Open

perf: Use for_each in Vec::extend #68046

perf: Use for_each in Vec::extend #68046

Conversation

Marwes commented Jan 9, 2020 • edited Loading

rust-highfive commented Jan 9, 2020

Marwes commented Jan 9, 2020

Centril commented Jan 9, 2020

rust-timer commented Jan 9, 2020

bors commented Jan 9, 2020

bors commented Jan 9, 2020

rust-timer commented Jan 9, 2020

rust-timer commented Jan 9, 2020

Marwes commented Jan 9, 2020

Mark-Simulacrum commented Jan 19, 2020

rust-highfive commented Jan 21, 2020

the8472 commented Jan 22, 2020

the8472 left a comment

Choose a reason for hiding this comment

the8472 Jan 22, 2020

Choose a reason for hiding this comment

the8472 Jan 22, 2020

Choose a reason for hiding this comment

Marwes Jan 22, 2020 • edited Loading

Choose a reason for hiding this comment

the8472 Jan 22, 2020

Choose a reason for hiding this comment

Marwes Jan 22, 2020 • edited Loading

Choose a reason for hiding this comment

the8472 Jan 22, 2020

Choose a reason for hiding this comment

Marwes Jan 24, 2020

Choose a reason for hiding this comment

Swatinem commented Jan 23, 2020

the8472 commented Jan 24, 2020

Marwes commented Jan 24, 2020 • edited Loading

the8472 commented Jan 24, 2020

Marwes commented Feb 25, 2020

joelpalmer commented Mar 9, 2020

Marwes commented Mar 9, 2020

Dylan-DPC-zz commented Mar 13, 2020

rust-timer commented Mar 13, 2020

bors commented Mar 13, 2020

bors commented Mar 13, 2020

rust-highfive commented Mar 13, 2020

Dylan-DPC-zz commented Mar 21, 2020

rust-timer commented Mar 21, 2020

bors commented Mar 21, 2020

bors commented Mar 21, 2020

rust-timer commented Mar 21, 2020

JohnCSimon commented Apr 6, 2020

Marwes commented Apr 6, 2020

perf: Use `for_each` in `Vec::extend` #68046

perf: Use `for_each` in `Vec::extend` #68046

Marwes commented Jan 9, 2020 •

edited

Loading

Marwes Jan 22, 2020 •

edited

Loading

Marwes Jan 22, 2020 •

edited

Loading

Marwes commented Jan 24, 2020 •

edited

Loading