Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove manual unrolling from slice::Iter(Mut)::try_fold #64600

Merged
merged 2 commits into from Sep 30, 2019

Conversation

@scottmcm
Copy link
Member

commented Sep 19, 2019

While this definitely helps sometimes (particularly for trivial closures), it's also a pessimization sometimes, so it's better to leave this to (hypothetical) future LLVM improvements instead of forcing this on everyone.

I think it's better for the advice to be that sometimes you need to unroll manually than you sometimes need to not-unroll manually (like #64545).


Final perf comparison: #64600 (comment)


For context see #64572 (comment)

@scottmcm

This comment has been minimized.

Copy link
Member Author

commented Sep 19, 2019

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

Copy link

commented Sep 19, 2019

Awaiting bors try build completion

@bors

This comment has been minimized.

Copy link
Contributor

commented Sep 19, 2019

⌛️ Trying commit 38d8c8d with merge dd115ba...

bors added a commit that referenced this pull request Sep 19, 2019
[DO NOT MERGE] Experiment with removing unrolling from slice::Iter::try_fold

For context see #64572 (comment)

r? @scottmcm
@bors

This comment has been minimized.

Copy link
Contributor

commented Sep 19, 2019

☀️ Try build successful - checks-azure
Build commit: dd115ba

@rust-timer

This comment has been minimized.

Copy link

commented Sep 19, 2019

Queued dd115ba with parent eceec57, future comparison URL.

@rust-timer

This comment has been minimized.

Copy link

commented Sep 19, 2019

Finished benchmarking try commit dd115ba, comparison URL.

@nnethercote

This comment has been minimized.

Copy link
Contributor

commented Sep 19, 2019

This change gets roughly half the improvements that the commit in #64572 gets.

@bluss

This comment has been minimized.

Copy link
Member

commented Sep 20, 2019

I think that unrolling would eventually have to go and be removed from libcore, I was just hoping the compiler would catch up and be able to unroll loops with multiple exits itself. Unrolling should ideally belong to the compiler, so it can make the decision about when to duplicate code. I haven't revisited that, so for all I know llvm could have learned this by now. [Edit: checked -- rustc nightly does not unroll such things by itself right now either. I wonder if this multiple exit improvement means that things are on the way..? No clue]

This seems like a situation where it's easy to find both good and bad cases. Things like scans through bytes for parsing with a simple predicate benefit a lot from unrolling in all/find etc.

While this definitely helps sometimes (particularly for trivial closures), it's also a pessimization sometimes, so it's better to leave this to (hypothetical) future LLVM improvements instead of forcing this on everyone.

I think it's better for the advice to be that sometimes you need to unroll manually than you sometimes need to not-unroll manually (like #64545).
@scottmcm scottmcm force-pushed the scottmcm:no-slice-tryfold-unroll branch from 38d8c8d to 92e91f7 Sep 22, 2019
@scottmcm scottmcm changed the title [DO NOT MERGE] Experiment with removing unrolling from slice::Iter::try_fold Remove manual unrolling from slice::Iter(Mut)::try_fold Sep 22, 2019
@scottmcm

This comment has been minimized.

Copy link
Member Author

commented Sep 22, 2019

Ok, it seems like the inclination is that we should do this so I've turned this into a "real" PR.

I do think it's better for the advice to be that sometimes you need to unroll manually than you sometimes need to not-unroll manually, though I certainly with LLVM was better at these cases.

I'm not sure who should approve this -- does it need libs sign-off?

@timvermeulen

This comment has been minimized.

Copy link
Contributor

commented Sep 22, 2019

Couldn't you just remove the try_fold and fold overrides entirely? They seem to now mirror the default implementations. And I think there's manual unrolling going on in the DoubleEndedIterator impl, too.

@scottmcm scottmcm force-pushed the scottmcm:no-slice-tryfold-unroll branch from 572de05 to 2f7b32a Sep 23, 2019
@rust-highfive

This comment was marked as resolved.

Copy link
Collaborator

commented Sep 23, 2019

The job mingw-check of your PR failed (pretty log, raw log). Through arcane magic we have determined that the following fragments from the build log may contain information about the problem.

Click to expand the log.
2019-09-23T04:17:57.9332535Z ##[command]git remote add origin https://github.com/rust-lang/rust
2019-09-23T04:17:57.9508127Z ##[command]git config gc.auto 0
2019-09-23T04:17:57.9579611Z ##[command]git config --get-all http.https://github.com/rust-lang/rust.extraheader
2019-09-23T04:17:58.6557923Z ##[command]git config --get-all http.proxy
2019-09-23T04:17:58.6564483Z ##[command]git -c http.extraheader="AUTHORIZATION: basic ***" fetch --force --tags --prune --progress --no-recurse-submodules --depth=2 origin +refs/heads/*:refs/remotes/origin/* +refs/pull/64600/merge:refs/remotes/pull/64600/merge
---
2019-09-23T04:24:01.4772415Z     Checking core v0.0.0 (/checkout/src/libcore)
2019-09-23T04:24:04.3260088Z error: unused import: `Try`
2019-09-23T04:24:04.3260685Z   --> src/libcore/slice/mod.rs:31:25
2019-09-23T04:24:04.3261338Z    |
2019-09-23T04:24:04.3261948Z 31 | use crate::ops::{FnMut, Try, self};
2019-09-23T04:24:04.3262524Z    |
2019-09-23T04:24:04.3262802Z    = note: `-D unused-imports` implied by `-D warnings`
2019-09-23T04:24:04.3262889Z 
2019-09-23T04:24:10.3420841Z    Compiling libc v0.2.62
---
2019-09-23T04:24:12.9117615Z == clock drift check ==
2019-09-23T04:24:12.9132669Z   local time: Mon Sep 23 04:24:12 UTC 2019
2019-09-23T04:24:12.9984868Z   network time: Mon, 23 Sep 2019 04:24:12 GMT
2019-09-23T04:24:12.9990300Z == end clock drift check ==
2019-09-23T04:24:26.2030889Z ##[error]Bash exited with code '1'.
2019-09-23T04:24:26.2065868Z ##[section]Starting: Checkout
2019-09-23T04:24:26.2067691Z ==============================================================================
2019-09-23T04:24:26.2067734Z Task         : Get sources
2019-09-23T04:24:26.2067789Z Description  : Get sources from a repository. Supports Git, TfsVC, and SVN repositories.

I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact @TimNN. (Feature Requests)

@scottmcm scottmcm force-pushed the scottmcm:no-slice-tryfold-unroll branch from 2f7b32a to 6ac64ab Sep 24, 2019
@scottmcm

This comment has been minimized.

Copy link
Member Author

commented Sep 24, 2019

@bors try @rust-timer queue

(I'm curious to see the new self-profile results, and want to make sure that removing the overrides still keeps the gain here -- it might mean more work to eliminate !s.)

@rust-timer

This comment has been minimized.

Copy link

commented Sep 24, 2019

Awaiting bors try build completion

@bors

This comment has been minimized.

Copy link
Contributor

commented Sep 24, 2019

⌛️ Trying commit 6ac64ab with merge 8be3622...

bors added a commit that referenced this pull request Sep 24, 2019
Remove manual unrolling from slice::Iter(Mut)::try_fold

While this definitely helps sometimes (particularly for trivial closures), it's also a pessimization sometimes, so it's better to leave this to (hypothetical) future LLVM improvements instead of forcing this on everyone.

I think it's better for the advice to be that sometimes you need to unroll manually than you sometimes need to not-unroll manually (like #64545).

---

For context see #64572 (comment)
@bors

This comment has been minimized.

Copy link
Contributor

commented Sep 24, 2019

☀️ Try build successful - checks-azure
Build commit: 8be3622 (8be3622ad74755484fd9b9e401d0ee96837be244)

@rust-timer

This comment has been minimized.

Copy link

commented Sep 24, 2019

Queued 8be3622 with parent 66bf391, future comparison URL.

@rust-timer

This comment has been minimized.

Copy link

commented Sep 24, 2019

Finished benchmarking try commit 8be3622, comparison URL.

@bjorn3

This comment has been minimized.

Copy link
Contributor

commented Sep 24, 2019

script-servo-opt regressed, the rest is positive or neutral.

@scottmcm

This comment has been minimized.

Copy link
Member Author

commented Sep 24, 2019

Oh, interesting. script-servo-opt is new, right? I can't find it in the previous perf comparison...

@scottmcm

This comment has been minimized.

Copy link
Member Author

commented Sep 24, 2019

New perf link (thank you, Mark-Simulacrum!) with self-profile results for both sides:
https://perf.rust-lang.org/compare.html?start=b4ba2a3953ea9ec28f01c314be315d46673bd782&end=8be3622ad74755484fd9b9e401d0ee96837be244

It looks like nearly all of the speedup for clap-rs-debug run clean is ~2.5s in LLVM_emit_obj (and ~0.3s in LLVM_make_bitcode). Any idea why this would be so much better there? I'd hypothesized that the closure might be getting inlined 5x even in debug, but based on a quick experiment (https://rust.godbolt.org/z/HmYVvd) that doesn't seem to happen. Since this PR still ends up using try_fold (unlike #64572), it doesn't feel like it should have had a major enough impact on the amount of code that's getting sent to LLVM to account for the size of the win.

And for script-servo-opt run baseline incremental it's LLVM_emit_obj that increased by ~13.5s with this change. Curious.

@bluss

This comment has been minimized.

Copy link
Member

commented Sep 27, 2019

This change looks good to me, but I guess we are waiting for some discussion. I'll try to ask @Geal about nom performance and unrolling.

You know how much I would like to say we can just reimplement important stuff, like an unrolling slice iterator, outside libcore, but the libcore version is still tied up with unstable features like assume — unknown what impact they have as of current rustc version.

@Geal

This comment has been minimized.

Copy link
Contributor

commented Sep 27, 2019

@bluss no issue for me, nom does not use try_fold nor try_rfold internally

@bluss

This comment has been minimized.

Copy link
Member

commented Sep 27, 2019

It looks like it's not impossible for rustc to unroll an "Iterator::all" like loop. It just can't do it in the simplest forms that those loops take, for example not in for elt in v { .. }

I have some old alternative slice iterator code, and it can be automatically unrolled by the compiler. The code is here (github link to iter.rs) and there are benchmarks that show the unrolling on that specific branch. I haven't managed to reduce the loop that will unroll, though — maybe it's specific to the code in the benchmark? The compiler's unroll disappears if the special case .all() method is removed from that iterator, even though it only adds indirections (similar to using try_fold).

andjo403 added a commit to andjo403/rust that referenced this pull request Sep 28, 2019
changes from rust-lang#64600 and part of rust-lang#64572.
think that if all other functions shall be implemented with try_fold nth shall also be it.
bors added a commit that referenced this pull request Sep 28, 2019
[WIP] use try_fold instead of try_for_each to reduce compile time

as it was stated in #64572 that the biggest gain was due to less code was generated I tried to reduce the number of functions to inline by using try_fold direct instead of calling try_for_each that calls try_fold.

as there is some gains with using the try_fold function this is maybe a way forward.
when I tried to compile the clap-rs benchmark I get times gains only some % from #64572

there is more function that use eg. fold that calls try_fold that also can be changed but the question is how mush "duplication" that is tolerated in std to give faster compile times

this PR contains the changes from #64600 to make the compare fair with #64572

can someone start a perf run?

cc @nnethercote @scottmcm @bluss
r? @ghost
@scottmcm

This comment has been minimized.

Copy link
Member Author

commented Sep 29, 2019

@bluss do you still have r+ here, or do I need to find a different reviewer for this?

@bluss

This comment has been minimized.

Copy link
Member

commented Sep 30, 2019

@scottmcm I guess I do, but with the nominated tag I thought we were waiting for the libs team

@bluss

This comment has been minimized.

Copy link
Member

commented Sep 30, 2019

@bors r+ rollup=never

@bors

This comment has been minimized.

Copy link
Contributor

commented Sep 30, 2019

📌 Commit 6ac64ab has been approved by bluss

@bors

This comment has been minimized.

Copy link
Contributor

commented Sep 30, 2019

⌛️ Testing commit 6ac64ab with merge e0436d9...

bors added a commit that referenced this pull request Sep 30, 2019
Remove manual unrolling from slice::Iter(Mut)::try_fold

While this definitely helps sometimes (particularly for trivial closures), it's also a pessimization sometimes, so it's better to leave this to (hypothetical) future LLVM improvements instead of forcing this on everyone.

I think it's better for the advice to be that sometimes you need to unroll manually than you sometimes need to not-unroll manually (like #64545).

---

For context see #64572 (comment)
@bors

This comment has been minimized.

Copy link
Contributor

commented Sep 30, 2019

☀️ Test successful - checks-azure
Approved by: bluss
Pushing e0436d9 to master...

@bors bors added the merged-by-bors label Sep 30, 2019
@bors bors merged commit 6ac64ab into rust-lang:master Sep 30, 2019
5 checks passed
5 checks passed
homu Test successful
Details
pr Build #20190924.6 succeeded
Details
pr (Linux mingw-check) Linux mingw-check succeeded
Details
pr (Linux x86_64-gnu-llvm-6.0) Linux x86_64-gnu-llvm-6.0 succeeded
Details
pr (LinuxTools) LinuxTools succeeded
Details
bors added a commit that referenced this pull request Oct 1, 2019
[WIP] use try_fold instead of try_for_each to reduce compile time

as it was stated in #64572 that the biggest gain was due to less code was generated I tried to reduce the number of functions to inline by using try_fold direct instead of calling try_for_each that calls try_fold.

as there is some gains with using the try_fold function this is maybe a way forward.
when I tried to compile the clap-rs benchmark I get times gains only some % from #64572

there is more function that use eg. fold that calls try_fold that also can be changed but the question is how mush "duplication" that is tolerated in std to give faster compile times

this PR contains the changes from #64600 to make the compare fair with #64572

can someone start a perf run?

cc @nnethercote @scottmcm @bluss
r? @ghost
@bluss

This comment has been minimized.

Copy link
Member

commented Oct 1, 2019

@Geal is now in the rustc 1.40.0-nightly (22bc9e1 2019-09-30) nightly. I'd be surprised if it didn't impact benches in some way

@scottmcm scottmcm deleted the scottmcm:no-slice-tryfold-unroll branch Oct 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
9 participants
You can’t perform that action at this time.