Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upGitHub is where the world builds software
Millions of developers and companies build, ship, and maintain their software on GitHub — the largest and most advanced development platform in the world.
Remove manual unrolling from slice::Iter(Mut)::try_fold #64600
Conversation
|
@bors try @rust-timer queue |
|
Awaiting bors try build completion |
[DO NOT MERGE] Experiment with removing unrolling from slice::Iter::try_fold For context see #64572 (comment) r? @scottmcm
|
|
|
Queued dd115ba with parent eceec57, future comparison URL. |
|
Finished benchmarking try commit dd115ba, comparison URL. |
|
This change gets roughly half the improvements that the commit in #64572 gets. |
|
I think that unrolling would eventually have to go and be removed from libcore, I was just hoping the compiler would catch up and be able to unroll loops with multiple exits itself. Unrolling should ideally belong to the compiler, so it can make the decision about when to duplicate code. I haven't revisited that, so for all I know llvm could have learned this by now. [Edit: checked -- rustc nightly does not unroll such things by itself right now either. I wonder if this multiple exit improvement means that things are on the way..? No clue] This seems like a situation where it's easy to find both good and bad cases. Things like scans through bytes for parsing with a simple predicate benefit a lot from unrolling in all/find etc. |
While this definitely helps sometimes (particularly for trivial closures), it's also a pessimization sometimes, so it's better to leave this to (hypothetical) future LLVM improvements instead of forcing this on everyone. I think it's better for the advice to be that sometimes you need to unroll manually than you sometimes need to not-unroll manually (like #64545).
38d8c8d
to
92e91f7
|
Ok, it seems like the inclination is that we should do this so I've turned this into a "real" PR. I do think it's better for the advice to be that sometimes you need to unroll manually than you sometimes need to not-unroll manually, though I certainly with LLVM was better at these cases. I'm not sure who should approve this -- does it need |
|
Couldn't you just remove the |
572de05
to
2f7b32a
This comment has been hidden.
This comment has been hidden.
2f7b32a
to
6ac64ab
|
@bors try @rust-timer queue (I'm curious to see the new self-profile results, and want to make sure that removing the overrides still keeps the gain here -- it might mean more work to eliminate |
|
Awaiting bors try build completion |
Remove manual unrolling from slice::Iter(Mut)::try_fold While this definitely helps sometimes (particularly for trivial closures), it's also a pessimization sometimes, so it's better to leave this to (hypothetical) future LLVM improvements instead of forcing this on everyone. I think it's better for the advice to be that sometimes you need to unroll manually than you sometimes need to not-unroll manually (like #64545). --- For context see #64572 (comment)
|
|
|
Queued 8be3622 with parent 66bf391, future comparison URL. |
|
Finished benchmarking try commit 8be3622, comparison URL. |
|
|
|
Oh, interesting. |
|
New perf link (thank you, Mark-Simulacrum!) with self-profile results for both sides: It looks like nearly all of the speedup for And for |
|
This change looks good to me, but I guess we are waiting for some discussion. I'll try to ask @Geal about nom performance and unrolling. You know how much I would like to say we can just reimplement important stuff, like an unrolling slice iterator, outside libcore, but the libcore version is still tied up with unstable features like |
|
@bluss no issue for me, nom does not use |
|
It looks like it's not impossible for rustc to unroll an "Iterator::all" like loop. It just can't do it in the simplest forms that those loops take, for example not in I have some old alternative slice iterator code, and it can be automatically unrolled by the compiler. The code is here (github link to iter.rs) and there are benchmarks that show the unrolling on that specific branch. I haven't managed to reduce the loop that will unroll, though — maybe it's specific to the code in the benchmark? The compiler's unroll disappears if the special case |
changes from rust-lang#64600 and part of rust-lang#64572. think that if all other functions shall be implemented with try_fold nth shall also be it.
[WIP] use try_fold instead of try_for_each to reduce compile time as it was stated in #64572 that the biggest gain was due to less code was generated I tried to reduce the number of functions to inline by using try_fold direct instead of calling try_for_each that calls try_fold. as there is some gains with using the try_fold function this is maybe a way forward. when I tried to compile the clap-rs benchmark I get times gains only some % from #64572 there is more function that use eg. fold that calls try_fold that also can be changed but the question is how mush "duplication" that is tolerated in std to give faster compile times this PR contains the changes from #64600 to make the compare fair with #64572 can someone start a perf run? cc @nnethercote @scottmcm @bluss r? @ghost
|
@bluss do you still have r+ here, or do I need to find a different reviewer for this? |
|
@scottmcm I guess I do, but with the nominated tag I thought we were waiting for the libs team |
|
@bors r+ rollup=never |
|
|
Remove manual unrolling from slice::Iter(Mut)::try_fold While this definitely helps sometimes (particularly for trivial closures), it's also a pessimization sometimes, so it's better to leave this to (hypothetical) future LLVM improvements instead of forcing this on everyone. I think it's better for the advice to be that sometimes you need to unroll manually than you sometimes need to not-unroll manually (like #64545). --- For context see #64572 (comment)
|
|
[WIP] use try_fold instead of try_for_each to reduce compile time as it was stated in #64572 that the biggest gain was due to less code was generated I tried to reduce the number of functions to inline by using try_fold direct instead of calling try_for_each that calls try_fold. as there is some gains with using the try_fold function this is maybe a way forward. when I tried to compile the clap-rs benchmark I get times gains only some % from #64572 there is more function that use eg. fold that calls try_fold that also can be changed but the question is how mush "duplication" that is tolerated in std to give faster compile times this PR contains the changes from #64600 to make the compare fair with #64572 can someone start a perf run? cc @nnethercote @scottmcm @bluss r? @ghost
While this definitely helps sometimes (particularly for trivial closures), it's also a pessimization sometimes, so it's better to leave this to (hypothetical) future LLVM improvements instead of forcing this on everyone.
I think it's better for the advice to be that sometimes you need to unroll manually than you sometimes need to not-unroll manually (like #64545).
Final perf comparison: #64600 (comment)
For context see #64572 (comment)