Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upRangeInclusive iteration performance improvement. #57378
Conversation
rust-highfive
assigned
rkruppe
Jan 6, 2019
This comment has been minimized.
This comment has been minimized.
|
Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @rkruppe (or someone else) soon. If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes. Please see the contribution instructions for more information. |
rust-highfive
added
the
S-waiting-on-review
label
Jan 6, 2019
This comment has been minimized.
This comment has been minimized.
|
The job Click to expand the log.
I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact |
This comment has been minimized.
This comment has been minimized.
matthieu-m
commented
Jan 6, 2019
•
|
The failure seems legitimate, LLVM apparently fails to constant-fold the loop over a I've reproduced the issue on the playground (https://play.rust-lang.org/?version=nightly&mode=release&edition=2018&gist=c6ad080bf6386dab551c2bed1ad6dbfb); so I'll have to fiddle with this to understand what is blocking LLVM. |
This comment has been minimized.
This comment has been minimized.
|
This is somewhat surprising, given that fn foo3c(n: u64) -> u64 {
let mut count = 0;
(0..n).for_each(|_| {
(0..n).chain(::std::iter::once(n)).rev().for_each(|j| {
count += j;
})
});
count
}constant-folds just fine |
This comment has been minimized.
This comment has been minimized.
matthieu-m
commented
Jan 8, 2019
|
Alright, let's go nuts: https://play.rust-lang.org/?version=nightly&mode=release&edition=2018&gist=c23c205c5f6dcdfeb958c2a6cf83ecdb .
At this point, I'm really wondering what trips up LLVM. Note: use of explicit @rkruppe : I am thinking that this test, as written, is bad. Whether LLVM const-fold or not seems to have no relation to the "tightness" of the generated LLVM IR, or its overall performance. It seems that we would be better serve by a check which actually verifies the number of conditional jump involved in the inner loop, rather than using const-folding as a proxy for performance. |
scottmcm
referenced this pull request
Jan 9, 2019
Closed
Override <RangeInclusive as Iterator>::try_(r)fold #56563
This comment has been minimized.
This comment has been minimized.
|
Cross-reference: #56563 |
This comment has been minimized.
This comment has been minimized.
matthieu-m
commented
Jan 12, 2019
•
|
Performance discussion should be accompanied by benchmarks, so I put together a number of benchmarks and used criterion to evaluate the relative performance of:
And the results are the following:
(see gist for details of each benchmark, I reported only the black-hole cases: https://gist.github.com/matthieu-m/df8dcfed3e23ca83ea5abf9e7b3ca4d3) This yields two conclusions:
Also, it is notable that LLVM's closed formula transformation kicks in for I guess either understanding or fixing this hole is the key to getting an implementation of inclusive ranges which both yields good assembly and let LLVM perform the closed formula transformation. In the absence of such understanding/fixing, I would tend to prefer better straightforward assembly at the expense of the closed formula transformation: it is easier for the user to substitute a closed formula rather than re-implement an inclusive range, and I am doubtful that a closed formula exists in many cases. I also have to revise my statement about performance; while on the Add Mul example, inclusive ranges perform as good as exclusive one, there is still some overhead remaining in the Pythagorean Triples case. It may simply be the slight overhead of the inner loop magnified by the number of times it is executed, of course, and this PR still significantly improves performance: from x1.91 to x1.27 slow-down. Does anyone have any idea as to what could prevent LLVM from effecting the closed formula transformation? |
This comment has been minimized.
This comment has been minimized.
matthieu-m
commented
Jan 12, 2019
|
@kennytm As the author of the current version of |
This comment has been minimized.
This comment has been minimized.
|
@matthieu-m I haven't investigated what causes LLVM to const-fold a loop, it just happened that the test works after tweaking the representation and putting |
This comment has been minimized.
This comment has been minimized.
matthieu-m
commented
Jan 13, 2019
•
|
Updated performance number after specializing
This reinforces the conclusion that the A custom Unfortunately, it does nothing to improve the performance of "simple" loops using external iteration, where LLVM just fails to perform Loop Splitting and subsequently to transform the loop into a closed form. My experiments with Loop Splitting have found it extremely finicky, with very similar cases falling on either side of the divide. This is pretty frustrating 1 The performance penalty observed is specific to the absence of Loop Splitting by LLVM; however in interior iteration we can manually split the loop between the loop itself and either a header or trailer, thereby gaining all our due performance without relying on getting lucky during optimizations. 2 As a more general note, it also means that (a) it is likely beneficial to implement a specialized |
This comment has been minimized.
This comment has been minimized.
|
The job Click to expand the log.
I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact |
This comment has been minimized.
This comment has been minimized.
|
The job Click to expand the log.
I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact |
This comment has been minimized.
This comment has been minimized.
Many were -- including |
This comment has been minimized.
This comment has been minimized.
|
Sorry, it doesn't seem like I'll be able to give this PR proper attention in the near future. Please assign someone else. |
This comment has been minimized.
This comment has been minimized.
rust-highfive
assigned
Kimundi
and unassigned
rkruppe
Jan 22, 2019
matthieu-m
force-pushed the
matthieu-m:range_incl_perf
branch
from
80aa9e4
to
eb5b096
Feb 3, 2019
matthieu-m
closed this
Feb 3, 2019
This comment has been minimized.
This comment has been minimized.
matthieu-m
commented
Feb 3, 2019
|
Unfortunately, I have yet to find a way to get LLVM to play nice with external iteration. I'll open another PR to improve internal iteration; as force-push corrupted this one, it seems. |
matthieu-m commentedJan 6, 2019
•
edited
The current implementation of Iterator::{next, next_back} for
RangeInclusive leads to sub-optimal performance of loops as LLVM is not
capable of splitting the loop into a first-pass initialization
(computing is_empty) followed by the actual loop. This results in each
iteration performing two conditional jumps, which not only impacts the
performance of unoptimized loops, but also inhibits unrolling and
vectorization.
The proposed implementation switches things around, performing extra
work only on the last iteration of the loop. This results in even
unoptimized loops performing a single conditional jump in all but the
last iteration, matching Range's performance, as well as letting LLVM
unroll and vectorize when it would do so for Range's loop.
As a result, it should make iterating on inclusive ranges as fast as
iterating on exclusive ones; avoiding a papercut performance pitfall.
Unfortunately, it also appears to foil LLVM Loop Splitting optimization.