New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RangeInclusive internal iteration performance improvement. #58122

Open
wants to merge 2 commits into
base: master
from

Conversation

Projects
None yet
6 participants
@matthieu-m
Copy link

matthieu-m commented Feb 3, 2019

Specialize Iterator::try_fold and DoubleEndedIterator::try_rfold to improve code generation in all internal iteration scenarios.

This changes brings the performance of internal iteration with RangeInclusive on par with the performance of iteration with Range:

  • Single conditional jump in hot loop,
  • Unrolling and vectorization,
  • And even Closed Form substitution.

Unfortunately, it only applies to internal iteration. Despite various attempts at stream-lining the implementation of next and next_back, LLVM has stubbornly refused to optimize external iteration appropriately, leaving me with a choice between:

  • The current implementation, for which Closed Form substitution is performed, but which uses 2 conditional jumps in the hot loop when optimization fail.
  • An implementation using a is_done boolean, which uses 1 conditional jump in the hot loop when optimization fail, allowing unrolling and vectorization, but for which Closed Form substitution fails.

In the absence of any conclusive evidence as to which usecase matters most, and with no assurance that the lack of Closed Form substitution is not indicative of other optimizations being foiled, there is no way
to pick one implementation over the other, and thus I defer to the statu quo as far as next and next_back are concerned.

Matthieu M
RangeInclusive internal iteration performance improvement.
Specialize Iterator::try_fold and DoubleEndedIterator::try_rfold to
improve code generation in all internal iteration scenarios.

This changes brings the performance of internal iteration with
RangeInclusive on par with the performance of iteration with Range:

 - Single conditional jump in hot loop,
 - Unrolling and vectorization,
 - And even Closed Form substitution.

Unfortunately, it only applies to internal iteration. Despite various
attempts at stream-lining the implementation of next and next_back,
LLVM has stubbornly refused to optimize external iteration
appropriately, leaving me with a choice between:

 - The current implementation, for which Closed Form substitution is
   performed, but which uses 2 conditional jumps in the hot loop when
   optimization fail.
 - An implementation using a "is_done" boolean, which uses 1
   conditional jump in the hot loop when optimization fail, allowing
   unrolling and vectorization, but for which Closed Form substitution
   fails.

In the absence of any conclusive evidence as to which usecase matters
most, and with no assurance that the lack of Closed Form substitution
is not indicative of other optimizations being foiled, there is no way
to pick one implementation over the other, and thus I defer to the
statu quo as far as next and next_back are concerned.
@rust-highfive

This comment has been minimized.

Copy link
Collaborator

rust-highfive commented Feb 3, 2019

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @dtolnay (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

@matthieu-m

This comment has been minimized.

Copy link
Author

matthieu-m commented Feb 3, 2019

Take 2 on #57378 .

Show resolved Hide resolved src/libcore/iter/range.rs
@matthieu-m

This comment has been minimized.

Copy link
Author

matthieu-m commented Feb 10, 2019

@dtolnay : I've addressed the concerns raised by scottmcm, so this should be ready for review.

@dtolnay
Copy link
Member

dtolnay left a comment

The implementation looks good to me, but since this is intended to be a performance improvement only, do you have a benchmark that other people could run to confirm the improvement?

@matthieu-m

This comment has been minimized.

Copy link
Author

matthieu-m commented Feb 15, 2019

During development, I used the following benchmark https://gist.github.com/matthieu-m/3a9ab9afc4eee80565c8d95f94db6fa4 where FixedRangeInclusive is a copy/paste of the current version, with the performance improvements of this PR.

On my machine, I get:

Benchmark Exclusive Chain Current Proposed (vs Current)
sum 1.1315 ns 4.8270 ns 972.33 ps 1.5776 ns (+62%)
triangle foreach 1.1667 ns 1.7881 ns 978.75 ps 1.5791 ns (+61%)
triangle loop 1.1299 ns 1.1643 ms 968.94 ps 975.43 ps (-0.34%)
addmul foreach 1.1611 ms 1.1633 ms 1.1799 ms 1.1644 ms (-1.3%)
addmul loop 1.1689 ms 1.1785 ms 1.1658 ms 1.1638 ms (-0.17%)
triples 960.00 us 889.76 us 1.8435 ms 968.05 us (-47%)

Where we can see that:

  • The sum and triangle foreach benchmarks are all transformed into a closed form, but not the same one.
  • The triangle loop and addmul loop benchmarks use the same next() implementation, so are mostly interesting to see the small imprecision of such measurements.
  • The addmul foreach benchmark is simple enough that LLVM was already optimizing it well, so the small improvement is close to the noise.
  • The triples benchmark, which originally prompted this work, shows a x2 improvement.

Overall, the new RangeInclusive implementation performs on par with the Range one (modulo closed form), as the performance cliff on triples is smoothed out.

@dtolnay
Copy link
Member

dtolnay left a comment

Thanks!

@dtolnay

This comment has been minimized.

Copy link
Member

dtolnay commented Feb 15, 2019

@bors r+

@bors

This comment has been minimized.

Copy link
Contributor

bors commented Feb 15, 2019

📌 Commit 4fed67f has been approved by dtolnay

@bors

This comment has been minimized.

Copy link
Contributor

bors commented Feb 15, 2019

⌛️ Testing commit 4fed67f with merge 74beb94...

bors added a commit that referenced this pull request Feb 15, 2019

Auto merge of #58122 - matthieu-m:range_incl_perf, r=dtolnay
RangeInclusive internal iteration performance improvement.

Specialize `Iterator::try_fold` and `DoubleEndedIterator::try_rfold` to improve code generation in all internal iteration scenarios.

This changes brings the performance of internal iteration with `RangeInclusive` on par with the performance of iteration with `Range`:

 - Single conditional jump in hot loop,
 - Unrolling and vectorization,
 - And even Closed Form substitution.

Unfortunately, it only applies to internal iteration. Despite various attempts at stream-lining the implementation of `next` and `next_back`, LLVM has stubbornly refused to optimize external iteration appropriately, leaving me with a choice between:

 - The current implementation, for which Closed Form substitution is performed, but which uses 2 conditional jumps in the hot loop when optimization fail.
 - An implementation using a `is_done` boolean, which uses 1 conditional jump in the hot loop when optimization fail, allowing unrolling and vectorization, but for which Closed Form substitution fails.

In the absence of any conclusive evidence as to which usecase matters most, and with no assurance that the lack of Closed Form substitution is not indicative of other optimizations being foiled, there is no way
to pick one implementation over the other, and thus I defer to the statu quo as far as `next` and `next_back` are concerned.
@bors

This comment has been minimized.

Copy link
Contributor

bors commented Feb 15, 2019

💔 Test failed - checks-travis

@rust-highfive

This comment has been minimized.

Copy link
Collaborator

rust-highfive commented Feb 15, 2019

Your PR failed on Travis (raw log). Through arcane magic we have determined that the following fragments from the build log may contain information about the problem.

Click to expand the log.

I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact @TimNN. (Feature Requests)

@pietroalbini

This comment has been minimized.

Copy link
Member

pietroalbini commented Feb 15, 2019

@bors retry -- apparently we can't clone the repo anymore on macOS

Centril added a commit to Centril/rust that referenced this pull request Feb 20, 2019

Rollup merge of rust-lang#58122 - matthieu-m:range_incl_perf, r=dtolnay
RangeInclusive internal iteration performance improvement.

Specialize `Iterator::try_fold` and `DoubleEndedIterator::try_rfold` to improve code generation in all internal iteration scenarios.

This changes brings the performance of internal iteration with `RangeInclusive` on par with the performance of iteration with `Range`:

 - Single conditional jump in hot loop,
 - Unrolling and vectorization,
 - And even Closed Form substitution.

Unfortunately, it only applies to internal iteration. Despite various attempts at stream-lining the implementation of `next` and `next_back`, LLVM has stubbornly refused to optimize external iteration appropriately, leaving me with a choice between:

 - The current implementation, for which Closed Form substitution is performed, but which uses 2 conditional jumps in the hot loop when optimization fail.
 - An implementation using a `is_done` boolean, which uses 1 conditional jump in the hot loop when optimization fail, allowing unrolling and vectorization, but for which Closed Form substitution fails.

In the absence of any conclusive evidence as to which usecase matters most, and with no assurance that the lack of Closed Form substitution is not indicative of other optimizations being foiled, there is no way
to pick one implementation over the other, and thus I defer to the statu quo as far as `next` and `next_back` are concerned.

bors added a commit that referenced this pull request Feb 20, 2019

Auto merge of #58594 - Centril:rollup, r=Centril
Rollup of 12 pull requests

Successful merges:

 - #55632 (Deny the `overflowing_literals` lint for all editions)
 - #57656 (Deprecate the unstable Vec::resize_default)
 - #57997 (Wrap write_bytes in a function. Move docs)
 - #58059 (deprecate before_exec in favor of unsafe pre_exec)
 - #58122 (RangeInclusive internal iteration performance improvement.)
 - #58198 (Suggest removing parentheses surrounding lifetimes)
 - #58353 (Check the Self-type of inherent associated constants)
 - #58476 (Remove `LazyTokenStream`.)
 - #58511 (Const to op simplification)
 - #58555 (Add a note about 2018e if someone uses `try {` in 2015e)
 - #58588 (remove a bit of dead code)
 - #58589 (cleanup macro after 2018 transition)

Failed merges:

r? @ghost
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment