-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimise the GCD implementations. #11
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to know whether you are interested in this kind of optimisation in principle
Optimizations that don't require breaking changes are absolutely welcome!
what kind of benchmark results you would like to see to inlcude it.
You could create a new file in benches/
and copy the old implementation for a few baseline benchmarks, for comparison with benchmarks of the current Integer::gcd
with your changes.
64-bit and 128-bit types may be more interesting -- if anything, I expect the results to be even more dramatic, but we should check.
src/lib.rs
Outdated
loop { | ||
if m > n { | ||
m -= n; | ||
if m == 0 { return n << shift; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case is impossible, since m > n
.
src/lib.rs
Outdated
loop { | ||
if m > n { | ||
m -= n; | ||
if m == 0 { return n << shift; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again impossible with m > n
.
Address review comments.
@cuviper Thanks for your comments! I've addressed them and added a benchmark. Here are the results for my machine (i7-4710MQ, x86_64, Linux):
For the test numbers I used (the Fibonacci numbers) the new version is faster for all types, except for the 128 bit types, where it is 20% slower. I suspect this is because the compiler is not able to reuse the comparison between The Fibonacci numbers may not be good test candidates. For the standard "modulo" version of the Euclidean algorithm they are precisely the numbers that need the most steps relative to their size before the algorithm terminates. However, for Stein's algorithm they shouldn't be special in any way, and tests suggest that I get similar results for other choices. Should I special-case the 128 bit implementations, so we use the faster version in each case? Or do we need to test this on more target architectures first? |
On my i7-7700k, x86_64-unknown-linux-gnu, I get:
Same hardware running i686:
The fact that i686's The biggest difference I can see in assembly is that the new versions use packed instructions, and PCMPEQB is reported as a hotspot in perf. But that's also true in the i686 anomaly... |
I tried with
|
I'm inclined to say this is a clear win on "normal" integers, and close enough on 128-bit integers that it might not make much difference in real use (with other surrounding calculations too). What do you think? |
@cuviper It's a bit hard to tell what the common use case to optimise for is. In the end, GCD computations are kind of fringe in real applications, so probably the "common use case" simply does not exist. But I agree that the new implementation seems to be faster overall, and the differences for the 128 bit implementation do not justify complicating the code any further. |
OK, let's merge! bors r+ |
11: Optimise the GCD implementations. r=cuviper a=smarnach This change avoids using `swap()` in the GCD implementation, which results in a speedup of almost a factor of two on my system. Here are a few benchmark numbers for `u32` values, where `t_old` is the time per function call for the current implementation, and `t_new` is the time of the implementation introduced in this change. ``` | m | n | t_old (ns) | t_new (ns) | +---------------+---------------+------------+------------+ | 2_971_215_073 | 1_836_311_903 | 76 | 40 | | 253_241 | 489_997 | 26 | 12 | | 4_183_928_743 | 1_234_567 | 63 | 31 | | 3_221_225_469 | 3 | 93 | 47 | | 0xffff_ffff | 0x8000_0000 | 98 | 55 | ``` I'm aware that having tested this on a single system for a single type doesn't tell us too much, but before trying to perform a more thorough benchmark, I'd like to know whether you are interested in this kind of optimisation in principle, and what kind of benchmark results you would like to see to inlcude it. Co-authored-by: Sven Marnach <sven@marnach.net>
Build succeeded |
@cuviper Thanks! |
This change avoids using
swap()
in the GCD implementation, which results in a speedup of almost a factor of two on my system. Here are a few benchmark numbers foru32
values, wheret_old
is the time per function call for the current implementation, andt_new
is the time of the implementation introduced in this change.I'm aware that having tested this on a single system for a single type doesn't tell us too much, but before trying to perform a more thorough benchmark, I'd like to know whether you are interested in this kind of optimisation in principle, and what kind of benchmark results you would like to see to inlcude it.