Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise the GCD implementations. #11

Merged
merged 3 commits into from
Oct 3, 2018

Conversation

smarnach
Copy link
Contributor

This change avoids using swap() in the GCD implementation, which results in a speedup of almost a factor of two on my system. Here are a few benchmark numbers for u32 values, where t_old is the time per function call for the current implementation, and t_new is the time of the implementation introduced in this change.

| m             | n             | t_old (ns) | t_new (ns) |
+---------------+---------------+------------+------------+
| 2_971_215_073 | 1_836_311_903 |         76 |         40 |
|       253_241 |       489_997 |         26 |         12 |
| 4_183_928_743 |     1_234_567 |         63 |         31 |
| 3_221_225_469 |             3 |         93 |         47 |
|   0xffff_ffff |   0x8000_0000 |         98 |         55 |

I'm aware that having tested this on a single system for a single type doesn't tell us too much, but before trying to perform a more thorough benchmark, I'd like to know whether you are interested in this kind of optimisation in principle, and what kind of benchmark results you would like to see to inlcude it.

@hauleth hauleth requested a review from cuviper October 1, 2018 12:46
Copy link
Member

@cuviper cuviper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to know whether you are interested in this kind of optimisation in principle

Optimizations that don't require breaking changes are absolutely welcome!

what kind of benchmark results you would like to see to inlcude it.

You could create a new file in benches/ and copy the old implementation for a few baseline benchmarks, for comparison with benchmarks of the current Integer::gcd with your changes.

64-bit and 128-bit types may be more interesting -- if anything, I expect the results to be even more dramatic, but we should check.

src/lib.rs Outdated
loop {
if m > n {
m -= n;
if m == 0 { return n << shift; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case is impossible, since m > n.

src/lib.rs Outdated
loop {
if m > n {
m -= n;
if m == 0 { return n << shift; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again impossible with m > n.

@smarnach
Copy link
Contributor Author

smarnach commented Oct 3, 2018

@cuviper Thanks for your comments! I've addressed them and added a benchmark. Here are the results for my machine (i7-4710MQ, x86_64, Linux):

running 20 tests
test i128::bench_gcd     ... bench:      55,449 ns/iter (+/- 1,621)
test i128::bench_gcd_old ... bench:      47,280 ns/iter (+/- 1,502)
test i16::bench_gcd      ... bench:         243 ns/iter (+/- 18)
test i16::bench_gcd_old  ... bench:         376 ns/iter (+/- 9)
test i32::bench_gcd      ... bench:         876 ns/iter (+/- 21)
test i32::bench_gcd_old  ... bench:       1,739 ns/iter (+/- 53)
test i64::bench_gcd      ... bench:       3,578 ns/iter (+/- 147)
test i64::bench_gcd_old  ... bench:       6,117 ns/iter (+/- 163)
test i8::bench_gcd       ... bench:          70 ns/iter (+/- 4)
test i8::bench_gcd_old   ... bench:          82 ns/iter (+/- 2)
test u128::bench_gcd     ... bench:      55,329 ns/iter (+/- 2,088)
test u128::bench_gcd_old ... bench:      45,653 ns/iter (+/- 1,334)
test u16::bench_gcd      ... bench:         270 ns/iter (+/- 20)
test u16::bench_gcd_old  ... bench:         478 ns/iter (+/- 18)
test u32::bench_gcd      ... bench:         902 ns/iter (+/- 86)
test u32::bench_gcd_old  ... bench:       1,875 ns/iter (+/- 59)
test u64::bench_gcd      ... bench:       3,631 ns/iter (+/- 141)
test u64::bench_gcd_old  ... bench:       6,772 ns/iter (+/- 201)
test u8::bench_gcd       ... bench:          77 ns/iter (+/- 3)
test u8::bench_gcd_old   ... bench:          90 ns/iter (+/- 2)

test result: ok. 0 passed; 0 failed; 0 ignored; 20 measured; 0 filtered out

For the test numbers I used (the Fibonacci numbers) the new version is faster for all types, except for the 128 bit types, where it is 20% slower. I suspect this is because the compiler is not able to reuse the comparison between m and n as a subtraction, but I did not manage to look at the assemby output to confirm this. Incidentally, the greatest speedup is achieved for u32, the type I first tested this with.

The Fibonacci numbers may not be good test candidates. For the standard "modulo" version of the Euclidean algorithm they are precisely the numbers that need the most steps relative to their size before the algorithm terminates. However, for Stein's algorithm they shouldn't be special in any way, and tests suggest that I get similar results for other choices.

Should I special-case the 128 bit implementations, so we use the faster version in each case? Or do we need to test this on more target architectures first?

@cuviper
Copy link
Member

cuviper commented Oct 3, 2018

On my i7-7700k, x86_64-unknown-linux-gnu, I get:

$ cargo +nightly bench --bench gcd
    Finished release [optimized] target(s) in 0.00s
     Running target/release/deps/gcd-1320b9ac6e705468

running 20 tests
test i128::bench_gcd     ... bench:      39,316 ns/iter (+/- 139)
test i128::bench_gcd_old ... bench:      34,610 ns/iter (+/- 205)
test i16::bench_gcd      ... bench:         176 ns/iter (+/- 0)
test i16::bench_gcd_old  ... bench:         216 ns/iter (+/- 0)
test i32::bench_gcd      ... bench:         635 ns/iter (+/- 1)
test i32::bench_gcd_old  ... bench:       1,055 ns/iter (+/- 8)
test i64::bench_gcd      ... bench:       2,601 ns/iter (+/- 92)
test i64::bench_gcd_old  ... bench:       4,082 ns/iter (+/- 20)
test i8::bench_gcd       ... bench:          50 ns/iter (+/- 0)
test i8::bench_gcd_old   ... bench:          53 ns/iter (+/- 0)
test u128::bench_gcd     ... bench:      39,990 ns/iter (+/- 239)
test u128::bench_gcd_old ... bench:      36,842 ns/iter (+/- 93)
test u16::bench_gcd      ... bench:         187 ns/iter (+/- 0)
test u16::bench_gcd_old  ... bench:         343 ns/iter (+/- 8)
test u32::bench_gcd      ... bench:         639 ns/iter (+/- 4)
test u32::bench_gcd_old  ... bench:       1,325 ns/iter (+/- 8)
test u64::bench_gcd      ... bench:       2,534 ns/iter (+/- 30)
test u64::bench_gcd_old  ... bench:       4,924 ns/iter (+/- 40)
test u8::bench_gcd       ... bench:          56 ns/iter (+/- 0)
test u8::bench_gcd_old   ... bench:          65 ns/iter (+/- 0)

test result: ok. 0 passed; 0 failed; 0 ignored; 20 measured; 0 filtered out

Same hardware running i686:

$ cargo +nightly bench --bench gcd --target i686-unknown-linux-gnu
    Finished release [optimized] target(s) in 0.00s
     Running target/i686-unknown-linux-gnu/release/deps/gcd-b0e38615de2ee657

running 20 tests
test i128::bench_gcd     ... bench:      88,740 ns/iter (+/- 403)
test i128::bench_gcd_old ... bench:      95,065 ns/iter (+/- 381)
test i16::bench_gcd      ... bench:         208 ns/iter (+/- 1)
test i16::bench_gcd_old  ... bench:         269 ns/iter (+/- 0)
test i32::bench_gcd      ... bench:         696 ns/iter (+/- 0)
test i32::bench_gcd_old  ... bench:       1,058 ns/iter (+/- 19)
test i64::bench_gcd      ... bench:       7,824 ns/iter (+/- 91)
test i64::bench_gcd_old  ... bench:       9,857 ns/iter (+/- 766)
test i8::bench_gcd       ... bench:          72 ns/iter (+/- 0)
test i8::bench_gcd_old   ... bench:          83 ns/iter (+/- 0)
test u128::bench_gcd     ... bench:      89,749 ns/iter (+/- 573)
test u128::bench_gcd_old ... bench:      84,835 ns/iter (+/- 6,675)
test u16::bench_gcd      ... bench:         222 ns/iter (+/- 26)
test u16::bench_gcd_old  ... bench:         342 ns/iter (+/- 5)
test u32::bench_gcd      ... bench:         706 ns/iter (+/- 3)
test u32::bench_gcd_old  ... bench:       1,462 ns/iter (+/- 12)
test u64::bench_gcd      ... bench:       7,882 ns/iter (+/- 61)
test u64::bench_gcd_old  ... bench:       9,177 ns/iter (+/- 109)
test u8::bench_gcd       ... bench:          74 ns/iter (+/- 0)
test u8::bench_gcd_old   ... bench:          76 ns/iter (+/- 0)

test result: ok. 0 passed; 0 failed; 0 ignored; 20 measured; 0 filtered out

The fact that i686's i128 got better doesn't help the evaluation here...

The biggest difference I can see in assembly is that the new versions use packed instructions, and PCMPEQB is reported as a hotspot in perf. But that's also true in the i686 anomaly...

@cuviper
Copy link
Member

cuviper commented Oct 3, 2018

I tried with RUSTFLAGS=-Ctarget-cpu=native, and now x86_64 flipped the i128 result!

$ RUSTFLAGS=-Ctarget-cpu=native cargo +nightly bench --bench gcd
   Compiling num-traits v0.2.5
   Compiling num-integer v0.1.39 (/home/jistone/rust/num/integer)
    Finished release [optimized] target(s) in 2.65s
     Running target/release/deps/gcd-1320b9ac6e705468

running 20 tests
test i128::bench_gcd     ... bench:      27,438 ns/iter (+/- 363)
test i128::bench_gcd_old ... bench:      30,866 ns/iter (+/- 280)
test i16::bench_gcd      ... bench:         152 ns/iter (+/- 1)
test i16::bench_gcd_old  ... bench:         198 ns/iter (+/- 14)
test i32::bench_gcd      ... bench:         430 ns/iter (+/- 13)
test i32::bench_gcd_old  ... bench:         665 ns/iter (+/- 9)
test i64::bench_gcd      ... bench:       1,792 ns/iter (+/- 54)
test i64::bench_gcd_old  ... bench:       3,344 ns/iter (+/- 64)
test i8::bench_gcd       ... bench:          44 ns/iter (+/- 0)
test i8::bench_gcd_old   ... bench:          55 ns/iter (+/- 0)
test u128::bench_gcd     ... bench:      35,119 ns/iter (+/- 257)
test u128::bench_gcd_old ... bench:      27,378 ns/iter (+/- 74)
test u16::bench_gcd      ... bench:         139 ns/iter (+/- 4)
test u16::bench_gcd_old  ... bench:         243 ns/iter (+/- 2)
test u32::bench_gcd      ... bench:         443 ns/iter (+/- 3)
test u32::bench_gcd_old  ... bench:         911 ns/iter (+/- 12)
test u64::bench_gcd      ... bench:       1,875 ns/iter (+/- 58)
test u64::bench_gcd_old  ... bench:       4,216 ns/iter (+/- 47)
test u8::bench_gcd       ... bench:          63 ns/iter (+/- 0)
test u8::bench_gcd_old   ... bench:          78 ns/iter (+/- 0)

test result: ok. 0 passed; 0 failed; 0 ignored; 20 measured; 0 filtered out
$ RUSTFLAGS=-Ctarget-cpu=native cargo +nightly bench --bench gcd --target=i686-unknown-linux-gnu
   Compiling num-traits v0.2.5
   Compiling num-integer v0.1.39 (/home/jistone/rust/num/integer)
    Finished release [optimized] target(s) in 2.71s
     Running target/i686-unknown-linux-gnu/release/deps/gcd-b0e38615de2ee657

running 20 tests
test i128::bench_gcd     ... bench:      94,120 ns/iter (+/- 4,215)
test i128::bench_gcd_old ... bench:      84,299 ns/iter (+/- 2,349)
test i16::bench_gcd      ... bench:         193 ns/iter (+/- 1)
test i16::bench_gcd_old  ... bench:         231 ns/iter (+/- 2)
test i32::bench_gcd      ... bench:         454 ns/iter (+/- 2)
test i32::bench_gcd_old  ... bench:         696 ns/iter (+/- 4)
test i64::bench_gcd      ... bench:       7,001 ns/iter (+/- 84)
test i64::bench_gcd_old  ... bench:       7,713 ns/iter (+/- 45)
test i8::bench_gcd       ... bench:          68 ns/iter (+/- 0)
test i8::bench_gcd_old   ... bench:          81 ns/iter (+/- 0)
test u128::bench_gcd     ... bench:      93,363 ns/iter (+/- 632)
test u128::bench_gcd_old ... bench:      90,175 ns/iter (+/- 447)
test u16::bench_gcd      ... bench:         168 ns/iter (+/- 0)
test u16::bench_gcd_old  ... bench:         274 ns/iter (+/- 2)
test u32::bench_gcd      ... bench:         447 ns/iter (+/- 0)
test u32::bench_gcd_old  ... bench:         985 ns/iter (+/- 7)
test u64::bench_gcd      ... bench:       7,132 ns/iter (+/- 78)
test u64::bench_gcd_old  ... bench:       7,434 ns/iter (+/- 100)
test u8::bench_gcd       ... bench:         103 ns/iter (+/- 0)
test u8::bench_gcd_old   ... bench:          86 ns/iter (+/- 0)

test result: ok. 0 passed; 0 failed; 0 ignored; 20 measured; 0 filtered out

@cuviper
Copy link
Member

cuviper commented Oct 3, 2018

I'm inclined to say this is a clear win on "normal" integers, and close enough on 128-bit integers that it might not make much difference in real use (with other surrounding calculations too).

What do you think?

@smarnach
Copy link
Contributor Author

smarnach commented Oct 3, 2018

@cuviper It's a bit hard to tell what the common use case to optimise for is. In the end, GCD computations are kind of fringe in real applications, so probably the "common use case" simply does not exist. But I agree that the new implementation seems to be faster overall, and the differences for the 128 bit implementation do not justify complicating the code any further.

@cuviper
Copy link
Member

cuviper commented Oct 3, 2018

OK, let's merge!

bors r+

bors bot added a commit that referenced this pull request Oct 3, 2018
11: Optimise the GCD implementations. r=cuviper a=smarnach

This change avoids using `swap()` in the GCD implementation, which results in a speedup of almost a factor of two on my system.  Here are a few benchmark numbers for `u32` values, where `t_old` is the time per function call for the current implementation, and `t_new` is the time of the implementation introduced in this change.
```
| m             | n             | t_old (ns) | t_new (ns) |
+---------------+---------------+------------+------------+
| 2_971_215_073 | 1_836_311_903 |         76 |         40 |
|       253_241 |       489_997 |         26 |         12 |
| 4_183_928_743 |     1_234_567 |         63 |         31 |
| 3_221_225_469 |             3 |         93 |         47 |
|   0xffff_ffff |   0x8000_0000 |         98 |         55 |
```
I'm aware that having tested this on a single system for a single type doesn't tell us too much, but before trying to perform a more thorough benchmark, I'd like to know whether you are interested in this kind of optimisation in principle, and what kind of benchmark results you would like to see to inlcude it.

Co-authored-by: Sven Marnach <sven@marnach.net>
@bors
Copy link
Contributor

bors bot commented Oct 3, 2018

Build succeeded

@bors bors bot merged commit 1cb1d29 into rust-num:master Oct 3, 2018
@smarnach
Copy link
Contributor Author

smarnach commented Oct 3, 2018

@cuviper Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants