Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize 64-bit division-by-constant for x86 platforms #73

Merged
merged 4 commits into from
Aug 16, 2018

Conversation

abolz
Copy link
Contributor

@abolz abolz commented Aug 14, 2018

On x86 platforms division-by-constant likely results in calls to library
functions like __udivdi3 or __aulldiv, even if the divisor is constant.

This patch performs division-by-constant on x86 platforms in the same
way as x64 compilers would do, i.e., by a multiplication and shifts.

To compute the high 64 bits of a 64x64 product, the current umul128
implementation can be reused. All tested compilers (MSVC, clang and g++)
eliminate the instructions used to compute the low 64 bits of the
product.

Each occurrence of "x/5" or "x%10" has been replaced by a function like
"div5(x)" or an expression like "x - 10*div10(x)". This makes the code
less readable, but the performance increase is probably worth it.

For MSVC x64 and g++ x64 the performance is unaffected.


Benchmark results for MSVC x86 /O2:

benchmark -ryu -64
Before:
64:  165.414   21.922
64:  165.394   19.978
64:  167.034   22.027
64:  164.965   19.600
After:
64:  122.904   13.817
64:  123.616   17.523
64:  123.353   14.597
64:  122.699   13.635
benchmark -ryu -64 -invert
Before:
64:  235.964   10.823
64:  235.708    7.874
64:  236.425   11.082
64:  235.988    7.720
After:
64:  182.802   10.710
64:  182.420   11.635
64:  181.156    6.522
64:  181.247    7.254

Some more benchmarks using google's benchmark library:

Benchmark:                Time Before:     After:
------------------------------------------------------------------------
// Test random bit patterns.
BM_RandomBits_double           226 ns      170 ns      1.33x

// Test numbers with x significand digits of the form "d.dddd".
BM_Digits_double/_1            725 ns      409 ns      1.77x
BM_Digits_double/_2            629 ns      362 ns      1.74x
BM_Digits_double/_3            585 ns      331 ns      1.77x
BM_Digits_double/_4            557 ns      313 ns      1.78x
BM_Digits_double/_5            521 ns      304 ns      1.71x
BM_Digits_double/_6            466 ns      289 ns      1.61x
BM_Digits_double/_7            431 ns      271 ns      1.59x
BM_Digits_double/_8            399 ns      256 ns      1.56x
BM_Digits_double/_9            367 ns      239 ns      1.54x
BM_Digits_double/_10           363 ns      231 ns      1.57x
BM_Digits_double/_11           328 ns      214 ns      1.53x
BM_Digits_double/_12           300 ns      198 ns      1.52x
BM_Digits_double/_13           269 ns      185 ns      1.45x
BM_Digits_double/_14           242 ns      170 ns      1.42x
BM_Digits_double/_15           214 ns      155 ns      1.38x

// Test uniformly distributed random numbers in the range [10^x, 10^y].
BM_Uniform_double/-30/-25      200 ns      149 ns      1.34x
BM_Uniform_double/-25/-20      201 ns      147 ns      1.37x
BM_Uniform_double/-20/-15      195 ns      144 ns      1.35x
BM_Uniform_double/-15/-20      192 ns      141 ns      1.36x
BM_Uniform_double/-10/-5       194 ns      144 ns      1.35x
BM_Uniform_double/-5/-1        194 ns      146 ns      1.33x
BM_Uniform_double/-1/1         194 ns      145 ns      1.34x
BM_Uniform_double/1/5          195 ns      145 ns      1.34x
BM_Uniform_double/5/10         192 ns      143 ns      1.34x
BM_Uniform_double/10/15        232 ns      162 ns      1.43x
BM_Uniform_double/15/20        244 ns      172 ns      1.42x
BM_Uniform_double/20/25        235 ns      164 ns      1.43x
BM_Uniform_double/25/30        246 ns      171 ns      1.44x
BM_Uniform_double/30/35        241 ns      170 ns      1.42x
BM_Uniform_double/35/40        211 ns      158 ns      1.34x
BM_Uniform_double/40/45        197 ns      146 ns      1.35x
BM_Uniform_double/45/50        191 ns      143 ns      1.34x

The idea is to move some of the basic building blocks of the Ryu
algorithm into this file which might then be optimized for different
compilers/architectures without decreasing code clarity.
On x86 platforms division-by-constant likely results in calls to library
functions like __udivdi3 or __aulldiv, even if the divisor is constant.

This patch performs division-by-constant on x86 platforms in the same
way as x64 compilers would do, i.e., by a multiplication and shifts.

To compute the high 64 bits of a 64x64 product, the current umul128
implementation can be reused. All tested compilers (MSVC, clang and g++)
eliminate the instructions used to compute the low 64 bits of the
product.

Each occurrence of "x/5" or "x%10" has been replaced by a function like
"div5(x)" or an expression like "x - 10*div10(x)". This makes the code
less readable, but the performance increase is probably worth it.

For MSVC x64 and g++ x64 the performance is unaffected.

---

Benchmark results for MSVC x86 /O2:

benchmark -ryu -64
Before:
64:  165.414   21.922
64:  165.394   19.978
64:  167.034   22.027
64:  164.965   19.600
After:
64:  122.904   13.817
64:  123.616   17.523
64:  123.353   14.597
64:  122.699   13.635

benchmark -ryu -64 -invert
Before:
64:  235.964   10.823
64:  235.708    7.874
64:  236.425   11.082
64:  235.988    7.720
After:
64:  182.802   10.710
64:  182.420   11.635
64:  181.156    6.522
64:  181.247    7.254

---

Some more benchmarks using google's benchmark library:

Benchmark:                Time Before:     After:
------------------------------------------------------------------------
// Test random bit patterns.
BM_RandomBits_double           226 ns      170 ns      1.33x

// Test numbers with x significand digits of the form "d.dddd".
BM_Digits_double/_1            725 ns      409 ns      1.77x
BM_Digits_double/_2            629 ns      362 ns      1.74x
BM_Digits_double/_3            585 ns      331 ns      1.77x
BM_Digits_double/_4            557 ns      313 ns      1.78x
BM_Digits_double/_5            521 ns      304 ns      1.71x
BM_Digits_double/_6            466 ns      289 ns      1.61x
BM_Digits_double/_7            431 ns      271 ns      1.59x
BM_Digits_double/_8            399 ns      256 ns      1.56x
BM_Digits_double/_9            367 ns      239 ns      1.54x
BM_Digits_double/_10           363 ns      231 ns      1.57x
BM_Digits_double/_11           328 ns      214 ns      1.53x
BM_Digits_double/_12           300 ns      198 ns      1.52x
BM_Digits_double/_13           269 ns      185 ns      1.45x
BM_Digits_double/_14           242 ns      170 ns      1.42x
BM_Digits_double/_15           214 ns      155 ns      1.38x

// Test uniformly distributed random numbers in the range [10^x, 10^y].
BM_Uniform_double/-30/-25      200 ns      149 ns      1.34x
BM_Uniform_double/-25/-20      201 ns      147 ns      1.37x
BM_Uniform_double/-20/-15      195 ns      144 ns      1.35x
BM_Uniform_double/-15/-20      192 ns      141 ns      1.36x
BM_Uniform_double/-10/-5       194 ns      144 ns      1.35x
BM_Uniform_double/-5/-1        194 ns      146 ns      1.33x
BM_Uniform_double/-1/1         194 ns      145 ns      1.34x
BM_Uniform_double/1/5          195 ns      145 ns      1.34x
BM_Uniform_double/5/10         192 ns      143 ns      1.34x
BM_Uniform_double/10/15        232 ns      162 ns      1.43x
BM_Uniform_double/15/20        244 ns      172 ns      1.42x
BM_Uniform_double/20/25        235 ns      164 ns      1.43x
BM_Uniform_double/25/30        246 ns      171 ns      1.44x
BM_Uniform_double/30/35        241 ns      170 ns      1.42x
BM_Uniform_double/35/40        211 ns      158 ns      1.34x
BM_Uniform_double/40/45        197 ns      146 ns      1.35x
BM_Uniform_double/45/50        191 ns      143 ns      1.34x
// in the same way as 64-bit compilers would do.
//
// NB:
// The multipliers and shift values are the ones generated by clang x64
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have an upstream bug that we can cite here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clang I just found this: https://bugs.llvm.org/show_bug.cgi?id=37932
I haven't found something similar for gcc or msvc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've filed VSO#664754 "On x86, division by 64-bit constants could be dramatically more efficient" in MSVC's internal database, with your code as a minimal repro. (This is separate from VSO#634761 "C2 emits code that's 34.3% slower than Clang/LLVM for the Ryu algorithm" which is the overall bug for Ryu codegen quality.)

Add links to some bug reports.
@StephanTLavavej
Copy link
Contributor

This PR continues to have great performance on my i7-4790. With my benchmark changes in #74 (defaulting to "inverted mode" for realistic branch prediction), I observe a 45% improvement for MSVC, 48% for Clang, and larger improvements for smaller output (which performs more divisions to trim more digits):

MSVC Digits Before After Speedup
All 153.713 106.1 1.45
7 439.555 167.245 2.63
6 476.098 178.361 2.67
5 510.015 186.656 2.73
4 545.182 195.279 2.79
3 584.221 206.642 2.83
2 635.785 221.941 2.86
1 711.366 238.35 2.98
Clang Digits Before After Speedup
All 118.688 80.378 1.48
7 284.491 130.421 2.18
6 305.25 137.703 2.22
5 336.106 145.762 2.31
4 356.973 153.755 2.32
3 379.057 158.039 2.40
2 412.38 177.199 2.33
1 486.917 185.11 2.63

@ulfjack ulfjack merged commit 3474a77 into ulfjack:master Aug 16, 2018
@abolz abolz deleted the div branch August 17, 2018 06:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants