-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize 64-bit division-by-constant for x86 platforms #73
Conversation
The idea is to move some of the basic building blocks of the Ryu algorithm into this file which might then be optimized for different compilers/architectures without decreasing code clarity.
On x86 platforms division-by-constant likely results in calls to library functions like __udivdi3 or __aulldiv, even if the divisor is constant. This patch performs division-by-constant on x86 platforms in the same way as x64 compilers would do, i.e., by a multiplication and shifts. To compute the high 64 bits of a 64x64 product, the current umul128 implementation can be reused. All tested compilers (MSVC, clang and g++) eliminate the instructions used to compute the low 64 bits of the product. Each occurrence of "x/5" or "x%10" has been replaced by a function like "div5(x)" or an expression like "x - 10*div10(x)". This makes the code less readable, but the performance increase is probably worth it. For MSVC x64 and g++ x64 the performance is unaffected. --- Benchmark results for MSVC x86 /O2: benchmark -ryu -64 Before: 64: 165.414 21.922 64: 165.394 19.978 64: 167.034 22.027 64: 164.965 19.600 After: 64: 122.904 13.817 64: 123.616 17.523 64: 123.353 14.597 64: 122.699 13.635 benchmark -ryu -64 -invert Before: 64: 235.964 10.823 64: 235.708 7.874 64: 236.425 11.082 64: 235.988 7.720 After: 64: 182.802 10.710 64: 182.420 11.635 64: 181.156 6.522 64: 181.247 7.254 --- Some more benchmarks using google's benchmark library: Benchmark: Time Before: After: ------------------------------------------------------------------------ // Test random bit patterns. BM_RandomBits_double 226 ns 170 ns 1.33x // Test numbers with x significand digits of the form "d.dddd". BM_Digits_double/_1 725 ns 409 ns 1.77x BM_Digits_double/_2 629 ns 362 ns 1.74x BM_Digits_double/_3 585 ns 331 ns 1.77x BM_Digits_double/_4 557 ns 313 ns 1.78x BM_Digits_double/_5 521 ns 304 ns 1.71x BM_Digits_double/_6 466 ns 289 ns 1.61x BM_Digits_double/_7 431 ns 271 ns 1.59x BM_Digits_double/_8 399 ns 256 ns 1.56x BM_Digits_double/_9 367 ns 239 ns 1.54x BM_Digits_double/_10 363 ns 231 ns 1.57x BM_Digits_double/_11 328 ns 214 ns 1.53x BM_Digits_double/_12 300 ns 198 ns 1.52x BM_Digits_double/_13 269 ns 185 ns 1.45x BM_Digits_double/_14 242 ns 170 ns 1.42x BM_Digits_double/_15 214 ns 155 ns 1.38x // Test uniformly distributed random numbers in the range [10^x, 10^y]. BM_Uniform_double/-30/-25 200 ns 149 ns 1.34x BM_Uniform_double/-25/-20 201 ns 147 ns 1.37x BM_Uniform_double/-20/-15 195 ns 144 ns 1.35x BM_Uniform_double/-15/-20 192 ns 141 ns 1.36x BM_Uniform_double/-10/-5 194 ns 144 ns 1.35x BM_Uniform_double/-5/-1 194 ns 146 ns 1.33x BM_Uniform_double/-1/1 194 ns 145 ns 1.34x BM_Uniform_double/1/5 195 ns 145 ns 1.34x BM_Uniform_double/5/10 192 ns 143 ns 1.34x BM_Uniform_double/10/15 232 ns 162 ns 1.43x BM_Uniform_double/15/20 244 ns 172 ns 1.42x BM_Uniform_double/20/25 235 ns 164 ns 1.43x BM_Uniform_double/25/30 246 ns 171 ns 1.44x BM_Uniform_double/30/35 241 ns 170 ns 1.42x BM_Uniform_double/35/40 211 ns 158 ns 1.34x BM_Uniform_double/40/45 197 ns 146 ns 1.35x BM_Uniform_double/45/50 191 ns 143 ns 1.34x
// in the same way as 64-bit compilers would do. | ||
// | ||
// NB: | ||
// The multipliers and shift values are the ones generated by clang x64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have an upstream bug that we can cite here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For clang I just found this: https://bugs.llvm.org/show_bug.cgi?id=37932
I haven't found something similar for gcc or msvc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And for gcc: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=37443
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've filed VSO#664754 "On x86, division by 64-bit constants could be dramatically more efficient" in MSVC's internal database, with your code as a minimal repro. (This is separate from VSO#634761 "C2 emits code that's 34.3% slower than Clang/LLVM for the Ryu algorithm" which is the overall bug for Ryu codegen quality.)
Add links to some bug reports.
This PR continues to have great performance on my i7-4790. With my benchmark changes in #74 (defaulting to "inverted mode" for realistic branch prediction), I observe a 45% improvement for MSVC, 48% for Clang, and larger improvements for smaller output (which performs more divisions to trim more digits):
|
On x86 platforms division-by-constant likely results in calls to library
functions like __udivdi3 or __aulldiv, even if the divisor is constant.
This patch performs division-by-constant on x86 platforms in the same
way as x64 compilers would do, i.e., by a multiplication and shifts.
To compute the high 64 bits of a 64x64 product, the current umul128
implementation can be reused. All tested compilers (MSVC, clang and g++)
eliminate the instructions used to compute the low 64 bits of the
product.
Each occurrence of "x/5" or "x%10" has been replaced by a function like
"div5(x)" or an expression like "x - 10*div10(x)". This makes the code
less readable, but the performance increase is probably worth it.
For MSVC x64 and g++ x64 the performance is unaffected.
Benchmark results for MSVC x86 /O2:
Some more benchmarks using google's benchmark library: