Optimize 64-bit division-by-constant for x86 platforms #73

abolz · 2018-08-14T19:30:23Z

On x86 platforms division-by-constant likely results in calls to library
functions like __udivdi3 or __aulldiv, even if the divisor is constant.

This patch performs division-by-constant on x86 platforms in the same
way as x64 compilers would do, i.e., by a multiplication and shifts.

To compute the high 64 bits of a 64x64 product, the current umul128
implementation can be reused. All tested compilers (MSVC, clang and g++)
eliminate the instructions used to compute the low 64 bits of the
product.

Each occurrence of "x/5" or "x%10" has been replaced by a function like
"div5(x)" or an expression like "x - 10*div10(x)". This makes the code
less readable, but the performance increase is probably worth it.

For MSVC x64 and g++ x64 the performance is unaffected.

Benchmark results for MSVC x86 /O2:

benchmark -ryu -64
Before:
64:  165.414   21.922
64:  165.394   19.978
64:  167.034   22.027
64:  164.965   19.600
After:
64:  122.904   13.817
64:  123.616   17.523
64:  123.353   14.597
64:  122.699   13.635

benchmark -ryu -64 -invert
Before:
64:  235.964   10.823
64:  235.708    7.874
64:  236.425   11.082
64:  235.988    7.720
After:
64:  182.802   10.710
64:  182.420   11.635
64:  181.156    6.522
64:  181.247    7.254

Some more benchmarks using google's benchmark library:

Benchmark:                Time Before:     After:
------------------------------------------------------------------------
// Test random bit patterns.
BM_RandomBits_double           226 ns      170 ns      1.33x

// Test numbers with x significand digits of the form "d.dddd".
BM_Digits_double/_1            725 ns      409 ns      1.77x
BM_Digits_double/_2            629 ns      362 ns      1.74x
BM_Digits_double/_3            585 ns      331 ns      1.77x
BM_Digits_double/_4            557 ns      313 ns      1.78x
BM_Digits_double/_5            521 ns      304 ns      1.71x
BM_Digits_double/_6            466 ns      289 ns      1.61x
BM_Digits_double/_7            431 ns      271 ns      1.59x
BM_Digits_double/_8            399 ns      256 ns      1.56x
BM_Digits_double/_9            367 ns      239 ns      1.54x
BM_Digits_double/_10           363 ns      231 ns      1.57x
BM_Digits_double/_11           328 ns      214 ns      1.53x
BM_Digits_double/_12           300 ns      198 ns      1.52x
BM_Digits_double/_13           269 ns      185 ns      1.45x
BM_Digits_double/_14           242 ns      170 ns      1.42x
BM_Digits_double/_15           214 ns      155 ns      1.38x

// Test uniformly distributed random numbers in the range [10^x, 10^y].
BM_Uniform_double/-30/-25      200 ns      149 ns      1.34x
BM_Uniform_double/-25/-20      201 ns      147 ns      1.37x
BM_Uniform_double/-20/-15      195 ns      144 ns      1.35x
BM_Uniform_double/-15/-20      192 ns      141 ns      1.36x
BM_Uniform_double/-10/-5       194 ns      144 ns      1.35x
BM_Uniform_double/-5/-1        194 ns      146 ns      1.33x
BM_Uniform_double/-1/1         194 ns      145 ns      1.34x
BM_Uniform_double/1/5          195 ns      145 ns      1.34x
BM_Uniform_double/5/10         192 ns      143 ns      1.34x
BM_Uniform_double/10/15        232 ns      162 ns      1.43x
BM_Uniform_double/15/20        244 ns      172 ns      1.42x
BM_Uniform_double/20/25        235 ns      164 ns      1.43x
BM_Uniform_double/25/30        246 ns      171 ns      1.44x
BM_Uniform_double/30/35        241 ns      170 ns      1.42x
BM_Uniform_double/35/40        211 ns      158 ns      1.34x
BM_Uniform_double/40/45        197 ns      146 ns      1.35x
BM_Uniform_double/45/50        191 ns      143 ns      1.34x

The idea is to move some of the basic building blocks of the Ryu algorithm into this file which might then be optimized for different compilers/architectures without decreasing code clarity.

On x86 platforms division-by-constant likely results in calls to library functions like __udivdi3 or __aulldiv, even if the divisor is constant. This patch performs division-by-constant on x86 platforms in the same way as x64 compilers would do, i.e., by a multiplication and shifts. To compute the high 64 bits of a 64x64 product, the current umul128 implementation can be reused. All tested compilers (MSVC, clang and g++) eliminate the instructions used to compute the low 64 bits of the product. Each occurrence of "x/5" or "x%10" has been replaced by a function like "div5(x)" or an expression like "x - 10*div10(x)". This makes the code less readable, but the performance increase is probably worth it. For MSVC x64 and g++ x64 the performance is unaffected. --- Benchmark results for MSVC x86 /O2: benchmark -ryu -64 Before: 64: 165.414 21.922 64: 165.394 19.978 64: 167.034 22.027 64: 164.965 19.600 After: 64: 122.904 13.817 64: 123.616 17.523 64: 123.353 14.597 64: 122.699 13.635 benchmark -ryu -64 -invert Before: 64: 235.964 10.823 64: 235.708 7.874 64: 236.425 11.082 64: 235.988 7.720 After: 64: 182.802 10.710 64: 182.420 11.635 64: 181.156 6.522 64: 181.247 7.254 --- Some more benchmarks using google's benchmark library: Benchmark: Time Before: After: ------------------------------------------------------------------------ // Test random bit patterns. BM_RandomBits_double 226 ns 170 ns 1.33x // Test numbers with x significand digits of the form "d.dddd". BM_Digits_double/_1 725 ns 409 ns 1.77x BM_Digits_double/_2 629 ns 362 ns 1.74x BM_Digits_double/_3 585 ns 331 ns 1.77x BM_Digits_double/_4 557 ns 313 ns 1.78x BM_Digits_double/_5 521 ns 304 ns 1.71x BM_Digits_double/_6 466 ns 289 ns 1.61x BM_Digits_double/_7 431 ns 271 ns 1.59x BM_Digits_double/_8 399 ns 256 ns 1.56x BM_Digits_double/_9 367 ns 239 ns 1.54x BM_Digits_double/_10 363 ns 231 ns 1.57x BM_Digits_double/_11 328 ns 214 ns 1.53x BM_Digits_double/_12 300 ns 198 ns 1.52x BM_Digits_double/_13 269 ns 185 ns 1.45x BM_Digits_double/_14 242 ns 170 ns 1.42x BM_Digits_double/_15 214 ns 155 ns 1.38x // Test uniformly distributed random numbers in the range [10^x, 10^y]. BM_Uniform_double/-30/-25 200 ns 149 ns 1.34x BM_Uniform_double/-25/-20 201 ns 147 ns 1.37x BM_Uniform_double/-20/-15 195 ns 144 ns 1.35x BM_Uniform_double/-15/-20 192 ns 141 ns 1.36x BM_Uniform_double/-10/-5 194 ns 144 ns 1.35x BM_Uniform_double/-5/-1 194 ns 146 ns 1.33x BM_Uniform_double/-1/1 194 ns 145 ns 1.34x BM_Uniform_double/1/5 195 ns 145 ns 1.34x BM_Uniform_double/5/10 192 ns 143 ns 1.34x BM_Uniform_double/10/15 232 ns 162 ns 1.43x BM_Uniform_double/15/20 244 ns 172 ns 1.42x BM_Uniform_double/20/25 235 ns 164 ns 1.43x BM_Uniform_double/25/30 246 ns 171 ns 1.44x BM_Uniform_double/30/35 241 ns 170 ns 1.42x BM_Uniform_double/35/40 211 ns 158 ns 1.34x BM_Uniform_double/40/45 197 ns 146 ns 1.35x BM_Uniform_double/45/50 191 ns 143 ns 1.34x

ulfjack · 2018-08-15T14:44:20Z

ryu/d2s_intrinsics.h

+// in the same way as 64-bit compilers would do.
+//
+// NB:
+// The multipliers and shift values are the ones generated by clang x64


Do we have an upstream bug that we can cite here?

For clang I just found this: https://bugs.llvm.org/show_bug.cgi?id=37932
I haven't found something similar for gcc or msvc.

And for gcc: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=37443

I've filed VSO#664754 "On x86, division by 64-bit constants could be dramatically more efficient" in MSVC's internal database, with your code as a minimal repro. (This is separate from VSO#634761 "C2 emits code that's 34.3% slower than Clang/LLVM for the Ryu algorithm" which is the overall bug for Ryu codegen quality.)

Add links to some bug reports.

StephanTLavavej · 2018-08-16T02:58:00Z

This PR continues to have great performance on my i7-4790. With my benchmark changes in #74 (defaulting to "inverted mode" for realistic branch prediction), I observe a 45% improvement for MSVC, 48% for Clang, and larger improvements for smaller output (which performs more divisions to trim more digits):

MSVC Digits	Before	After	Speedup
All	153.713	106.1	1.45
7	439.555	167.245	2.63
6	476.098	178.361	2.67
5	510.015	186.656	2.73
4	545.182	195.279	2.79
3	584.221	206.642	2.83
2	635.785	221.941	2.86
1	711.366	238.35	2.98

Clang Digits	Before	After	Speedup
All	118.688	80.378	1.48
7	284.491	130.421	2.18
6	305.25	137.703	2.22
5	336.106	145.762	2.31
4	356.973	153.755	2.32
3	379.057	158.039	2.40
2	412.38	177.199	2.33
1	486.917	185.11	2.63

abolz added 3 commits August 13, 2018 23:35

Rename mulshift128.h -> d2s_intrinsics.h

3a8ec97

The idea is to move some of the basic building blocks of the Ryu algorithm into this file which might then be optimized for different compilers/architectures without decreasing code clarity.

Update include guard

567e338

ulfjack reviewed Aug 15, 2018

View reviewed changes

Update comment

3edffca

Add links to some bug reports.

ulfjack merged commit 3474a77 into ulfjack:master Aug 16, 2018

abolz deleted the div branch August 17, 2018 06:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize 64-bit division-by-constant for x86 platforms #73

Optimize 64-bit division-by-constant for x86 platforms #73

abolz commented Aug 14, 2018

ulfjack Aug 15, 2018

abolz Aug 15, 2018

abolz Aug 15, 2018

StephanTLavavej Aug 16, 2018

StephanTLavavej commented Aug 16, 2018

Optimize 64-bit division-by-constant for x86 platforms #73

Optimize 64-bit division-by-constant for x86 platforms #73

Conversation

abolz commented Aug 14, 2018

ulfjack Aug 15, 2018

Choose a reason for hiding this comment

abolz Aug 15, 2018

Choose a reason for hiding this comment

abolz Aug 15, 2018

Choose a reason for hiding this comment

StephanTLavavej Aug 16, 2018

Choose a reason for hiding this comment

StephanTLavavej commented Aug 16, 2018