[libsleef] Add modified Payne Hanek argument reduction #197

shibatch · 2018-06-19T16:00:53Z

This patch adds a modified Payne Hanek argument reduction that can be used for very large arguments.

Payne Hanek reduction algorithm can handle very large arguments by table look-ups. In this patch, a vectorized version of the algorithm is implemented. The argument range for DP and SP trig functions will become [-1e+299, 1e+299] and [-1e+28, 1e+28], respectively. In order to avoid using 64 bit or 128 bit multiplication, the algorithm is modified to use DD computation. Gather instructions are used for table look-ups. The reduction subroutine is tested to confirm that it correctly handle the worst case with 6381956970095103.0 * 2.0^797.

The traditional implementation can be seen in the following gist.
https://gist.github.com/simonbyrne/d640ac1c1db3e1774cf2d405049beef3

fpetrogalli

@shibatch Is there any performance improvements with this?

shibatch · 2018-06-20T01:07:42Z

I haven't taken a benchmark, but it depends on how fast gather instructions are. On Skylake, the latency and throughput of a gather instruction is 20 and 4, so executing 4 gather instructions take (at least) 20 + 4 * 3 = 32 clocks, which is already pretty slow.

The execution time should not matter in practice, since the old reduction algorithms are still enabled. I would say there is something wrong if someone is calling a trig function with a very large argument. The point is that when adopting SLEEF to some project, that person wants to feel safe to replace the existing calls to the standard math functions with SLEEF functions.

fpetrogalli · 2018-06-20T16:37:58Z

@shibatch

I haven't taken a benchmark, but it depends on how fast gather instructions are. On Skylake, the latency and throughput of a gather instruction is 20 and 4, so executing 4 gather instructions take (at least) 20 + 4 * 3 = 32 clocks, which is already pretty slow.

But isn't this patch using gather instructions? If that's the case, it seems to be not a good idea given that they are slow.

The execution time should not matter in practice, since the old reduction algorithms are still enabled.

I think that execution time is a very important constraint.

I would say there is something wrong if someone is calling a trig function with a very large argument.

You are probably right, but penalizing those users that compute trigonometric functions on proper values with en extra vector test instruction doesn't seem fair.

The point is that when adopting SLEEF to some project, that person wants to feel safe to replace the existing calls to the standard math functions with SLEEF functions.

I agree. But I think that we should provide them via separate functions, not merging them in the same function with that if statement on the "test all one" function (this would of course add another degree of freedom for the run-time system, if we ever come up with it).

shibatch · 2018-06-20T18:38:41Z

But isn't this patch using gather instructions? If that's the case, it seems to be not a good idea given that they are slow.

Payne-Henek algorithm is a slow algorithm, compared to Cody-Waite.

You are probably right, but penalizing those users that compute trigonometric functions on proper values with en extra vector test instruction doesn't seem fair.

CPUs with out-of-order execution and branch prediction should have only small performance impact with this. We have to think the critical path to estimate the execution time, rather than simply summing up the execution time of all the instructions. Since the extra vector test instructions are not in the critical path, they don't affect the overall execution time so much.

shibatch · 2018-06-21T05:01:56Z

Here is the benchmark.
This is the results with the latest patch.

shibatch · 2018-06-21T09:57:04Z

As a workaround of the bug in armclang, I changed the definition of INLINE macro and it does not insert always_inline attribute when SLEEF is compiled for SVE target. armclang still aggressively inlines functions, so I think this is not a problem. Please check the assembly output from the compiler.

https://github.com/shibatch/sleef/wiki/197/sleefsimddp.s

Testing for SVE target is now enabled.

shibatch · 2018-06-22T11:15:37Z

I moved the fixup code to the reduction part, and now most of the functions are slightly faster than before, if the argument is small. The graphs are updated.

fpetrogalli · 2018-06-27T04:07:32Z

Hi @shibatch, a couple of questions before I dig into the code review:

Am I correct thinking that the plots you have reported are for x86? If so, could you please report the same benchmarks for an AArch64 machine?
Would it be possible to support such big range values by improving the reduction algorithms (especially for periodic functions like sin and cosine), instead of specializing the polynomials?
Are you happy about the ~4x slowdown of the 4ULP version of sin/cos/tan (SP and DP) when sampling the input values outside of [0,6.28]?
It is very nice that the functions are faster in [0, 6.28]. Is that due to the effect of using __builtin_expect? If so, what is the maximum range where this effect is still working
How does this compares to SVML or libmvec on x86? Is SLEEF any better due to the changes in this patch? Specifically, can you say that now 4ULP sin is faster than sin in SVML for the new very large input values?

shibatch · 2018-06-27T04:32:58Z

Am I correct thinking that the plots you have reported are for x86? If so, could you please report the same benchmarks for an AArch64 machine?

Sure. I will post the results.

Would it be possible to support such big range values by improving the reduction algorithms (especially for periodic functions like sin and cosine), instead of specializing the polynomials?

Reduction algorithms for trig functions is not easy, and there are basically two algorithms. Cody-Waite algorithm has been used since the first version of SLEEF, and Payne-Hanek is the one introduced in this PR. The versions used in SLEEF are both modified versions of these algorithms, but basically it is not easy to further improve the algorithms.

Are you happy about the ~4x slowdown of the 4ULP version of sin/cos/tan (SP and DP) when sampling the input values outside of [0,6.28]?

For the DP versions of the functions, there is actually almost no slowdown, since the that large input domain was not supported. For the 1 ulp SP version, the situation is same, and there is almost no slowdown.

There is some slowdown in 3.5 ulp SP version of the functions, between 125 to 39000, but I don't know if people care about this. For the input domain less that 125, it should be a little faster than before.

It is very nice that the functions are faster in [0, 6.28]. Is that due to the effect of using __builtin_expect? If so, what is the maximum range where this effect is still working

For DP functions, the fastest algorithm is used up to 15, and the second algorithm is used up to 1e+9.

For SP functions, the faster algorithm is used up to 125.

The effect is not only __builtin_expect. It is a combination of __builtin_expect and moving fixup code to the slow reduction routine, and adjusting the polynomial for reduction.

How does this compares to SVML or libmvec on x86? Is SLEEF any better due to the changes in this patch? Specifically, can you say that now 4ULP sin is faster than sin in SVML for the new very large input values?

They are a little slower than SVML for very large arguments. I will post the graph.

shibatch · 2018-06-27T11:13:53Z

These are comparison between SVML and SLEEF with this patch on Core i7-6700.

These graphs are comparison on RK3399.

* Benchmarking tool now compiles for aarch64

shibatch · 2018-06-30T07:23:19Z

I put back the reduction routine for [125, 39000] range to 3.5 ulp SP functions.
There should be no big slowdown in those functions anymore.

fpetrogalli · 2018-07-04T04:29:30Z

src/arch/helperavx.h

@@ -290,6 +290,12 @@ static INLINE vdouble vloadu_vd_p(const double *ptr) { return _mm256_loadu_pd(pt
 static INLINE void vstore_v_p_vd(double *ptr, vdouble v) { _mm256_store_pd(ptr, v); }
 static INLINE void vstoreu_v_p_vd(double *ptr, vdouble v) { _mm256_storeu_pd(ptr, v); }

+static INLINE vdouble vgather_vd_p_vi(const double *ptr, vint vi) {
+  int a[4];


Shouldn't this beint a[VECLENSP]?

fpetrogalli · 2018-07-04T04:30:08Z

src/arch/helperavx.h

@@ -477,6 +483,13 @@ static INLINE vfloat vloadu_vf_p(const float *ptr) { return _mm256_loadu_ps(ptr)
 static INLINE void vstore_v_p_vf(float *ptr, vfloat v) { _mm256_store_ps(ptr, v); }
 static INLINE void vstoreu_v_p_vf(float *ptr, vfloat v) { _mm256_storeu_ps(ptr, v); }

+static INLINE vfloat vgather_vf_p_vi2(const float *ptr, vint2 vi2) {
+  int a[8];


Same here, shouldn't it be int a[VECLENSP]?

fpetrogalli · 2018-07-04T04:31:12Z

src/arch/helperpower_128.h

@@ -74,6 +74,18 @@ static INLINE void vstoreu_v_p_vf(float *ptr, vfloat v) { ptr[0] = v[0]; ptr[1]

 static INLINE void vscatter2_v_p_i_i_vd(double *ptr, int offset, int step, vdouble v) { vstore_v_p_vd((double *)(&ptr[2*offset]), v); }

+static INLINE vdouble vgather_vd_p_vi(const double *ptr, vint vi) {
+  int a[4];


Shouldn't this also be int[VECLENDP]?

fpetrogalli · 2018-07-04T04:31:28Z

src/arch/helperpower_128.h

+}
+
+static INLINE vfloat vgather_vf_p_vi2(const float *ptr, vint2 vi2) {
+  int a[4];


Shouldn't this also be int[VECLENSP]?

fpetrogalli · 2018-07-04T04:32:23Z

src/arch/helpersse2.h

@@ -267,6 +267,12 @@ static INLINE vdouble vloadu_vd_p(const double *ptr) { return _mm_loadu_pd(ptr);
 static INLINE void vstore_v_p_vd(double *ptr, vdouble v) { _mm_store_pd(ptr, v); }
 static INLINE void vstoreu_v_p_vd(double *ptr, vdouble v) { _mm_storeu_pd(ptr, v); }

+static INLINE vdouble vgather_vd_p_vi(const double *ptr, vint vi) {
+  int a[4];


int a[VECLENDP];?

fpetrogalli · 2018-07-04T04:40:15Z

src/arch/helpersve.h

+}
+
+static float vcast_f_vf(vfloat v) {
+  float a[64];


int a[svcntw()];

fpetrogalli · 2018-07-04T04:40:59Z

src/arch/helpersve.h

+}
+
+static int vcast_i_vi(vint v) {
+  int a[64];


int a[svcntw()];

fpetrogalli · 2018-07-04T04:41:44Z

src/arch/helpersve.h

+}
+
+static int vcast_i_vi2(vint2 v) {
+  int a[64];


int a[svcntw()];

fpetrogalli · 2018-07-04T04:53:25Z

src/libm/sleefsimdsp.c

+#ifdef ENABLE_SVE
+typedef __sizeless_struct {
+  vfloat d;
+  vint2 i;


Shouldn't this be vint i; and not vint2 i;?

vint2 is correct here

my bad, sorry

fpetrogalli · 2018-07-04T04:53:50Z

src/libm/sleefsimdsp.c

+#else
+typedef struct {
+  vfloat d;
+  vint2 i;


Same here, shouldn't this be vint i;?

no message

4d1bd3e

shibatch requested a review from fpetrogalli June 19, 2018 16:21

fpetrogalli reviewed Jun 19, 2018

View reviewed changes

Bug workaround for armclang is added

e215fc1

shibatch added 3 commits June 22, 2018 01:20

no message

0fa5047

no message

adeac9b

no message

763f08f

shibatch changed the title ~~[libsleef] Add modified Payne Henek argument reduction~~ [libsleef] Add modified Payne Hanek argument reduction Jun 22, 2018

shibatch added 4 commits June 23, 2018 15:15

Add brance hinting

4ccf1df

no message

14cad71

no message

0faa1f2

no message

65f2112

* Add back 125 to 39000 reduction for 3.5ULP SP functions

a268a8b

* Benchmarking tool now compiles for aarch64

rempitab was exported

5983c98

fpetrogalli reviewed Jul 4, 2018

View reviewed changes

shibatch added 3 commits July 4, 2018 17:40

no message

0c7d11f

no message

b827834

no message

2fdfae8

fpetrogalli approved these changes Jul 6, 2018

View reviewed changes

Change armclang version in Jenkinsfile to 18.1

99080e6

shibatch merged commit 4d5e75f into master Jul 6, 2018

shibatch deleted the Add-payne-hanek-reduction branch July 6, 2018 03:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[libsleef] Add modified Payne Hanek argument reduction #197

[libsleef] Add modified Payne Hanek argument reduction #197

shibatch commented Jun 19, 2018 •

edited

Loading

fpetrogalli left a comment

shibatch commented Jun 20, 2018

fpetrogalli commented Jun 20, 2018

shibatch commented Jun 20, 2018

shibatch commented Jun 21, 2018 •

edited

Loading

shibatch commented Jun 21, 2018 •

edited

Loading

shibatch commented Jun 22, 2018

fpetrogalli commented Jun 27, 2018

shibatch commented Jun 27, 2018

shibatch commented Jun 27, 2018

shibatch commented Jun 30, 2018

fpetrogalli Jul 4, 2018

fpetrogalli Jul 4, 2018

fpetrogalli Jul 4, 2018

fpetrogalli Jul 4, 2018

fpetrogalli Jul 4, 2018

fpetrogalli Jul 4, 2018

fpetrogalli Jul 4, 2018

fpetrogalli Jul 4, 2018

fpetrogalli Jul 4, 2018

shibatch Jul 4, 2018

fpetrogalli Jul 6, 2018

fpetrogalli Jul 4, 2018

[libsleef] Add modified Payne Hanek argument reduction #197

[libsleef] Add modified Payne Hanek argument reduction #197

Conversation

shibatch commented Jun 19, 2018 • edited Loading

fpetrogalli left a comment

Choose a reason for hiding this comment

shibatch commented Jun 20, 2018

fpetrogalli commented Jun 20, 2018

shibatch commented Jun 20, 2018

shibatch commented Jun 21, 2018 • edited Loading

shibatch commented Jun 21, 2018 • edited Loading

shibatch commented Jun 22, 2018

fpetrogalli commented Jun 27, 2018

shibatch commented Jun 27, 2018

shibatch commented Jun 27, 2018

shibatch commented Jun 30, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shibatch commented Jun 19, 2018 •

edited

Loading

shibatch commented Jun 21, 2018 •

edited

Loading

shibatch commented Jun 21, 2018 •

edited

Loading