New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[libsleef] Add modified Payne Hanek argument reduction #197

Merged
merged 15 commits into from Jul 6, 2018

Conversation

Projects
None yet
2 participants
@shibatch
Owner

shibatch commented Jun 19, 2018

This patch adds a modified Payne Hanek argument reduction that can be used for very large arguments.

Payne Hanek reduction algorithm can handle very large arguments by table look-ups. In this patch, a vectorized version of the algorithm is implemented. The argument range for DP and SP trig functions will become [-1e+299, 1e+299] and [-1e+28, 1e+28], respectively. In order to avoid using 64 bit or 128 bit multiplication, the algorithm is modified to use DD computation. Gather instructions are used for table look-ups. The reduction subroutine is tested to confirm that it correctly handle the worst case with 6381956970095103.0 * 2.0^797.

The traditional implementation can be seen in the following gist.
https://gist.github.com/simonbyrne/d640ac1c1db3e1774cf2d405049beef3

@shibatch shibatch requested a review from fpetrogalli-arm Jun 19, 2018

@fpetrogalli-arm

@shibatch Is there any performance improvements with this?

@shibatch

This comment has been minimized.

Owner

shibatch commented Jun 20, 2018

I haven't taken a benchmark, but it depends on how fast gather instructions are. On Skylake, the latency and throughput of a gather instruction is 20 and 4, so executing 4 gather instructions take (at least) 20 + 4 * 3 = 32 clocks, which is already pretty slow.

The execution time should not matter in practice, since the old reduction algorithms are still enabled. I would say there is something wrong if someone is calling a trig function with a very large argument. The point is that when adopting SLEEF to some project, that person wants to feel safe to replace the existing calls to the standard math functions with SLEEF functions.

@fpetrogalli-arm

This comment has been minimized.

Collaborator

fpetrogalli-arm commented Jun 20, 2018

@shibatch

I haven't taken a benchmark, but it depends on how fast gather instructions are. On Skylake, the latency and throughput of a gather instruction is 20 and 4, so executing 4 gather instructions take (at least) 20 + 4 * 3 = 32 clocks, which is already pretty slow.

But isn't this patch using gather instructions? If that's the case, it seems to be not a good idea given that they are slow.

The execution time should not matter in practice, since the old reduction algorithms are still enabled.

I think that execution time is a very important constraint.

I would say there is something wrong if someone is calling a trig function with a very large argument.

You are probably right, but penalizing those users that compute trigonometric functions on proper values with en extra vector test instruction doesn't seem fair.

The point is that when adopting SLEEF to some project, that person wants to feel safe to replace the existing calls to the standard math functions with SLEEF functions.

I agree. But I think that we should provide them via separate functions, not merging them in the same function with that if statement on the "test all one" function (this would of course add another degree of freedom for the run-time system, if we ever come up with it).

@shibatch

This comment has been minimized.

Owner

shibatch commented Jun 20, 2018

But isn't this patch using gather instructions? If that's the case, it seems to be not a good idea given that they are slow.

Payne-Henek algorithm is a slow algorithm, compared to Cody-Waite.

You are probably right, but penalizing those users that compute trigonometric functions on proper values with en extra vector test instruction doesn't seem fair.

CPUs with out-of-order execution and branch prediction should have only small performance impact with this. We have to think the critical path to estimate the execution time, rather than simply summing up the execution time of all the instructions. Since the extra vector test instructions are not in the critical path, they don't affect the overall execution time so much.

@shibatch

This comment has been minimized.

Owner

shibatch commented Jun 21, 2018

Here is the benchmark.
This is the results with the latest patch.

DP graph
SP graph

@shibatch

This comment has been minimized.

Owner

shibatch commented Jun 21, 2018

As a workaround of the bug in armclang, I changed the definition of INLINE macro and it does not insert always_inline attribute when SLEEF is compiled for SVE target. armclang still aggressively inlines functions, so I think this is not a problem. Please check the assembly output from the compiler.

https://github.com/shibatch/sleef/wiki/197/sleefsimddp.s

Testing for SVE target is now enabled.

shibatch added some commits Jun 21, 2018

@shibatch

This comment has been minimized.

Owner

shibatch commented Jun 22, 2018

I moved the fixup code to the reduction part, and now most of the functions are slightly faster than before, if the argument is small. The graphs are updated.

@shibatch shibatch changed the title from [libsleef] Add modified Payne Henek argument reduction to [libsleef] Add modified Payne Hanek argument reduction Jun 22, 2018

shibatch added some commits Jun 23, 2018

@fpetrogalli-arm

This comment has been minimized.

Collaborator

fpetrogalli-arm commented Jun 27, 2018

Hi @shibatch, a couple of questions before I dig into the code review:

  1. Am I correct thinking that the plots you have reported are for x86? If so, could you please report the same benchmarks for an AArch64 machine?

  2. Would it be possible to support such big range values by improving the reduction algorithms (especially for periodic functions like sin and cosine), instead of specializing the polynomials?

  3. Are you happy about the ~4x slowdown of the 4ULP version of sin/cos/tan (SP and DP) when sampling the input values outside of [0,6.28]?

  4. It is very nice that the functions are faster in [0, 6.28]. Is that due to the effect of using __builtin_expect? If so, what is the maximum range where this effect is still working

  5. How does this compares to SVML or libmvec on x86? Is SLEEF any better due to the changes in this patch? Specifically, can you say that now 4ULP sin is faster than sin in SVML for the new very large input values?

@shibatch

This comment has been minimized.

Owner

shibatch commented Jun 27, 2018

Am I correct thinking that the plots you have reported are for x86? If so, could you please report the same benchmarks for an AArch64 machine?

Sure. I will post the results.

Would it be possible to support such big range values by improving the reduction algorithms (especially for periodic functions like sin and cosine), instead of specializing the polynomials?

Reduction algorithms for trig functions is not easy, and there are basically two algorithms. Cody-Waite algorithm has been used since the first version of SLEEF, and Payne-Hanek is the one introduced in this PR. The versions used in SLEEF are both modified versions of these algorithms, but basically it is not easy to further improve the algorithms.

Are you happy about the ~4x slowdown of the 4ULP version of sin/cos/tan (SP and DP) when sampling the input values outside of [0,6.28]?

For the DP versions of the functions, there is actually almost no slowdown, since the that large input domain was not supported. For the 1 ulp SP version, the situation is same, and there is almost no slowdown.

There is some slowdown in 3.5 ulp SP version of the functions, between 125 to 39000, but I don't know if people care about this. For the input domain less that 125, it should be a little faster than before.

It is very nice that the functions are faster in [0, 6.28]. Is that due to the effect of using __builtin_expect? If so, what is the maximum range where this effect is still working

For DP functions, the fastest algorithm is used up to 15, and the second algorithm is used up to 1e+9.

For SP functions, the faster algorithm is used up to 125.

The effect is not only __builtin_expect. It is a combination of __builtin_expect and moving fixup code to the slow reduction routine, and adjusting the polynomial for reduction.

How does this compares to SVML or libmvec on x86? Is SLEEF any better due to the changes in this patch? Specifically, can you say that now 4ULP sin is faster than sin in SVML for the new very large input values?

They are a little slower than SVML for very large arguments. I will post the graph.

@shibatch

This comment has been minimized.

Owner

shibatch commented Jun 27, 2018

These are comparison between SVML and SLEEF with this patch on Core i7-6700.
SVML DP graph
SVML SP graph

These graphs are comparison on RK3399.
AArch64 DP graph
AArch64 SP graph

* Add back 125 to 39000 reduction for 3.5ULP SP functions
* Benchmarking tool now compiles for aarch64
@shibatch

This comment has been minimized.

Owner

shibatch commented Jun 30, 2018

I put back the reduction routine for [125, 39000] range to 3.5 ulp SP functions.
There should be no big slowdown in those functions anymore.

@@ -290,6 +290,12 @@ static INLINE vdouble vloadu_vd_p(const double *ptr) { return _mm256_loadu_pd(pt
static INLINE void vstore_v_p_vd(double *ptr, vdouble v) { _mm256_store_pd(ptr, v); }
static INLINE void vstoreu_v_p_vd(double *ptr, vdouble v) { _mm256_storeu_pd(ptr, v); }
static INLINE vdouble vgather_vd_p_vi(const double *ptr, vint vi) {
int a[4];

This comment has been minimized.

@fpetrogalli-arm

fpetrogalli-arm Jul 4, 2018

Collaborator

Shouldn't this beint a[VECLENSP]?

@@ -477,6 +483,13 @@ static INLINE vfloat vloadu_vf_p(const float *ptr) { return _mm256_loadu_ps(ptr)
static INLINE void vstore_v_p_vf(float *ptr, vfloat v) { _mm256_store_ps(ptr, v); }
static INLINE void vstoreu_v_p_vf(float *ptr, vfloat v) { _mm256_storeu_ps(ptr, v); }
static INLINE vfloat vgather_vf_p_vi2(const float *ptr, vint2 vi2) {
int a[8];

This comment has been minimized.

@fpetrogalli-arm

fpetrogalli-arm Jul 4, 2018

Collaborator

Same here, shouldn't it be int a[VECLENSP]?

@@ -74,6 +74,18 @@ static INLINE void vstoreu_v_p_vf(float *ptr, vfloat v) { ptr[0] = v[0]; ptr[1]
static INLINE void vscatter2_v_p_i_i_vd(double *ptr, int offset, int step, vdouble v) { vstore_v_p_vd((double *)(&ptr[2*offset]), v); }
static INLINE vdouble vgather_vd_p_vi(const double *ptr, vint vi) {
int a[4];

This comment has been minimized.

@fpetrogalli-arm

fpetrogalli-arm Jul 4, 2018

Collaborator

Shouldn't this also be int[VECLENDP]?

}
static INLINE vfloat vgather_vf_p_vi2(const float *ptr, vint2 vi2) {
int a[4];

This comment has been minimized.

@fpetrogalli-arm

fpetrogalli-arm Jul 4, 2018

Collaborator

Shouldn't this also be int[VECLENSP]?

@@ -267,6 +267,12 @@ static INLINE vdouble vloadu_vd_p(const double *ptr) { return _mm_loadu_pd(ptr);
static INLINE void vstore_v_p_vd(double *ptr, vdouble v) { _mm_store_pd(ptr, v); }
static INLINE void vstoreu_v_p_vd(double *ptr, vdouble v) { _mm_storeu_pd(ptr, v); }
static INLINE vdouble vgather_vd_p_vi(const double *ptr, vint vi) {
int a[4];

This comment has been minimized.

@fpetrogalli-arm

fpetrogalli-arm Jul 4, 2018

Collaborator

int a[VECLENDP];?

}
static float vcast_f_vf(vfloat v) {
float a[64];

This comment has been minimized.

@fpetrogalli-arm

fpetrogalli-arm Jul 4, 2018

Collaborator

int a[svcntw()];

}
static int vcast_i_vi(vint v) {
int a[64];

This comment has been minimized.

@fpetrogalli-arm

fpetrogalli-arm Jul 4, 2018

Collaborator

int a[svcntw()];

}
static int vcast_i_vi2(vint2 v) {
int a[64];

This comment has been minimized.

@fpetrogalli-arm

fpetrogalli-arm Jul 4, 2018

Collaborator

int a[svcntw()];

#ifdef ENABLE_SVE
typedef __sizeless_struct {
vfloat d;
vint2 i;

This comment has been minimized.

@fpetrogalli-arm

fpetrogalli-arm Jul 4, 2018

Collaborator

Shouldn't this be vint i; and not vint2 i;?

This comment has been minimized.

@shibatch

shibatch Jul 4, 2018

Owner

vint2 is correct here

This comment has been minimized.

@fpetrogalli-arm

fpetrogalli-arm Jul 6, 2018

Collaborator

my bad, sorry

#else
typedef struct {
vfloat d;
vint2 i;

This comment has been minimized.

@fpetrogalli-arm

fpetrogalli-arm Jul 4, 2018

Collaborator

Same here, shouldn't this be vint i;?

shibatch added some commits Jul 4, 2018

@shibatch shibatch merged commit 4d5e75f into master Jul 6, 2018

0 of 6 checks passed

continuous-integration/appveyor/branch Waiting for AppVeyor build to complete
Details
continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
continuous-integration/jenkins/branch This commit is being built
Details
continuous-integration/jenkins/pr-merge This commit is being built
Details
continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
continuous-integration/travis-ci/push The Travis CI build is in progress
Details

@shibatch shibatch deleted the Add-payne-hanek-reduction branch Jul 6, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment