Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: stats: fix skewnorm.cdf losing precision at large x #8473

Closed
wants to merge 1 commit into from

Conversation

ev-br
Copy link
Member

@ev-br ev-br commented Feb 24, 2018

Now that Owen's T function is available, can use it for the skew normal distribution.

closes #7746

@josef-pkt
Copy link
Member

LGTM, it's good to have this now

@person142
Copy link
Member

Closes #5160 too, right?

@WarrenWeckesser
Copy link
Member

Nice. The survival function _sf should now be defined as

def _sf(self, x, a):
    return _norm_cdf(-x) + 2*sc.owens_t(x, a)

so we get accurate values for large x

@ev-br
Copy link
Member Author

ev-br commented Feb 24, 2018

@WarrenWeckesser I thought about it, but

In [9]: x = np.linspace(10, 50, 20)

In [10]: stats.skewnorm.sf(x, -1)
Out[10]: 
array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.])

And I did not find a failing test case (admittedly, I did not try too hard). Do you have one?

@WarrenWeckesser
Copy link
Member

Those values are all 0, but that just means the true values are smaller than ~1.11e-16. We should be able to do better than that.

Here's a comparison:

In [55]: def skewnorm_sf(x, a):
    ...:     x = np.asarray(x)
    ...:     return ndtr(-x) + 2*owens_t(x, a)
    ...: 

In [56]: x = np.arange(10.)

Here's the default sf, which is simply 1 - cdf, so it can't do better than ~1.11e-16:

In [57]: skewnorm.sf(x, -1)
Out[57]: 
array([  2.50000000e-01,   2.51714896e-02,   5.17568504e-04,
         1.82222470e-06,   1.00306752e-09,   8.22675261e-14,
         1.11022302e-16,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00])

Here's the version I suggested:

In [58]: skewnorm_sf(x, -1)
Out[58]: 
array([  2.50000000e-01,   2.51714896e-02,   5.17568504e-04,
         1.82222470e-06,   1.00306756e-09,   8.21691225e-14,
         9.73347976e-19,   1.63800341e-24,  -8.48025473e-30,
        -8.42594351e-34])

But that reveals a new problem: the survival function should not be negative! How accurate is the implementation of Owen's T with a call such as owens_t(10, -1)?

@WarrenWeckesser
Copy link
Member

I said

We should be able to do better than that.

but apparently it isn't as easy as I hoped. For x large enough, the terms _norm_cdf(-x) and 2*owens_t(x, -1) have opposite signs and almost the same magnitude, so we get catastrophic loss of precision.

@WarrenWeckesser
Copy link
Member

The indentation in the test function is wrong--it should be four spaces, not five.

@WarrenWeckesser
Copy link
Member

The CDF implementation using owens_t also has problems. For example:

In [52]: skewnorm.cdf(-8, 1)
Out[52]: -8.4802547311258769e-30

With scipy 1.0.0, we get

In [5]: skewnorm.cdf(-8, 1)
Out[5]: 3.8700350444395389e-31

According to Wolfram Alpha, the correct answer is 3.870035046664392611e-31.

@WarrenWeckesser WarrenWeckesser added the needs-work Items that are pending response from the author label Feb 24, 2018
@josef-pkt
Copy link
Member

problems in cdf and sf should be symmetric for opposite signs of x and a.

one possibility would be to recompute using super for edge cases where numerical integration (scipy 1.0) works better.

@ev-br
Copy link
Member Author

ev-br commented Feb 24, 2018

The current version fixes a problem which was reported by a user for somewhat large arguments. At still larger arguments, cdf is exponentially small and the current version suffers from catastrophic cancellation. In that regime, one likely needs to figure out the asymptotic expansion of the Owen's T function. While this could be a fun project, and the answer is likely available from the literature (or possibly even implemented inside of scipy.special.owen_t), it is out of scope for this PR.

@WarrenWeckesser
Copy link
Member

Evgeni, so what do you think is the next step? I don't think this pull request should be merged in its current state. It fixes one problem but creates a new problem.

@person142
Copy link
Member

A small thought: the test case skewnorm(x, 1) is the worst possible scenario because when a = 1 Owen's T function is 0.5*Phi(x)*(1 - Phi(x)). Of course this means there will also be problems around a = 1, but in this case you can use method T6 from Patefield and Tandy which decomposes Owen's T as 0.5*Phi(x)*(1 - Phi(x)) + [stuff] and cancel the common Phi(x) terms.

Haven't thought about whether there's cancellation in other regimes.

@WarrenWeckesser
Copy link
Member

Haven't thought about whether there's cancellation in other regimes.

It is not hard to find examples where the CDF implementation using Owen's T can result in big errors even when a is not 1. In this session, I don't have this pull request installed, so I'll define a one-liner called skewnorm_cdf(x, a) that uses Owen's T to compute the CDF.

In [61]: import scipy

In [62]: scipy.__version__
Out[62]: '1.1.0.dev0+895a774'

In [63]: from scipy.stats import skewnorm

Here's the CDF implemented with Owen's T function:

In [64]: from scipy.special import ndtr, owens_t

In [65]: def skewnorm_cdf(x, a):
    ...:     return ndtr(x) - 2*owens_t(x, a)
    ...: 

A couple cases where the error is big:

x = -4, a = 2. The correct value (according to Wolfram Alpha) is 8.1298399188811398e-21.

In [66]: skewnorm_cdf(-4, 2)
Out[66]: -1.0164395367051604e-19

In [67]: skewnorm.cdf(-4, 2)
Out[67]: 8.1298399223752232e-21

x = -2, a = 5. The correct value is 1.55326826787106273e-26.

In [105]: skewnorm_cdf(-2, 5)
Out[105]: -2.0816681711721685e-17

In [106]: skewnorm.cdf(-2, 5)
Out[106]: 1.5532682678674716e-26

@WarrenWeckesser
Copy link
Member

WarrenWeckesser commented Feb 25, 2018

An alternative to using Owen's T is to split the integral of the PDF at x=0, to ensure that quad "sees" the significant part of the PDF. If I define the following in skew_norm_gen, the problem shown in #7746 is fixed, and the other examples I gave above result in the "correct" values (relative error less than 1e-9).

    def _cdf_single(self, x, *args):
        if x < 0:
            cdf = integrate.quad(self._pdf, self.a, x, args=args)[0]
        else:
            t1 = integrate.quad(self._pdf, self.a, 0, args=args)[0]
            t2 = integrate.quad(self._pdf, 0, x, args=args)[0]
            cdf = t1 + t2
        if cdf > 1:
            # Presumably numerical noise, e.g. 1.0000000000000002
            cdf = 1.0
        return cdf

It would still be nice to get the CDF and SF functions working reliably with Owen's T, because it is much faster. For example,

In [19]: from scipy.special import ndtr, owens_t

In [20]: def skewnorm_cdf(x, a):
    ...:     return ndtr(x) - 2*owens_t(x, a)
    ...: 

In [21]: skewnorm.cdf(3, 2)  # Implemented using quad, split at x=0.
Out[21]: 0.99730020393729912

In [22]: skewnorm_cdf(3, 2)  # Implemented using Owen's T function.
Out[22]: 0.99730020393729946

In [23]: %timeit skewnorm.cdf(3, 2)
334 µs ± 1.97 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [24]: %timeit skewnorm_cdf(3, 2)   # Much faster to use Owen's T.
2.7 µs ± 19.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

@WarrenWeckesser
Copy link
Member

@josef-pkt wrote

problems in cdf and sf should be symmetric for opposite signs of x and a.

Indeed, for the skew normal distribution, sf(x, a) = cdf(-x, -a). So we can implement _sf as

    def _sf(self, x, a):
        return self._cdf(-x, -a)

and focus on getting the CDF function right.

@WarrenWeckesser
Copy link
Member

It might take a while for someone to resolve the problem of the catastrophic loss of precision, so in the meantime, I incorporated my comments above into an alternative pull request: #8501

@ev-br
Copy link
Member Author

ev-br commented Mar 3, 2018

Closing this in favor of gh-8501

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-work Items that are pending response from the author scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

False CDF values for skew normal distribution
4 participants