Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the limiting distributions to generalized Pareto distribution with shapes c=0 and c=-1 #3225

Merged
merged 5 commits into from
Sep 9, 2014

Conversation

ev-br
Copy link
Member

@ev-br ev-br commented Jan 19, 2014

closes gh-1295

The implementation here is similar to the original one by @pbrod, as listed in gh-1295.

Several points I'd like to flag for a review:

  • The exact properties of the c\to 0 limit. For example, I'm currently skipping the (otherwise failing) test for continuity of the ppf at small c --- I first wrote the test without much thinking, but in fact I'm not sure if the requirement of the test is not too stringent. The specific question here, I guess, is whether the Box-Cox transform is uniformly convergent at lambda\to 0.
  • We might want to add a new ufunc for computing log(1 + a*x)/x with the correct behavior at x=0, instead of a private helper in genpareto.
  • _munp and entropy are not properly vectorized at the moment (neither in master, not in this PR).

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 4c9413d on ev-br:genpareto into dc7555b on scipy:master.

@josef-pkt
Copy link
Member

about ppf: I don't know the new boxcox, but in the old version we have 1-eps - 1 which becomes mostly noise for c<1e-8

my guess is that it's a purely numerical problems that won't go away without using a (Taylor) series expansion around c=0 or something like that

>>> q = np.linspace(0., 1., 30, endpoint=False)
>>> stats.genpareto.ppf(q, 1e-8) - stats.expon.ppf(q)
array([  0.00000000e+00,  -3.44033249e-09,   6.59031606e-09,
         4.65046893e-09,  -1.35370382e-09,   1.05718484e-09,
        -3.23048349e-09,   4.84360874e-09,  -4.33792879e-09,
        -9.90649074e-09,  -2.17329854e-09,   6.37755515e-09,
         2.47717025e-09,  -2.54236632e-11,  -5.40378220e-09,
         3.53435514e-09,   1.01246712e-08,   2.18065133e-09,
         3.03871706e-10,  -8.03573652e-10,   1.36105660e-09,
         6.01152550e-09,  -1.86942684e-09,   1.36590266e-08,
         3.83822663e-09,   2.70998719e-08,   2.38693882e-08,
         2.95770421e-08,   2.74037433e-08,   5.31425592e-08])
>>> stats.genpareto.ppf(q, 1e-10) - stats.expon.ppf(q)
array([  0.00000000e+00,   2.18604272e-07,   8.28155354e-07,
        -3.50620899e-07,   2.42895362e-07,  -7.31690011e-07,
         1.74405200e-07,  -1.50587615e-07,  -8.03698507e-07,
        -2.54155556e-07,  -5.57284811e-07,   6.72511370e-07,
        -9.07905710e-07,  -5.99545857e-07,  -3.82879611e-07,
         5.80850328e-07,  -8.11440367e-07,   8.23745690e-07,
         7.55255528e-07,  -2.22848179e-07,   2.35655171e-08,
        -3.27055382e-07,   1.97970718e-07,  -2.30590039e-07,
        -8.84340193e-07,   6.04415845e-07,   7.78821045e-07,
        -3.03489865e-07,  -8.60774676e-07,  -2.79924348e-07])
>>> stats.genpareto.ppf(q, 1e-14) - stats.expon.ppf(q)
array([ 0.        ,  0.01050737, -0.00237949,  0.00566179, -0.00987408,
       -0.00468587, -0.00109895,  0.00075036,  0.00070752, -0.00140358,
       -0.00578482,  0.00953527, -0.00012303,  0.00933194, -0.00688377,
       -0.00480891, -0.0071884 ,  0.00752147, -0.00590785, -0.00410139,
       -0.01059372, -0.00493194,  0.01051179,  0.01020716, -0.01071676,
        0.00680183,  0.00570288,  0.0066788 ,  0.00089398, -0.00391493])

@pbrod
Copy link
Contributor

pbrod commented Jan 19, 2014

The scipy.special.boxcox for lmbda != 0 is implemented as (pow(x, lmbda) - 1.0) / lmbda and is numerically unstable for small lmbda.

A better solution in the _ppf method is to replace the call to boxcox with the following:

   def _ppf(self, q, c):
          x = -log(-log(q))
         return _lazywhere((x==x) & (c != 0), (x, c), lambda x, c: -expm1(-c*x) / c, x)

Replacing with the above you will get the machine precision as shown here:

In [36]: q = np.linspace(0., 1., 30, endpoint=False)
In [37]: c=1e-8;(np.abs(genpareto.cdf(genpareto.ppf(q, c),c) - q))
Out[37]:
array([ 0.00000000e+00, 0.00000000e+00, 1.38777878e-17,
0.00000000e+00, 0.00000000e+00, 2.77555756e-17,
0.00000000e+00, 5.55111512e-17, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 1.11022302e-16,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
1.11022302e-16, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00])

In [40]: c=1e-15;(np.abs(genpareto.cdf(genpareto.ppf(q, c),c) - q))
Out[40]:
array([ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 2.77555756e-17, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 1.11022302e-16, 0.00000000e+00,
0.00000000e+00, 1.11022302e-16, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
1.11022302e-16, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00])

@pv
Copy link
Member

pv commented Jan 19, 2014

There's scipy.special.boxcox.

If there's something to improve, the improvement should be done in scipy.special.


def _ppf(self, q, c):
vals = 1.0/c * (pow(1-q, -c)-1)
return vals
return -boxcox(1. - q, -c)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pv which is exactly what's used here

@ev-br
Copy link
Member Author

ev-br commented Jan 19, 2014

@pbrod would you be interested in fixing special.boxcox like you've shown?

@pv
Copy link
Member

pv commented Jan 19, 2014

A suitable fix is probably to use for |c| << log(x) the expansion (x^c - 1)/c = (exp(c log(x)) - 1)/c = sum_{n=1}^inf c^{n-1} log(x)^n/n!.

@pbrod
Copy link
Contributor

pbrod commented Jan 19, 2014

Using scipy.special.boxcox is not a good idea for small q in genpareto._ppf. Even if you replace the (pow(x, lmbda) - 1.0) / lmbda part in scipy.special.boxcox with with expm1(lmbda*log(x))/lmbda, you will still loose precision in genpareto.ppf for small q because x = 1-q.

@WarrenWeckesser
Copy link
Member

special.boxcox should be fixed, and since I'm the one responsible for the naive implementation, I'm happy to fix it. (I'm already experimenting with the series given by @pv.) However, @pbrod is correct: if q is near zero, then the battle for precision is lost as soon as you compute 1 - q, regardless of how boxcox is implemented.

@pv
Copy link
Member

pv commented Jan 19, 2014

If boxcox(1 - q, c) is common, it may make sense to extend the boxcox function to support also this, e.g. boxcox(-q, c, at_1=True).

@josef-pkt
Copy link
Member

q near 0 or near 1 is a bit a different issue, because we can use isf or ppf depending on whether we are in the upper or lower tail, I think. (*)
(Otherwise I didn't pay enough attention to understand the numerics of the different solutions.)

(*) so far a choice by user. we run in some cases into problems when we do use ppf in the extreme upper tail as in #3214
ppf in this case is also used for the rvs.

If boxcox(1 - q, c) is common

I have no idea how common the extreme cases are, I think usually not common with actual data.

@pv
Copy link
Member

pv commented Jan 20, 2014

I have no idea how common the extreme cases are, I think usually not
common with actual data.

How common is the call boxcox(1 - q, c) in the Scipy codebase?
If it's more than once (or perhaps even once), it may make sense
to abstract this.

@ev-br
Copy link
Member Author

ev-br commented Jan 20, 2014

I think a good API would be to separate the offset explicitly: boxcox(x, lmbda, x0=0), calculating ((x+x0)**lmbda -1)/lmbda. This way both a user can pick the right one, and the implementation would have a chance to do the right thing under the hood.

I don't think there's much boxcox in the scipy codebase so far, given that it was only introduced in #3150.

Overall, I think it's worth it to grow the collection of ufuncs for, loosely speaking, 'simply-looking combinations of elementary functions with all the corner cases taken care of' (xlogy, log1p, boxcox and so on).

@WarrenWeckesser
Copy link
Member

I updated boxcox and added the new function boxcox1p here: #3229

It was sufficient to express the function use expm1 and either log or log1p--no need for the series expansion.

Consistent with the functions expm1, log1p and xlogyp1, I added the new function boxcox1p instead of adding additional arguments to boxcox.

@ev-br
Copy link
Member Author

ev-br commented Jan 22, 2014

Incorporated the new boxcox, boxcox1p, added an explicit _isf method, and squashed it all into a single commit.

@josef-pkt
Copy link
Member

looks good to me

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 9c3c02f on ev-br:genpareto into 6b6b41a on scipy:master.

val = (-1.0/c)**n * sum(comb(n, k)*(-1)**k / (1.0-c*k), axis=0)
return where(c*n < 1, val, inf)
else:
return gam(n+1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vectorization of _munp can be done like this

def _munp(self, n, c):
    def __munp(n, c):
        val = 0.0
        k = arange(0, n + 1)
        for ki, cnk in zip(k, comb(n, k)):
            val = val + cnk * (-1) ** ki / (1.0 - c * ki)
        return where(c * n < 1, val * (-1.0 / c) ** n, inf)
    munp = lambda c: __munp(n, c)
    return _lazywhere(c != 0, (c,), munp, gam(n + 1))

@ev-br
Copy link
Member Author

ev-br commented Jan 24, 2014

In the last commit I've replaced the call to np.log1p with scipy.special.log1p in genpareto and elsewhere in _continuous_distns. All tests keep passing locally, but this needs to be tested on Windows (numpy/numpy#4225).

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 9fc4d7b on ev-br:genpareto into 6b6b41a on scipy:master.

@@ -11,11 +11,11 @@
from scipy import special
from scipy import optimize
from scipy import integrate
from scipy.special import (gammaln as gamln, gamma as gam)
from scipy.special import (gammaln as gamln, gamma as gam, boxcox, boxcox1p)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not import log1p here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no strong preference here, mostly a matter of taste. I personally have a slight preference for being a little more explicit, and here at least I would definitely expect just log1p being a numpy function. Can change it if there are strong opinions though.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 4bfdceb on ev-br:genpareto into 6b6b41a on scipy:master.

@WarrenWeckesser
Copy link
Member

This does not fix the problem that @pbrod pointed out. The problem is not in log1p; it is in _logpdf(). _log1pcx correctly returns inf, but then in _logpdf, that value is mutilplied by (c + 1.) (which is 0), and 0 * inf is nan.

You could refactor a bit, and maybe use xlog1py, but my initial impression is that somewhere in the code you'll need special cases for both c = 0 and c = -1.

@ev-br
Copy link
Member Author

ev-br commented Mar 2, 2014

Ah, yes, you're right. It's not just tweaking the plumbing: At c=-1, genpareto must reduce to a uniform distribution on [0, 1], which this implementation does not. Will revert the last commit and look at this a bit more.

@ev-br
Copy link
Member Author

ev-br commented Mar 9, 2014

In the last commit I'm using xlog1py (thanks @WarrenWeckesser) to special-case both c=0 and c=-1.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 5c1ec5a on ev-br:genpareto into b54a499 on scipy:master.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling e4949fc on ev-br:genpareto into b54a499 on scipy:master.

For c=0, genpareto is equivalent to the exponential
distribution.
np.log1p(np.inf) is reported to produce nans on some platforms,
see numpy issue scipygh-4225
@ev-br ev-br changed the title Add the limiting exponential distribution to generalized Pareto distribution with shape c=0 Add the limiting distributions to generalized Pareto distribution with shapes c=0 and c=-1 Jul 23, 2014
@ev-br
Copy link
Member Author

ev-br commented Jul 23, 2014

Rebased, changed the title to better reflect the content of the PR (c=0 and c=-1). I believe I've addressed all the review comments.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.02%) when pulling 9718599 on ev-br:genpareto into 686537d on scipy:master.

@pv pv removed the PR label Aug 13, 2014
@ev-br ev-br added this to the 0.15.0 milestone Sep 5, 2014
@ev-br
Copy link
Member Author

ev-br commented Sep 5, 2014

I think it'd be nice to have it in 0.15 if time permits.

@pbrod
Copy link
Contributor

pbrod commented Sep 9, 2014

I agree..

argriffing added a commit that referenced this pull request Sep 9, 2014
Add the limiting distributions to generalized Pareto distribution with shapes c=0 and c=-1
@argriffing argriffing merged commit 8b8531d into scipy:master Sep 9, 2014
@argriffing
Copy link
Contributor

Thanks for making these improvements! I especially agree with:

Overall, I think it's worth it to grow the collection of ufuncs for, loosely speaking, 'simply-looking combinations of elementary functions with all the corner cases taken care of'

@ev-br
Copy link
Member Author

ev-br commented Sep 9, 2014

@argriffing you've mplemented a nice bunch of them recently, have you not :-)

@ev-br ev-br deleted the genpareto branch September 9, 2014 15:33
@argriffing
Copy link
Contributor

Yes I added some weird functions that have been graciously merged, but they are not yet ufunc-ified and they do not yet deal with all corner cases.

@ev-br
Copy link
Member Author

ev-br commented Sep 9, 2014

Ah, good to know! I think it'd be very useful to actually make them ufuncs and fix up all the corner cases

@argriffing
Copy link
Contributor

I think it'd be very useful to actually make them ufuncs and fix up all the corner cases

PR #3981

@argriffing
Copy link
Contributor

Adding boxcox inverse transform ufuncs to scipy.special could help clean up

def _logpdf(self, x, c):
    return _lazywhere((x == x) & (c != 0), (x, c),
        lambda x, c: -special.xlog1py(c+1., c*x) / c, -x)

as well as helping answer http://stackoverflow.com/questions/26391454

@ev-br
Copy link
Member Author

ev-br commented Oct 16, 2014

Yeah, it could. Open an issue for this? So that it's more visible

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add the limiting exponential distribution to the generalized pareto distribution when shape c=0. (Trac #768)
7 participants