Add the limiting distributions to generalized Pareto distribution with shapes c=0 and c=-1 #3225

Merged
merged 5 commits into from Sep 9, 2014

7 participants

@ev-br
SciPy member

closes gh-1295

The implementation here is similar to the original one by @pbrod, as listed in gh-1295.

Several points I'd like to flag for a review:

  • The exact properties of the c\to 0 limit. For example, I'm currently skipping the (otherwise failing) test for continuity of the ppf at small c --- I first wrote the test without much thinking, but in fact I'm not sure if the requirement of the test is not too stringent. The specific question here, I guess, is whether the Box-Cox transform is uniformly convergent at lambda\to 0.
  • We might want to add a new ufunc for computing log(1 + a*x)/x with the correct behavior at x=0, instead of a private helper in genpareto.
  • _munp and entropy are not properly vectorized at the moment (neither in master, not in this PR).
@coveralls

Coverage Status

Coverage remained the same when pulling 4c9413d on ev-br:genpareto into dc7555b on scipy:master.

@josef-pkt
SciPy member

about ppf: I don't know the new boxcox, but in the old version we have 1-eps - 1 which becomes mostly noise for c<1e-8

my guess is that it's a purely numerical problems that won't go away without using a (Taylor) series expansion around c=0 or something like that

>>> q = np.linspace(0., 1., 30, endpoint=False)
>>> stats.genpareto.ppf(q, 1e-8) - stats.expon.ppf(q)
array([  0.00000000e+00,  -3.44033249e-09,   6.59031606e-09,
         4.65046893e-09,  -1.35370382e-09,   1.05718484e-09,
        -3.23048349e-09,   4.84360874e-09,  -4.33792879e-09,
        -9.90649074e-09,  -2.17329854e-09,   6.37755515e-09,
         2.47717025e-09,  -2.54236632e-11,  -5.40378220e-09,
         3.53435514e-09,   1.01246712e-08,   2.18065133e-09,
         3.03871706e-10,  -8.03573652e-10,   1.36105660e-09,
         6.01152550e-09,  -1.86942684e-09,   1.36590266e-08,
         3.83822663e-09,   2.70998719e-08,   2.38693882e-08,
         2.95770421e-08,   2.74037433e-08,   5.31425592e-08])
>>> stats.genpareto.ppf(q, 1e-10) - stats.expon.ppf(q)
array([  0.00000000e+00,   2.18604272e-07,   8.28155354e-07,
        -3.50620899e-07,   2.42895362e-07,  -7.31690011e-07,
         1.74405200e-07,  -1.50587615e-07,  -8.03698507e-07,
        -2.54155556e-07,  -5.57284811e-07,   6.72511370e-07,
        -9.07905710e-07,  -5.99545857e-07,  -3.82879611e-07,
         5.80850328e-07,  -8.11440367e-07,   8.23745690e-07,
         7.55255528e-07,  -2.22848179e-07,   2.35655171e-08,
        -3.27055382e-07,   1.97970718e-07,  -2.30590039e-07,
        -8.84340193e-07,   6.04415845e-07,   7.78821045e-07,
        -3.03489865e-07,  -8.60774676e-07,  -2.79924348e-07])
>>> stats.genpareto.ppf(q, 1e-14) - stats.expon.ppf(q)
array([ 0.        ,  0.01050737, -0.00237949,  0.00566179, -0.00987408,
       -0.00468587, -0.00109895,  0.00075036,  0.00070752, -0.00140358,
       -0.00578482,  0.00953527, -0.00012303,  0.00933194, -0.00688377,
       -0.00480891, -0.0071884 ,  0.00752147, -0.00590785, -0.00410139,
       -0.01059372, -0.00493194,  0.01051179,  0.01020716, -0.01071676,
        0.00680183,  0.00570288,  0.0066788 ,  0.00089398, -0.00391493])
@pbrod

The scipy.special.boxcox for lmbda != 0 is implemented as (pow(x, lmbda) - 1.0) / lmbda and is numerically unstable for small lmbda.

A better solution in the _ppf method is to replace the call to boxcox with the following:

   def _ppf(self, q, c):
          x = -log(-log(q))
         return _lazywhere((x==x) & (c != 0), (x, c), lambda x, c: -expm1(-c*x) / c, x)

Replacing with the above you will get the machine precision as shown here:

In [36]: q = np.linspace(0., 1., 30, endpoint=False)
In [37]: c=1e-8;(np.abs(genpareto.cdf(genpareto.ppf(q, c),c) - q))
Out[37]:
array([ 0.00000000e+00, 0.00000000e+00, 1.38777878e-17,
0.00000000e+00, 0.00000000e+00, 2.77555756e-17,
0.00000000e+00, 5.55111512e-17, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 1.11022302e-16,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
1.11022302e-16, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00])

In [40]: c=1e-15;(np.abs(genpareto.cdf(genpareto.ppf(q, c),c) - q))
Out[40]:
array([ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 2.77555756e-17, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 1.11022302e-16, 0.00000000e+00,
0.00000000e+00, 1.11022302e-16, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
1.11022302e-16, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00])

@pv
SciPy member
pv commented Jan 19, 2014

There's scipy.special.boxcox.

If there's something to improve, the improvement should be done in scipy.special.

@ev-br ev-br commented on an outdated diff Jan 19, 2014
scipy/stats/_continuous_distns.py
def _ppf(self, q, c):
- vals = 1.0/c * (pow(1-q, -c)-1)
- return vals
+ return -boxcox(1. - q, -c)
@ev-br
SciPy member
ev-br added a line comment Jan 19, 2014

@pv which is exactly what's used here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@ev-br
SciPy member

@pbrod would you be interested in fixing special.boxcox like you've shown?

@pv
SciPy member
pv commented Jan 19, 2014

A suitable fix is probably to use for |c| << log(x) the expansion (x^c - 1)/c = (exp(c log(x)) - 1)/c = sum_{n=1}^inf c^{n-1} log(x)^n/n!.

@pbrod

Using scipy.special.boxcox is not a good idea for small q in genpareto._ppf. Even if you replace the (pow(x, lmbda) - 1.0) / lmbda part in scipy.special.boxcox with with expm1(lmbda*log(x))/lmbda, you will still loose precision in genpareto.ppf for small q because x = 1-q.

@WarrenWeckesser

special.boxcox should be fixed, and since I'm the one responsible for the naive implementation, I'm happy to fix it. (I'm already experimenting with the series given by @pv.) However, @pbrod is correct: if q is near zero, then the battle for precision is lost as soon as you compute 1 - q, regardless of how boxcox is implemented.

@pv
SciPy member
pv commented Jan 19, 2014

If boxcox(1 - q, c) is common, it may make sense to extend the boxcox function to support also this, e.g. boxcox(-q, c, at_1=True).

@josef-pkt
SciPy member

q near 0 or near 1 is a bit a different issue, because we can use isf or ppf depending on whether we are in the upper or lower tail, I think. (*)
(Otherwise I didn't pay enough attention to understand the numerics of the different solutions.)

(*) so far a choice by user. we run in some cases into problems when we do use ppf in the extreme upper tail as in #3214
ppf in this case is also used for the rvs.

If boxcox(1 - q, c) is common

I have no idea how common the extreme cases are, I think usually not common with actual data.

@pv
SciPy member
@ev-br
SciPy member

I think a good API would be to separate the offset explicitly: boxcox(x, lmbda, x0=0), calculating ((x+x0)**lmbda -1)/lmbda. This way both a user can pick the right one, and the implementation would have a chance to do the right thing under the hood.

I don't think there's much boxcox in the scipy codebase so far, given that it was only introduced in #3150.

Overall, I think it's worth it to grow the collection of ufuncs for, loosely speaking, 'simply-looking combinations of elementary functions with all the corner cases taken care of' (xlogy, log1p, boxcox and so on).

@WarrenWeckesser

I updated boxcox and added the new function boxcox1p here: #3229

It was sufficient to express the function use expm1 and either log or log1p--no need for the series expansion.

Consistent with the functions expm1, log1p and xlogyp1, I added the new function boxcox1p instead of adding additional arguments to boxcox.

@ev-br
SciPy member

Incorporated the new boxcox, boxcox1p, added an explicit _isf method, and squashed it all into a single commit.

@josef-pkt
SciPy member

looks good to me

@coveralls

Coverage Status

Coverage remained the same when pulling 9c3c02f on ev-br:genpareto into 6b6b41a on scipy:master.

@pbrod pbrod commented on an outdated diff Jan 22, 2014
scipy/stats/_continuous_distns.py
def _munp(self, n, c):
- k = arange(0, n+1)
- val = (-1.0/c)**n * sum(comb(n, k)*(-1)**k / (1.0-c*k), axis=0)
- return where(c*n < 1, val, inf)
+ if c != 0:
+ k = arange(0, n+1)
+ val = (-1.0/c)**n * sum(comb(n, k)*(-1)**k / (1.0-c*k), axis=0)
+ return where(c*n < 1, val, inf)
+ else:
+ return gam(n+1)
@pbrod
pbrod added a line comment Jan 22, 2014

Vectorization of _munp can be done like this

def _munp(self, n, c):
    def __munp(n, c):
        val = 0.0
        k = arange(0, n + 1)
        for ki, cnk in zip(k, comb(n, k)):
            val = val + cnk * (-1) ** ki / (1.0 - c * ki)
        return where(c * n < 1, val * (-1.0 / c) ** n, inf)
    munp = lambda c: __munp(n, c)
    return _lazywhere(c != 0, (c,), munp, gam(n + 1))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@pbrod pbrod and 2 others commented on an outdated diff Jan 23, 2014
scipy/stats/_continuous_distns.py
def _logpdf(self, x, c):
- return (-1.0-1.0/c) * np.log1p(c*x)
+ return -(c + 1.) * self._log1pcx(x, c)
@pbrod
pbrod added a line comment Jan 23, 2014

genpareto.logpdf(inf, 0.1) return nan. Correct answer is -inf (This failure is due to np.log1p(inf) returns nan in numpy version 1.8.0)
genpareto.logpdf(1, -1) return nan. Correct answer is 0

@WarrenWeckesser
WarrenWeckesser added a line comment Jan 23, 2014

@pbrod: I can't reproduce the np.log1p bug. Using both 1.7.1 and 1.8.0, np.log1p(np.inf) returns inf. Perhaps it is platform dependent? (I'm using Ubuntu 12.04 (64 bit).) Have you reported the problem on the numpy issue tracker (https://github.com/numpy/numpy/issues)?

@pbrod
pbrod added a line comment Jan 23, 2014

I am using windows 7 and numpy version 1.8.0 that comes with pythonxy and yes
I have just reported it on numpy.

@ev-br
SciPy member
ev-br added a line comment Jan 24, 2014

@pbrod what does scipy.special.log1p(inf) return on your system? --- apparently, there's one from cephes/unity.c
(I cannot test it myself, since I'm on ubuntu precise at the moment)

@pbrod
pbrod added a line comment Jan 24, 2014
@pbrod
pbrod added a line comment Jan 24, 2014

However, scipy.special.log1p gives the correct answer on my system:

In [2]: scipy.special.log1p(inf)
Out[2]: inf

@ev-br
SciPy member
ev-br added a line comment Jan 24, 2014

Great! Would you be able to run the full stats test suite on your machine, using the updated PR?

@pbrod
pbrod added a line comment Jan 24, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@ev-br
SciPy member

In the last commit I've replaced the call to np.log1p with scipy.special.log1p in genpareto and elsewhere in _continuous_distns. All tests keep passing locally, but this needs to be tested on Windows (numpy/numpy#4225).

@coveralls

Coverage Status

Coverage remained the same when pulling 9fc4d7b on ev-br:genpareto into 6b6b41a on scipy:master.

@pbrod pbrod commented on the diff Jan 24, 2014
scipy/stats/_continuous_distns.py
@@ -11,11 +11,11 @@
from scipy import special
from scipy import optimize
from scipy import integrate
-from scipy.special import (gammaln as gamln, gamma as gam)
+from scipy.special import (gammaln as gamln, gamma as gam, boxcox, boxcox1p)
@pbrod
pbrod added a line comment Jan 24, 2014

Why not import log1p here?

@ev-br
SciPy member
ev-br added a line comment Jan 25, 2014

no strong preference here, mostly a matter of taste. I personally have a slight preference for being a little more explicit, and here at least I would definitely expect just log1p being a numpy function. Can change it if there are strong opinions though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@pbrod pbrod and 3 others commented on an outdated diff Jan 24, 2014
scipy/stats/_continuous_distns.py
from numpy import (where, arange, putmask, ravel, sum, shape,
log, sqrt, exp, arctanh, tan, sin, arcsin, arctan,
- tanh, cos, cosh, sinh, log1p, expm1)
+ tanh, cos, cosh, sinh, expm1)
@pbrod
pbrod added a line comment Jan 24, 2014

Why not import expm1 from scipy.special also, since numpy.expm1(inf) returns nan on windows?

@ev-br
SciPy member
ev-br added a line comment Jan 25, 2014

It does not seem to be directly related to this PR --- expm1 is only used in _cdf, and the argument does not go to positive inf. (BTW boxcox has been reimplemeted in terms of expm1, is it affected? ping @WarrenWeckesser)
But I agree, it might be that expm1 & log1p have to be replaced everywhere in the scipy codebase. I'll open a ticket.

@WarrenWeckesser
WarrenWeckesser added a line comment Jan 25, 2014

Why wouldn't the argument of _cdf go to inf? The result should be 1. If we use the buggy version of np.expm1, _cdf(inf, c) will return nan.

@josef-pkt
SciPy member
josef-pkt added a line comment Jan 25, 2014

Having a correct _cdf(inf, c) is a bonus and desired. However, IIRC, cdf wouldn't/shouldn't delegate to _cdf if x == .b == inf and the wrapper code should return 1.

@ev-br
SciPy member
ev-br added a line comment Jan 25, 2014

@WarrenWeckesser indeed, I was wrong.
The last commit uses special.expm1

@WarrenWeckesser
WarrenWeckesser added a line comment Jan 25, 2014

@josef-pkt: Good point. So it is not essential that _cdf handle inf, but it would be nice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@coveralls

Coverage Status

Coverage remained the same when pulling 4bfdceb on ev-br:genpareto into 6b6b41a on scipy:master.

@pv pv added the PR label Feb 19, 2014
@ev-br
SciPy member

rebased, added vectorized _munp implementation (vectorization beats readability, huh), force-pushed the result. I believe I've incorporated all review comments.

@coveralls

Coverage Status

Coverage remained the same when pulling 6169f67 on ev-br:genpareto into 32cd96d on scipy:master.

@pbrod

It looks good except that genpareto.logpdf(1, -1) still returns nan. Correct answer is 0

@ev-br
SciPy member

@pbrod does the last commit fixes it for you? [I cannot reproduce it locally, so I'm reduced to guessing and trying]. @pv this involves a small tweak in special.

@coveralls

Coverage Status

Changes Unknown when pulling d4a484d on ev-br:genpareto into ** on scipy:master**.

@WarrenWeckesser

This does not fix the problem that @pbrod pointed out. The problem is not in log1p; it is in _logpdf(). _log1pcx correctly returns inf, but then in _logpdf, that value is mutilplied by (c + 1.) (which is 0), and 0 * inf is nan.

You could refactor a bit, and maybe use xlog1py, but my initial impression is that somewhere in the code you'll need special cases for both c = 0 and c = -1.

@ev-br
SciPy member

Ah, yes, you're right. It's not just tweaking the plumbing: At c=-1, genpareto must reduce to a uniform distribution on [0, 1], which this implementation does not. Will revert the last commit and look at this a bit more.

@ev-br
SciPy member

In the last commit I'm using xlog1py (thanks @WarrenWeckesser) to special-case both c=0 and c=-1.

@coveralls

Coverage Status

Coverage remained the same when pulling 5c1ec5a on ev-br:genpareto into b54a499 on scipy:master.

@coveralls

Coverage Status

Coverage remained the same when pulling e4949fc on ev-br:genpareto into b54a499 on scipy:master.

ev-br added some commits Jan 18, 2014
@ev-br ev-br ENH: add the limit shape c=0 to genpareto distribution
For c=0, genpareto is equivalent to the exponential
distribution.
3a8fec3
@ev-br ev-br BUG: use special.log1p instead of np.log1p
np.log1p(np.inf) is reported to produce nans on some platforms,
see numpy issue gh-4225
1dd2992
@ev-br ev-br BUG: use scipy.special.expm1 instead of np.expm1 733d320
@ev-br ev-br ENH: vectorize genpareto._munp
implementation is by @pbrod
540c23f
@ev-br ev-br BUG: special-case genpareto(c=-1) 9718599
@ev-br ev-br changed the title from Add the limiting exponential distribution to generalized Pareto distribution with shape c=0 to Add the limiting distributions to generalized Pareto distribution with shapes c=0 and c=-1 Jul 23, 2014
@ev-br
SciPy member

Rebased, changed the title to better reflect the content of the PR (c=0 and c=-1). I believe I've addressed all the review comments.

@coveralls

Coverage Status

Coverage increased (+0.02%) when pulling 9718599 on ev-br:genpareto into 686537d on scipy:master.

@pv pv removed the PR label Aug 13, 2014
@ev-br ev-br added the enhancement label Sep 5, 2014
@ev-br ev-br added this to the 0.15.0 milestone Sep 5, 2014
@ev-br
SciPy member

I think it'd be nice to have it in 0.15 if time permits.

@pbrod

I agree..

@argriffing argriffing merged commit 8b8531d into scipy:master Sep 9, 2014

1 check passed

Details continuous-integration/travis-ci The Travis CI build passed
@argriffing

Thanks for making these improvements! I especially agree with:

Overall, I think it's worth it to grow the collection of ufuncs for, loosely speaking, 'simply-looking combinations of elementary functions with all the corner cases taken care of'

@ev-br
SciPy member

@argriffing you've mplemented a nice bunch of them recently, have you not :-)

@ev-br ev-br deleted the ev-br:genpareto branch Sep 9, 2014
@argriffing

Yes I added some weird functions that have been graciously merged, but they are not yet ufunc-ified and they do not yet deal with all corner cases.

@ev-br
SciPy member

Ah, good to know! I think it'd be very useful to actually make them ufuncs and fix up all the corner cases

@argriffing

I think it'd be very useful to actually make them ufuncs and fix up all the corner cases

PR #3981

@argriffing

Adding boxcox inverse transform ufuncs to scipy.special could help clean up

def _logpdf(self, x, c):
    return _lazywhere((x == x) & (c != 0), (x, c),
        lambda x, c: -special.xlog1py(c+1., c*x) / c, -x)

as well as helping answer http://stackoverflow.com/questions/26391454

@ev-br
SciPy member

Yeah, it could. Open an issue for this? So that it's more visible

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment