# Add the limiting distributions to generalized Pareto distribution with shapes c=0 and c=-1 #3225

Merged
merged 5 commits into from Sep 9, 2014

### 7 participants

SciPy member

closes gh-1295

The implementation here is similar to the original one by @pbrod, as listed in gh-1295.

Several points I'd like to flag for a review:

• The exact properties of the c\to 0 limit. For example, I'm currently skipping the (otherwise failing) test for continuity of the ppf at small c --- I first wrote the test without much thinking, but in fact I'm not sure if the requirement of the test is not too stringent. The specific question here, I guess, is whether the Box-Cox transform is uniformly convergent at lambda\to 0.
• We might want to add a new ufunc for computing log(1 + a*x)/x with the correct behavior at x=0, instead of a private helper in genpareto.
• _munp and entropy are not properly vectorized at the moment (neither in master, not in this PR).

Coverage remained the same when pulling 4c9413d on ev-br:genpareto into dc7555b on scipy:master.

SciPy member

about ppf: I don't know the new boxcox, but in the old version we have 1-eps - 1 which becomes mostly noise for c<1e-8

my guess is that it's a purely numerical problems that won't go away without using a (Taylor) series expansion around c=0 or something like that

>>> q = np.linspace(0., 1., 30, endpoint=False)
>>> stats.genpareto.ppf(q, 1e-8) - stats.expon.ppf(q)
array([  0.00000000e+00,  -3.44033249e-09,   6.59031606e-09,
4.65046893e-09,  -1.35370382e-09,   1.05718484e-09,
-3.23048349e-09,   4.84360874e-09,  -4.33792879e-09,
-9.90649074e-09,  -2.17329854e-09,   6.37755515e-09,
2.47717025e-09,  -2.54236632e-11,  -5.40378220e-09,
3.53435514e-09,   1.01246712e-08,   2.18065133e-09,
3.03871706e-10,  -8.03573652e-10,   1.36105660e-09,
6.01152550e-09,  -1.86942684e-09,   1.36590266e-08,
3.83822663e-09,   2.70998719e-08,   2.38693882e-08,
2.95770421e-08,   2.74037433e-08,   5.31425592e-08])
>>> stats.genpareto.ppf(q, 1e-10) - stats.expon.ppf(q)
array([  0.00000000e+00,   2.18604272e-07,   8.28155354e-07,
-3.50620899e-07,   2.42895362e-07,  -7.31690011e-07,
1.74405200e-07,  -1.50587615e-07,  -8.03698507e-07,
-2.54155556e-07,  -5.57284811e-07,   6.72511370e-07,
-9.07905710e-07,  -5.99545857e-07,  -3.82879611e-07,
5.80850328e-07,  -8.11440367e-07,   8.23745690e-07,
7.55255528e-07,  -2.22848179e-07,   2.35655171e-08,
-3.27055382e-07,   1.97970718e-07,  -2.30590039e-07,
-8.84340193e-07,   6.04415845e-07,   7.78821045e-07,
-3.03489865e-07,  -8.60774676e-07,  -2.79924348e-07])
>>> stats.genpareto.ppf(q, 1e-14) - stats.expon.ppf(q)
array([ 0.        ,  0.01050737, -0.00237949,  0.00566179, -0.00987408,
-0.00468587, -0.00109895,  0.00075036,  0.00070752, -0.00140358,
-0.00578482,  0.00953527, -0.00012303,  0.00933194, -0.00688377,
-0.00480891, -0.0071884 ,  0.00752147, -0.00590785, -0.00410139,
-0.01059372, -0.00493194,  0.01051179,  0.01020716, -0.01071676,
0.00680183,  0.00570288,  0.0066788 ,  0.00089398, -0.00391493])


The scipy.special.boxcox for lmbda != 0 is implemented as (pow(x, lmbda) - 1.0) / lmbda and is numerically unstable for small lmbda.

A better solution in the _ppf method is to replace the call to boxcox with the following:

   def _ppf(self, q, c):
x = -log(-log(q))
return _lazywhere((x==x) & (c != 0), (x, c), lambda x, c: -expm1(-c*x) / c, x)


Replacing with the above you will get the machine precision as shown here:

In [36]: q = np.linspace(0., 1., 30, endpoint=False)
In [37]: c=1e-8;(np.abs(genpareto.cdf(genpareto.ppf(q, c),c) - q))
Out[37]:
array([ 0.00000000e+00, 0.00000000e+00, 1.38777878e-17,
0.00000000e+00, 0.00000000e+00, 2.77555756e-17,
0.00000000e+00, 5.55111512e-17, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 1.11022302e-16,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
1.11022302e-16, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00])

In [40]: c=1e-15;(np.abs(genpareto.cdf(genpareto.ppf(q, c),c) - q))
Out[40]:
array([ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 2.77555756e-17, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 1.11022302e-16, 0.00000000e+00,
0.00000000e+00, 1.11022302e-16, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
1.11022302e-16, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00])

SciPy member
commented Jan 19, 2014

There's scipy.special.boxcox.

If there's something to improve, the improvement should be done in scipy.special.

commented on an outdated diff Jan 19, 2014
scipy/stats/_continuous_distns.py
 def _ppf(self, q, c): - vals = 1.0/c * (pow(1-q, -c)-1) - return vals + return -boxcox(1. - q, -c)
 SciPy member ev-br added a line comment Jan 19, 2014 @pv which is exactly what's used here to join this conversation on GitHub. Already have an account? Sign in to comment
SciPy member

@pbrod would you be interested in fixing special.boxcox like you've shown?

SciPy member
commented Jan 19, 2014

A suitable fix is probably to use for |c| << log(x) the expansion (x^c - 1)/c = (exp(c log(x)) - 1)/c = sum_{n=1}^inf c^{n-1} log(x)^n/n!.

Using scipy.special.boxcox is not a good idea for small q in genpareto._ppf. Even if you replace the (pow(x, lmbda) - 1.0) / lmbda part in scipy.special.boxcox with with expm1(lmbda*log(x))/lmbda, you will still loose precision in genpareto.ppf for small q because x = 1-q.

special.boxcox should be fixed, and since I'm the one responsible for the naive implementation, I'm happy to fix it. (I'm already experimenting with the series given by @pv.) However, @pbrod is correct: if q is near zero, then the battle for precision is lost as soon as you compute 1 - q, regardless of how boxcox is implemented.

SciPy member
commented Jan 19, 2014

If boxcox(1 - q, c) is common, it may make sense to extend the boxcox function to support also this, e.g. boxcox(-q, c, at_1=True).

SciPy member

q near 0 or near 1 is a bit a different issue, because we can use isf or ppf depending on whether we are in the upper or lower tail, I think. (*)
(Otherwise I didn't pay enough attention to understand the numerics of the different solutions.)

(*) so far a choice by user. we run in some cases into problems when we do use ppf in the extreme upper tail as in #3214
ppf in this case is also used for the rvs.

If boxcox(1 - q, c) is common

I have no idea how common the extreme cases are, I think usually not common with actual data.

SciPy member
SciPy member

I think a good API would be to separate the offset explicitly: boxcox(x, lmbda, x0=0), calculating ((x+x0)**lmbda -1)/lmbda. This way both a user can pick the right one, and the implementation would have a chance to do the right thing under the hood.

I don't think there's much boxcox in the scipy codebase so far, given that it was only introduced in #3150.

Overall, I think it's worth it to grow the collection of ufuncs for, loosely speaking, 'simply-looking combinations of elementary functions with all the corner cases taken care of' (xlogy, log1p, boxcox and so on).

I updated boxcox and added the new function boxcox1p here: #3229

It was sufficient to express the function use expm1 and either log or log1p--no need for the series expansion.

Consistent with the functions expm1, log1p and xlogyp1, I added the new function boxcox1p instead of adding additional arguments to boxcox.

SciPy member

Incorporated the new boxcox, boxcox1p, added an explicit _isf method, and squashed it all into a single commit.

SciPy member

looks good to me

Coverage remained the same when pulling 9c3c02f on ev-br:genpareto into 6b6b41a on scipy:master.

commented on an outdated diff Jan 22, 2014
scipy/stats/_continuous_distns.py
 def _munp(self, n, c): - k = arange(0, n+1) - val = (-1.0/c)**n * sum(comb(n, k)*(-1)**k / (1.0-c*k), axis=0) - return where(c*n < 1, val, inf) + if c != 0: + k = arange(0, n+1) + val = (-1.0/c)**n * sum(comb(n, k)*(-1)**k / (1.0-c*k), axis=0) + return where(c*n < 1, val, inf) + else: + return gam(n+1)
 pbrod added a line comment Jan 22, 2014 Vectorization of _munp can be done like this def _munp(self, n, c): def __munp(n, c): val = 0.0 k = arange(0, n + 1) for ki, cnk in zip(k, comb(n, k)): val = val + cnk * (-1) ** ki / (1.0 - c * ki) return where(c * n < 1, val * (-1.0 / c) ** n, inf) munp = lambda c: __munp(n, c) return _lazywhere(c != 0, (c,), munp, gam(n + 1))  to join this conversation on GitHub. Already have an account? Sign in to comment
and 2 others commented on an outdated diff Jan 23, 2014
scipy/stats/_continuous_distns.py
 def _logpdf(self, x, c): - return (-1.0-1.0/c) * np.log1p(c*x) + return -(c + 1.) * self._log1pcx(x, c)
 pbrod added a line comment Jan 23, 2014 genpareto.logpdf(inf, 0.1) return nan. Correct answer is -inf (This failure is due to np.log1p(inf) returns nan in numpy version 1.8.0) genpareto.logpdf(1, -1) return nan. Correct answer is 0 WarrenWeckesser added a line comment Jan 23, 2014 @pbrod: I can't reproduce the np.log1p bug. Using both 1.7.1 and 1.8.0, np.log1p(np.inf) returns inf. Perhaps it is platform dependent? (I'm using Ubuntu 12.04 (64 bit).) Have you reported the problem on the numpy issue tracker (https://github.com/numpy/numpy/issues)? pbrod added a line comment Jan 23, 2014 I am using windows 7 and numpy version 1.8.0 that comes with pythonxy and yes I have just reported it on numpy. SciPy member ev-br added a line comment Jan 24, 2014 @pbrod what does scipy.special.log1p(inf) return on your system? --- apparently, there's one from cephes/unity.c (I cannot test it myself, since I'm on ubuntu precise at the moment) pbrod added a line comment Jan 24, 2014 scipy.special.log1p(inf) returns inf. … pbrod added a line comment Jan 24, 2014 However, scipy.special.log1p gives the correct answer on my system: In [2]: scipy.special.log1p(inf) Out[2]: inf SciPy member ev-br added a line comment Jan 24, 2014 Great! Would you be able to run the full stats test suite on your machine, using the updated PR? pbrod added a line comment Jan 24, 2014 I am not able to run the complete test suite on this PR. However, it works correctly when using scipy.special.log1p. … to join this conversation on GitHub. Already have an account? Sign in to comment
SciPy member

In the last commit I've replaced the call to np.log1p with scipy.special.log1p in genpareto and elsewhere in _continuous_distns. All tests keep passing locally, but this needs to be tested on Windows (numpy/numpy#4225).

Coverage remained the same when pulling 9fc4d7b on ev-br:genpareto into 6b6b41a on scipy:master.

commented on the diff Jan 24, 2014
scipy/stats/_continuous_distns.py
 @@ -11,11 +11,11 @@ from scipy import special from scipy import optimize from scipy import integrate -from scipy.special import (gammaln as gamln, gamma as gam) +from scipy.special import (gammaln as gamln, gamma as gam, boxcox, boxcox1p)
 pbrod added a line comment Jan 24, 2014 Why not import log1p here? SciPy member ev-br added a line comment Jan 25, 2014 no strong preference here, mostly a matter of taste. I personally have a slight preference for being a little more explicit, and here at least I would definitely expect just log1p being a numpy function. Can change it if there are strong opinions though. to join this conversation on GitHub. Already have an account? Sign in to comment
and 3 others commented on an outdated diff Jan 24, 2014
scipy/stats/_continuous_distns.py
 from numpy import (where, arange, putmask, ravel, sum, shape, log, sqrt, exp, arctanh, tan, sin, arcsin, arctan, - tanh, cos, cosh, sinh, log1p, expm1) + tanh, cos, cosh, sinh, expm1)
 pbrod added a line comment Jan 24, 2014 Why not import expm1 from scipy.special also, since numpy.expm1(inf) returns nan on windows? SciPy member ev-br added a line comment Jan 25, 2014 It does not seem to be directly related to this PR --- expm1 is only used in _cdf, and the argument does not go to positive inf. (BTW boxcox has been reimplemeted in terms of expm1, is it affected? ping @WarrenWeckesser) But I agree, it might be that expm1 & log1p have to be replaced everywhere in the scipy codebase. I'll open a ticket. WarrenWeckesser added a line comment Jan 25, 2014 Why wouldn't the argument of _cdf go to inf? The result should be 1. If we use the buggy version of np.expm1, _cdf(inf, c) will return nan. SciPy member josef-pkt added a line comment Jan 25, 2014 Having a correct _cdf(inf, c) is a bonus and desired. However, IIRC, cdf wouldn't/shouldn't delegate to _cdf if x == .b == inf and the wrapper code should return 1. SciPy member ev-br added a line comment Jan 25, 2014 @WarrenWeckesser indeed, I was wrong. The last commit uses special.expm1 WarrenWeckesser added a line comment Jan 25, 2014 @josef-pkt: Good point. So it is not essential that _cdf handle inf, but it would be nice. to join this conversation on GitHub. Already have an account? Sign in to comment
referenced this pull request Jan 25, 2014
Open

#### corner cases in numpy.log1p and friends #3242

Coverage remained the same when pulling 4bfdceb on ev-br:genpareto into 6b6b41a on scipy:master.

referenced this pull request Feb 4, 2014
Open

#### Fix #3258 (Frechet distributions are incorrect) #3275

added the PR label Feb 19, 2014
SciPy member

rebased, added vectorized _munp implementation (vectorization beats readability, huh), force-pushed the result. I believe I've incorporated all review comments.

Coverage remained the same when pulling 6169f67 on ev-br:genpareto into 32cd96d on scipy:master.

It looks good except that genpareto.logpdf(1, -1) still returns nan. Correct answer is 0

SciPy member

@pbrod does the last commit fixes it for you? [I cannot reproduce it locally, so I'm reduced to guessing and trying]. @pv this involves a small tweak in special.

Changes Unknown when pulling d4a484d on ev-br:genpareto into ** on scipy:master**.

This does not fix the problem that @pbrod pointed out. The problem is not in log1p; it is in _logpdf(). _log1pcx correctly returns inf, but then in _logpdf, that value is mutilplied by (c + 1.) (which is 0), and 0 * inf is nan.

You could refactor a bit, and maybe use xlog1py, but my initial impression is that somewhere in the code you'll need special cases for both c = 0 and c = -1.

SciPy member

Ah, yes, you're right. It's not just tweaking the plumbing: At c=-1, genpareto must reduce to a uniform distribution on [0, 1], which this implementation does not. Will revert the last commit and look at this a bit more.

SciPy member

In the last commit I'm using xlog1py (thanks @WarrenWeckesser) to special-case both c=0 and c=-1.

Coverage remained the same when pulling 5c1ec5a on ev-br:genpareto into b54a499 on scipy:master.

Coverage remained the same when pulling e4949fc on ev-br:genpareto into b54a499 on scipy:master.

added some commits Jan 18, 2014
 ev-br ENH: add the limit shape c=0 to genpareto distribution For c=0, genpareto is equivalent to the exponential distribution. 3a8fec3 ev-br BUG: use special.log1p instead of np.log1p np.log1p(np.inf) is reported to produce nans on some platforms, see numpy issue gh-4225 1dd2992 ev-br BUG: use scipy.special.expm1 instead of np.expm1 733d320 ev-br ENH: vectorize genpareto._munp implementation is by @pbrod 540c23f ev-br BUG: special-case genpareto(c=-1) 9718599
changed the title from Add the limiting exponential distribution to generalized Pareto distribution with shape c=0 to Add the limiting distributions to generalized Pareto distribution with shapes c=0 and c=-1 Jul 23, 2014
SciPy member

Rebased, changed the title to better reflect the content of the PR (c=0 and c=-1). I believe I've addressed all the review comments.

Coverage increased (+0.02%) when pulling 9718599 on ev-br:genpareto into 686537d on scipy:master.

removed the PR label Aug 13, 2014
added the enhancement label Sep 5, 2014
added this to the 0.15.0 milestone Sep 5, 2014
SciPy member

I think it'd be nice to have it in 0.15 if time permits.

I agree..

merged commit 8b8531d into scipy:master Sep 9, 2014

#### 1 check passed

Details continuous-integration/travis-ci The Travis CI build passed

Thanks for making these improvements! I especially agree with:

Overall, I think it's worth it to grow the collection of ufuncs for, loosely speaking, 'simply-looking combinations of elementary functions with all the corner cases taken care of'

SciPy member

@argriffing you've mplemented a nice bunch of them recently, have you not :-)

deleted the ev-br:genpareto branch Sep 9, 2014

Yes I added some weird functions that have been graciously merged, but they are not yet ufunc-ified and they do not yet deal with all corner cases.

SciPy member

Ah, good to know! I think it'd be very useful to actually make them ufuncs and fix up all the corner cases

I think it'd be very useful to actually make them ufuncs and fix up all the corner cases

PR #3981

Adding boxcox inverse transform ufuncs to scipy.special could help clean up

def _logpdf(self, x, c):
return _lazywhere((x == x) & (c != 0), (x, c),
lambda x, c: -special.xlog1py(c+1., c*x) / c, -x)

as well as helping answer http://stackoverflow.com/questions/26391454

SciPy member

Yeah, it could. Open an issue for this? So that it's more visible

This was referenced Oct 16, 2014
Closed

Merged