Add the limiting distributions to generalized Pareto distribution with shapes c=0 and c=-1 #3225

ev-br · 2014-01-19T01:58:00Z

closes gh-1295

The implementation here is similar to the original one by @pbrod, as listed in gh-1295.

Several points I'd like to flag for a review:

The exact properties of the c\to 0 limit. For example, I'm currently skipping the (otherwise failing) test for continuity of the ppf at small c --- I first wrote the test without much thinking, but in fact I'm not sure if the requirement of the test is not too stringent. The specific question here, I guess, is whether the Box-Cox transform is uniformly convergent at lambda\to 0.
We might want to add a new ufunc for computing log(1 + a*x)/x with the correct behavior at x=0, instead of a private helper in genpareto.
_munp and entropy are not properly vectorized at the moment (neither in master, not in this PR).

coveralls · 2014-01-19T02:28:21Z

Coverage remained the same when pulling 4c9413d on ev-br:genpareto into dc7555b on scipy:master.

josef-pkt · 2014-01-19T03:27:44Z

about ppf: I don't know the new boxcox, but in the old version we have 1-eps - 1 which becomes mostly noise for c<1e-8

my guess is that it's a purely numerical problems that won't go away without using a (Taylor) series expansion around c=0 or something like that

>>> q = np.linspace(0., 1., 30, endpoint=False)
>>> stats.genpareto.ppf(q, 1e-8) - stats.expon.ppf(q)
array([  0.00000000e+00,  -3.44033249e-09,   6.59031606e-09,
         4.65046893e-09,  -1.35370382e-09,   1.05718484e-09,
        -3.23048349e-09,   4.84360874e-09,  -4.33792879e-09,
        -9.90649074e-09,  -2.17329854e-09,   6.37755515e-09,
         2.47717025e-09,  -2.54236632e-11,  -5.40378220e-09,
         3.53435514e-09,   1.01246712e-08,   2.18065133e-09,
         3.03871706e-10,  -8.03573652e-10,   1.36105660e-09,
         6.01152550e-09,  -1.86942684e-09,   1.36590266e-08,
         3.83822663e-09,   2.70998719e-08,   2.38693882e-08,
         2.95770421e-08,   2.74037433e-08,   5.31425592e-08])
>>> stats.genpareto.ppf(q, 1e-10) - stats.expon.ppf(q)
array([  0.00000000e+00,   2.18604272e-07,   8.28155354e-07,
        -3.50620899e-07,   2.42895362e-07,  -7.31690011e-07,
         1.74405200e-07,  -1.50587615e-07,  -8.03698507e-07,
        -2.54155556e-07,  -5.57284811e-07,   6.72511370e-07,
        -9.07905710e-07,  -5.99545857e-07,  -3.82879611e-07,
         5.80850328e-07,  -8.11440367e-07,   8.23745690e-07,
         7.55255528e-07,  -2.22848179e-07,   2.35655171e-08,
        -3.27055382e-07,   1.97970718e-07,  -2.30590039e-07,
        -8.84340193e-07,   6.04415845e-07,   7.78821045e-07,
        -3.03489865e-07,  -8.60774676e-07,  -2.79924348e-07])
>>> stats.genpareto.ppf(q, 1e-14) - stats.expon.ppf(q)
array([ 0.        ,  0.01050737, -0.00237949,  0.00566179, -0.00987408,
       -0.00468587, -0.00109895,  0.00075036,  0.00070752, -0.00140358,
       -0.00578482,  0.00953527, -0.00012303,  0.00933194, -0.00688377,
       -0.00480891, -0.0071884 ,  0.00752147, -0.00590785, -0.00410139,
       -0.01059372, -0.00493194,  0.01051179,  0.01020716, -0.01071676,
        0.00680183,  0.00570288,  0.0066788 ,  0.00089398, -0.00391493])

pbrod · 2014-01-19T15:39:22Z

The scipy.special.boxcox for lmbda != 0 is implemented as (pow(x, lmbda) - 1.0) / lmbda and is numerically unstable for small lmbda.

A better solution in the _ppf method is to replace the call to boxcox with the following:

   def _ppf(self, q, c):
          x = -log(-log(q))
         return _lazywhere((x==x) & (c != 0), (x, c), lambda x, c: -expm1(-c*x) / c, x)

Replacing with the above you will get the machine precision as shown here:

In [36]: q = np.linspace(0., 1., 30, endpoint=False)
In [37]: c=1e-8;(np.abs(genpareto.cdf(genpareto.ppf(q, c),c) - q))
Out[37]:
array([ 0.00000000e+00, 0.00000000e+00, 1.38777878e-17,
0.00000000e+00, 0.00000000e+00, 2.77555756e-17,
0.00000000e+00, 5.55111512e-17, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 1.11022302e-16,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
1.11022302e-16, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00])

In [40]: c=1e-15;(np.abs(genpareto.cdf(genpareto.ppf(q, c),c) - q))
Out[40]:
array([ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 2.77555756e-17, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 1.11022302e-16, 0.00000000e+00,
0.00000000e+00, 1.11022302e-16, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
1.11022302e-16, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00])

pv · 2014-01-19T15:41:36Z

There's scipy.special.boxcox.

If there's something to improve, the improvement should be done in scipy.special.

ev-br · 2014-01-19T15:48:33Z

scipy/stats/_continuous_distns.py


    def _ppf(self, q, c):
-        vals = 1.0/c * (pow(1-q, -c)-1)
-        return vals
+        return -boxcox(1. - q, -c)


@pv which is exactly what's used here

ev-br · 2014-01-19T15:50:49Z

@pbrod would you be interested in fixing special.boxcox like you've shown?

pv · 2014-01-19T15:55:26Z

A suitable fix is probably to use for |c| << log(x) the expansion (x^c - 1)/c = (exp(c log(x)) - 1)/c = sum_{n=1}^inf c^{n-1} log(x)^n/n!.

pbrod · 2014-01-19T19:29:43Z

Using scipy.special.boxcox is not a good idea for small q in genpareto._ppf. Even if you replace the (pow(x, lmbda) - 1.0) / lmbda part in scipy.special.boxcox with with expm1(lmbda*log(x))/lmbda, you will still loose precision in genpareto.ppf for small q because x = 1-q.

WarrenWeckesser · 2014-01-19T20:39:30Z

special.boxcox should be fixed, and since I'm the one responsible for the naive implementation, I'm happy to fix it. (I'm already experimenting with the series given by @pv.) However, @pbrod is correct: if q is near zero, then the battle for precision is lost as soon as you compute 1 - q, regardless of how boxcox is implemented.

pv · 2014-01-19T20:40:57Z

If boxcox(1 - q, c) is common, it may make sense to extend the boxcox function to support also this, e.g. boxcox(-q, c, at_1=True).

josef-pkt · 2014-01-19T23:52:24Z

q near 0 or near 1 is a bit a different issue, because we can use isf or ppf depending on whether we are in the upper or lower tail, I think. (*)
(Otherwise I didn't pay enough attention to understand the numerics of the different solutions.)

(*) so far a choice by user. we run in some cases into problems when we do use ppf in the extreme upper tail as in #3214
ppf in this case is also used for the rvs.

If boxcox(1 - q, c) is common

I have no idea how common the extreme cases are, I think usually not common with actual data.

pv · 2014-01-20T08:01:50Z

I have no idea how common the extreme cases are, I think usually not
common with actual data.

How common is the call boxcox(1 - q, c) in the Scipy codebase?
If it's more than once (or perhaps even once), it may make sense
to abstract this.

ev-br · 2014-01-20T13:41:36Z

I think a good API would be to separate the offset explicitly: boxcox(x, lmbda, x0=0), calculating ((x+x0)**lmbda -1)/lmbda. This way both a user can pick the right one, and the implementation would have a chance to do the right thing under the hood.

I don't think there's much boxcox in the scipy codebase so far, given that it was only introduced in #3150.

Overall, I think it's worth it to grow the collection of ufuncs for, loosely speaking, 'simply-looking combinations of elementary functions with all the corner cases taken care of' (xlogy, log1p, boxcox and so on).

WarrenWeckesser · 2014-01-20T22:04:02Z

I updated boxcox and added the new function boxcox1p here: #3229

It was sufficient to express the function use expm1 and either log or log1p--no need for the series expansion.

Consistent with the functions expm1, log1p and xlogyp1, I added the new function boxcox1p instead of adding additional arguments to boxcox.

ev-br · 2014-01-22T12:49:39Z

Incorporated the new boxcox, boxcox1p, added an explicit _isf method, and squashed it all into a single commit.

josef-pkt · 2014-01-22T13:20:06Z

looks good to me

coveralls · 2014-01-22T13:35:26Z

Coverage remained the same when pulling 9c3c02f on ev-br:genpareto into 6b6b41a on scipy:master.

pbrod · 2014-01-22T20:14:37Z

scipy/stats/_continuous_distns.py

+            val = (-1.0/c)**n * sum(comb(n, k)*(-1)**k / (1.0-c*k), axis=0)
+            return where(c*n < 1, val, inf)
+        else:
+            return gam(n+1)


Vectorization of _munp can be done like this

def _munp(self, n, c): def __munp(n, c): val = 0.0 k = arange(0, n + 1) for ki, cnk in zip(k, comb(n, k)): val = val + cnk * (-1) ** ki / (1.0 - c * ki) return where(c * n < 1, val * (-1.0 / c) ** n, inf) munp = lambda c: __munp(n, c) return _lazywhere(c != 0, (c,), munp, gam(n + 1))

ev-br · 2014-01-24T14:07:44Z

In the last commit I've replaced the call to np.log1p with scipy.special.log1p in genpareto and elsewhere in _continuous_distns. All tests keep passing locally, but this needs to be tested on Windows (numpy/numpy#4225).

coveralls · 2014-01-24T14:46:31Z

Coverage remained the same when pulling 9fc4d7b on ev-br:genpareto into 6b6b41a on scipy:master.

pbrod · 2014-01-24T16:58:43Z

scipy/stats/_continuous_distns.py

@@ -11,11 +11,11 @@
 from scipy import special
 from scipy import optimize
 from scipy import integrate
-from scipy.special import (gammaln as gamln, gamma as gam)
+from scipy.special import (gammaln as gamln, gamma as gam, boxcox, boxcox1p)


Why not import log1p here?

no strong preference here, mostly a matter of taste. I personally have a slight preference for being a little more explicit, and here at least I would definitely expect just log1p being a numpy function. Can change it if there are strong opinions though.

coveralls · 2014-01-25T03:44:44Z

Coverage remained the same when pulling 4bfdceb on ev-br:genpareto into 6b6b41a on scipy:master.

WarrenWeckesser · 2014-03-02T21:30:13Z

This does not fix the problem that @pbrod pointed out. The problem is not in log1p; it is in _logpdf(). _log1pcx correctly returns inf, but then in _logpdf, that value is mutilplied by (c + 1.) (which is 0), and 0 * inf is nan.

You could refactor a bit, and maybe use xlog1py, but my initial impression is that somewhere in the code you'll need special cases for both c = 0 and c = -1.

ev-br · 2014-03-02T23:47:06Z

Ah, yes, you're right. It's not just tweaking the plumbing: At c=-1, genpareto must reduce to a uniform distribution on [0, 1], which this implementation does not. Will revert the last commit and look at this a bit more.

ev-br · 2014-03-09T22:08:25Z

In the last commit I'm using xlog1py (thanks @WarrenWeckesser) to special-case both c=0 and c=-1.

coveralls · 2014-03-09T22:55:16Z

Coverage remained the same when pulling 5c1ec5a on ev-br:genpareto into b54a499 on scipy:master.

coveralls · 2014-03-10T22:49:43Z

Coverage remained the same when pulling e4949fc on ev-br:genpareto into b54a499 on scipy:master.

For c=0, genpareto is equivalent to the exponential distribution.

np.log1p(np.inf) is reported to produce nans on some platforms, see numpy issue scipygh-4225

@pbrod

implementation is by @pbrod

ev-br · 2014-07-23T11:38:58Z

Rebased, changed the title to better reflect the content of the PR (c=0 and c=-1). I believe I've addressed all the review comments.

coveralls · 2014-07-23T12:27:01Z

Coverage increased (+0.02%) when pulling 9718599 on ev-br:genpareto into 686537d on scipy:master.

ev-br · 2014-09-05T23:34:43Z

I think it'd be nice to have it in 0.15 if time permits.

pbrod · 2014-09-09T10:28:33Z

I agree..

Add the limiting distributions to generalized Pareto distribution with shapes c=0 and c=-1

argriffing · 2014-09-09T14:33:35Z

Thanks for making these improvements! I especially agree with:

Overall, I think it's worth it to grow the collection of ufuncs for, loosely speaking, 'simply-looking combinations of elementary functions with all the corner cases taken care of'

ev-br · 2014-09-09T15:33:05Z

@argriffing you've mplemented a nice bunch of them recently, have you not :-)

argriffing · 2014-09-09T16:20:53Z

Yes I added some weird functions that have been graciously merged, but they are not yet ufunc-ified and they do not yet deal with all corner cases.

ev-br · 2014-09-09T17:31:25Z

Ah, good to know! I think it'd be very useful to actually make them ufuncs and fix up all the corner cases

argriffing · 2014-09-10T05:36:48Z

I think it'd be very useful to actually make them ufuncs and fix up all the corner cases

PR #3981

argriffing · 2014-10-16T00:05:31Z

Adding boxcox inverse transform ufuncs to scipy.special could help clean up

def _logpdf(self, x, c):
    return _lazywhere((x == x) & (c != 0), (x, c),
        lambda x, c: -special.xlog1py(c+1., c*x) / c, -x)

as well as helping answer http://stackoverflow.com/questions/26391454

ev-br · 2014-10-16T08:22:42Z

Yeah, it could. Open an issue for this? So that it's more visible

ev-br reviewed Jan 19, 2014
View reviewed changes

pbrod reviewed Jan 22, 2014
View reviewed changes

pbrod reviewed Jan 24, 2014
View reviewed changes

ev-br mentioned this pull request Jan 25, 2014

corner cases in numpy.log1p and friends #3242

Closed

ev-br mentioned this pull request Feb 4, 2014

Fix #3258 (Frechet distributions are incorrect) #3275

Closed

argriffing added scipy.stats and removed scipy.sparse.linalg labels Feb 6, 2014

pv added the PR label Feb 19, 2014

ev-br added 5 commits July 23, 2014 12:12

ENH: add the limit shape c=0 to genpareto distribution

3a8fec3

For c=0, genpareto is equivalent to the exponential distribution.

BUG: use special.log1p instead of np.log1p

1dd2992

np.log1p(np.inf) is reported to produce nans on some platforms, see numpy issue scipygh-4225

BUG: use scipy.special.expm1 instead of np.expm1

733d320

ENH: vectorize genpareto._munp

540c23f

implementation is by @pbrod

BUG: special-case genpareto(c=-1)

9718599

ev-br changed the title ~~Add the limiting exponential distribution to generalized Pareto distribution with shape c=0~~ Add the limiting distributions to generalized Pareto distribution with shapes c=0 and c=-1 Jul 23, 2014

pv removed the PR label Aug 13, 2014

ev-br added the enhancement label Sep 5, 2014

ev-br added this to the 0.15.0 milestone Sep 5, 2014

argriffing added a commit that referenced this pull request Sep 9, 2014

Merge pull request #3225 from ev-br/genpareto

8b8531d

Add the limiting distributions to generalized Pareto distribution with shapes c=0 and c=-1

argriffing merged commit 8b8531d into scipy:master Sep 9, 2014

ev-br deleted the genpareto branch September 9, 2014 15:33

argriffing mentioned this pull request Oct 16, 2014

inverse boxcox transform #4078

Closed

ev-br mentioned this pull request Jun 29, 2015

Fix up boxcox for underflow / loss of precision #4997

Merged

Add the limiting distributions to generalized Pareto distribution with shapes c=0 and c=-1 #3225

Add the limiting distributions to generalized Pareto distribution with shapes c=0 and c=-1 #3225

Conversation

ev-br commented Jan 19, 2014

coveralls commented Jan 19, 2014

josef-pkt commented Jan 19, 2014

pbrod commented Jan 19, 2014

pv commented Jan 19, 2014

ev-br Jan 19, 2014

Choose a reason for hiding this comment

ev-br commented Jan 19, 2014

pv commented Jan 19, 2014

pbrod commented Jan 19, 2014

WarrenWeckesser commented Jan 19, 2014

pv commented Jan 19, 2014

josef-pkt commented Jan 19, 2014

pv commented Jan 20, 2014

ev-br commented Jan 20, 2014

WarrenWeckesser commented Jan 20, 2014

ev-br commented Jan 22, 2014

josef-pkt commented Jan 22, 2014

coveralls commented Jan 22, 2014

pbrod Jan 22, 2014

Choose a reason for hiding this comment

ev-br commented Jan 24, 2014

coveralls commented Jan 24, 2014

pbrod Jan 24, 2014

Choose a reason for hiding this comment

ev-br Jan 25, 2014

Choose a reason for hiding this comment

coveralls commented Jan 25, 2014

WarrenWeckesser commented Mar 2, 2014

ev-br commented Mar 2, 2014

ev-br commented Mar 9, 2014

coveralls commented Mar 9, 2014

coveralls commented Mar 10, 2014

ev-br commented Jul 23, 2014

coveralls commented Jul 23, 2014

ev-br commented Sep 5, 2014

pbrod commented Sep 9, 2014

argriffing commented Sep 9, 2014

ev-br commented Sep 9, 2014

argriffing commented Sep 9, 2014

ev-br commented Sep 9, 2014

argriffing commented Sep 10, 2014

argriffing commented Oct 16, 2014

ev-br commented Oct 16, 2014