Add exponentially modified gaussian distribution (scipy.stats.expongauss) #4533

aconley · 2015-02-18T17:05:54Z

'Enhancement' request.
Uses already existing stats unit tests
No additional PEP8 failures in _continuous_distns beyond those
already existing
Added to release notes for 0.16.0
The reference guide doesn't discuss individual distributions
in scipy.stats, so nothing added there.
Example PDFs and CDFs checked against the wikipedia page
(http://en.wikipedia.org/wiki/Exponentially_modified_Gaussian_distribution),
random samples tested against mean, variance, histograms compared
for various parameters.

argriffing · 2015-02-18T19:19:25Z

Looks good to me, if the framework handles the testing. I guess mu is left out because it's a location parameter which is handled automatically by the scipy.stats framework. I see that other distributions with constrained shape parameters have _argcheck. Should this distribution have that too?

aconley · 2015-02-18T19:27:46Z

Ah hah! I was wondering how to do that. Will add.

Loc is indeed mu. Unfortunately, there doesn't seem to be a combination of sigma and lambda that can act as scale.

ev-br · 2015-02-18T19:28:49Z

I don't think this is being tested at all.
For the test loop in https://github.com/scipy/scipy/blob/master/scipy/stats/tests/test_continuous_basic.py to test this new distribution, you'd need to add an entry to the list in
https://github.com/scipy/scipy/blob/master/scipy/stats/_distr_params.py

ev-br · 2015-02-18T19:29:48Z

scipy/stats/_continuous_distns.py

+
+    The two shape parameters for `expongauss` (``lam`` and ``s``) must
+    be set explicitly.
+    .. versionadded:: 0.16.0


I'm pretty sure this needs a blank line above .. versionadded::

aconley · 2015-02-18T19:38:20Z

@ev-br -- adding to _distr_params: wow, that's well documented... will add -- as well as a note to test_continuous_basic that you should add parameters there.

argriffing · 2015-02-18T19:48:11Z

THANKS.txt

@@ -139,6 +139,7 @@ Brandon Liu for stats.combine_pvalues.
 Clark Fitzgerald for namedtuple outputs in scipy.stats.
 Florian Wilhelm for usage of RandomState in scipy.stats distributions.
 Robert T. McGibbon for Levinson-Durbin Toeplitz solver.
+Alex Conley for the Exponentiall Modified Gaussian distribution.


Exponentiall

ev-br · 2015-02-18T19:51:12Z

@aconley Yeah, it's not too well documented :-(. Basically, there is a loop over the distributions in
https://github.com/scipy/scipy/blob/master/scipy/stats/tests/test_continuous_basic.py#L91.
Additionally, there is the same loop for "slow" distributions + a separate loop checking moments vs integration, see test moments. Is this expongauss slow?

Parameters with which to test are pulled from _distr_params.py in
https://github.com/scipy/scipy/blob/master/scipy/stats/tests/test_continuous_basic.py#L15

The reason why parameters are separated is that they are also used to construct the default docstring example :-).

If you have an idea of how to document this, please do :-)

ev-br · 2015-02-18T20:05:04Z

I'd say LGTM once this is added to the main test loop over the distributions and these tests are green.
(regardless of the nitpicks about the formatting of the docstring).

If there are any additional tests, please add them to stats/tests/test_distributions.py

josef-pkt · 2015-02-18T20:23:50Z

_argcheck is not needed, AFAIK it still defaults to strictly positive. try it.

josef-pkt · 2015-02-18T20:26:16Z

adding explicit _logpdf would be useful, AFAICS

josef-pkt · 2015-02-18T20:31:06Z

also explicit _sf should have better values in the tail than going through _cdf

josef-pkt · 2015-02-18T20:36:00Z

extra task: it would be good to add new distributions to the list in the tutorial, although, IIRC we haven't added previous additions.

aconley · 2015-02-18T22:28:50Z

Should all hopefully be addressed.

Turning on the unit tests revealed some overflow behavior which is now fixed. However, it does still print an overflow warning, even though the result is fine. Is it worth using numpy.seterr to turn off the overflow warning just inside the pdf function? Or is that unwanted/unneeded overhead?

ev-br · 2015-02-18T22:41:33Z

Please, no seterr in the library code. (My personal opinion, different ones exist). Have you tried 1) moving computations to log space, with _pdf(...): return exp(self._logpdf(...))?, 2) using special.erfcx?

aconley · 2015-02-18T22:59:30Z

seems like a potential loss of precision, although it would certainly work. Is avoiding an overflow warning that isn't affecting the actual outputs worth that?
Is a more interesting idea -- I don't know if it will work, but it's worth a shot!

aconley · 2015-02-18T23:10:54Z

Alas, erfcx doesn't help. Or, rather, it fixes the problem for large negative x but introduces a new problem for large positive x (since erfc(-1000) is 2, but erfcx(-1000) overflows).

aconley · 2015-02-18T23:14:39Z

Oh, and the log approach causes a different set of warnings (divide by zero), although it again produces valid results.

josef-pkt · 2015-02-18T23:44:51Z

maybe you need to separate the lower tail and the upper tail by branching the calculations, based on your comment on erfcx versus erfc.
In general working on the wrong tail, for example in the normal distribution, can be pretty awful.

I didn't check the details: Is there some symmetry in the distribution between small and large values that could be exploited?

aconley · 2015-02-18T23:52:26Z

Not that I can see.

Splitting the lower/upper tail would work -- but it's actually even simpler if I go that route. Just check for the argument to the exponential being less than, say, -700 and return zero if it is (since you're going to get that anyways!). I was trying to find a way to avoid that ugliness, but maybe numpy.vectorize is the only way...

josef-pkt · 2015-02-19T00:05:19Z

don't use numpy.vectorize, you already have broadcasted arguments, so it should be just a conditional assignment. In other cases, like 0 * log(0) we switched to helper functions.

@ev-br and others are more up to date on how this is handled now.

josef-pkt · 2015-02-19T00:09:41Z

IIRC, Evgeni added a lazy_where that can be used in this cases and avoids the calculations that cause the warning.

ev-br · 2015-02-19T12:43:09Z

Re lazywhere --- I only found myself copy-pasting a pre-existing piece of code, and refactored into a function :-). The code itself was there before me, so is it you Josef, or Per or Travis?

ev-br · 2015-02-19T12:54:34Z

The test failure for the "full" test suite seems real: the 4th moment from _stats does not seem to agree with a direct calculation from expect, can you investigate @aconley

Re far tails and warnings. We already filter out warnings in the test suite, we can filter out this one too --- in the test suite, not the library code.

We already do calculations in log space in several cases, there was a discussion about this a while ago (a PR from @WarrenWeckesser IIRC).

We also have several cases where calculations fail in the "wrong" tail, so here we might just live with it.

Meanwhile I wonder if we could/should select the tails in the level of the generic cdf/sf: to have cdf dispatch to either _cdf or _sf depending on the arguments. But this is clearly not for this PR.

WarrenWeckesser · 2015-02-19T13:14:14Z

... there was a discussion about this a while ago (a PR from @WarrenWeckesser IIRC)

Here: #3510

ev-br · 2015-02-19T15:56:43Z

Not that I want to be a pain in everybody's neck, but I do have a sort of a general comment. This PR defines a two-parametric distribution with shape parameters lam and s. Wikipedia, http://en.wikipedia.org/wiki/Exponentially_modified_Gaussian_distribution , gives a three-parameter distribution. As discussed above, one of these parameters can be absorbed into the loc parameter, which is automatically added by the framework. Which makes the total number of parameters to be four (lam, s, loc and scale) instead of three.

Unless I'm wrong somewhere, this distribution can indeed be written in a standartized form with a single shape parameter. The relation to the form from this PR would be then (with xi being the shape parameter):

loc = mu + lam * s**2
scale = sqrt(2) * s
xi = sqrt(2) * s * lam

Here: https://dl.dropboxusercontent.com/u/18720218/expon_modified_normal.pdf
(my handwriting is abysmal, I know)

I think it's better to standartize this distribution. (And only then deal with numerical issues at the tails, if any are left). Thoughts?

aconley · 2015-02-19T16:45:00Z

Wouldn’t it be a bit cleaner to define

loc = mu + lam * s**2
scale = s
xi = s * lam?

Then the distribution is

xi / (2*scale) exp(-(xi * x + xi^2/2)) erfc(-x / sqrt(2))

That has the advantage that the ‘canonical’ values s = 1 lam = 1 map onto
scale = 1, xi = 1.

The default loc=0 is a little weird though.

aconley · 2015-02-19T17:45:27Z

Ok, even better (?)

loc = mu
s = sigma
xi = s * lambda
pdf = const * exp(-xi * x) Erfc(-(x-xi)/sqrt(2))
where const = xi / (2*s) exp(-xi^2/2)

Then the canonical version (sum of a centered gaussian with unit variance and a exponential with scale 1) has loc=0, scale=1, shape=1.

ev-br · 2015-02-19T18:33:01Z

+1 for the last parametrization.

aconley · 2015-02-20T00:17:00Z

_lazywhere doesn't work for single argument functions, so I added a _lazywhere_single -- if there's some easy way to fix _lazywhere for that, I'm happy to switch back. Ignoring that, this should

Fix the overflow issues
Change the parameterization to loc, scale, shape
Hopefully fix the kurtosis issue

ev-br · 2015-02-20T14:44:56Z

scipy/stats/_continuous_distns.py

+        # Avoid overflows; setting exp(exparg) to the max float works
+        #  all right here
+        expval = _lazywhere_single(exparg < _maxexparg, (exparg),
+                                   exp, _maxfloat)


_lazywhere should work with a single argument: just feed it a single-element tuple. See, for example, invgamma.stats method.

Yes, silly me, I forgot the trailing , in (exparg,)

ev-br · 2015-02-20T15:04:11Z

I made one more round of comments, mostly rather minor. Two issues remaining:

_rvs only handles a standartized distribution. loc and scale are handled at the generic rvs level and are not seen by _rvs.
kurtosis test fails still. Might be an issue with the definition. I think stats returns the excess kurtosis, so that for a gaussian it's zero, not three:

In [6]: norm.stats(moments='mvsk')
Out[6]: (array(0.0), array(1.0), array(0.0), array(0.0))

If push comes to shove, you may just mark it as a known failure, see test_moments in test_continuous_basic.py.

One other thing which might be helpful is to spell out in the docstring the transformation from the 'wikipedia' definition --- and/or other definitions used in the literature to the standartized one here.

josef-pkt · 2015-02-20T15:53:57Z

One other thing which might be helpful is to spell out in the docstring the transformation from the 'wikipedia' definition --- and/or other definitions used in the literature to the standartized one here.

To this point: I haven't thought about how this would be implemented, but it would be good to start helper methods to convert the parameterization to something more "standard" in some cases.
lognormal is an example that shows up very often in questions.
In statsmodels I started to write helper functions/methods that translate our parameterization that is specialized for regression to the one of scipy (e.g. negativebinomial), so we can reuse scipy's rvs and other methods.

It would be just a distribution specific method without any generic support, I guess.

ev-br · 2015-02-20T16:05:11Z

@josef-pkt Would a docstring example with an explicit formula not be sufficient?

josef-pkt · 2015-02-20T16:14:28Z

It's sufficient for now, but I think eventually we should support it automatically, with code.
I'm not using lognormal very often, but each time I have to stare at it (documentation and examples and/or the math) for a while to understand and remember the parameterization.

ev-br · 2015-02-20T16:42:07Z

Let's take the discussion of parametrization to #4538 to avoid side-tracking this PR.

aconley · 2015-02-20T18:02:29Z

You'll like this: it is calculating the excess kurtosis, using the formula from Wikipedia. The problem is that the Ex Kurtosis formula on Wikipedia is wrong -- it should have a 3 in front of it, not a 2 (and it simplifies much further with that change). This also fixes the test.

So, the unit tests in scipy have uncovered a wikipedia mistake.

I need to find a reference for this somewhere so I can fix wikipedia -- they probably won't like my explanation that you can just derive this from the characteristic function.

Edit: fixed on Wikipedia; their original source had the 3, so it was apparently a transcription error.

ev-br · 2015-02-20T21:05:54Z

Oh, great :-).

Two last nitpicks:

_maxfloat = float_info.max seems to be the same as _XMAX = np.finfo(float).machar.xmax, which is already stored via from .constants import ...
if you are at ease with git, could you squash your commits into a single commit and rewrite the commit message with the ENH prefix (see http://docs.scipy.org/doc/numpy/dev/gitwash/development_workflow.html).

In any case, LGTM.

ev-br · 2015-02-20T21:31:00Z

Trivial tweaks in the docs (* need to be shielded from sphinx etc) here, ev-br/scipy@scipy:master...ev-br:pr/4533
Only the last commit is relevant, ev-br@ae9bdcb, can you take it over?

…nnorm) * Added to release notes for 0.16.0 * Example PDFs and CDFs checked against the wikipedia page (http://en.wikipedia.org/wiki/Exponentially_modified_Gaussian_distribution), random samples tested against mean, variance, histograms compared for various parameters. A few simple unit tests added against non loc/scale/shape parameterization statistics. * RVS checked against PDFs for a range of parameters.

aconley · 2015-02-20T21:42:59Z

Ok, done.

argriffing · 2015-02-20T23:49:27Z

It is kind of impressive to me that the scipy stats tests detected the excess kurtosis error on Wikipedia.

ev-br · 2015-03-04T13:34:03Z

I think I'll merge it soon-ish unless further comments.

ev-br · 2015-03-08T17:06:30Z

No more comments, merging. Thanks @aconley !

Add exponentially modified gaussian distribution (scipy.stats.expongauss)

ev-br reviewed Feb 18, 2015
View reviewed changes

ev-br added scipy.stats enhancement A new feature or improvement labels Feb 18, 2015

argriffing reviewed Feb 18, 2015
View reviewed changes

ev-br reviewed Feb 20, 2015
View reviewed changes

ev-br mentioned this pull request Feb 20, 2015

alternative parametrizations of univariate distributions #4538

Open

4 tasks

aconley force-pushed the expgauss branch 2 times, most recently from 09f0831 to 700879c Compare February 20, 2015 21:27

aconley force-pushed the expgauss branch from 700879c to 3453214 Compare February 20, 2015 21:41

ev-br mentioned this pull request Mar 6, 2015

scipy.stats: generalized normal distribution #4604

Closed

rgommers added this to the 0.16.0 milestone Mar 8, 2015

ev-br added a commit that referenced this pull request Mar 8, 2015

Merge pull request #4533 from aconley/expgauss

e8e1391

Add exponentially modified gaussian distribution (scipy.stats.expongauss)

ev-br merged commit e8e1391 into scipy:master Mar 8, 2015

Add exponentially modified gaussian distribution (scipy.stats.expongauss) #4533

Add exponentially modified gaussian distribution (scipy.stats.expongauss) #4533

Conversation

aconley commented Feb 18, 2015

argriffing commented Feb 18, 2015

aconley commented Feb 18, 2015

ev-br commented Feb 18, 2015

ev-br Feb 18, 2015

Choose a reason for hiding this comment

aconley commented Feb 18, 2015

argriffing Feb 18, 2015

Choose a reason for hiding this comment

ev-br commented Feb 18, 2015

ev-br commented Feb 18, 2015

josef-pkt commented Feb 18, 2015

josef-pkt commented Feb 18, 2015

josef-pkt commented Feb 18, 2015

josef-pkt commented Feb 18, 2015

aconley commented Feb 18, 2015

ev-br commented Feb 18, 2015

aconley commented Feb 18, 2015

aconley commented Feb 18, 2015

aconley commented Feb 18, 2015

josef-pkt commented Feb 18, 2015

aconley commented Feb 18, 2015

josef-pkt commented Feb 19, 2015

josef-pkt commented Feb 19, 2015

ev-br commented Feb 19, 2015

ev-br commented Feb 19, 2015

WarrenWeckesser commented Feb 19, 2015

ev-br commented Feb 19, 2015

aconley commented Feb 19, 2015

aconley commented Feb 19, 2015

ev-br commented Feb 19, 2015

aconley commented Feb 20, 2015

ev-br Feb 20, 2015

Choose a reason for hiding this comment

aconley Feb 20, 2015

Choose a reason for hiding this comment

ev-br commented Feb 20, 2015

josef-pkt commented Feb 20, 2015

ev-br commented Feb 20, 2015

josef-pkt commented Feb 20, 2015

ev-br commented Feb 20, 2015

aconley commented Feb 20, 2015

ev-br commented Feb 20, 2015

ev-br commented Feb 20, 2015

aconley commented Feb 20, 2015

argriffing commented Feb 20, 2015

ev-br commented Mar 4, 2015

ev-br commented Mar 8, 2015