Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add exponentially modified gaussian distribution (scipy.stats.expongauss) #4533

Merged
merged 1 commit into from Mar 8, 2015

Conversation

aconley
Copy link

@aconley aconley commented Feb 18, 2015

  • 'Enhancement' request.
  • Uses already existing stats unit tests
  • No additional PEP8 failures in _continuous_distns beyond those
    already existing
  • Added to release notes for 0.16.0
  • The reference guide doesn't discuss individual distributions
    in scipy.stats, so nothing added there.
  • Example PDFs and CDFs checked against the wikipedia page
    (http://en.wikipedia.org/wiki/Exponentially_modified_Gaussian_distribution),
    random samples tested against mean, variance, histograms compared
    for various parameters.

@argriffing
Copy link
Contributor

Looks good to me, if the framework handles the testing. I guess mu is left out because it's a location parameter which is handled automatically by the scipy.stats framework. I see that other distributions with constrained shape parameters have _argcheck. Should this distribution have that too?

@aconley
Copy link
Author

aconley commented Feb 18, 2015

Ah hah! I was wondering how to do that. Will add.

Loc is indeed mu. Unfortunately, there doesn't seem to be a combination of sigma and lambda that can act as scale.

@ev-br
Copy link
Member

ev-br commented Feb 18, 2015

I don't think this is being tested at all.
For the test loop in https://github.com/scipy/scipy/blob/master/scipy/stats/tests/test_continuous_basic.py to test this new distribution, you'd need to add an entry to the list in
https://github.com/scipy/scipy/blob/master/scipy/stats/_distr_params.py


The two shape parameters for `expongauss` (``lam`` and ``s``) must
be set explicitly.
.. versionadded:: 0.16.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure this needs a blank line above .. versionadded::

@ev-br ev-br added scipy.stats enhancement A new feature or improvement labels Feb 18, 2015
@aconley
Copy link
Author

aconley commented Feb 18, 2015

@ev-br -- adding to _distr_params: wow, that's well documented... will add -- as well as a note to test_continuous_basic that you should add parameters there.

@@ -139,6 +139,7 @@ Brandon Liu for stats.combine_pvalues.
Clark Fitzgerald for namedtuple outputs in scipy.stats.
Florian Wilhelm for usage of RandomState in scipy.stats distributions.
Robert T. McGibbon for Levinson-Durbin Toeplitz solver.
Alex Conley for the Exponentiall Modified Gaussian distribution.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exponentiall

@ev-br
Copy link
Member

ev-br commented Feb 18, 2015

@aconley Yeah, it's not too well documented :-(. Basically, there is a loop over the distributions in
https://github.com/scipy/scipy/blob/master/scipy/stats/tests/test_continuous_basic.py#L91.
Additionally, there is the same loop for "slow" distributions + a separate loop checking moments vs integration, see test moments. Is this expongauss slow?

Parameters with which to test are pulled from _distr_params.py in
https://github.com/scipy/scipy/blob/master/scipy/stats/tests/test_continuous_basic.py#L15

The reason why parameters are separated is that they are also used to construct the default docstring example :-).

If you have an idea of how to document this, please do :-)

@ev-br
Copy link
Member

ev-br commented Feb 18, 2015

I'd say LGTM once this is added to the main test loop over the distributions and these tests are green.
(regardless of the nitpicks about the formatting of the docstring).

If there are any additional tests, please add them to stats/tests/test_distributions.py

@josef-pkt
Copy link
Member

_argcheck is not needed, AFAIK it still defaults to strictly positive. try it.

@josef-pkt
Copy link
Member

adding explicit _logpdf would be useful, AFAICS

@josef-pkt
Copy link
Member

also explicit _sf should have better values in the tail than going through _cdf

@josef-pkt
Copy link
Member

extra task: it would be good to add new distributions to the list in the tutorial, although, IIRC we haven't added previous additions.

@aconley
Copy link
Author

aconley commented Feb 18, 2015

Should all hopefully be addressed.

Turning on the unit tests revealed some overflow behavior which is now fixed. However, it does still print an overflow warning, even though the result is fine. Is it worth using numpy.seterr to turn off the overflow warning just inside the pdf function? Or is that unwanted/unneeded overhead?

@ev-br
Copy link
Member

ev-br commented Feb 18, 2015

Please, no seterr in the library code. (My personal opinion, different ones exist). Have you tried 1) moving computations to log space, with _pdf(...): return exp(self._logpdf(...))?, 2) using special.erfcx?

@aconley
Copy link
Author

aconley commented Feb 18, 2015

  1. seems like a potential loss of precision, although it would certainly work. Is avoiding an overflow warning that isn't affecting the actual outputs worth that?
  2. Is a more interesting idea -- I don't know if it will work, but it's worth a shot!

@aconley
Copy link
Author

aconley commented Feb 18, 2015

Alas, erfcx doesn't help. Or, rather, it fixes the problem for large negative x but introduces a new problem for large positive x (since erfc(-1000) is 2, but erfcx(-1000) overflows).

@aconley
Copy link
Author

aconley commented Feb 18, 2015

Oh, and the log approach causes a different set of warnings (divide by zero), although it again produces valid results.

@josef-pkt
Copy link
Member

maybe you need to separate the lower tail and the upper tail by branching the calculations, based on your comment on erfcx versus erfc.
In general working on the wrong tail, for example in the normal distribution, can be pretty awful.

I didn't check the details: Is there some symmetry in the distribution between small and large values that could be exploited?

@aconley
Copy link
Author

aconley commented Feb 18, 2015

Not that I can see.

Splitting the lower/upper tail would work -- but it's actually even simpler if I go that route. Just check for the argument to the exponential being less than, say, -700 and return zero if it is (since you're going to get that anyways!). I was trying to find a way to avoid that ugliness, but maybe numpy.vectorize is the only way...

@josef-pkt
Copy link
Member

don't use numpy.vectorize, you already have broadcasted arguments, so it should be just a conditional assignment. In other cases, like 0 * log(0) we switched to helper functions.

@ev-br and others are more up to date on how this is handled now.

@josef-pkt
Copy link
Member

IIRC, Evgeni added a lazy_where that can be used in this cases and avoids the calculations that cause the warning.

@ev-br
Copy link
Member

ev-br commented Feb 19, 2015

Re lazywhere --- I only found myself copy-pasting a pre-existing piece of code, and refactored into a function :-). The code itself was there before me, so is it you Josef, or Per or Travis?

@ev-br
Copy link
Member

ev-br commented Feb 19, 2015

The test failure for the "full" test suite seems real: the 4th moment from _stats does not seem to agree with a direct calculation from expect, can you investigate @aconley

Re far tails and warnings. We already filter out warnings in the test suite, we can filter out this one too --- in the test suite, not the library code.

We already do calculations in log space in several cases, there was a discussion about this a while ago (a PR from @WarrenWeckesser IIRC).

We also have several cases where calculations fail in the "wrong" tail, so here we might just live with it.

Meanwhile I wonder if we could/should select the tails in the level of the generic cdf/sf: to have cdf dispatch to either _cdf or _sf depending on the arguments. But this is clearly not for this PR.

@WarrenWeckesser
Copy link
Member

... there was a discussion about this a while ago (a PR from @WarrenWeckesser IIRC)

Here: #3510

@ev-br
Copy link
Member

ev-br commented Feb 19, 2015

Not that I want to be a pain in everybody's neck, but I do have a sort of a general comment. This PR defines a two-parametric distribution with shape parameters lam and s. Wikipedia, http://en.wikipedia.org/wiki/Exponentially_modified_Gaussian_distribution , gives a three-parameter distribution. As discussed above, one of these parameters can be absorbed into the loc parameter, which is automatically added by the framework. Which makes the total number of parameters to be four (lam, s, loc and scale) instead of three.

Unless I'm wrong somewhere, this distribution can indeed be written in a standartized form with a single shape parameter. The relation to the form from this PR would be then (with xi being the shape parameter):

loc = mu + lam * s**2
scale = sqrt(2) * s
xi = sqrt(2) * s * lam

Here: https://dl.dropboxusercontent.com/u/18720218/expon_modified_normal.pdf
(my handwriting is abysmal, I know)

I think it's better to standartize this distribution. (And only then deal with numerical issues at the tails, if any are left). Thoughts?

@aconley
Copy link
Author

aconley commented Feb 19, 2015

Wouldn’t it be a bit cleaner to define

loc = mu + lam * s**2
scale = s
xi = s * lam?

Then the distribution is

xi / (2*scale) exp(-(xi * x + xi^2/2)) erfc(-x / sqrt(2))

That has the advantage that the ‘canonical’ values s = 1 lam = 1 map onto
scale = 1, xi = 1.

The default loc=0 is a little weird though.

@aconley
Copy link
Author

aconley commented Feb 19, 2015

Ok, even better (?)

loc = mu
s = sigma
xi = s * lambda
pdf = const * exp(-xi * x) Erfc(-(x-xi)/sqrt(2))
where const = xi / (2*s) exp(-xi^2/2)

Then the canonical version (sum of a centered gaussian with unit variance and a exponential with scale 1) has loc=0, scale=1, shape=1.

@ev-br
Copy link
Member

ev-br commented Feb 19, 2015

+1 for the last parametrization.

@aconley
Copy link
Author

aconley commented Feb 20, 2015

_lazywhere doesn't work for single argument functions, so I added a _lazywhere_single -- if there's some easy way to fix _lazywhere for that, I'm happy to switch back. Ignoring that, this should

  • Fix the overflow issues
  • Change the parameterization to loc, scale, shape
  • Hopefully fix the kurtosis issue

# Avoid overflows; setting exp(exparg) to the max float works
# all right here
expval = _lazywhere_single(exparg < _maxexparg, (exparg),
exp, _maxfloat)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_lazywhere should work with a single argument: just feed it a single-element tuple. See, for example, invgamma.stats method.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, silly me, I forgot the trailing , in (exparg,)

@ev-br
Copy link
Member

ev-br commented Feb 20, 2015

I made one more round of comments, mostly rather minor. Two issues remaining:

  • _rvs only handles a standartized distribution. loc and scale are handled at the generic rvs level and are not seen by _rvs.
  • kurtosis test fails still. Might be an issue with the definition. I think stats returns the excess kurtosis, so that for a gaussian it's zero, not three:
In [6]: norm.stats(moments='mvsk')
Out[6]: (array(0.0), array(1.0), array(0.0), array(0.0))

If push comes to shove, you may just mark it as a known failure, see test_moments in test_continuous_basic.py.

One other thing which might be helpful is to spell out in the docstring the transformation from the 'wikipedia' definition --- and/or other definitions used in the literature to the standartized one here.

@josef-pkt
Copy link
Member

One other thing which might be helpful is to spell out in the docstring the transformation from the 'wikipedia' definition --- and/or other definitions used in the literature to the standartized one here.

To this point: I haven't thought about how this would be implemented, but it would be good to start helper methods to convert the parameterization to something more "standard" in some cases.
lognormal is an example that shows up very often in questions.
In statsmodels I started to write helper functions/methods that translate our parameterization that is specialized for regression to the one of scipy (e.g. negativebinomial), so we can reuse scipy's rvs and other methods.

It would be just a distribution specific method without any generic support, I guess.

@ev-br
Copy link
Member

ev-br commented Feb 20, 2015

@josef-pkt Would a docstring example with an explicit formula not be sufficient?

@josef-pkt
Copy link
Member

It's sufficient for now, but I think eventually we should support it automatically, with code.
I'm not using lognormal very often, but each time I have to stare at it (documentation and examples and/or the math) for a while to understand and remember the parameterization.

@ev-br
Copy link
Member

ev-br commented Feb 20, 2015

Let's take the discussion of parametrization to #4538 to avoid side-tracking this PR.

@aconley
Copy link
Author

aconley commented Feb 20, 2015

You'll like this: it is calculating the excess kurtosis, using the formula from Wikipedia. The problem is that the Ex Kurtosis formula on Wikipedia is wrong -- it should have a 3 in front of it, not a 2 (and it simplifies much further with that change). This also fixes the test.

So, the unit tests in scipy have uncovered a wikipedia mistake.

I need to find a reference for this somewhere so I can fix wikipedia -- they probably won't like my explanation that you can just derive this from the characteristic function.

Edit: fixed on Wikipedia; their original source had the 3, so it was apparently a transcription error.

@ev-br
Copy link
Member

ev-br commented Feb 20, 2015

Oh, great :-).

Two last nitpicks:

  • _maxfloat = float_info.max seems to be the same as _XMAX = np.finfo(float).machar.xmax, which is already stored via from .constants import ...
  • if you are at ease with git, could you squash your commits into a single commit and rewrite the commit message with the ENH prefix (see http://docs.scipy.org/doc/numpy/dev/gitwash/development_workflow.html).

In any case, LGTM.

@aconley aconley force-pushed the expgauss branch 2 times, most recently from 09f0831 to 700879c Compare February 20, 2015 21:27
@ev-br
Copy link
Member

ev-br commented Feb 20, 2015

Trivial tweaks in the docs (* need to be shielded from sphinx etc) here, ev-br/scipy@scipy:master...ev-br:pr/4533
Only the last commit is relevant, ev-br@ae9bdcb, can you take it over?

…nnorm)

* Added to release notes for 0.16.0
* Example PDFs and CDFs checked against the wikipedia page
 (http://en.wikipedia.org/wiki/Exponentially_modified_Gaussian_distribution),
 random samples tested against mean, variance, histograms compared
 for various parameters.  A few simple unit tests added against
 non loc/scale/shape parameterization statistics.
* RVS checked against PDFs for a range of parameters.
@aconley
Copy link
Author

aconley commented Feb 20, 2015

Ok, done.

@argriffing
Copy link
Contributor

It is kind of impressive to me that the scipy stats tests detected the excess kurtosis error on Wikipedia.

@ev-br
Copy link
Member

ev-br commented Mar 4, 2015

I think I'll merge it soon-ish unless further comments.

@rgommers rgommers added this to the 0.16.0 milestone Mar 8, 2015
@ev-br
Copy link
Member

ev-br commented Mar 8, 2015

No more comments, merging. Thanks @aconley !

ev-br added a commit that referenced this pull request Mar 8, 2015
Add exponentially modified gaussian distribution (scipy.stats.expongauss)
@ev-br ev-br merged commit e8e1391 into scipy:master Mar 8, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants