ENH: add the Edgeworth expansion based on the normal distribution #1325

Merged
merged 6 commits into from Oct 9, 2014

Projects

None yet

4 participants

@ev-br
Contributor
ev-br commented Jan 24, 2014

This PR adds the Edgeworth expansion (normal distribution + cumulants), as discussed in this ML thread:
https://groups.google.com/forum/#!topic/pystatsmodels/so83mPP9nkE

(time flies) No significant changes since that email exchange last year.

@josef-pkt
Member

Thanks Evgeni, it might take me a bit of time going back to this.

as motivation:
I coded a test recently, that used 4 term Edgeworth expansion to get better approximate p-values, because the test statistic has considerable skew in small samples.
(However, I gave up after mean and variance, because I worried finding more typos in another two pages of formulas will take too much time.)

@ev-br
Contributor
ev-br commented Jan 24, 2014

Well, I guess one advantage of this is that at least some part of the math is checked (FWIW) and covered by tests.
(on at least the chi square example, where we know the answers).
Speaking of which, I don't really understand the TravisCI failure --- distributions.test() passes locally with the scipy master, but statsmodels.test() errs left and right.

@josef-pkt
Member

My guess is that it's a backwards compatibility issue. TravisCI uses the default scipy, which is 0.9.0 in Ubuntu precise

@josef-pkt josef-pkt commented on an outdated diff Jan 24, 2014
statsmodels/distributions/edgeworth.py
+ self._coef, self._mu, self._sigma = self._compute_coefs_pdf(cum)
+ self._herm_pdf = HermiteE(self._coef)
+ if self._coef.size > 2:
+ self._herm_cdf = HermiteE(-self._coef[1:])
+ else:
+ self._herm_cdf = lambda x: 0.
+
+ # warn if pdf(x) < 0 for some values of x within 4 sigma
+ r = np.real_if_close(self._herm_pdf.roots())
+ r = (r - self._mu) / self._sigma
+ if r[(np.imag(r) == 0) & (np.abs(r) < 4)].any():
+ mesg = 'PDF has zeros at %s ' % r
+ warnings.warn(mesg, UserWarning)
+
+ kwds.update({'momtype': 0}) # use pdf, not ppf in self.moment()
+ super(ExpandedNormal, self).__init__(**kwds)
@josef-pkt
josef-pkt Jan 24, 2014 Member

given the test failure, I think kwds needs to include name

@josef-pkt josef-pkt and 1 other commented on an outdated diff Jan 24, 2014
statsmodels/distributions/tests/test_edgeworth.py
+ assert_allclose(ne2.pdf(x), stats.norm.pdf(x, loc=3, scale=2))
+
+ def test_chi2_moments(self):
+ # construct the expansion for \chi^2
+ N, df = 6, 15
+ cum = [_chi2_cumulant(n+1, df) for n in range(N)]
+ ne = ExpandedNormal(cum, name='edgw_chi2')
+
+ # compare the moments
+ assert_allclose([_chi2_moment(n, df) for n in range(N)],
+ [ne.moment(n) for n in range(N)])
+
+ # compate the pdf
+ m, s = df, np.sqrt(2*df)
+ x = np.linspace(m - s, m + s, 10)
+ assert_allclose(ne.pdf(x), stats.chi2.pdf(x, df=df),
@josef-pkt
josef-pkt Jan 24, 2014 Member

shape parameter df has to be positional in older scipy

@ev-br
ev-br Jan 24, 2014 Contributor

Indeed! Will fix both as soon as I can --- it's a bit awkward to do from
the phone :-)

/Zh
On Jan 24, 2014 4:36 PM, "Josef Perktold" notifications@github.com wrote:

In statsmodels/distributions/tests/test_edgeworth.py:

  •    assert_allclose(ne2.pdf(x), stats.norm.pdf(x, loc=3, scale=2))
    
  • def test_chi2_moments(self):
  •    # construct the expansion for \chi^2
    
  •    N, df = 6, 15
    
  •    cum = [_chi2_cumulant(n+1, df) for n in range(N)]
    
  •    ne = ExpandedNormal(cum, name='edgw_chi2')
    
  •    # compare the moments
    
  •    assert_allclose([_chi2_moment(n, df) for n in range(N)],
    
  •                    [ne.moment(n) for n in range(N)])
    
  •    # compate the pdf
    
  •    m, s = df, np.sqrt(2*df)
    
  •    x = np.linspace(m - s, m + s, 10)
    
  •    assert_allclose(ne.pdf(x), stats.chi2.pdf(x, df=df),
    

shape parameter df has to be positional in older scipy


Reply to this email directly or view it on GitHubhttps://github.com/statsmodels/statsmodels/pull/1325/files#r9154059
.

@josef-pkt
Member

It looks like you use two features of the distributions that are only in very recent scipy.
Do you have a default name in scipy master?

@ev-br
Contributor
ev-br commented Jan 25, 2014

After fixing the issues with the name and named args, there's still a numerical problem of some sort. Locally, all tests pass with scipy master, but with scipy 0.9 it fails to reconstruct a chi-square distribution [same failure on TravisCI]: pdf starts oscillating much earlier, hence ppf cannot be defined etc. Am wondering what were the relevant changes between 0.9 and 0.13. Will have a look into it.

@josef-pkt
Member

the ppf brentq error is also because of a previous version of the ppf inversion that had fixed limits, .xa, .xb.

It was change in scipy to use an expanding search if the f at the bounds don't have the same sign.
base on the error message on TravisCI in _ppf_single_call

Note the old version required a large xa and xb to be set in the definition of the distribution.

I haven't tried the example, and don't know why the function itself would be different.
Also I assume you are testing against a recent numpy with polynomial while TravisCI uses your compatibility function. I have no idea if that matters.

@jseabold
Member

Excuse to bump scipy version in master for 0.6.0?

@jseabold
Member

Also time to discuss numpy version?

@josef-pkt
Member

Ubuntu LTS (recent scikit-learn discussion) ?

some of the goodies in scipy are too recent and we would have to bump too high, many of Evgeni's improvements are only one release old.

Do we need to move to more recent numpy than 1.6 ?

@ev-br
Contributor
ev-br commented Jan 28, 2014

The current LTS has numpy 1.6 already.
Ubuntu 13.10 ships scipy 0.12 and numpy 1.7; the next LTS is going to be 14.04, due this Spring.
Unless I misread this: https://launchpad.net/ubuntu/+source/python-scipy, there's a chance of even having 0.13 in next LTS :-).
The previous LTS, 10.04 is definitely at the end of life, even its 'server edition'.

As far as this one is concerned:

  • named args & default distribution name (which require scipy 0.13) are not needed, are in fact, removed from this PR. No problem here
  • Hermite polynomials need numpy 1.6, so even the current LTS is OK.
  • Something with xa and xb: In fact I don't know when they were finally removed.

So far I did not manage to have a proper look at the failure beyond noticing that a (rather strange) combination of numpy 1.8 and scipy 0.9 still fails.

@josef-pkt josef-pkt referenced this pull request Feb 20, 2014
Merged

Bug norm expan shapes #1403

@jseabold
Member
jseabold commented Apr 3, 2014

Can you rebase this on master? We've bumped our minimum versions and we'll see how Travis does now.

@ev-br
Contributor
ev-br commented Apr 3, 2014

@jseabold Done. I don't have a proper dev setup for statsmodels at the moment, so all I can say is that tests pass locally with scipy master & numpy 1.8. TravisCI?

@jseabold
Member
jseabold commented Apr 3, 2014

Yes, travis should let us know.

@coveralls

Coverage Status

Coverage remained the same when pulling c3ffcae on ev-br:edgeworth into c70a709 on statsmodels:master.

@jseabold
Member
jseabold commented Apr 3, 2014

Failure is unrelated. Good to merge?

@ev-br
Contributor
ev-br commented Apr 3, 2014

Might want to remove the old numpy compat for _hermite (will do tomorrow)

@jseabold
Member
jseabold commented Apr 3, 2014

Ok. Ping the PR when it's ready. (I won't be online tomorrow, likely).

@coveralls

Coverage Status

Coverage remained the same when pulling b191422 on ev-br:edgeworth into c70a709 on statsmodels:master.

@ev-br
Contributor
ev-br commented Apr 4, 2014

Done.
I think it's usable as is, even though there's plenty of room for improvement. For one, ppf (hence rvs) is sloth-slow since it goes through the generic path (no Cornish-Fisher).

@jseabold
Member
jseabold commented Apr 4, 2014

Thanks. Would you mind also making a bullet point note in docs/source/release/version0.6.rst. We're trying to be better about release notes and advertising improvements.

@ev-br
Contributor
ev-br commented Apr 4, 2014

Added a line. Feel free to edit or expand if desired.

@coveralls

Coverage Status

Coverage remained the same when pulling 233b88b on ev-br:edgeworth into c70a709 on statsmodels:master.

@jseabold
Member

It looks like this could go in after a rebase?

@jseabold jseabold added this to the 0.6 milestone Sep 26, 2014
@ev-br
Contributor
ev-br commented Sep 28, 2014

rebased

@coveralls

Coverage Status

Coverage increased (+0.05%) when pulling 09016f7 on ev-br:edgeworth into cfebe9a on statsmodels:master.

@coveralls

Coverage Status

Coverage increased (+0.05%) when pulling 3301a64 on ev-br:edgeworth into cfebe9a on statsmodels:master.

@ev-br
Contributor
ev-br commented Sep 28, 2014

I'm not sure why trying to import NoseWrapper in distributions.__init__ causes test failures.

@coveralls

Coverage Status

Coverage increased (+0.05%) when pulling dec35ee on ev-br:edgeworth into cfebe9a on statsmodels:master.

@jseabold
Member
jseabold commented Oct 8, 2014

Thanks. Merge? The NoseWrapper circular imports happened with the recent warnings refactor. Probably needs a fix but not critical IMO.

@jseabold
Member
jseabold commented Oct 9, 2014

Gonna go ahead and merge this. Thanks.

@jseabold jseabold merged commit 953d944 into statsmodels:master Oct 9, 2014

1 check passed

continuous-integration/travis-ci The Travis CI build passed
Details
@ev-br ev-br referenced this pull request in scipy/scipy Dec 13, 2014
Closed

Statistics Review: pdf_fromgamma (Trac #172) #699

@ev-br ev-br deleted the ev-br:edgeworth branch Dec 13, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment