Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor stats/distributions.py #2724

Merged
merged 9 commits into from
Sep 22, 2013
Merged

refactor stats/distributions.py #2724

merged 9 commits into from
Sep 22, 2013

Conversation

ev-br
Copy link
Member

@ev-br ev-br commented Aug 14, 2013

Deduplicate a bit of code in stats/distributions.py: move exact duplicates from rv_continuous and rv_discrete to rv_generic, fix a couple of typos.

npt.assert_equal(distfn.logsf(x, *args), [0.0, -np.inf])

npt.assert_equal(distfn.ppf([0.0, 1.0], *args), x)
npt.assert_equal(distfn.isf([0.0, 1.0], *args), x[::-1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is wrong for discrete distributions. (I always have to draw graphs and look at the definition, and never remember exactly)

scipy 0.0

>>> stats.poisson.isf(1, 1)
-1.0
>>> stats.poisson.ppf(0, 1)
-1.0
>>> 

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@josef-pkt yes, indeed. That's why these tests are only run for the continuous distributions. In fact these tests can be removed: these were basically for me to see if there's a clean way of moving cdf and friends to rv_generic (and no, one of the blockers is the different behaviour at the edges of the support)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I didn't realize it's only in the continuous tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the tests are good, I don't think the edge behavior is tested otherwise.

@josef-pkt
Copy link
Member

There is a lot of moving around, that's difficult to review.

Needs a check of test coverage, because I would have thought that we cannot combine these for discrete and continuous.
entropy doesn't have an generic tests AFAIR

@ev-br
Copy link
Member Author

ev-br commented Aug 14, 2013

@josef-pkt That's precisely why I kept very atomic commits, almost one commit per function moved. Does looking at individual commits help?
Moreover, commit messages list (I think) all changes in functionality other than copy-paste.

Entropy: deliberately breaking the generic entropy (returning 42*output instead of output) produces four test failures in test_distributions (TestBinom, TestHypergeom, TestBernoulli, and TestRvDiscrete).

@josef-pkt
Copy link
Member

Entropy: deliberately breaking the generic entropy (returning 42*output instead of output) produces four test failures in test_distributions (TestBinom, TestHypergeom, TestBernoulli, and TestRvDiscrete).

But IIRC they have explicit implementation of entropy and don't use the generic path.

a basic test for the changes of entropy would be to compare .entropy with the generic implementation .vecentropy or whatever that is called.

I will try to go over the changes later today or in the next days.

@ev-br
Copy link
Member Author

ev-br commented Aug 14, 2013

@josef-pkt Is this sort of what you mean (line 463 of test_continuous_basic.py):

+def check_vecentropy(distfn, args):
+    npt.assert_equal( distfn.vecentropy(*args), distfn._entropy(*args) )

@ev-br
Copy link
Member Author

ev-br commented Aug 21, 2013

To help reviewing this, here's a gist having three versions of the functions moved: the first 'revision' is functions from the rv_discrete of the current master, the second 'revision' is the same functions from rv_continuous (master as well), and the third revision is rv_generic from the PR.
https://gist.github.com/bburlo/6292751
Hope it provide at least some line-by-line diffs.

@josef-pkt
Copy link
Member

@josef-pkt https://github.com/josef-pkt Is this sort of what you mean
(line 463 of test_continuous_basic.py):

+def check_vecentropy(distfn, args):

  • npt.assert_equal( distfn.vecentropy(_args), distfn._entropy(_args) )

Yes, that should check the generic (numerical integration) solution with
the explicit ones, when they are defined

two comments:

  • I expect that there are several bugs in specific distributions, but this
    should be a better way of identifying the cases without too many false
    positives
  • this is a "tautology" if a distribution doesn't specify _entropy, it
    just calculates twice the same thing.

To the second point: Since you are working on this now, maybe you can pick
up an idea that I never implemented: Inspect whether the specific
distribution defines the ._xxx method *) and only then compare with the
generic implementation like in this case.
This would also be useful for the higher order moments k=3,4 and skew and
kurtosis in stats.
*) I asked the question on the dev (IIRC) mailing list, and Robert Kern
answered how to find which class in the class hierarchy defines a method.

@ev-br
Copy link
Member Author

ev-br commented Aug 21, 2013

@josef-pkt Agree, these tests not failing might indicate a tautology rather than a bug. Nonetheless, none of them fail at the moment --- so that all distributions which do define _entropy seem to be in the clear.

Avoiding the tautology is definitely doable (by walking the __bases__ or maybe there's a smarter way), but is it critical for the tests really?

@ev-br
Copy link
Member Author

ev-br commented Aug 21, 2013

$ grep 'def _entropy' distributions.py |wc -l
48

Subtract two for rv_discrete & rv_continuous.

@josef-pkt
Copy link
Member

I'm very surprised that all the tests pass, I thought I saw some failures with random tests before (but maybe my information is outdated, there have been several bug-fixes since I looked at this.)

Avoiding the tautology is not critical at all, more of a cleanup job to avoid unnecessary time used by the test suite.

@ev-br
Copy link
Member Author

ev-br commented Aug 21, 2013

@josef-pkt ooops, my bad. it was only enabled for continuous distributions with numargs==0. Have just pushed a commit with these tests enabled for all distributions (other than those marked as 'slow' --- these tests do not run as a part of 'usual' test suite).

Locally all tests pass (python2.7, numpy1.7.1). There are several floating-point warnings (foldnorm and logistic) --- both spurious in the sense you mentioned.

@ev-br
Copy link
Member Author

ev-br commented Aug 21, 2013

for the tautology: do you think it would make sense to open a separate issue for cleaning up the test suite?

@josef-pkt
Copy link
Member

Locally all tests pass

Sounds too good to be true, without verifying

I think I interpreted vecentropy wrongly, it just delegates to the _entropy method, not to the generic calculations like vecppf and veccdf
https://github.com/scipy/scipy/blob/master/scipy/stats/distributions.py#L1069

looks like all tautologies (except for vectorization)

I don't know yet how to access the generic calculation. I never really looked at entropy very closely.

@josef-pkt
Copy link
Member

for the tautology: do you think it would make sense to open a separate
issue for cleaning up the test suite?

Yes,
In a quick search I only found gh-1795 but I thought there are several more
for improving the stats.distributions test suite

@josef-pkt
Copy link
Member

I don't know why entropy follows a different pattern.
related: I was puzzled why entropy needs the vectorize workaround when ppf and cdf don't, mentioned here gh-2702
but reviewing that sounds like more than this PR needs to handle

@josef-pkt
Copy link
Member

see gh-2765 for a quick try on the entropy bugs (after outsourcing the numerical integration code for _entropy)

@rgommers
Copy link
Member

I think it's time to finish reviewing and merge this. @EvgeniBurovski could you rebase it first? Also, with an interactive rebase you can remove the TST: enable check_vecentropy commit instead of reverting it.

Lift moment, _munp, __call__ and freeze to rv_generic.
rv_continuous version had a guard for vecentropy w/ numargs ==0, kept it.
Changed betwen rv_continuous and rv_discrete were:
* rv_c had guards for nan**1.5; kept.
* rv_c had a shortcut for no valid arguments; kept.
Also trivial renamings: _veccdf vs _cdfvec etc.
rv_discrete(rv_continuous) version used bitwise(logical) and.
Kept the logical_and.
@ev-br
Copy link
Member Author

ev-br commented Sep 15, 2013

@rgommers done. I understand it's not easy to review, let me know if there's something I can do to simplify it (https://gist.github.com/EvgeniBurovski/6292751)

some individual commits here show leftovers from conflict merges, but the final code does not have any. Let me know if this is ok.

The last commit (0fa9789) differs from the rest since it's not just copy-paste, it actually does change things a little (confines inspect to __init__s only)

@rgommers
Copy link
Member

No worries, I think this is reviewable commit by commit. And our test suite has expanded quite a bit recently, which also helps.

@rgommers
Copy link
Member

This all looks good. Moving the inspect makes sense, doesn't really help for speed but keeps the magic in one place. The leftover merge thinks are not that bad - they will make git bisect skip a commit here and there, but other than that it doesn't really hurt.

@rgommers
Copy link
Member

Useful cleanup, some things had diverged where they shouldn't have. Thanks @EvgeniBurovski

rgommers added a commit that referenced this pull request Sep 22, 2013
@rgommers rgommers merged commit 29f1330 into scipy:master Sep 22, 2013
@ev-br ev-br deleted the stats_refactor branch August 12, 2015 14:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants