Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: stats: Add the Irwin-Hall distribution #20481

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

nimish
Copy link

@nimish nimish commented Apr 15, 2024

Reference issue

Closes #14806

What does this implement/fix?

Adds the Irwin-Hall distribution i.e. $$\sum^{n}_{k=1}U_k$$ where $U_k \sim U(0, 1)$ and are independent.

Additional information

The Bates distribution is just the IH distribution scaled by $1/n$

@nimish nimish requested a review from ev-br as a code owner April 15, 2024 02:25
@github-actions github-actions bot added scipy.stats enhancement A new feature or improvement labels Apr 15, 2024
@nimish nimish changed the title ENH: Add the Irwin-Hall distribution ENH: stats: Add the Irwin-Hall distribution Apr 15, 2024
@nimish nimish force-pushed the irwinhall branch 7 times, most recently from 7449412 to 5e1ac5a Compare April 15, 2024 14:56
@nimish
Copy link
Author

nimish commented Apr 15, 2024

@ev-br @mdhaber I can't figure out how to have the distribution reject any n that isn't a positive integer, so I didn't add it to the tests.

@mdhaber
Copy link
Contributor

mdhaber commented Apr 15, 2024

Did adding this to _argcheck not work? Take a look at some of the discrete distributions; some have this constraint.

Note that this would need to be vectorized to support array n. Also, it looks like you reversed your entries in _distr_params.py; the one in distcont should be valid.

@nimish nimish force-pushed the irwinhall branch 2 times, most recently from 961b8b9 to 3167ba1 Compare April 15, 2024 16:07
@mdhaber
Copy link
Contributor

mdhaber commented Apr 15, 2024

@nimish please run the tests locally before pushing and add [skip ci] to your commit message when you don't need CI to run. Resources are limited.

@nimish
Copy link
Author

nimish commented Apr 15, 2024

Did adding this to _argcheck not work? Take a look at some of the discrete distributions; some have this constraint.

Note that this would need to be vectorized to support array n. Also, it looks like you reversed your entries in _distr_params.py; the one in distcont should be valid.

Done.

@nimish nimish force-pushed the irwinhall branch 4 times, most recently from a624402 to 72221c0 Compare April 16, 2024 19:04
@mdhaber
Copy link
Contributor

mdhaber commented Apr 16, 2024

Mostly, but the vectorization needs to work for arbitrary shapes. As written, it only works for 1D arrays.

Also, this needs at least a test against reference values of the CDF or PDF to show that this is indeed the Irwin-Hall distribution. It's OK to let the existing test suite confirm that the distribution is self-consistent (e.g. that the PDF is the derivative of the CDF, etc.).

The references also need to be fixed; they are causing issues with the doc build. Check against other distributions. Check your spaces between .. and the number of the reference; Sphinx is very picky. There should be brackets around the citation numbers, too.

What is lol.rst?

@nimish
Copy link
Author

nimish commented Apr 16, 2024

Mostly, but the vectorization needs to work for arbitrary shapes. As written, it only works for 1D arrays.

Also, this needs at least a test against reference values of the CDF or PDF to show that this is indeed the Irwin-Hall distribution. It's OK to let the existing test suite confirm that the distribution is self-consistent (e.g. that the PDF is the derivative of the CDF, etc.).

The references also need to be fixed; they are causing issues with the doc build. Check against other distributions. Check your spaces between .. and the number of the reference; Sphinx is very picky. There should be brackets around the citation numbers, too.

What is lol.rst?

Lol.rst isnt supposed to be there, just a scratch file to test what was throwing the doc parser off.

For tests: I suppose adding a test to confirm the sum of n uniforms has the same pdf and cdf would make sense; I'll add it in.

Equivalenty the mgf being the mgf of the uniform raised to n would also work but I don't see an explicit mgf for that.

@mdhaber
Copy link
Contributor

mdhaber commented Apr 16, 2024

For tests: I suppose adding a test to confirm the sum of n uniforms has the same pdf and cdf would make sense; I'll add it in.

Well, that reminds me that this should override _rvs like that. I'm not sure what sort of test you mean. Do you mean comparing the PDF with a convolution of individual uniform PDFs? This would be ok. Or do you mean comparing a normalized histogram (from summing uniformly distributed random samples) against the PDF? The latter would be a pretty weak test.

I would prefer comparison of the output of pdf (at least) against that of UniformSumDistribution.

@nimish nimish force-pushed the irwinhall branch 3 times, most recently from 4d913f0 to 2247132 Compare April 17, 2024 06:20
@nimish
Copy link
Author

nimish commented Apr 17, 2024

For tests: I suppose adding a test to confirm the sum of n uniforms has the same pdf and cdf would make sense; I'll add it in.

Well, that reminds me that this should override _rvs like that. I'm not sure what sort of test you mean. Do you mean comparing the PDF with a convolution of individual uniform PDFs? This would be ok. Or do you mean comparing a normalized histogram (from summing uniformly distributed random samples) against the PDF? The latter would be a pretty weak test.

I would prefer comparison of the output of pdf (at least) against that of UniformSumDistribution.

Changed _rvs to just sum up n uniform samples.

Added both a K-S GoF (basically a weak histo test....) against a sum of 10 uniforms and the exact values as computed by Wolfram Alpha.

Test suite passes on my machine, at least, and I fixed up the sphinx stuff.

@nimish
Copy link
Author

nimish commented Apr 17, 2024

Sigh, apparently not. Let me fix the doc gen

@nimish
Copy link
Author

nimish commented Apr 30, 2024

BTW, no more force-pushes please. We will squash merge this at the end, and I need to see what has changed in the history.

Also, you can include [skip ci] in the body of commit messages if you don't need to re-run CI. Please do that to conserve resources when appropriate.

@mdhaber
Sure, will do both.

Though I'd add config to only ever run a single CI run at a time for PRs and/or give the PR author the ability to cancel the run if done accidentally. Personally I'd also set it to just manual runs if it's that much of an issue/run only the shallowest suite automatically.

Anything left to do before merging?

@mdhaber
Copy link
Contributor

mdhaber commented Apr 30, 2024

Though I'd add config to only ever run a single CI run at a time for PRs and/or give the PR author the ability to cancel the run if done accidentally.

Most CI is not running automatically on this PR, I think because this is your first. It's just a good habit to get into for when it does run by default.

Before merging I need to review the changes. But if those all look good and nothing major comes up, I will consider merging.

@mdhaber
Copy link
Contributor

mdhaber commented May 1, 2024

Anything left to do before merging?

Oops, yes - thanks for your patience!

All new features need an announcement at https://discuss.scientific-python.org/c/contributor/scipy/32. We used to do this on a mailing list; here is an example. One thing I imagine that people might want to comment on is the name - irwinhall vs, say, uniformsum, so that might be something to ask about in the post. Another thing to get other opinions about is what to do about the fit method.

Besides that, I un-resolved all the comments above and re-resolved the ones that are resolved. We just need to address the ones that are still unresolved:

  • We need a decision here about what to do with the fit method, given that it's essentially guaranteed to fail unless the shape parameter is fixed. I'd suggest overriding fit, raising a NotImplementedError unless the shape is fixed, and falling back to super().fit otherwise. (Update: always raising NotImplementedError is fine, too.)
  • The comment about the manual _argcheck within the loop of the rvs method is still open.
  • According to this comment, the _munp implementation needs to be corrected or removed. (And I'll need to remove the commented-out code before merging.)
  • I can add the reference for the stats method that I requested, remove the redundant test, and fix the PEP8 issue; don't worry about those.
  • According to this, we need to adjust the comment about how to create a Bates distribution. (I'll add an improved comment before merging.)
  • It would help for someone more familiar with B-splines to review the PDF implementation. (I absolutely believe it; it's just not immediately obvious to me, and there are many things to do!) (Update: Thanks @steppi!)
  • We'll need to make sure CI is green before merging. I'm running it so we can see if there's anything else that fails besides the moment check. (Update: I'll fix the lint issues before merging.)
  • Approval from discussion forum.

Thanks @nimish!

@fancidev
Copy link
Contributor

fancidev commented May 2, 2024

Overriding fit that checks n is fixed sounds appropriate. That is also why I asked if the distribution can be “easily” extended to allow real n.

UniformSum sounds intuitive to me (and it’s the name picked by Mathematica), but I’ve never used or heard of the distribution. Not sure if IrwinHall or Bates would be more familiar to people who actually use them.

@nimish
Copy link
Author

nimish commented May 2, 2024

Overriding fit that checks n is fixed sounds appropriate. That is also why I asked if the distribution can be “easily” extended to allow real n.

UniformSum sounds intuitive to me (and it’s the name picked by Mathematica), but I’ve never used or heard of the distribution. Not sure if IrwinHall or Bates would be more familiar to people who actually use them.

Doing the obvious extension to non integral 'n' by adding an extra weighted uniform breaks a lot of the nice formulas.

I could see adding a UniformSum that did accept non integer N and keeping the IH as the integral only version

IME neither seem to be used that often over just treating the sum as a sum.

@mdhaber
Copy link
Contributor

mdhaber commented May 2, 2024

Re: implementing for non-integral n - please let's not attempt that for now.

nimish and others added 2 commits May 5, 2024 18:56
Co-authored-by: Matt Haberland <mhaberla@calpoly.edu>
@nimish
Copy link
Author

nimish commented May 5, 2024

  1. removed argcheck
  2. I've added a commented-out implementation of the moments calculated from the cumulants that really should live in the generic rv_continuous implementation. It's debatable whether it's better to do that over the generic computation. Some other distributions have very easily calculated cumulants and difficult moments like this one.
  3. Removed the bates comment
  4. fit will throw an error and has been added to the xfail/skip lists

@mdhaber
Copy link
Contributor

mdhaber commented May 5, 2024

Great. Please remember to send that announcement #20481 (comment)

I'll fix the lint errors, etc.

@nimish
Copy link
Author

nimish commented May 6, 2024

Anything left to do before merging?

Oops, yes - thanks for your patience!

All new features need an announcement at discuss.scientific-python.org/c/contributor/scipy/32. We used to do this on a mailing list; here is an example. One thing I imagine that people might want to comment on is the name - irwinhall vs, say, uniformsum, so that might be something to ask about in the post. Another thing to get other opinions about is what to do about the fit method.

Besides that, I un-resolved all the comments above and re-resolved the ones that are resolved. We just need to address the ones that are still unresolved:

  • We need a decision here about what to do with the fit method, given that it's essentially guaranteed to fail unless the shape parameter is fixed. I'd suggest overriding fit, raising a NotImplementedError unless the shape is fixed, and falling back to super().fit otherwise. (Update: always raising NotImplementedError is fine, too.)
  • The comment about the manual _argcheck within the loop of the rvs method is still open.
  • According to this comment, the _munp implementation needs to be corrected or removed. (And I'll need to remove the commented-out code before merging.)
  • I can add the reference for the stats method that I requested, remove the redundant test, and fix the PEP8 issue; don't worry about those.
  • According to this, we need to adjust the comment about how to create a Bates distribution. (I'll add an improved comment before merging.)
  • It would help for someone more familiar with B-splines to review the PDF implementation. (I absolutely believe it; it's just not immediately obvious to me, and there are many things to do!) (Update: Thanks @steppi!)
  • We'll need to make sure CI is green before merging. I'm running it so we can see if there's anything else that fails besides the moment check. (Update: I'll fix the lint issues before merging.)
  • Approval from discussion forum.

Thanks @nimish!

Re the moments/cumulants code: I'll remove it.

@nimish
Copy link
Author

nimish commented May 7, 2024

@mdhaber removed the moments code, made the post

@mdhaber
Copy link
Contributor

mdhaber commented May 7, 2024

@nimish Great, thanks! Re:

Adding that will be a separate feature because scipy needs more cumulant infrastructure (which I’ll look at adding)

I'd recommend taking a look at gh-15928, commenting there, and getting feedback before writing code.

@rlucas7
Copy link
Member

rlucas7 commented May 8, 2024

@nimish thanks for working on this distribution. I'm not sure what the state of the distributions infrastructure is @mdhaber or @steppi please chime in if I'm derailing anything here.

@nimish for the general moments of this distribution you should be able to use the formula I noted here. That formula should work for all non-central non-negative integer moments as far as I can tell.

The binomial coefficients are available from scipy.special (in approximate and exact form), as are the Stirling numbers of the second kind, for the latter they were added here #18103

I'll drop a note with a backlink here in the discuss forum.

@mdhaber
Copy link
Contributor

mdhaber commented May 8, 2024

Not derailing, since the new infrastructure is a different issue. If the formula is that simple in terms of the Stirling numbers, it would be OK to include them in this PR, accompanied by the citation information.

However I'd do some informal tests of the useful domain of the formula. When do the terms begin to overflow? And in what cases do we get low precision (when is the result close to 1)? Does the generic implementation do better in these cases?

@nimish
Copy link
Author

nimish commented May 8, 2024

@nimish thanks for working on this distribution. I'm not sure what the state of the distributions infrastructure is @mdhaber or @steppi please chime in if I'm derailing anything here.

@nimish for the general moments of this distribution you should be able to use the formula I noted here. That formula should work for all non-central non-negative integer moments as far as I can tell.

The binomial coefficients are available from scipy.special (in approximate and exact form), as are the Stirling numbers of the second kind, for the latter they were added here #18103

I'll drop a note with a backlink here in the discuss forum.

Thanks. I have the code and derivation of a general cumulant to moment formula but left it out for now since it's not clear if it's better than just calculating the expectation directly

@mdhaber @rlucas7 https://gist.github.com/nimish/c986c6f71863eca8b119cd12094e9e72 is a correct derivation of the moments from cumulants, and it's what I removed in eeaef71 (#20481)

It looks like the stirling number formula for the moments lines up with the one from the cumulants so I'll stick the simpler one from @rlucas7 in, thanks!

@rlucas7
Copy link
Member

rlucas7 commented May 9, 2024

It looks like the stirling number formula for the moments lines up with the one from the cumulants so I'll stick the simpler one from @rlucas7 in, thanks!

Sounds good @nimish I agree with @mdhaber 's comment here:

I'd do some informal tests of the useful domain of the formula. When do the terms begin to overflow? And in what cases do we get low precision (when is the result close to 1)? Does the generic implementation do better in these cases?

I'd guess that at some higher n (number of uniforms summed) and k raw moment power) the Stirling numbers of the second kind and the Binomial coefficients both are large and the division into floating point should be checked.

I'm guessing the values for this scenario are larger than where I'd think to use but in any case we usually try to document where the numerics are less accurate. (you can see examples of this in the pr for stirling2 where I wrote the approximate branch of the code).

Note: for the moments here I'd recommend to use the exact=True branch of binomial and stirling2.

@nimish
Copy link
Author

nimish commented May 9, 2024

Makes sense. Worst case scenario i fall back to logs or other tricks

@mdhaber
Copy link
Contributor

mdhaber commented May 9, 2024

Most recent commit addresses my comments and should fix the lint issues. I'll also run the xslow tests locally to make sure all the right ones are xslowed or skipped.

@WarrenWeckesser
Copy link
Member

WarrenWeckesser commented May 9, 2024

This is not a full review of the pull request, but I can see some changes that should be made.

  • An explicit _sf method should be created, otherwise a call such as irwinhall(10).sf(9.9) will return 0.0 instead of the correct 2.755731922398591e-17. It can be implemented to take advantage of the symmetry of the distribution.
  • All uses of assert_allclose should have an explicit rtol given, and generally it should be much smaller than the default 1e-7. rtol=1e-14 is a typical value to start with (loosen as needed, but hopefully we don't need to approach the default value). @nimish, this suggestion is as much for all the other SciPy developers as it is for you--this suggestion should probably be in the "missing bits" developer docs, if it is not already written down somewhere. I know it has come up in a review or two over the last few years.
  • Currently there is a single test function test_irwinhall() that tests everything. This should be split into several shorter test functions, where each test is for a specific method (e.g. test_irwinhall_pdf(), test_irwinhall_cdf(), etc). Modularizing the tests like that makes it much easer to read the reports generated by pytest when a test fails. It also allows all the tests to be run, instead of failing at the first failed assertion. For most (all?) of the distributions, we define a class to contain the tests (e.g, TestBinom, TestArcsine, etc; see the first several thousand lines of test_distributions.py). We don't follow that pattern exclusively (I think I've add several test functions over the years that could have been a method of an existing class), but for a new distribution, we should follow the old pattern--for consistency and for the organizational convenience--and create a class (e.g. class TestIrwinHall) with the individual tests as methods.
  • A test of the CDF that checks the values against values from Wolfram Alpha should be added.

@mdhaber
Copy link
Contributor

mdhaber commented May 9, 2024

this suggestion is as much for all the other SciPy developers as it is for you

@WarrenWeckesser I'd suggest participating in gh-17807 (which I opened to discuss these points when they were coming up frequently) if you want to impose stricter requirements on all distributions. It's not that I don't understand these things or remember to consider them; I'm pretty familiar with scipy.stats at this point and especially the accuracy issues in distribution tails and extreme parameter values, etc. I just don't necessarily agree that these things should have to be in the distribution-specific implementations or distribution-specific tests, especially of a symmetric distribution. Much of this should be taken care of by the infrastructure (and can be), but the existing infrastructure doesn't, so I wasn't going to hold this distribution to stricter, undocumented requirements than the rest. There are many distributions with room for improvement (gh-17832), and this would be one of them if merged in its current state. You are welcome to open a PR to improve it. That said, if this is important to you, you can also request changes in this review and take responsibility for merging this PR.

The suggestion about breaking the tests up is good, and I break up my tests by their intent, but there were enough higher-priority things to adjust already that I wasn't going to insist on it.

These are all good suggestions and I'd encourage @nimish to consider them, but I personally won't require them here or of all distributions added with the old infrastructure. (This is the last of those that I will review, in any case. I would have just let this wait for you or others to review like gh-19145, but I happened to be interested in this distribution when the original issue was posted.)

@nimish
Copy link
Author

nimish commented May 11, 2024

@WarrenWeckesser good point on the numeric cancellation, that's a genuine bug I'll look at fixing. But as @mdhaber said catastrophic cancellation at the tails of a symmetric distribution is an infra-wide problem that likely needs regression testing and should be enforced at a higher level.

  • rtol: this is debatable, and really a project wide policy thing since the semantics of isclose are subtle. I'll keep this at the default unless there's a good reason not to.
  • I can split up the tests into individual functions. We don't actually use much of the class functionality in pytest (again a higher level stats infra problem to tackle) so I didn't see a need.
  • CDF test: given the strict dependence of the distribution's CDF on the BSpline used for the PDF, all this would be doing is showing that BSpline.antiderivative works as intended.

We should be within 10 ulps of the exact rational division,
but this boils down to the precision of bspline evaluation
more than anything
@nimish
Copy link
Author

nimish commented May 12, 2024

@WarrenWeckesser / @mdhaber

Long story short, the best we can do here is to be equal within 10-ish ulps since we're basically just testing the precision of the BSpline evaluation algorithm. Unfortunately, since the exact value of cdf(99/100) = 36287999999999999/36288000000000000 is within an ulp of 1, even wolframalpha will round it to 1.0 unless specifically asked to evaluate it using rationals.

I think we're reasonably close and even then it's tough to justify any closer without heuristics that might be better placed in global infrastructure. The exact value only needs rational computation so perhaps mpmath or sympy could step in here.

Note that even specifically implemented and optimized spline eval algorithms have roughly the same error: https://www.boost.org/doc/libs/1_76_0/libs/math/doc/html/math_toolkit/sf_poly/cardinal_b_splines.html and it grows with the IH parameter.

https://en.wikipedia.org/wiki/Hoeffding%27s_inequality guarantee quite rapid convergence to zero anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Add the Irwin-Hall (Uniform Sum) and Bates (Uniform Mean) distributions to scipy.stats
6 participants