Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stats binom at non-integer n (Trac #1895) #2414

Closed
scipy-gitbot opened this issue Apr 25, 2013 · 19 comments · Fixed by #15769
Closed

stats binom at non-integer n (Trac #1895) #2414

scipy-gitbot opened this issue Apr 25, 2013 · 19 comments · Fixed by #15769
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected Migrated from Trac needs-decision Items that need further discussion before they are merged or closed scipy.stats
Milestone

Comments

@scipy-gitbot
Copy link

Original ticket http://projects.scipy.org/scipy/ticket/1895 on 2013-04-19 by @josef-pkt, assigned to @rgommers.

stats.binom does not impose that n is an integer.

The cdf seems to work correctly (?), the pmf doesn't sum to one.
The cdf from scipy.special might just floor the argument n (cast to int).
I don't know what to make of the case, when n is not an integer.
Is there an extension of the distribution for all real n>=0?

>>> stats.binom.cdf(np.arange(-3,5), 3.5, 0.5)
array([ 0.   ,  0.   ,  0.   ,  0.125,  0.5  ,  0.875,  1.   ,  1.   ])
>>> np.diff(stats.binom.cdf(np.arange(-3,5), 3.5, 0.5))
array([ 0.   ,  0.   ,  0.125,  0.375,  0.375,  0.125,  0.   ])
>>> stats.binom.pmf(np.arange(-3,5), 3.5, 0.5)
array([ 0.        ,  0.        ,  0.        ,  0.08838835,  0.30935922,
        0.38669902,  0.19334951,  0.        ])
>>> stats.binom.pmf(np.arange(-3,5), 3.5, 0.5).sum()
0.97779609585952287

>>> stats.binom.pmf(np.arange(-3,5), 3, 0.5)
array([ 0.   ,  0.   ,  0.   ,  0.125,  0.375,  0.375,  0.125,  0.   ])
@scipy-gitbot
Copy link
Author

trac user Sytse wrote on 2013-04-23

I don't know of an extension of the binomial distribution for non-integer n and couldn't find one in the literature.
In my opinion the parameter n should be restricted to non-negative integer values.

BTW, how should in scipy a check be coded whether a parameter has an integer value? abs(par-round(par))<eps?
I couldn't find an example, neither in the hypergeometric distribution nor in the discrete uniform distribution, where parameters imo also should have integer values.
A similar check should therefore also be introduced for all parameters in these two distributions (and maybe in more?)

@scipy-gitbot
Copy link
Author

@josef-pkt wrote on 2013-04-23

for a related discussion #198 and gh-2172
In that case the distribution extends to the real line.

"practicality beats purity"

I'm using binom right now for the binomial tests and power and sample size calculations. It's good to know the limitation, but don't know what's the best way to enforce this yet.

@scipy-gitbot
Copy link
Author

trac user Sytse wrote on 2013-04-23

The discussion you refer to is about something different: the Erlang distribution appears to be identical to the gamma distribution, apart from the fact that one of the Erlang parameters is restricted to integer values.
You therefore may consider the gamma distribution as a 'continuous' extension of the Erlang distribution: both share the same support and the limit of the gamma probability density function as the parameter approaches an integer value is the same as the Erlang probability density function for this integer value.

If the integer value restriction to the binomial n is dropped, you obviously don't get such a 'continuous' extension: the resulting 'distribution' either has no finite support (as in the binomial case), or its total probability is not 1, where the latter in my opinion is an essential property of a probability distribution.

The slogan "practicality beats purity" sounds appealing, but is imo not a fair summary of the problem.

@scipy-gitbot
Copy link
Author

@josef-pkt wrote on 2013-04-23

I completely agree, but what should be done

  • just document it
  • round n
  • check and return nans

Nobody has ever complained about this, and I only found it because of a random check of the distribution.

Maybe the case where we run over a large number of n doesn't show up very often (only in sample size calculations, and maybe estimation). So I guess the check would cost much.

Floating point errors: what's the correct check? your BTW

I think I can convince myself to returning nans.

in R

> pbinom(3, 10.1, 0.5)
[1] NaN
> pbinom(3, 10., 0.5)
[1] 0.171875
> pbinom(3, 10.0000000001, 0.5)
[1] 0.171875
> pbinom(3, 10.000000001, 0.5)
[1] 0.171875
> pbinom(3, 10.00000001, 0.5)
[1] 0.171875
> pbinom(3, 10.0000001, 0.5)
[1] 0.171875
> pbinom(3, 10.000001, 0.5)
[1] NaN

@scipy-gitbot
Copy link
Author

@josef-pkt wrote on 2013-04-23

typo in previous comment

So I guess the check would NOT cost much

@mdhaber
Copy link
Contributor

mdhaber commented Feb 2, 2021

I think this issue just needs a decision. I would suggest just putting an equality integer check in _argcheck, not rounding, e.g.

    def _argcheck(self, n, p):
        return (n >= 0) & **(n == np.asarray(n, dtype=int))** & (p >= 0) & (p <= 1)

Ideally n comes in as an array, but it doesn't if passed from, e.g., moment. That is a separate issue (gh-12197 I think), and until it is resolved, we need np.asarray.

For reference, gh-13204 performs this sort of integer check for the new zipfian distribution because there is not an obvious way of extending it to non-integer n for a <= 1.

@rlucas7
Copy link
Member

rlucas7 commented Feb 10, 2021

@mdhaber Josh and I had some related discussions here: #11045
the bdtrfamily of special functions from cephes run the cdf, ppf, and sf for the binomial in SciPy-at least until boostinator goes in (love the name BTW).

Given that the _argcheck applies to cdf, and sf as well as the pmf, I do think the looser requirements of double (or float) should be preferred.

This wiki section describes the relationship, basically repeated integration by parts of the incomplete beta function to get the distribution function for binomial.

@mdhaber
Copy link
Contributor

mdhaber commented Feb 10, 2021

at least until boostinator goes in (love the name BTW).

100% @mckib2! BTW, it's ready from my perspective if you have a minute to check it out!

This wiki section describes the relationship, basically repeated integration by parts of the incomplete beta function to get the distribution function for binomial.

I'll take a look, thanks. Does that mean that most of the methods work fine with non-integer n; it's just pmf that's wrong? Should there be a separate argument check in just the _pmf method?

@rlucas7
Copy link
Member

rlucas7 commented Feb 10, 2021 via email

@mdhaber
Copy link
Contributor

mdhaber commented Feb 10, 2021

Given that plan is to move to boost, how does boost handle these cases?

I'll reply in gh-13328

@mdhaber
Copy link
Contributor

mdhaber commented Mar 9, 2022

@steppi I noticed your comments in gh-15213. Do you have any suggestions here? Or about gh-13632 (more generally)?

@steppi
Copy link
Contributor

steppi commented Mar 9, 2022

@steppi I noticed your comments in gh-15213. Do you have any suggestions here? Or about gh-13632 (more generally)?

My vote would be just to do the integer check. Here's some background for context:

The natural generalization of the binomial distribution for non-integral n is a quasiprobability distribution that includes negative probabilities. It has support over all of the non-negative integers. For k < n it assigns positive probability, and the sign of the probability alternates for k > n. Quasiprobability distributions show up in quantum mechanics, but I'm not aware of this one appearing anywhere in the physics literature.

Some special cases show up in a paper by Gabor J Szelesky entitled Half of a coin: Negative probabilities. The main idea is that if x and y are real numbers such that x + y = n for n a positive integer and X ~ quasibinom(x, p), Y ~ quasibinom(y, p) then X + Y ~ binom(n, p) in the sense that the convolution of the distributions for X and Y yields binom(n, p). This can be generalized to sums of arbitrary finite length. Szelesky treats the case where the x, y, etc. values sum to 1, yielding the distributions associated to "fractional coins".

The pmf takes on negative values and the cdf is not monotonically increasing. They are well-defined functions but they they don't satisfy the properties needed for a probability distribution. Also, it doesn't make sense to sample from a quasiprobability distribution, so there would be no meaningful rvs.

I think these quasiprobability distributions are interesting but outside the scope of scipy.stats. However, any generalization of the binomial distribution to non-integral n that isn't such a quasiprobability distribution would strike me as unnatural. So I think it makes sense just to do the integer check.

@steppi
Copy link
Contributor

steppi commented Mar 9, 2022

For gh-13632, my vote would also be to do the integral check unless there really is a meaningful extension to non-integral parameter values that gives an actual probability distribution.

@mdhaber
Copy link
Contributor

mdhaber commented Mar 9, 2022

Thanks @steppi.

@tupui Since you also replied in gh-13632, I thought I'd bring this one up. Based on @steppi's comments above, what do you think about deprecating calling SciPy's built-in discrete distributions with non-integer values of arguments that are supposed to be integer (unless there is a clearly correct extension to non-integer values already implemented in SciPy)?

@mdhaber
Copy link
Contributor

mdhaber commented Mar 10, 2022

Crossref gh-3758. Not the same, but related (which arguments can be non-integral.)

@tupui
Copy link
Member

tupui commented Mar 11, 2022

If only calling integers simplifies things on our side, I am fine with it. I used a lot discrete distributions with non integers values for modelling, but I could have used a mapping if SciPy was only dealing with integers.

@steppi
Copy link
Contributor

steppi commented Mar 11, 2022

If only calling integers simplifies things on our side, I am fine with it. I used a lot discrete distributions with non integers values for modelling, but I could have used a mapping if SciPy was only dealing with integers.

@tupui can you give an example? I think @mdhaber is talking about discrete distributions with parameters that only make sense for integer values, but I think you may be talking about discrete distributions which take on non-integer values, for example a distribution that takes on 0 with probability 1/2 and 0.5 with probability 0.5. Do I understand correctly? I’d hope that we only deprecate for cases where there genuinely isn’t a valid probability distribution when a parameter is extended to be non-integral, like the example from this issue of the binomial distribution with non-integral n.

@tupui
Copy link
Member

tupui commented Mar 11, 2022

You're correct 👍 then all good.

@mdhaber
Copy link
Contributor

mdhaber commented Mar 11, 2022

I think we're on the same page. I'll open a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected Migrated from Trac needs-decision Items that need further discussion before they are merged or closed scipy.stats
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants