stats binom at non-integer n (Trac #1895) #2414

scipy-gitbot · 2013-04-25T17:56:21Z

Original ticket http://projects.scipy.org/scipy/ticket/1895 on 2013-04-19 by @josef-pkt, assigned to @rgommers.

stats.binom does not impose that n is an integer.

The cdf seems to work correctly (?), the pmf doesn't sum to one.
The cdf from scipy.special might just floor the argument n (cast to int).
I don't know what to make of the case, when n is not an integer.
Is there an extension of the distribution for all real n>=0?

>>> stats.binom.cdf(np.arange(-3,5), 3.5, 0.5)
array([ 0.   ,  0.   ,  0.   ,  0.125,  0.5  ,  0.875,  1.   ,  1.   ])
>>> np.diff(stats.binom.cdf(np.arange(-3,5), 3.5, 0.5))
array([ 0.   ,  0.   ,  0.125,  0.375,  0.375,  0.125,  0.   ])
>>> stats.binom.pmf(np.arange(-3,5), 3.5, 0.5)
array([ 0.        ,  0.        ,  0.        ,  0.08838835,  0.30935922,
        0.38669902,  0.19334951,  0.        ])
>>> stats.binom.pmf(np.arange(-3,5), 3.5, 0.5).sum()
0.97779609585952287

>>> stats.binom.pmf(np.arange(-3,5), 3, 0.5)
array([ 0.   ,  0.   ,  0.   ,  0.125,  0.375,  0.375,  0.125,  0.   ])

The text was updated successfully, but these errors were encountered:

scipy-gitbot · 2013-04-25T18:21:30Z

trac user Sytse wrote on 2013-04-23

I don't know of an extension of the binomial distribution for non-integer n and couldn't find one in the literature.
In my opinion the parameter n should be restricted to non-negative integer values.

BTW, how should in scipy a check be coded whether a parameter has an integer value? abs(par-round(par))<eps?
I couldn't find an example, neither in the hypergeometric distribution nor in the discrete uniform distribution, where parameters imo also should have integer values.
A similar check should therefore also be introduced for all parameters in these two distributions (and maybe in more?)

scipy-gitbot · 2013-04-25T18:21:30Z

@josef-pkt wrote on 2013-04-23

for a related discussion #198 and gh-2172
In that case the distribution extends to the real line.

"practicality beats purity"

I'm using binom right now for the binomial tests and power and sample size calculations. It's good to know the limitation, but don't know what's the best way to enforce this yet.

scipy-gitbot · 2013-04-25T18:21:31Z

trac user Sytse wrote on 2013-04-23

The discussion you refer to is about something different: the Erlang distribution appears to be identical to the gamma distribution, apart from the fact that one of the Erlang parameters is restricted to integer values.
You therefore may consider the gamma distribution as a 'continuous' extension of the Erlang distribution: both share the same support and the limit of the gamma probability density function as the parameter approaches an integer value is the same as the Erlang probability density function for this integer value.

If the integer value restriction to the binomial n is dropped, you obviously don't get such a 'continuous' extension: the resulting 'distribution' either has no finite support (as in the binomial case), or its total probability is not 1, where the latter in my opinion is an essential property of a probability distribution.

The slogan "practicality beats purity" sounds appealing, but is imo not a fair summary of the problem.

scipy-gitbot · 2013-04-25T18:21:31Z

@josef-pkt wrote on 2013-04-23

I completely agree, but what should be done

just document it
round n
check and return nans

Nobody has ever complained about this, and I only found it because of a random check of the distribution.

Maybe the case where we run over a large number of n doesn't show up very often (only in sample size calculations, and maybe estimation). So I guess the check would cost much.

Floating point errors: what's the correct check? your BTW

I think I can convince myself to returning nans.

in R

> pbinom(3, 10.1, 0.5)
[1] NaN
> pbinom(3, 10., 0.5)
[1] 0.171875
> pbinom(3, 10.0000000001, 0.5)
[1] 0.171875
> pbinom(3, 10.000000001, 0.5)
[1] 0.171875
> pbinom(3, 10.00000001, 0.5)
[1] 0.171875
> pbinom(3, 10.0000001, 0.5)
[1] 0.171875
> pbinom(3, 10.000001, 0.5)
[1] NaN

scipy-gitbot · 2013-04-25T18:21:31Z

@josef-pkt wrote on 2013-04-23

typo in previous comment

So I guess the check would NOT cost much

mdhaber · 2021-02-02T20:07:07Z

I think this issue just needs a decision. I would suggest just putting an equality integer check in _argcheck, not rounding, e.g.

    def _argcheck(self, n, p):
        return (n >= 0) & **(n == np.asarray(n, dtype=int))** & (p >= 0) & (p <= 1)

Ideally n comes in as an array, but it doesn't if passed from, e.g., moment. That is a separate issue (gh-12197 I think), and until it is resolved, we need np.asarray.

For reference, gh-13204 performs this sort of integer check for the new zipfian distribution because there is not an obvious way of extending it to non-integer n for a <= 1.

rlucas7 · 2021-02-10T04:29:20Z

@mdhaber Josh and I had some related discussions here: #11045
the bdtrfamily of special functions from cephes run the cdf, ppf, and sf for the binomial in SciPy-at least until boostinator goes in (love the name BTW).

Given that the _argcheck applies to cdf, and sf as well as the pmf, I do think the looser requirements of double (or float) should be preferred.

This wiki section describes the relationship, basically repeated integration by parts of the incomplete beta function to get the distribution function for binomial.

mdhaber · 2021-02-10T04:44:04Z

at least until boostinator goes in (love the name BTW).

100% @mckib2! BTW, it's ready from my perspective if you have a minute to check it out!

This wiki section describes the relationship, basically repeated integration by parts of the incomplete beta function to get the distribution function for binomial.

I'll take a look, thanks. Does that mean that most of the methods work fine with non-integer n; it's just pmf that's wrong? Should there be a separate argument check in just the _pmf method?

rlucas7 · 2021-02-10T14:56:45Z

On Feb 9, 2021, at 11:44 PM, Matt Haberland ***@***.***> wrote: at least until boostinator goes in (love the name BTW). 100% @mckib2! BTW, it's ready from my perspective if you have a minute to check it out! This wiki section describes the relationship, basically repeated integration by parts of the incomplete beta function to get the distribution function for binomial. I'll take a look, thanks. Does that mean that most of the methods are right; it's just pmf that's wrong?

That’s the TL;DR for current implementation. The rvs method might also be impacted-not certain on that one though-it depends.

Should there be a separate argument check in pmf, or should we just mention it in documentation?

Looked over the discussion again in the bdtr PR, there we said we’ll stop supporting n!=floor(n) cases with 1.7.0 release which we are doing in next release. That change will impact the cdf, sf, and ppf. The cdf and sf can support a double k argument and a double n but they need to be floored iirc so things still behave according to axioms of probabilities. Given that plan is to move to boost, how does boost handle these cases? (n and/or k double)

…

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

mdhaber · 2021-02-10T19:41:25Z

Given that plan is to move to boost, how does boost handle these cases?

I'll reply in gh-13328

mdhaber · 2022-03-09T08:22:49Z

@steppi I noticed your comments in gh-15213. Do you have any suggestions here? Or about gh-13632 (more generally)?

steppi · 2022-03-09T14:43:09Z

@steppi I noticed your comments in gh-15213. Do you have any suggestions here? Or about gh-13632 (more generally)?

My vote would be just to do the integer check. Here's some background for context:

The natural generalization of the binomial distribution for non-integral n is a quasiprobability distribution that includes negative probabilities. It has support over all of the non-negative integers. For k < n it assigns positive probability, and the sign of the probability alternates for k > n. Quasiprobability distributions show up in quantum mechanics, but I'm not aware of this one appearing anywhere in the physics literature.

Some special cases show up in a paper by Gabor J Szelesky entitled Half of a coin: Negative probabilities. The main idea is that if x and y are real numbers such that x + y = n for n a positive integer and X ~ quasibinom(x, p), Y ~ quasibinom(y, p) then X + Y ~ binom(n, p) in the sense that the convolution of the distributions for X and Y yields binom(n, p). This can be generalized to sums of arbitrary finite length. Szelesky treats the case where the x, y, etc. values sum to 1, yielding the distributions associated to "fractional coins".

The pmf takes on negative values and the cdf is not monotonically increasing. They are well-defined functions but they they don't satisfy the properties needed for a probability distribution. Also, it doesn't make sense to sample from a quasiprobability distribution, so there would be no meaningful rvs.

I think these quasiprobability distributions are interesting but outside the scope of scipy.stats. However, any generalization of the binomial distribution to non-integral n that isn't such a quasiprobability distribution would strike me as unnatural. So I think it makes sense just to do the integer check.

steppi · 2022-03-09T14:48:04Z

For gh-13632, my vote would also be to do the integral check unless there really is a meaningful extension to non-integral parameter values that gives an actual probability distribution.

mdhaber · 2022-03-09T15:46:21Z

Thanks @steppi.

@tupui Since you also replied in gh-13632, I thought I'd bring this one up. Based on @steppi's comments above, what do you think about deprecating calling SciPy's built-in discrete distributions with non-integer values of arguments that are supposed to be integer (unless there is a clearly correct extension to non-integer values already implemented in SciPy)?

mdhaber · 2022-03-10T15:30:21Z

Crossref gh-3758. Not the same, but related (which arguments can be non-integral.)

tupui · 2022-03-11T10:18:06Z

If only calling integers simplifies things on our side, I am fine with it. I used a lot discrete distributions with non integers values for modelling, but I could have used a mapping if SciPy was only dealing with integers.

steppi · 2022-03-11T13:13:36Z

If only calling integers simplifies things on our side, I am fine with it. I used a lot discrete distributions with non integers values for modelling, but I could have used a mapping if SciPy was only dealing with integers.

@tupui can you give an example? I think @mdhaber is talking about discrete distributions with parameters that only make sense for integer values, but I think you may be talking about discrete distributions which take on non-integer values, for example a distribution that takes on 0 with probability 1/2 and 0.5 with probability 0.5. Do I understand correctly? I’d hope that we only deprecate for cases where there genuinely isn’t a valid probability distribution when a parameter is extended to be non-integral, like the example from this issue of the binomial distribution with non-integral n.

tupui · 2022-03-11T14:53:17Z

You're correct 👍 then all good.

mdhaber · 2022-03-11T15:34:21Z

I think we're on the same page. I'll open a PR.

scipy-gitbot mentioned this issue Apr 25, 2013

Distributions: restrictions on parameter values at a lower level (Trac #1900) #2419

Closed

josef-pkt mentioned this issue Aug 17, 2013

scipy.stats.binom_test gives undesired results when x is first above n/2 (Trac #1865) #2384

Closed

mdhaber mentioned this issue Dec 20, 2020

WIP: Boost stats distributions mdhaber/scipy#48

Closed

mckib2 mentioned this issue Jan 2, 2021

ENH: Boost stats distributions #13328

Merged

mdhaber mentioned this issue Jan 25, 2021

A Solid Foundation for Statistics in Python with SciPy mdhaber/scipy#26

Closed

mdhaber added the needs-decision Items that need further discussion before they are merged or closed label Feb 2, 2021

mdhaber mentioned this issue Mar 1, 2021

stats.rv_discrete is not checking that xk values are integers #13632

Closed

mdhaber mentioned this issue Mar 11, 2022

MAINT: stats: check integrality in _argcheck as needed #15769

Merged

steppi closed this as completed in #15769 Mar 27, 2022

tylerjereddy added this to the 1.9.0 milestone Mar 27, 2022

seantur mentioned this issue Aug 15, 2022

Change scipy.stats.binom to use integer n markmipt/biosaur2#4

Merged

mdhaber mentioned this issue Jan 31, 2023

BUG: inconsistent checks for integrality in several distributions #17891

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stats binom at non-integer n (Trac #1895) #2414

stats binom at non-integer n (Trac #1895) #2414

scipy-gitbot commented Apr 25, 2013

scipy-gitbot commented Apr 25, 2013

scipy-gitbot commented Apr 25, 2013

scipy-gitbot commented Apr 25, 2013

scipy-gitbot commented Apr 25, 2013

scipy-gitbot commented Apr 25, 2013

mdhaber commented Feb 2, 2021 •

edited

Loading

rlucas7 commented Feb 10, 2021

mdhaber commented Feb 10, 2021 •

edited

Loading

rlucas7 commented Feb 10, 2021 via email

mdhaber commented Feb 10, 2021

mdhaber commented Mar 9, 2022

steppi commented Mar 9, 2022 •

edited

Loading

steppi commented Mar 9, 2022 •

edited

Loading

mdhaber commented Mar 9, 2022 •

edited

Loading

mdhaber commented Mar 10, 2022 •

edited

Loading

tupui commented Mar 11, 2022

steppi commented Mar 11, 2022

tupui commented Mar 11, 2022

mdhaber commented Mar 11, 2022

stats binom at non-integer n (Trac #1895) #2414

stats binom at non-integer n (Trac #1895) #2414

Comments

scipy-gitbot commented Apr 25, 2013

scipy-gitbot commented Apr 25, 2013

scipy-gitbot commented Apr 25, 2013

scipy-gitbot commented Apr 25, 2013

scipy-gitbot commented Apr 25, 2013

scipy-gitbot commented Apr 25, 2013

mdhaber commented Feb 2, 2021 • edited Loading

rlucas7 commented Feb 10, 2021

mdhaber commented Feb 10, 2021 • edited Loading

rlucas7 commented Feb 10, 2021 via email

mdhaber commented Feb 10, 2021

mdhaber commented Mar 9, 2022

steppi commented Mar 9, 2022 • edited Loading

steppi commented Mar 9, 2022 • edited Loading

mdhaber commented Mar 9, 2022 • edited Loading

mdhaber commented Mar 10, 2022 • edited Loading

tupui commented Mar 11, 2022

steppi commented Mar 11, 2022

tupui commented Mar 11, 2022

mdhaber commented Mar 11, 2022

mdhaber commented Feb 2, 2021 •

edited

Loading

mdhaber commented Feb 10, 2021 •

edited

Loading

steppi commented Mar 9, 2022 •

edited

Loading

steppi commented Mar 9, 2022 •

edited

Loading

mdhaber commented Mar 9, 2022 •

edited

Loading

mdhaber commented Mar 10, 2022 •

edited

Loading