scipy.stats.binom.ppf Incorrect for p=0 #5122

ghost · 2015-08-07T17:30:03Z

Hi,
The when calling the function scipy.stats.binom.ppf(x,n,0) it returns the following values:
x=0 : scipy.stats.binom.ppf(x,n,0) == -1
0<x<1 : scipy.stats.binom.ppf(x,n,0) == n-1
x=1 : scipy.stats.binom.ppf(x,n,0) == n
However, I believe that it should return the values:
x=0 : scipy.stats.binom.ppf(x,n,0) == -1
0<x<1 : scipy.stats.binom.ppf(x,n,0) == 0
x=1 : scipy.stats.binom.ppf(x,n,0) == 0

argriffing · 2015-08-07T22:09:57Z

0<x<0

I guess you mean 0<x<1?

ghost · 2015-08-07T22:21:40Z

I did. Sorry for the error.
On Aug 7, 2015 6:10 PM, "argriffing" notifications@github.com wrote:

0<x<0

I guess you mean 0<x<1?

—
Reply to this email directly or view it on GitHub
#5122 (comment).

nayyarv · 2015-08-09T15:31:25Z

Btw, I got scipy.stats.binom.ppf(q, n, 0) ==0 for 0.5<q<1. (on, 0.16.0)

Most of what's happening is coming from here https://github.com/scipy/scipy/blob/master/scipy/stats/_discrete_distns.py#L64

The special.bdtrik is returning the wrong value (as far as I can tell) for this case. Running special.bdtrik for the values above gives the problematic answers.

As in

special.bdtrik(0, n, p)  = 0 # for non-zero n and p
special.bdtrik(q, n, 0)  = n # for non-zero n and q<0.5
special.bdtrik(q, n, 0)  = 0 # for non-zero n and q>0.5

It appears bdtrik is auto-generated and has no documentation: http://docs.scipy.org/doc/scipy/reference/generated/scipy.special.bdtrik.html

So I have no idea what's happening.

I suspect this has to do with implementation where it chooses method of evaluation depending on whether q>0.5 or not.

ghost · 2015-08-09T22:21:34Z

Hi,

For version 0.13.0b1 I get the following:

ppf(q,n,0) == n-1 for 0<q<=.5
ppf(q,n,0) == 0 for .5<q<1
ppf(q,n,0) == n for q==1

I'm not sure what is wrong with the code, however hopefully the following
will help:

I suspect the issue is due to the fact that a binomial random variable
takes on discrete values and thus the cdf is defined by a collection of
line segments that are closed on the left hand side and open on the right
hand side. (e.g., it will look a little like the first example on this
page :
https://en.wikipedia.org/wiki/File:Discrete_probability_distribution_illustration.svg
).

To ensure that its inverse (the ppf function) is defined everywhere in
[0,1] one will have to do some "tricks" to connect those line segments.
Here is a very inefficient example of how one might do that:

ppf = lambda x, n, p : (float(i) for i in xrange(n+1) if
scipy.stats.binom.cdf(i,n,p) >= x).next()

Here I find the first value "i" such that the cdf is >= x.

(keep n small if you want to test this out)

I suspect that the section of code that does this is the likely cause of
the error. This is of course assuming that I know what I'm talking about
and there isn't some fancier method of computing the ppf that I don't know
about. This paper (which I haven't read) may be informative (
http://www.lancaster.ac.uk/postgrad/moorhead/Work/SummerProject13.pdf ).

On Sun, Aug 9, 2015 at 11:31 AM, Varun Nayyar notifications@github.com
wrote:

Btw, I got scipy.stats.binom.ppf(q, n, 0) ==0 for 0.5<q<1. (on, 0.16.0)

Most of what's happening is coming from here
https://github.com/scipy/scipy/blob/master/scipy/stats/_discrete_distns.py#L64

The special.bdtrik is returning the wrong value (as far as I can tell)
for this case. Running special.bdtrik for the values above gives the
problematic answers.

As in

special.bdtrik(0, n, p) = 0 # for non-zero n and p
special.bdtrik(q, n, 0) = n # for non-zero n and q<0.5
special.bdtrik(q, n, 0) = 0 # for non-zero n and q>0.5

It appears bdtrik is auto-generated and has no documentation:
http://docs.scipy.org/doc/scipy/reference/generated/scipy.special.bdtrik.html

So I have no idea what's happening.

I suspect this has to do with implementation where it chooses method of
evaluation depending on whether q>0.5 or not.

—
Reply to this email directly or view it on GitHub
#5122 (comment).

argriffing · 2015-08-11T15:14:19Z

Is this related to #1603?

josef-pkt · 2015-08-11T15:20:13Z

Going to the limits is often difficult numerically as in #1603.
However, in this case we would need the right values just at the limit p=0, as in the original description.

The current ppf looks correct for small (but not tiny) p, but jumps to wrong values at p=0. (ignoring the segment of p in #1603)

argriffing · 2015-08-11T15:24:12Z

@josef-pkt I vaguely recall that earlier you said the scipy.stats functions are not in general implemented correctly at limits. Is this still true? Are problems at these limits considered bugs or out of scope? If limits are allowed (e.g. p=0 or p=1 in a binomial distribution, sigma=0 in a normal distribution, etc.) then should some kind of meta-issue be opened for this?

josef-pkt · 2015-08-11T15:40:39Z

@argriffing

First, specific to this issue: AFAICS, ppf returns wrong numbers and not np.nan, so it's currently not treated as out of domain. The fix is either to make it out of domain or return the correct number which is available in this case.

edit The difference to the case sigma=0 in the normal distribution is that here we already have a discrete distribution with probabilities and a well defined finite limit. When scale->0 in continuous distribtuion, then we don't have a standard continuous distribution anymore. (end edit)

In general about edge cases:

I would treat them case by case and improve (include boundaries) if possible and if it doesn't produce slow spaghetti code. scipy.special has improved a lot over the years, and also in corner cases in stats are now much better handled than they were before.

However, I wouldn't make any assumptions that any of the functions are well defined at or close to singular limits. For example in statsmodels (e.g. GLM and discrete) we clip values most of the times to stay away from the boundaries, or impose transformations that force values away from the boundaries. (#1603 doesn't worry me longer than half a minute)

A case that is now well handled by scipy but not yet used much in statsmodels is the 0log0 case or many similar cases. (or currently box-cox transformation where I didn't pay attention)

argriffing · 2015-08-11T17:40:58Z

special.bdtrik(0, n, p) = 0 # for non-zero n and p

I'm not sure if it helps debugging, but this happens:

>>> bdtrik(0, 4, 1e-16)
0.0
>>> bdtrik(0, 4, 1e-20)
4.0
>>> bdtrik(0, 4, 0)
4.0

I guess this is related to that other issue.

argriffing · 2015-08-11T17:50:02Z

I think it's hitting edge case bugs in https://github.com/scipy/scipy/blob/master/scipy/special/cdflib/cdfbin.f or https://github.com/scipy/scipy/blob/master/scipy/special/cdflib/dinvr.f, if anyone would like following GOTOs through unmaintained fortran code.

argriffing · 2015-08-11T18:37:39Z

Here is a very inefficient example of how one might do that:
ppf = lambda x, n, p : (float(i) for i in xrange(n+1) if scipy.stats.binom.cdf(i,n,p) >= x).next()
Here I find the first value "i" such that the cdf is >= x.

Yes this is how scipy documents ppf of discrete distributions in http://docs.scipy.org/doc/scipy/reference/tutorial/stats.html#specific-points-for-discrete-distributions: ppf(q) = min{x : cdf(x) >= q, x integer}, except possibly for the lower bound of the search (I think returning -1 for ppf(0, n, 0) is deliberate?).

abukaj · 2017-05-16T09:13:40Z

It is almost second anniversary of the issue and the issue still haunts scipy (v. 0.19.0).

Maybe some kind of a monkey-path like 0 if (p, q) == (0, 0) else _ppf(q, n, p) will do it for now?

mdhaber · 2020-12-20T08:02:03Z

In the original post, the desired answer for x=1 is:

x=1 : scipy.stats.binom.ppf(x,n,0) == 0

Note that Boost and R (e.g. qbinom(1, 10, 0)) agree with SciPy here, returning n. Are they all wrong?

rgommers added the scipy.stats label Aug 8, 2015

abukaj mentioned this issue May 16, 2017

scipy.stats.binom.ppf returns nan for q between 0 and 1 if n == 0 #7406

Closed

mdhaber mentioned this issue Dec 12, 2020

A Solid Foundation for Statistics in Python with SciPy mdhaber/scipy#26

Closed

mdhaber mentioned this issue Dec 20, 2020

WIP: Boost stats distributions mdhaber/scipy#48

Closed

mckib2 added a commit to mckib2/scipy that referenced this issue Dec 20, 2020

TEST: add test for issue scipy#5122

5ca60dd

mckib2 added a commit to mckib2/scipy that referenced this issue Jan 2, 2021

TEST: add test for issue scipy#5122

039d67c

mckib2 mentioned this issue Jan 2, 2021

ENH: Boost stats distributions #13328

Merged

rgommers closed this as completed in #13328 May 8, 2021

rgommers added this to the 1.7.0 milestone May 8, 2021

rgommers added the defect A clear bug or issue that prevents SciPy from being installed or used as expected label May 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scipy.stats.binom.ppf Incorrect for p=0 #5122

scipy.stats.binom.ppf Incorrect for p=0 #5122

ghost commented Aug 7, 2015

argriffing commented Aug 7, 2015

ghost commented Aug 7, 2015

nayyarv commented Aug 9, 2015

ghost commented Aug 9, 2015

argriffing commented Aug 11, 2015

josef-pkt commented Aug 11, 2015

argriffing commented Aug 11, 2015

josef-pkt commented Aug 11, 2015

argriffing commented Aug 11, 2015

argriffing commented Aug 11, 2015

argriffing commented Aug 11, 2015

abukaj commented May 16, 2017

mdhaber commented Dec 20, 2020 •

edited

scipy.stats.binom.ppf Incorrect for p=0 #5122

scipy.stats.binom.ppf Incorrect for p=0 #5122

Comments

ghost commented Aug 7, 2015

argriffing commented Aug 7, 2015

ghost commented Aug 7, 2015

nayyarv commented Aug 9, 2015

ghost commented Aug 9, 2015

argriffing commented Aug 11, 2015

josef-pkt commented Aug 11, 2015

argriffing commented Aug 11, 2015

josef-pkt commented Aug 11, 2015

argriffing commented Aug 11, 2015

argriffing commented Aug 11, 2015

argriffing commented Aug 11, 2015

abukaj commented May 16, 2017

mdhaber commented Dec 20, 2020 • edited

mdhaber commented Dec 20, 2020 •

edited