Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scipy.stats.binom.ppf Incorrect for p=0 #5122

Closed
ghost opened this issue Aug 7, 2015 · 13 comments · Fixed by #13328
Closed

scipy.stats.binom.ppf Incorrect for p=0 #5122

ghost opened this issue Aug 7, 2015 · 13 comments · Fixed by #13328
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.stats
Milestone

Comments

@ghost
Copy link

ghost commented Aug 7, 2015

Hi,
The when calling the function scipy.stats.binom.ppf(x,n,0) it returns the following values:
x=0 : scipy.stats.binom.ppf(x,n,0) == -1
0<x<1 : scipy.stats.binom.ppf(x,n,0) == n-1
x=1 : scipy.stats.binom.ppf(x,n,0) == n
However, I believe that it should return the values:
x=0 : scipy.stats.binom.ppf(x,n,0) == -1
0<x<1 : scipy.stats.binom.ppf(x,n,0) == 0
x=1 : scipy.stats.binom.ppf(x,n,0) == 0

@argriffing
Copy link
Contributor

0<x<0

I guess you mean 0<x<1?

@ghost
Copy link
Author

ghost commented Aug 7, 2015

I did. Sorry for the error.
On Aug 7, 2015 6:10 PM, "argriffing" notifications@github.com wrote:

0<x<0

I guess you mean 0<x<1?


Reply to this email directly or view it on GitHub
#5122 (comment).

@nayyarv
Copy link

nayyarv commented Aug 9, 2015

Btw, I got scipy.stats.binom.ppf(q, n, 0) ==0 for 0.5<q<1. (on, 0.16.0)

Most of what's happening is coming from here https://github.com/scipy/scipy/blob/master/scipy/stats/_discrete_distns.py#L64

The special.bdtrik is returning the wrong value (as far as I can tell) for this case. Running special.bdtrik for the values above gives the problematic answers.

As in

special.bdtrik(0, n, p)  = 0 # for non-zero n and p
special.bdtrik(q, n, 0)  = n # for non-zero n and q<0.5
special.bdtrik(q, n, 0)  = 0 # for non-zero n and q>0.5

It appears bdtrik is auto-generated and has no documentation: http://docs.scipy.org/doc/scipy/reference/generated/scipy.special.bdtrik.html

So I have no idea what's happening.

I suspect this has to do with implementation where it chooses method of evaluation depending on whether q>0.5 or not.

@ghost
Copy link
Author

ghost commented Aug 9, 2015

Hi,

For version 0.13.0b1 I get the following:

ppf(q,n,0) == n-1 for 0<q<=.5
ppf(q,n,0) == 0 for .5<q<1
ppf(q,n,0) == n for q==1

I'm not sure what is wrong with the code, however hopefully the following
will help:

I suspect the issue is due to the fact that a binomial random variable
takes on discrete values and thus the cdf is defined by a collection of
line segments that are closed on the left hand side and open on the right
hand side. (e.g., it will look a little like the first example on this
page :
https://en.wikipedia.org/wiki/File:Discrete_probability_distribution_illustration.svg
).

To ensure that its inverse (the ppf function) is defined everywhere in
[0,1] one will have to do some "tricks" to connect those line segments.
Here is a very inefficient example of how one might do that:

ppf = lambda x, n, p : (float(i) for i in xrange(n+1) if
scipy.stats.binom.cdf(i,n,p) >= x).next()

Here I find the first value "i" such that the cdf is >= x.

(keep n small if you want to test this out)

I suspect that the section of code that does this is the likely cause of
the error. This is of course assuming that I know what I'm talking about
and there isn't some fancier method of computing the ppf that I don't know
about. This paper (which I haven't read) may be informative (
http://www.lancaster.ac.uk/postgrad/moorhead/Work/SummerProject13.pdf ).

On Sun, Aug 9, 2015 at 11:31 AM, Varun Nayyar notifications@github.com
wrote:

Btw, I got scipy.stats.binom.ppf(q, n, 0) ==0 for 0.5<q<1. (on, 0.16.0)

Most of what's happening is coming from here
https://github.com/scipy/scipy/blob/master/scipy/stats/_discrete_distns.py#L64

The special.bdtrik is returning the wrong value (as far as I can tell)
for this case. Running special.bdtrik for the values above gives the
problematic answers.

As in

special.bdtrik(0, n, p) = 0 # for non-zero n and p
special.bdtrik(q, n, 0) = n # for non-zero n and q<0.5
special.bdtrik(q, n, 0) = 0 # for non-zero n and q>0.5

It appears bdtrik is auto-generated and has no documentation:
http://docs.scipy.org/doc/scipy/reference/generated/scipy.special.bdtrik.html

So I have no idea what's happening.

I suspect this has to do with implementation where it chooses method of
evaluation depending on whether q>0.5 or not.


Reply to this email directly or view it on GitHub
#5122 (comment).

@argriffing
Copy link
Contributor

Is this related to #1603?

@josef-pkt
Copy link
Member

Going to the limits is often difficult numerically as in #1603.
However, in this case we would need the right values just at the limit p=0, as in the original description.

The current ppf looks correct for small (but not tiny) p, but jumps to wrong values at p=0. (ignoring the segment of p in #1603)

@argriffing
Copy link
Contributor

@josef-pkt I vaguely recall that earlier you said the scipy.stats functions are not in general implemented correctly at limits. Is this still true? Are problems at these limits considered bugs or out of scope? If limits are allowed (e.g. p=0 or p=1 in a binomial distribution, sigma=0 in a normal distribution, etc.) then should some kind of meta-issue be opened for this?

@josef-pkt
Copy link
Member

@argriffing

First, specific to this issue: AFAICS, ppf returns wrong numbers and not np.nan, so it's currently not treated as out of domain. The fix is either to make it out of domain or return the correct number which is available in this case.

edit The difference to the case sigma=0 in the normal distribution is that here we already have a discrete distribution with probabilities and a well defined finite limit. When scale->0 in continuous distribtuion, then we don't have a standard continuous distribution anymore. (end edit)

In general about edge cases:

I would treat them case by case and improve (include boundaries) if possible and if it doesn't produce slow spaghetti code. scipy.special has improved a lot over the years, and also in corner cases in stats are now much better handled than they were before.

However, I wouldn't make any assumptions that any of the functions are well defined at or close to singular limits. For example in statsmodels (e.g. GLM and discrete) we clip values most of the times to stay away from the boundaries, or impose transformations that force values away from the boundaries. (#1603 doesn't worry me longer than half a minute)

A case that is now well handled by scipy but not yet used much in statsmodels is the 0log0 case or many similar cases. (or currently box-cox transformation where I didn't pay attention)

@argriffing
Copy link
Contributor

special.bdtrik(0, n, p) = 0 # for non-zero n and p

I'm not sure if it helps debugging, but this happens:

>>> bdtrik(0, 4, 1e-16)
0.0
>>> bdtrik(0, 4, 1e-20)
4.0
>>> bdtrik(0, 4, 0)
4.0

I guess this is related to that other issue.

@argriffing
Copy link
Contributor

I think it's hitting edge case bugs in https://github.com/scipy/scipy/blob/master/scipy/special/cdflib/cdfbin.f or https://github.com/scipy/scipy/blob/master/scipy/special/cdflib/dinvr.f, if anyone would like following GOTOs through unmaintained fortran code.

@argriffing
Copy link
Contributor

Here is a very inefficient example of how one might do that:

ppf = lambda x, n, p : (float(i) for i in xrange(n+1) if scipy.stats.binom.cdf(i,n,p) >= x).next()

Here I find the first value "i" such that the cdf is >= x.

Yes this is how scipy documents ppf of discrete distributions in http://docs.scipy.org/doc/scipy/reference/tutorial/stats.html#specific-points-for-discrete-distributions: ppf(q) = min{x : cdf(x) >= q, x integer}, except possibly for the lower bound of the search (I think returning -1 for ppf(0, n, 0) is deliberate?).

@abukaj
Copy link

abukaj commented May 16, 2017

It is almost second anniversary of the issue and the issue still haunts scipy (v. 0.19.0).

Maybe some kind of a monkey-path like 0 if (p, q) == (0, 0) else _ppf(q, n, p) will do it for now?

@mdhaber
Copy link
Contributor

mdhaber commented Dec 20, 2020

In the original post, the desired answer for x=1 is:

x=1 : scipy.stats.binom.ppf(x,n,0) == 0

Note that Boost and R (e.g. qbinom(1, 10, 0)) agree with SciPy here, returning n. Are they all wrong?

mckib2 added a commit to mckib2/scipy that referenced this issue Dec 20, 2020
mckib2 added a commit to mckib2/scipy that referenced this issue Jan 2, 2021
@rgommers rgommers added this to the 1.7.0 milestone May 8, 2021
@rgommers rgommers added the defect A clear bug or issue that prevents SciPy from being installed or used as expected label May 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.stats
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants