Negative p-value on hypergeom.cdf (Trac #1719) #2238

Closed
scipy-gitbot opened this Issue Apr 25, 2013 · 29 comments

Comments

Projects
None yet
8 participants

Original ticket http://projects.scipy.org/scipy/ticket/1719 on 2012-08-16 by trac user lucapinello, assigned to unknown.

Hi!
I have noticed that not only the precision for hypergeom.cdf is not very good but for some cases the value is negative.

I compared the results with Matlab and R.

Two simple examples:

  1. Negative values:
    from scipy.stats import hypergeom

1-hypergeom.cdf(30,13397950,4363,12390)

-3.5282108346024188e-09

In both Matlab and R you obtain 0.

  1. Bad precision:

1-hypergeom.cdf(20,13397950,4363,9260)

2.0581985005208026e-09

In both Matlab and R you obtain 1.228906e-11

Thanks for this great project, I hope you can fix this bug soon.

Best,

Luca Pinello

@rgommers wrote on 2012-08-18

Still an issue in current master. Hypergeom recently got a _sf method with higher precision, but needs a _cdf or _logcdf method too.

Note that what you want is actually the survival function, which is way faster and gives correct results:

In [11]: hypergeom.sf(30,13397950,4363,12390)
Out[11]: 1.329942944239457e-17

In [12]: hypergeom.sf(20,13397950,4363,9260)
Out[12]: 1.2289025689799385e-11
Contributor

argriffing commented May 18, 2013

This can be fixed by adding hypergeom _cdf(...) = 1 - _sf(...) in distributions.py. I would change this and submit a PR, except that there's already a modification to distributions.py in the PR queue and I'm not good enough with git to merge them smoothly.

Contributor

argriffing commented May 19, 2013

I tried this, and I was stymied by the requirement that the hypergeometric pmf(k, M, n, N) allows k to be a sequence. I wasn't aware of this aspect of the hypergeometric distribution interface, and it is causing my tests to fail.

Owner

rgommers commented May 20, 2013

Not sure what is failing for you exactly - the first parameter to most distribution methods is array_like, i.e. an ndarray or a sequence. k is also immediately converted to an array in pmf. Can you push the code you have now?

Contributor

argriffing commented May 20, 2013

It should be possible to fix the bug using just cdf = 1-sf but I absent-mindedly made the changes to the master branch instead of a separate branch, and I also wasn't able to get the tests working. I'll let someone else handle this one, it should be very easy.

Contributor

argriffing commented May 20, 2013

Maybe @josef-pkt would be familiar enough with scipy.stats to do this as a one or two line fix?

Member

josef-pkt commented May 20, 2013

I don't think cdf = 1-sf is a good fix, because then we will run into precision problems on the left tail.

General recommendation to users: use sf if you want 1-cdf in the right tail, it's in many cases more precise.

I think the best would be to add a _cdf that does the same loop as the new _sf, so we also get higher precision in the right tail.

However, I think this doesn't solve the problem in this issue, if we are already now running into precision problems coming from the other tail. The only way to increase precision would be to dispatch to either a cdf or a sf loop depending on whether we are closer to the left or right tail.

Another however, I don't really understand where the precision problem is coming from. We are only summing over a small number of terms, and I don't see why _cdf should be larger than 1.
There might be a bug somewhere

>>> stats.hypergeom.pmf(30,13397950,4363,12390)
9.0186518528656726e-17
>>> stats.hypergeom.a
-13381197

.a is wrong and we calculate a huge number of irrelevant terms
this is with scipy 0.9 which I have open right now.

Member

josef-pkt commented May 20, 2013

the .a is fixed in current master. I need to switch to a newer scipy

Member

josef-pkt commented May 20, 2013

as reference gh-2321 fixed the .a and the ppf loop

Member

josef-pkt commented May 20, 2013

all pmf larger than 30 are smaller than 1e-16, there is no way to calculate the upper tail coming from the other side

we are just adding 30 values, but the floating point precision adds up

>>> scipy.__version__
'0.12.0rc1'
>>> stats.hypergeom.pmf(30,13397950,4363,12390)
9.0186518528656726e-17
>>> stats.hypergeom.a
0
>>> stats.hypergeom.cdf(30,13397950,4363,12390)
1.0000000112708283

>>> stats.hypergeom.pmf(np.arange(0, 10, 1),13397950,4363,12390)
array([ 0.01764525,  0.07128356,  0.14394158,  0.19371231,  0.1954586 ,
        0.15772756,  0.10603382,  0.06108009,  0.03077714,  0.01378064])
>>> stats.hypergeom.pmf(np.arange(0, 30, 1),13397950,4363,12390).sum()
1.0000000112708283

The only solution I see, is do choose the calculation based on _sf or _cdf given on which tail we get better precision, maybe using the mean as a threshold. This is actually what the user is supposed to do by calling sf instead of calculating 1-cdf.

>>> stats.hypergeom.stats(13397950,4363,12390)
(array(4.034764273638878)
Member

josef-pkt commented May 20, 2013

There is also the problem that we don't truncate the returned values for cdf and sf to be within the interval [0,1]

Member

josef-pkt commented May 20, 2013

and here's the (almost) symmetric test case, moving the problem into the other tail

>>> stats.hypergeom.cdf(12390 - 30, 13397950, 13397950-4363, 12390)
1.0348594866567994e-16
>>> stats.hypergeom.sf(12390 - 30, 13397950, 13397950-4363, 12390)
1.0000000060954974

there are small numerical differences to the previous case, that I don't know where they are coming from
(additionally to numerical problems I might be off by one for cdf versus sf, weak versus strict inequalities)

>>> stats.hypergeom.pmf(12390 - np.arange(0, 30, 1),13397950,13397950-4363,12390).sum()
1.000000006095497
Contributor

argriffing commented May 20, 2013

@josef-pkt OK, I see that this is probably more complicated than I had appreciated. Thanks for your reply!

My thinking had been that the more sophisticated integral in sf could replace the naive quad integral used by cdf and this would fix the issue, but now I see that it could be more complicated than this. When I try to fix the bug myself, I get a combination of precision errors and plumbing errors which are beyond my scipy.stats knowledge to debug. For example, I don't know what .a and .b are, and I don't know why copypasting the _sf integral as a new _cdf integral (with appropriate modifications) should give zip errors when I keep the same .asarray() calls. I assume that this is because _cdf and _sf have subtly different interfaces even though they are conceptually symmetric with each other.

In the current scipy master branch, I can do

>>> 1 - (1 - scipy.stats.hypergeom.sf(30, 13397950, 4363, 12390))
0.0
>>> 1 - (1 - scipy.stats.hypergeom.sf(20, 13397950, 4363, 9260))
1.2289058659575858e-11

which had led me to believe that just replacing cdf by 1-sf would be a non-ideal but sufficient patch, but now I guess that this would cause regressions in other parts of the distribution.

Member

josef-pkt commented May 20, 2013

second case,

the generic cdf calculation is using all relevant terms in the sum. There is no truncation in the number of terms in the sum in this case.

looks like pure precision errors in the underlying function, calculation of pmf

>>> stats.hypergeom.cdf(20,13397950,4363,9260)
0.99999998823069003
>>> stats.hypergeom.pmf(np.arange(20+1),13397950,4363,9260).sum()
0.99999998823069003

>>> 1 - stats.hypergeom.pmf(np.arange(20+1),13397950,4363,9260).sum()
1.176930997104364e-08
>>> stats.hypergeom.pmf(np.arange(20, 9260+1),13397950,4363,9260).sum()
8.6715410757106549e-11
Member

josef-pkt commented May 20, 2013

@argriffing

some background:

hypergeom is a discrete distribution, so we add all the pmf values in a sum to get sf or cdf instead of using integrate.quad as for the continuous distribution

.a and .b are the bounds of the support of the distribution, outside the probability is zero and we can start to sum at either .a or .b (if we want the full sum for sf or cdf)

"copypasting the _sf integral as a new _cdf integral (with appropriate modifications)" should work.
I don't see/remember what would be different

I thought the following might explain you zip error, but I don't think it's relevant for what you did

The generic functions like sf do argument checks and (relevant here) broadcasting of the arguments. So hypergeom._sf is written under the assumption that all arguments have the same shape.
If you call hypergeom._sf directly, then it doesn't know how to handle different shaped arguments. In most other distributions and methods, we don't have loops, and there we get the usual numpy broadcasting behavior also when we call the private methods directly.

Contributor

argriffing commented May 20, 2013

The CellProfiler project has some functions

pochdivgamma(a, b, iterations)
hyper3F2regularizedZ1(a1, a2, a3, b1, b2)
pochdivpoch(a, b, iterations)
pochdivpochgen(a, b)
hyper3F2Z1(a1, a2, a3, b1, b2, tol=1e-15)
hyp2f1mine(a, b, c)
hyper3F2aZ1(a1, a2, a3, b2, tol=1e-10)

given

from scipy.special import gamma, hyp2f1, gammaln

which might help with the hypergeometric distribution pdf and cdf, but CellProfiler is GPL...

Owner

rgommers commented May 20, 2013

CellProfiler code has previously been relicensed for scikit-image. If you really want to use some of their code (I haven't checked the above in detail), ask @thouis.

Contributor

thouis commented May 22, 2013

It's fine with me. I wrote that code, but I'm not on the project any more.
@ljosa could probably change the license for just that file. Or extract
it and tack on a new license header.

ljosa commented Aug 21, 2013

I have changed the license of the files cpa/hypergeom.py and cpa/dirichletintegrate.py in CellProfiler-Analyst from GPLv2 to BSD.

If you incorporate enough of it in SciPy that we no longer need to maintain our own version of these functions, please let me know.

Sorry for taking so long to get this done. I hope it is not too late to be useful.

Member

josef-pkt commented Aug 22, 2013

@argriffing I have no idea how these functions help for stats.hypergeom. I'm no expert in special functions. I only use them if they are already used in code, or if I have a reference that uses a specific special function.

Owner

rgommers commented Aug 22, 2013

@ljosa thanks for doing that!

Contributor

argriffing commented Aug 22, 2013

I'll try to look at this today. My thought was that because according to wikipedia the closed form of the hypergeometric http://en.wikipedia.org/wiki/Hypergeometric_distribution survival function is a regularized 3F2 hypergeometric function with z=1, and because CellProfiler-Analyst provides a regularized 3F2 hypergeometric function with z=1, then there might be some overlap.

Owner

pv commented Aug 22, 2013

3F2 for z=1 has a closed-form expression: http://functions.wolfram.com/HypergeometricFunctions/Hypergeometric3F2/03/02/01/0002/

EDIT: given clebsch-gordan coefficients :)

Contributor

argriffing commented Aug 22, 2013

I looked at this a bit, and it seems to not deal with the sizes of numbers that are causing problems with the hypergeometric cdf. The domain of 3F2(a1,a2,a3;b1,b2;1) which is handled most carefully by the CPA code requires b1 = a1+1 which is not what we want for the hypergeometric cdf. I started checking http://functions.wolfram.com/HypergeometricFunctions/Hypergeometric3F2/17/02/06/ for relevant transformations but I gave up.

Contributor

thouis commented Sep 6, 2013

@argriffing - I don't think the hyper3F2Z1 function in the CPA code requires b1 = a1 + 1, it just assumes that every term in the numerator has a larger term in the denominator so that all of the series from pochdivpochgen() converge.

Contributor

argriffing commented Sep 6, 2013

@thouis yes, I was referring to the hyper3F2aZ1 function whose docstring says "same has hyper3F2Z1 but with b1 = a1+1". I tried the hyper3F2Z1 for hypergeometric cdf/sf but it couldn't deal with the bigness of the parameter values that caused someone to open the scipy issue. The hyper3F2aZ1 function treats the domain more carefully but it has that extra b1=a1+1 restriction.

Member

ev-br commented Feb 14, 2014

The immediate problem with hypergeom has been fixed by clipping in #3211. With current master branch I get:

>>> 1-stats.hypergeom.cdf(30,13397950,4363,12390)
0.0
>>> stats.hypergeom.sf(12390 - 30, 13397950, 13397950-4363, 12390)
0.99999999835287956
>>> stats.hypergeom.pmf(12390 - np.arange(0, 30, 1),13397950,13397950-4363,12390).sum()
0.99999999835287956
>>> print scipy.__version__
0.14.0.dev-9812e4c
Owner

rgommers commented Feb 16, 2014

I get slightly different values for some reason (with numpy 1.5.1), didn't check why:

In [2]: stats.hypergeom.cdf(30,13397950,4363,12390)
Out[2]: 0.99999998798961176

In [3]: 1 - stats.hypergeom.cdf(30,13397950,4363,12390)
Out[3]: 1.2010388239502845e-08

In [4]: stats.hypergeom.sf(12390 - 30, 13397950, 13397950-4363, 12390)
Out[4]: 0.99999998281428071

In [5]: stats.hypergeom.pmf(12390 - np.arange(0, 30, 1),13397950,13397950-4363,12390).sum()
Out[5]: 0.99999998281428071

In [6]: scipy.__version__
Out[6]: '0.14.0.dev-fefae50'

But can this now be closed or is there something left to do?

Member

josef-pkt commented Feb 16, 2014

I don't see anything left to do.

@josef-pkt josef-pkt closed this Feb 16, 2014

@rgommers rgommers added this to the 0.14.0 milestone Feb 16, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment