Negative p-value on hypergeom.cdf (Trac #1719) #2238

scipy-gitbot · 2013-04-25T17:49:07Z

Original ticket http://projects.scipy.org/scipy/ticket/1719 on 2012-08-16 by trac user lucapinello, assigned to unknown.

Hi!
I have noticed that not only the precision for hypergeom.cdf is not very good but for some cases the value is negative.

I compared the results with Matlab and R.

Two simple examples:

Negative values:
from scipy.stats import hypergeom

1-hypergeom.cdf(30,13397950,4363,12390)

-3.5282108346024188e-09

In both Matlab and R you obtain 0.

Bad precision:

1-hypergeom.cdf(20,13397950,4363,9260)

2.0581985005208026e-09

In both Matlab and R you obtain 1.228906e-11

Thanks for this great project, I hope you can fix this bug soon.

Best,

Luca Pinello

scipy-gitbot · 2013-04-25T17:49:08Z

@rgommers wrote on 2012-08-18

Still an issue in current master. Hypergeom recently got a _sf method with higher precision, but needs a _cdf or _logcdf method too.

Note that what you want is actually the survival function, which is way faster and gives correct results:

In [11]: hypergeom.sf(30,13397950,4363,12390)
Out[11]: 1.329942944239457e-17

In [12]: hypergeom.sf(20,13397950,4363,9260)
Out[12]: 1.2289025689799385e-11

argriffing · 2013-05-18T16:30:07Z

This can be fixed by adding hypergeom _cdf(...) = 1 - _sf(...) in distributions.py. I would change this and submit a PR, except that there's already a modification to distributions.py in the PR queue and I'm not good enough with git to merge them smoothly.

argriffing · 2013-05-19T23:08:02Z

I tried this, and I was stymied by the requirement that the hypergeometric pmf(k, M, n, N) allows k to be a sequence. I wasn't aware of this aspect of the hypergeometric distribution interface, and it is causing my tests to fail.

rgommers · 2013-05-20T08:39:51Z

Not sure what is failing for you exactly - the first parameter to most distribution methods is array_like, i.e. an ndarray or a sequence. k is also immediately converted to an array in pmf. Can you push the code you have now?

argriffing · 2013-05-20T12:17:25Z

It should be possible to fix the bug using just cdf = 1-sf but I absent-mindedly made the changes to the master branch instead of a separate branch, and I also wasn't able to get the tests working. I'll let someone else handle this one, it should be very easy.

argriffing · 2013-05-20T12:20:20Z

Maybe @josef-pkt would be familiar enough with scipy.stats to do this as a one or two line fix?

josef-pkt · 2013-05-20T13:16:44Z

I don't think cdf = 1-sf is a good fix, because then we will run into precision problems on the left tail.

General recommendation to users: use sf if you want 1-cdf in the right tail, it's in many cases more precise.

I think the best would be to add a _cdf that does the same loop as the new _sf, so we also get higher precision in the right tail.

However, I think this doesn't solve the problem in this issue, if we are already now running into precision problems coming from the other tail. The only way to increase precision would be to dispatch to either a cdf or a sf loop depending on whether we are closer to the left or right tail.

Another however, I don't really understand where the precision problem is coming from. We are only summing over a small number of terms, and I don't see why _cdf should be larger than 1.
There might be a bug somewhere

>>> stats.hypergeom.pmf(30,13397950,4363,12390)
9.0186518528656726e-17
>>> stats.hypergeom.a
-13381197

.a is wrong and we calculate a huge number of irrelevant terms
this is with scipy 0.9 which I have open right now.

josef-pkt · 2013-05-20T13:18:48Z

the .a is fixed in current master. I need to switch to a newer scipy

josef-pkt · 2013-05-20T13:23:06Z

as reference gh-2321 fixed the .a and the ppf loop

josef-pkt · 2013-05-20T13:36:34Z

all pmf larger than 30 are smaller than 1e-16, there is no way to calculate the upper tail coming from the other side

we are just adding 30 values, but the floating point precision adds up

>>> scipy.__version__
'0.12.0rc1'
>>> stats.hypergeom.pmf(30,13397950,4363,12390)
9.0186518528656726e-17
>>> stats.hypergeom.a
0
>>> stats.hypergeom.cdf(30,13397950,4363,12390)
1.0000000112708283

>>> stats.hypergeom.pmf(np.arange(0, 10, 1),13397950,4363,12390)
array([ 0.01764525,  0.07128356,  0.14394158,  0.19371231,  0.1954586 ,
        0.15772756,  0.10603382,  0.06108009,  0.03077714,  0.01378064])
>>> stats.hypergeom.pmf(np.arange(0, 30, 1),13397950,4363,12390).sum()
1.0000000112708283

The only solution I see, is do choose the calculation based on _sf or _cdf given on which tail we get better precision, maybe using the mean as a threshold. This is actually what the user is supposed to do by calling sf instead of calculating 1-cdf.

>>> stats.hypergeom.stats(13397950,4363,12390)
(array(4.034764273638878)

josef-pkt · 2013-05-20T13:38:53Z

There is also the problem that we don't truncate the returned values for cdf and sf to be within the interval [0,1]

josef-pkt · 2013-05-20T14:06:50Z

and here's the (almost) symmetric test case, moving the problem into the other tail

>>> stats.hypergeom.cdf(12390 - 30, 13397950, 13397950-4363, 12390)
1.0348594866567994e-16
>>> stats.hypergeom.sf(12390 - 30, 13397950, 13397950-4363, 12390)
1.0000000060954974

there are small numerical differences to the previous case, that I don't know where they are coming from
(additionally to numerical problems I might be off by one for cdf versus sf, weak versus strict inequalities)

>>> stats.hypergeom.pmf(12390 - np.arange(0, 30, 1),13397950,13397950-4363,12390).sum()
1.000000006095497

argriffing · 2013-05-20T14:14:45Z

@josef-pkt OK, I see that this is probably more complicated than I had appreciated. Thanks for your reply!

My thinking had been that the more sophisticated integral in sf could replace the naive quad integral used by cdf and this would fix the issue, but now I see that it could be more complicated than this. When I try to fix the bug myself, I get a combination of precision errors and plumbing errors which are beyond my scipy.stats knowledge to debug. For example, I don't know what .a and .b are, and I don't know why copypasting the _sf integral as a new _cdf integral (with appropriate modifications) should give zip errors when I keep the same .asarray() calls. I assume that this is because _cdf and _sf have subtly different interfaces even though they are conceptually symmetric with each other.

In the current scipy master branch, I can do

>>> 1 - (1 - scipy.stats.hypergeom.sf(30, 13397950, 4363, 12390))
0.0
>>> 1 - (1 - scipy.stats.hypergeom.sf(20, 13397950, 4363, 9260))
1.2289058659575858e-11

which had led me to believe that just replacing cdf by 1-sf would be a non-ideal but sufficient patch, but now I guess that this would cause regressions in other parts of the distribution.

josef-pkt · 2013-05-20T14:19:53Z

second case,

the generic cdf calculation is using all relevant terms in the sum. There is no truncation in the number of terms in the sum in this case.

looks like pure precision errors in the underlying function, calculation of pmf

>>> stats.hypergeom.cdf(20,13397950,4363,9260)
0.99999998823069003
>>> stats.hypergeom.pmf(np.arange(20+1),13397950,4363,9260).sum()
0.99999998823069003

>>> 1 - stats.hypergeom.pmf(np.arange(20+1),13397950,4363,9260).sum()
1.176930997104364e-08
>>> stats.hypergeom.pmf(np.arange(20, 9260+1),13397950,4363,9260).sum()
8.6715410757106549e-11

josef-pkt · 2013-05-20T14:34:58Z

@argriffing

some background:

hypergeom is a discrete distribution, so we add all the pmf values in a sum to get sf or cdf instead of using integrate.quad as for the continuous distribution

.a and .b are the bounds of the support of the distribution, outside the probability is zero and we can start to sum at either .a or .b (if we want the full sum for sf or cdf)

"copypasting the _sf integral as a new _cdf integral (with appropriate modifications)" should work.
I don't see/remember what would be different

I thought the following might explain you zip error, but I don't think it's relevant for what you did

The generic functions like sf do argument checks and (relevant here) broadcasting of the arguments. So hypergeom._sf is written under the assumption that all arguments have the same shape.
If you call hypergeom._sf directly, then it doesn't know how to handle different shaped arguments. In most other distributions and methods, we don't have loops, and there we get the usual numpy broadcasting behavior also when we call the private methods directly.

argriffing · 2013-05-20T18:17:13Z

The CellProfiler project has some functions

pochdivgamma(a, b, iterations)
hyper3F2regularizedZ1(a1, a2, a3, b1, b2)
pochdivpoch(a, b, iterations)
pochdivpochgen(a, b)
hyper3F2Z1(a1, a2, a3, b1, b2, tol=1e-15)
hyp2f1mine(a, b, c)
hyper3F2aZ1(a1, a2, a3, b2, tol=1e-10)

given

from scipy.special import gamma, hyp2f1, gammaln

which might help with the hypergeometric distribution pdf and cdf, but CellProfiler is GPL...

rgommers · 2013-05-20T18:20:25Z

CellProfiler code has previously been relicensed for scikit-image. If you really want to use some of their code (I haven't checked the above in detail), ask @thouis.

thouis · 2013-05-22T01:54:02Z

It's fine with me. I wrote that code, but I'm not on the project any more.
@ljosa could probably change the license for just that file. Or extract
it and tack on a new license header.

ljosa · 2013-08-21T20:08:59Z

I have changed the license of the files cpa/hypergeom.py and cpa/dirichletintegrate.py in CellProfiler-Analyst from GPLv2 to BSD.

If you incorporate enough of it in SciPy that we no longer need to maintain our own version of these functions, please let me know.

Sorry for taking so long to get this done. I hope it is not too late to be useful.

josef-pkt · 2013-08-22T11:04:46Z

@argriffing I have no idea how these functions help for stats.hypergeom. I'm no expert in special functions. I only use them if they are already used in code, or if I have a reference that uses a specific special function.

rgommers · 2013-08-22T12:27:18Z

@ljosa thanks for doing that!

argriffing · 2013-08-22T15:02:41Z

I'll try to look at this today. My thought was that because according to wikipedia the closed form of the hypergeometric http://en.wikipedia.org/wiki/Hypergeometric_distribution survival function is a regularized 3F2 hypergeometric function with z=1, and because CellProfiler-Analyst provides a regularized 3F2 hypergeometric function with z=1, then there might be some overlap.

pv · 2013-08-22T15:58:24Z

3F2 for z=1 has a closed-form expression: http://functions.wolfram.com/HypergeometricFunctions/Hypergeometric3F2/03/02/01/0002/

EDIT: given clebsch-gordan coefficients :)

argriffing · 2013-08-22T17:31:01Z

I looked at this a bit, and it seems to not deal with the sizes of numbers that are causing problems with the hypergeometric cdf. The domain of 3F2(a1,a2,a3;b1,b2;1) which is handled most carefully by the CPA code requires b1 = a1+1 which is not what we want for the hypergeometric cdf. I started checking http://functions.wolfram.com/HypergeometricFunctions/Hypergeometric3F2/17/02/06/ for relevant transformations but I gave up.

thouis · 2013-09-06T15:26:32Z

@argriffing - I don't think the hyper3F2Z1 function in the CPA code requires b1 = a1 + 1, it just assumes that every term in the numerator has a larger term in the denominator so that all of the series from pochdivpochgen() converge.

argriffing · 2013-09-06T16:00:11Z

@thouis yes, I was referring to the hyper3F2aZ1 function whose docstring says "same has hyper3F2Z1 but with b1 = a1+1". I tried the hyper3F2Z1 for hypergeometric cdf/sf but it couldn't deal with the bigness of the parameter values that caused someone to open the scipy issue. The hyper3F2aZ1 function treats the domain more carefully but it has that extra b1=a1+1 restriction.

ev-br · 2014-02-14T00:28:47Z

The immediate problem with hypergeom has been fixed by clipping in #3211. With current master branch I get:

>>> 1-stats.hypergeom.cdf(30,13397950,4363,12390)
0.0
>>> stats.hypergeom.sf(12390 - 30, 13397950, 13397950-4363, 12390)
0.99999999835287956
>>> stats.hypergeom.pmf(12390 - np.arange(0, 30, 1),13397950,13397950-4363,12390).sum()
0.99999999835287956
>>> print scipy.__version__
0.14.0.dev-9812e4c

rgommers · 2014-02-16T16:43:12Z

I get slightly different values for some reason (with numpy 1.5.1), didn't check why:

In [2]: stats.hypergeom.cdf(30,13397950,4363,12390)
Out[2]: 0.99999998798961176

In [3]: 1 - stats.hypergeom.cdf(30,13397950,4363,12390)
Out[3]: 1.2010388239502845e-08

In [4]: stats.hypergeom.sf(12390 - 30, 13397950, 13397950-4363, 12390)
Out[4]: 0.99999998281428071

In [5]: stats.hypergeom.pmf(12390 - np.arange(0, 30, 1),13397950,13397950-4363,12390).sum()
Out[5]: 0.99999998281428071

In [6]: scipy.__version__
Out[6]: '0.14.0.dev-fefae50'

But can this now be closed or is there something left to do?

josef-pkt · 2014-02-16T16:53:20Z

I don't see anything left to do.

josef-pkt mentioned this issue May 20, 2013

distributions cdf, sf evaluation in the wrong tail #2509

Open

ev-br mentioned this issue Jan 13, 2014

clipped sf, pmf and cdf funs to between 0 and 1 #3195

Closed

josef-pkt closed this as completed Feb 16, 2014

rgommers added this to the 0.14.0 milestone Feb 16, 2014

argriffing mentioned this issue Jun 22, 2015

Non-finite values of hypergeom.logsf and slow evaluation #4982

Closed

fperez mentioned this issue Apr 8, 2018

Non-negative values of stats.hypergeom.logcdf #8692

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Negative p-value on hypergeom.cdf (Trac #1719) #2238

Negative p-value on hypergeom.cdf (Trac #1719) #2238

scipy-gitbot commented Apr 25, 2013

scipy-gitbot commented Apr 25, 2013

argriffing commented May 18, 2013

argriffing commented May 19, 2013

rgommers commented May 20, 2013

argriffing commented May 20, 2013

argriffing commented May 20, 2013

josef-pkt commented May 20, 2013

josef-pkt commented May 20, 2013

josef-pkt commented May 20, 2013

josef-pkt commented May 20, 2013

josef-pkt commented May 20, 2013

josef-pkt commented May 20, 2013

argriffing commented May 20, 2013

josef-pkt commented May 20, 2013

josef-pkt commented May 20, 2013

argriffing commented May 20, 2013

rgommers commented May 20, 2013

thouis commented May 22, 2013

ljosa commented Aug 21, 2013

josef-pkt commented Aug 22, 2013

rgommers commented Aug 22, 2013

argriffing commented Aug 22, 2013

pv commented Aug 22, 2013

argriffing commented Aug 22, 2013

thouis commented Sep 6, 2013

argriffing commented Sep 6, 2013

ev-br commented Feb 14, 2014

rgommers commented Feb 16, 2014

josef-pkt commented Feb 16, 2014

Negative p-value on hypergeom.cdf (Trac #1719) #2238

Negative p-value on hypergeom.cdf (Trac #1719) #2238

Comments

scipy-gitbot commented Apr 25, 2013

scipy-gitbot commented Apr 25, 2013

argriffing commented May 18, 2013

argriffing commented May 19, 2013

rgommers commented May 20, 2013

argriffing commented May 20, 2013

argriffing commented May 20, 2013

josef-pkt commented May 20, 2013

josef-pkt commented May 20, 2013

josef-pkt commented May 20, 2013

josef-pkt commented May 20, 2013

josef-pkt commented May 20, 2013

josef-pkt commented May 20, 2013

argriffing commented May 20, 2013

josef-pkt commented May 20, 2013

josef-pkt commented May 20, 2013

argriffing commented May 20, 2013

rgommers commented May 20, 2013

thouis commented May 22, 2013

ljosa commented Aug 21, 2013

josef-pkt commented Aug 22, 2013

rgommers commented Aug 22, 2013

argriffing commented Aug 22, 2013

pv commented Aug 22, 2013

argriffing commented Aug 22, 2013

thouis commented Sep 6, 2013

argriffing commented Sep 6, 2013

ev-br commented Feb 14, 2014

rgommers commented Feb 16, 2014

josef-pkt commented Feb 16, 2014