Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Negative p-value on hypergeom.cdf (Trac #1719) #2238

Closed
scipy-gitbot opened this Issue · 29 comments

8 participants

@scipy-gitbot

Original ticket http://projects.scipy.org/scipy/ticket/1719 on 2012-08-16 by trac user lucapinello, assigned to unknown.

Hi!
I have noticed that not only the precision for hypergeom.cdf is not very good but for some cases the value is negative.

I compared the results with Matlab and R.

Two simple examples:

1) Negative values:
from scipy.stats import hypergeom

1-hypergeom.cdf(30,13397950,4363,12390)

-3.5282108346024188e-09

In both Matlab and R you obtain 0.

2) Bad precision:

1-hypergeom.cdf(20,13397950,4363,9260)

2.0581985005208026e-09

In both Matlab and R you obtain 1.228906e-11

Thanks for this great project, I hope you can fix this bug soon.

Best,

Luca Pinello

@scipy-gitbot

@rgommers wrote on 2012-08-18

Still an issue in current master. Hypergeom recently got a _sf method with higher precision, but needs a _cdf or _logcdf method too.

Note that what you want is actually the survival function, which is way faster and gives correct results:

In [11]: hypergeom.sf(30,13397950,4363,12390)
Out[11]: 1.329942944239457e-17

In [12]: hypergeom.sf(20,13397950,4363,9260)
Out[12]: 1.2289025689799385e-11
@argriffing
Collaborator

This can be fixed by adding hypergeom _cdf(...) = 1 - _sf(...) in distributions.py. I would change this and submit a PR, except that there's already a modification to distributions.py in the PR queue and I'm not good enough with git to merge them smoothly.

@argriffing
Collaborator

I tried this, and I was stymied by the requirement that the hypergeometric pmf(k, M, n, N) allows k to be a sequence. I wasn't aware of this aspect of the hypergeometric distribution interface, and it is causing my tests to fail.

@rgommers
Owner

Not sure what is failing for you exactly - the first parameter to most distribution methods is array_like, i.e. an ndarray or a sequence. k is also immediately converted to an array in pmf. Can you push the code you have now?

@argriffing
Collaborator

It should be possible to fix the bug using just cdf = 1-sf but I absent-mindedly made the changes to the master branch instead of a separate branch, and I also wasn't able to get the tests working. I'll let someone else handle this one, it should be very easy.

@argriffing
Collaborator

Maybe @josef-pkt would be familiar enough with scipy.stats to do this as a one or two line fix?

@josef-pkt
Collaborator

I don't think cdf = 1-sf is a good fix, because then we will run into precision problems on the left tail.

General recommendation to users: use sf if you want 1-cdf in the right tail, it's in many cases more precise.

I think the best would be to add a _cdf that does the same loop as the new _sf, so we also get higher precision in the right tail.

However, I think this doesn't solve the problem in this issue, if we are already now running into precision problems coming from the other tail. The only way to increase precision would be to dispatch to either a cdf or a sf loop depending on whether we are closer to the left or right tail.

Another however, I don't really understand where the precision problem is coming from. We are only summing over a small number of terms, and I don't see why _cdf should be larger than 1.
There might be a bug somewhere

>>> stats.hypergeom.pmf(30,13397950,4363,12390)
9.0186518528656726e-17
>>> stats.hypergeom.a
-13381197

.a is wrong and we calculate a huge number of irrelevant terms
this is with scipy 0.9 which I have open right now.

@josef-pkt
Collaborator

the .a is fixed in current master. I need to switch to a newer scipy

@josef-pkt
Collaborator

as reference gh-2321 fixed the .a and the ppf loop

@josef-pkt
Collaborator

all pmf larger than 30 are smaller than 1e-16, there is no way to calculate the upper tail coming from the other side

we are just adding 30 values, but the floating point precision adds up

>>> scipy.__version__
'0.12.0rc1'
>>> stats.hypergeom.pmf(30,13397950,4363,12390)
9.0186518528656726e-17
>>> stats.hypergeom.a
0
>>> stats.hypergeom.cdf(30,13397950,4363,12390)
1.0000000112708283

>>> stats.hypergeom.pmf(np.arange(0, 10, 1),13397950,4363,12390)
array([ 0.01764525,  0.07128356,  0.14394158,  0.19371231,  0.1954586 ,
        0.15772756,  0.10603382,  0.06108009,  0.03077714,  0.01378064])
>>> stats.hypergeom.pmf(np.arange(0, 30, 1),13397950,4363,12390).sum()
1.0000000112708283

The only solution I see, is do choose the calculation based on _sf or _cdf given on which tail we get better precision, maybe using the mean as a threshold. This is actually what the user is supposed to do by calling sf instead of calculating 1-cdf.

>>> stats.hypergeom.stats(13397950,4363,12390)
(array(4.034764273638878)
@josef-pkt
Collaborator

There is also the problem that we don't truncate the returned values for cdf and sf to be within the interval [0,1]

@josef-pkt
Collaborator

and here's the (almost) symmetric test case, moving the problem into the other tail

>>> stats.hypergeom.cdf(12390 - 30, 13397950, 13397950-4363, 12390)
1.0348594866567994e-16
>>> stats.hypergeom.sf(12390 - 30, 13397950, 13397950-4363, 12390)
1.0000000060954974

there are small numerical differences to the previous case, that I don't know where they are coming from
(additionally to numerical problems I might be off by one for cdf versus sf, weak versus strict inequalities)

>>> stats.hypergeom.pmf(12390 - np.arange(0, 30, 1),13397950,13397950-4363,12390).sum()
1.000000006095497
@argriffing
Collaborator

@josef-pkt OK, I see that this is probably more complicated than I had appreciated. Thanks for your reply!

My thinking had been that the more sophisticated integral in sf could replace the naive quad integral used by cdf and this would fix the issue, but now I see that it could be more complicated than this. When I try to fix the bug myself, I get a combination of precision errors and plumbing errors which are beyond my scipy.stats knowledge to debug. For example, I don't know what .a and .b are, and I don't know why copypasting the _sf integral as a new _cdf integral (with appropriate modifications) should give zip errors when I keep the same .asarray() calls. I assume that this is because _cdf and _sf have subtly different interfaces even though they are conceptually symmetric with each other.

In the current scipy master branch, I can do

>>> 1 - (1 - scipy.stats.hypergeom.sf(30, 13397950, 4363, 12390))
0.0
>>> 1 - (1 - scipy.stats.hypergeom.sf(20, 13397950, 4363, 9260))
1.2289058659575858e-11

which had led me to believe that just replacing cdf by 1-sf would be a non-ideal but sufficient patch, but now I guess that this would cause regressions in other parts of the distribution.

@josef-pkt
Collaborator

second case,

the generic cdf calculation is using all relevant terms in the sum. There is no truncation in the number of terms in the sum in this case.

looks like pure precision errors in the underlying function, calculation of pmf

>>> stats.hypergeom.cdf(20,13397950,4363,9260)
0.99999998823069003
>>> stats.hypergeom.pmf(np.arange(20+1),13397950,4363,9260).sum()
0.99999998823069003

>>> 1 - stats.hypergeom.pmf(np.arange(20+1),13397950,4363,9260).sum()
1.176930997104364e-08
>>> stats.hypergeom.pmf(np.arange(20, 9260+1),13397950,4363,9260).sum()
8.6715410757106549e-11
@josef-pkt
Collaborator

@argriffing

some background:

hypergeom is a discrete distribution, so we add all the pmf values in a sum to get sf or cdf instead of using integrate.quad as for the continuous distribution

.a and .b are the bounds of the support of the distribution, outside the probability is zero and we can start to sum at either .a or .b (if we want the full sum for sf or cdf)

"copypasting the _sf integral as a new _cdf integral (with appropriate modifications)" should work.
I don't see/remember what would be different

I thought the following might explain you zip error, but I don't think it's relevant for what you did

The generic functions like sf do argument checks and (relevant here) broadcasting of the arguments. So hypergeom._sf is written under the assumption that all arguments have the same shape.
If you call hypergeom._sf directly, then it doesn't know how to handle different shaped arguments. In most other distributions and methods, we don't have loops, and there we get the usual numpy broadcasting behavior also when we call the private methods directly.

@argriffing
Collaborator

The CellProfiler project has some functions

pochdivgamma(a, b, iterations)
hyper3F2regularizedZ1(a1, a2, a3, b1, b2)
pochdivpoch(a, b, iterations)
pochdivpochgen(a, b)
hyper3F2Z1(a1, a2, a3, b1, b2, tol=1e-15)
hyp2f1mine(a, b, c)
hyper3F2aZ1(a1, a2, a3, b2, tol=1e-10)

given

from scipy.special import gamma, hyp2f1, gammaln

which might help with the hypergeometric distribution pdf and cdf, but CellProfiler is GPL...

@rgommers
Owner

CellProfiler code has previously been relicensed for scikit-image. If you really want to use some of their code (I haven't checked the above in detail), ask @thouis.

@thouis
@ljosa

I have changed the license of the files cpa/hypergeom.py and cpa/dirichletintegrate.py in CellProfiler-Analyst from GPLv2 to BSD.

If you incorporate enough of it in SciPy that we no longer need to maintain our own version of these functions, please let me know.

Sorry for taking so long to get this done. I hope it is not too late to be useful.

@josef-pkt
Collaborator

@argriffing I have no idea how these functions help for stats.hypergeom. I'm no expert in special functions. I only use them if they are already used in code, or if I have a reference that uses a specific special function.

@rgommers
Owner

@ljosa thanks for doing that!

@argriffing
Collaborator

I'll try to look at this today. My thought was that because according to wikipedia the closed form of the hypergeometric http://en.wikipedia.org/wiki/Hypergeometric_distribution survival function is a regularized 3F2 hypergeometric function with z=1, and because CellProfiler-Analyst provides a regularized 3F2 hypergeometric function with z=1, then there might be some overlap.

@pv
Owner
pv commented

3F2 for z=1 has a closed-form expression: http://functions.wolfram.com/HypergeometricFunctions/Hypergeometric3F2/03/02/01/0002/

EDIT: given clebsch-gordan coefficients :)

@argriffing
Collaborator

I looked at this a bit, and it seems to not deal with the sizes of numbers that are causing problems with the hypergeometric cdf. The domain of 3F2(a1,a2,a3;b1,b2;1) which is handled most carefully by the CPA code requires b1 = a1+1 which is not what we want for the hypergeometric cdf. I started checking http://functions.wolfram.com/HypergeometricFunctions/Hypergeometric3F2/17/02/06/ for relevant transformations but I gave up.

@thouis

@argriffing - I don't think the hyper3F2Z1 function in the CPA code requires b1 = a1 + 1, it just assumes that every term in the numerator has a larger term in the denominator so that all of the series from pochdivpochgen() converge.

@argriffing
Collaborator

@thouis yes, I was referring to the hyper3F2aZ1 function whose docstring says "same has hyper3F2Z1 but with b1 = a1+1". I tried the hyper3F2Z1 for hypergeometric cdf/sf but it couldn't deal with the bigness of the parameter values that caused someone to open the scipy issue. The hyper3F2aZ1 function treats the domain more carefully but it has that extra b1=a1+1 restriction.

@ev-br
Collaborator

The immediate problem with hypergeom has been fixed by clipping in #3211. With current master branch I get:

>>> 1-stats.hypergeom.cdf(30,13397950,4363,12390)
0.0
>>> stats.hypergeom.sf(12390 - 30, 13397950, 13397950-4363, 12390)
0.99999999835287956
>>> stats.hypergeom.pmf(12390 - np.arange(0, 30, 1),13397950,13397950-4363,12390).sum()
0.99999999835287956
>>> print scipy.__version__
0.14.0.dev-9812e4c
@rgommers
Owner

I get slightly different values for some reason (with numpy 1.5.1), didn't check why:

In [2]: stats.hypergeom.cdf(30,13397950,4363,12390)
Out[2]: 0.99999998798961176

In [3]: 1 - stats.hypergeom.cdf(30,13397950,4363,12390)
Out[3]: 1.2010388239502845e-08

In [4]: stats.hypergeom.sf(12390 - 30, 13397950, 13397950-4363, 12390)
Out[4]: 0.99999998281428071

In [5]: stats.hypergeom.pmf(12390 - np.arange(0, 30, 1),13397950,13397950-4363,12390).sum()
Out[5]: 0.99999998281428071

In [6]: scipy.__version__
Out[6]: '0.14.0.dev-fefae50'

But can this now be closed or is there something left to do?

@josef-pkt
Collaborator

I don't see anything left to do.

@josef-pkt josef-pkt closed this
@rgommers rgommers added this to the 0.14.0 milestone
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.