-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Negative p-value on hypergeom.cdf (Trac #1719) #2238
Comments
@rgommers wrote on 2012-08-18 Still an issue in current master. Hypergeom recently got a Note that what you want is actually the survival function, which is way faster and gives correct results:
|
This can be fixed by adding hypergeom _cdf(...) = 1 - _sf(...) in distributions.py. I would change this and submit a PR, except that there's already a modification to distributions.py in the PR queue and I'm not good enough with git to merge them smoothly. |
I tried this, and I was stymied by the requirement that the hypergeometric |
Not sure what is failing for you exactly - the first parameter to most distribution methods is |
It should be possible to fix the bug using just |
Maybe @josef-pkt would be familiar enough with scipy.stats to do this as a one or two line fix? |
I don't think General recommendation to users: use sf if you want 1-cdf in the right tail, it's in many cases more precise. I think the best would be to add a _cdf that does the same loop as the new _sf, so we also get higher precision in the right tail. However, I think this doesn't solve the problem in this issue, if we are already now running into precision problems coming from the other tail. The only way to increase precision would be to dispatch to either a cdf or a sf loop depending on whether we are closer to the left or right tail. Another however, I don't really understand where the precision problem is coming from. We are only summing over a small number of terms, and I don't see why _cdf should be larger than 1.
|
the .a is fixed in current master. I need to switch to a newer scipy |
as reference gh-2321 fixed the .a and the ppf loop |
all pmf larger than 30 are smaller than 1e-16, there is no way to calculate the upper tail coming from the other side we are just adding 30 values, but the floating point precision adds up
The only solution I see, is do choose the calculation based on _sf or _cdf given on which tail we get better precision, maybe using the mean as a threshold. This is actually what the user is supposed to do by calling sf instead of calculating 1-cdf.
|
There is also the problem that we don't truncate the returned values for cdf and sf to be within the interval [0,1] |
and here's the (almost) symmetric test case, moving the problem into the other tail
there are small numerical differences to the previous case, that I don't know where they are coming from
|
@josef-pkt OK, I see that this is probably more complicated than I had appreciated. Thanks for your reply! My thinking had been that the more sophisticated integral in In the current scipy master branch, I can do >>> 1 - (1 - scipy.stats.hypergeom.sf(30, 13397950, 4363, 12390)) 0.0 >>> 1 - (1 - scipy.stats.hypergeom.sf(20, 13397950, 4363, 9260)) 1.2289058659575858e-11 which had led me to believe that just replacing |
second case, the generic cdf calculation is using all relevant terms in the sum. There is no truncation in the number of terms in the sum in this case. looks like pure precision errors in the underlying function, calculation of pmf
|
some background: hypergeom is a discrete distribution, so we add all the pmf values in a sum to get sf or cdf instead of using integrate.quad as for the continuous distribution
"copypasting the _sf integral as a new _cdf integral (with appropriate modifications)" should work. I thought the following might explain you zip error, but I don't think it's relevant for what you did The generic functions like |
The CellProfiler project has some functions pochdivgamma(a, b, iterations) hyper3F2regularizedZ1(a1, a2, a3, b1, b2) pochdivpoch(a, b, iterations) pochdivpochgen(a, b) hyper3F2Z1(a1, a2, a3, b1, b2, tol=1e-15) hyp2f1mine(a, b, c) hyper3F2aZ1(a1, a2, a3, b2, tol=1e-10) given from scipy.special import gamma, hyp2f1, gammaln which might help with the hypergeometric distribution pdf and cdf, but CellProfiler is GPL... |
CellProfiler code has previously been relicensed for scikit-image. If you really want to use some of their code (I haven't checked the above in detail), ask @thouis. |
It's fine with me. I wrote that code, but I'm not on the project any more. |
I have changed the license of the files cpa/hypergeom.py and cpa/dirichletintegrate.py in CellProfiler-Analyst from GPLv2 to BSD. If you incorporate enough of it in SciPy that we no longer need to maintain our own version of these functions, please let me know. Sorry for taking so long to get this done. I hope it is not too late to be useful. |
@argriffing I have no idea how these functions help for stats.hypergeom. I'm no expert in special functions. I only use them if they are already used in code, or if I have a reference that uses a specific special function. |
@ljosa thanks for doing that! |
I'll try to look at this today. My thought was that because according to wikipedia the closed form of the hypergeometric http://en.wikipedia.org/wiki/Hypergeometric_distribution survival function is a regularized 3F2 hypergeometric function with z=1, and because CellProfiler-Analyst provides a regularized 3F2 hypergeometric function with z=1, then there might be some overlap. |
3F2 for z=1 has a closed-form expression: http://functions.wolfram.com/HypergeometricFunctions/Hypergeometric3F2/03/02/01/0002/ EDIT: given clebsch-gordan coefficients :) |
I looked at this a bit, and it seems to not deal with the sizes of numbers that are causing problems with the hypergeometric cdf. The domain of |
@argriffing - I don't think the hyper3F2Z1 function in the CPA code requires b1 = a1 + 1, it just assumes that every term in the numerator has a larger term in the denominator so that all of the series from pochdivpochgen() converge. |
@thouis yes, I was referring to the hyper3F2aZ1 function whose docstring says "same has hyper3F2Z1 but with b1 = a1+1". I tried the hyper3F2Z1 for hypergeometric cdf/sf but it couldn't deal with the bigness of the parameter values that caused someone to open the scipy issue. The hyper3F2aZ1 function treats the domain more carefully but it has that extra b1=a1+1 restriction. |
The immediate problem with
|
I get slightly different values for some reason (with numpy 1.5.1), didn't check why:
But can this now be closed or is there something left to do? |
I don't see anything left to do. |
Original ticket http://projects.scipy.org/scipy/ticket/1719 on 2012-08-16 by trac user lucapinello, assigned to unknown.
Hi!
I have noticed that not only the precision for hypergeom.cdf is not very good but for some cases the value is negative.
I compared the results with Matlab and R.
Two simple examples:
from scipy.stats import hypergeom
-3.5282108346024188e-09
In both Matlab and R you obtain 0.
2.0581985005208026e-09
In both Matlab and R you obtain 1.228906e-11
Thanks for this great project, I hope you can fix this bug soon.
Best,
Luca Pinello
The text was updated successfully, but these errors were encountered: