New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scipy.stats.poisson underflow/overflow + solution #8424
Comments
Hi, logcdf(x) simply returns log(cdf(x)). So in your examples above, the problem already arises when computing the cdf.
(R gives the same result using And the cdf is simply I don't know how much accuracy one can expect here (note: |
Hi, The log cdf can be calculated from the log pmf with full accuracy instead of taking the log of the cdf. |
OK, you are right, the logcdf should be computed from the logpmf in that case instead of
This amounts to |
The question is in how to do this efficiently and vectorized in x and mu. |
I'm not convinced by the
For |
@pvanmulbregt For small k and large mu, which causes overflow issues in current scipy implementation, I think my solution still holds. In this situation, a log is taken over a value very close to zero; digits several places after the comma are not very important here. EDIT: I meant my formula has underflow issues, the formula is mathematically still correct indeed. |
The formula is fine, it just incurs rounding errors which overwhelm the final result. |
I found some a formula to calculate the results in python for k up to 250. I have no proof for the formula but the results seem to agree with ones @pvanmulbregt posted.
EDIT: the second formula for 30 < k < 250 only seems to works for mu < 30, but is otherwise wrong. |
Continued fraction expansions may be helpful because they're easy to logarithmize. For large import numpy as np
from scipy import stats, special
def logG(s, z): # upper incomplete gamma, unregularized, large mu
return (s*np.log(z) - z - np.log(
(z + (1 - s)/
(1 + 1/
(z + (2-s)/
(1 + 2/
(z + (3-s)/
(1 + 3/
(z + (4-s)/
(1 + 4/
(z + (5-s)/
(1 + 5))))))))))))
k, mu = 5, 100
print(stats.poisson.logcdf(k, mu)) # -81.71088951390705
print(logG(k+1, mu) - special.gammaln(k+1)) # -81.71088951390705
k, mu = 5, 1000
print(stats.poisson.logcdf(k, mu)) # -inf
print(logG(k+1, mu) - special.gammaln(k+1)) # -970.243707846241 For large k: def logg(s, z): # lower incomplete gamma, unregularized, large z
return (s*np.log(z) - z - np.log(
(s - s*z/
(s+1 + z/
(s+2 - (s+1)*z/
(s+3 + 2*z/
s+4))))))
k, mu = 30, 5
print(stats.poisson.logcdf(k, mu)) # -4.551914400963152e-15
a = logg(k+1, mu) - special.gammaln(k+1)
print(np.real(special.logsumexp([0., a + 1j*np.pi]))) # -4.551914400963152e-15
k, mu = 100, 5
print(stats.poisson.logcdf(k, mu)) # 0.0
a = logg(k+1, mu) - special.gammaln(k+1)
print(np.real(special.logsumexp([0., a + 1j*np.pi]))) # 6.591317973300016e-216 |
poisson.stats.logcdf has underflow and overflow issues. This can be solved using logsumexp on an array of the individual poisson logpmf's.
Note: neither issue nor solution are exclusive to this scipy function.
Reproducing code example:
Possible solution:
(implemented for one pair of argument types only. Might be fully vectorized instead of for loop?)
Check resolvement of issues
Scipy/Numpy/Python version information:
The text was updated successfully, but these errors were encountered: