Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scipy.stats.poisson underflow/overflow + solution #8424

Open
Tracked by #20223
labodyn opened this issue Feb 15, 2018 · 9 comments
Open
Tracked by #20223

Scipy.stats.poisson underflow/overflow + solution #8424

labodyn opened this issue Feb 15, 2018 · 9 comments
Labels
enhancement A new feature or improvement scipy.stats

Comments

@labodyn
Copy link

labodyn commented Feb 15, 2018

poisson.stats.logcdf has underflow and overflow issues. This can be solved using logsumexp on an array of the individual poisson logpmf's.

Note: neither issue nor solution are exclusive to this scipy function.

Reproducing code example:

>>> from scipy.stats import poisson
>>> poisson.logcdf(5, 1000)
-inf
>>> poisson.logcdf(500, 10)
0.0

Possible solution:

(implemented for one pair of argument types only. Might be fully vectorized instead of for loop?)

import numpy as np
from scipy.stats import poisson
from scipy.misc import logsumexp

def poisson_logcdf_stable(k, mu):
    assert isinstance(mu, np.ndarray) and isinstance(k, int), 'Unsupported types, implement yourself.'
    poissons = np.zeros((k + 1, *mu.shape))
    for i in range(k + 1):
        poissons[i, ...] = poisson.logpmf(i, mu)
    return logsumexp(poissons, axis=0)

Check resolvement of issues

>>> mu = np.array([1000, 5])
>>> poisson.logcdf(5, mu)
[       -inf -0.48457219]
>>> poisson_logcdf_stable(5, mu)
[ -9.70243708e+02  -4.84572190e-01] 
>>> poisson.logcdf(500, mu)
[-156.76165407    0.        ]
>>> poisson_logcdf_stable(500, mu)
[ -1.56761654e+02  -6.66133815e-16] 

Scipy/Numpy/Python version information:

0.19.1 1.13.3 sys.version_info(major=3, minor=6, micro=3, releaselevel='final', serial=0)
@ilayn ilayn added defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.stats labels Feb 21, 2018
@chrisb83
Copy link
Member

chrisb83 commented Feb 24, 2018

Hi,

logcdf(x) simply returns log(cdf(x)). So in your examples above, the problem already arises when computing the cdf.

stats.poisson.cdf(5, 1000) # 0.0; same for stats.poisson.pmf(5, 1000)
stats.poisson.cdf(500, 10) # 1.0

(R gives the same result using ppois)

And the cdf is simply special.pdtr (https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.pdtr.html)

I don't know how much accuracy one can expect here (note: exp(-1000) in the first expression), but it is more a question about the accuracy of the cdf.

@labodyn
Copy link
Author

labodyn commented Feb 24, 2018

Hi,

The log cdf can be calculated from the log pmf with full accuracy instead of taking the log of the cdf.

@chrisb83
Copy link
Member

OK, you are right, the logcdf should be computed from the logpmf in that case instead of log(cdf(.)).
I think one needs to define _logcdf for poisson in _discrete_distn.py:

def _logcdf(self, x, mu):
    k = floor(x)
    x = np.arange(0, k+1)
    return logsumexp(self._logpmf(x, mu))

This amounts to logsumexp(stats.poisson.logpmf(np.arange(0, 6), 1000)) in your first example.

@josef-pkt
Copy link
Member

The question is in how to do this efficiently and vectorized in x and mu.
Looping over all (x, mu) pairs might be much slower than the current version.

@pvanmulbregt
Copy link
Contributor

I'm not convinced by the poisson_logcdf_stable(500, 5)=-6.66133815e-16 example. I did the calculations in SageMath using a 10,000bit real field, and the pmf values decayed very quickly.

   k           pmf             cdf              1-cdf               log(cdf)
   0    6.73794700e-03  0.0067379469990855    9.93262053e-01   -5.00000000e+00
   1    3.36897350e-02  0.0404276819945128    9.59572318e-01   -3.20824053e+00
   2    8.42243375e-02  0.1246520194830811    8.75347981e-01   -2.08222927e+00
   3    1.40373896e-01  0.2650259152973617    7.34974085e-01   -1.32792766e+00
   4    1.75467370e-01  0.4404932850652124    5.59506715e-01   -8.19860078e-01
   5    1.75467370e-01  0.6159606548330632    3.84039345e-01   -4.84572190e-01
   6    1.46222808e-01  0.7621834629729387    2.37816537e-01   -2.71567987e-01
   7    1.04444863e-01  0.8666283259299927    1.33371674e-01   -1.43145084e-01
   8    6.52780393e-02  0.9319063652781514    6.80936347e-02   -7.05229358e-02
   9    3.62655774e-02  0.9681719426937951    3.18280573e-02   -3.23455807e-02
  10    1.81327887e-02  0.9863047314016170    1.36952686e-02   -1.37899139e-02
  11    8.24217669e-03  0.9945469080869906    5.45309191e-03   -5.46801429e-03
  12    3.43424029e-03  0.9979811483725630    2.01885163e-03   -2.02089226e-03
  13    1.32086165e-03  0.9993020100208601    6.97989979e-04   -6.98233688e-04
  14    4.71736303e-04  0.9997737463238232    2.26253676e-04   -2.26279275e-04
  15    1.57245434e-04  0.9999309917581444    6.90082419e-05   -6.90106230e-05
  16    4.91391982e-05  0.9999801309563696    1.98690436e-05   -1.98692410e-05
  17    1.44527054e-05  0.9999945836617300    5.41633827e-06   -5.41635294e-06
  18    4.01464038e-06  0.9999985983021079    1.40169789e-06   -1.40169887e-06
  19    1.05648431e-06  0.9999996547864179    3.45213582e-07   -3.45213642e-07
  20    2.64121077e-07  0.9999999189074954    8.10925046e-08   -8.10925079e-08
  21    6.28859708e-08  0.9999999817934663    1.82065338e-08   -1.82065339e-08
  22    1.42922661e-08  0.9999999960857323    3.91426767e-09   -3.91426768e-09
  23    3.10701437e-09  0.9999999991927467    8.07253300e-10   -8.07253300e-10
  24    6.47294660e-10  0.9999999998400414    1.59958640e-10   -1.59958640e-10
  25    1.29458932e-10  0.9999999999695003    3.04997078e-11   -3.04997078e-11
  26    2.48959485e-11  0.9999999999943963    5.60375933e-12   -5.60375933e-12
  27    4.61036083e-12  0.9999999999990066    9.93398502e-13   -9.93398502e-13
  28    8.23278719e-13  0.9999999999998299    1.70119783e-13   -1.70119783e-13
  29    1.41944607e-13  0.9999999999999718    2.81751762e-14   -2.81751762e-14
  30    2.36574345e-14  0.9999999999999954    4.51774169e-15   -4.51774169e-15
  31    3.81571524e-15  0.9999999999999993    7.02026458e-16   -7.02026458e-16
  32    5.96205506e-16  0.9999999999999999    1.05820953e-16   -1.05820953e-16
  33    9.03341675e-17  1.0000000000000000    1.54867851e-17   -1.54867851e-17
  34    1.32844364e-17  1.0000000000000000    2.20234871e-18   -2.20234871e-18
  35    1.89777663e-18  1.0000000000000000    3.04572079e-19   -3.04572079e-19
  36    2.63580087e-19  1.0000000000000000    4.09919915e-20   -4.09919915e-20
  37    3.56189307e-20  1.0000000000000000    5.37306080e-21   -5.37306080e-21
  38    4.68670141e-21  1.0000000000000000    6.86359393e-22   -6.86359393e-22
  39    6.00859155e-22  1.0000000000000000    8.55002376e-23   -8.55002376e-23
  40    7.51073944e-23  1.0000000000000000    1.03928432e-23   -1.03928432e-23
...
 244   1.69438241e-310  1.0000000000000000   3.52965828e-312  -3.52965828e-312
 245   3.45792329e-312  1.0000000000000000   7.17349982e-314  -7.17349982e-314
 246   7.02829936e-314  1.0000000000000000   1.45200453e-315  -1.45200453e-315
 247   1.42273266e-315  1.0000000000000000   2.92718678e-317  -2.92718678e-317
 248   2.86841273e-317  1.0000000000000000   5.87740492e-319  -5.87740492e-319
 249   5.75986671e-319  1.0000000000000000   1.17538217e-320  -1.17538217e-320
 250   1.15216109e-320  1.0000000000000000   2.32210854e-322  -2.32210854e-322
 251   2.27270197e-322  1.0000000000000000   4.94065646e-324  -4.94065646e-324
 252   4.94065646e-324  1.0000000000000000    0.00000000e+00   -0.00000000e+00
 253    0.00000000e+00  1.0000000000000000    0.00000000e+00   -0.00000000e+00

For k=33, the logcdf was already less than 10^(-16).
The logcdf bottomed out near k=250 at 10^(-324), and after that the cdf was too close to 1 to get a value.
At k=500, the pmf was ~ e^(-1811) and none of the higher-k values were going to contribute to the cdf. I wonder if the non-zero value returned by poisson_logcdf_stable() is just rounding error arising during the computations inside poisson.logpmf and then scipy.misc.logsumexp.

@labodyn
Copy link
Author

labodyn commented Apr 3, 2018

@pvanmulbregt
You're right, my formula is wrong for large k. logsumexp works by normalisation on the largest value in the sum, which results in underflow of the smaller values. These smaller values are important for the precise end result however; a log is taken over a value very close to one.

For small k and large mu, which causes overflow issues in current scipy implementation, I think my solution still holds. In this situation, a log is taken over a value very close to zero; digits several places after the comma are not very important here.

EDIT: I meant my formula has underflow issues, the formula is mathematically still correct indeed.

@pvanmulbregt
Copy link
Contributor

The formula is fine, it just incurs rounding errors which overwhelm the final result.
In the mu=1000, k=5 example, the logcdf() is ~-970, so the cdf will underflow a 64bit double and be 0. Are you asking for logcdf() to be accurate in this part of the domain?
One possible alternative might be to use an approximation valid for large mu, the Normal(mean=mu, var=mu) with a continuity correction. Is that appropriate for your needs?
[I can see a few ways to compute log(cdf): Using the incomplete gamma function as now; use logsumexp(); using the Normal(mean=mu, var=mu) approximation. But it's not immediately clear to me how to stitch any of them them together preserving a CDF's properties --- increasing as a function of k, decreasing as a function of mu --- and control the computational requirements. The incomplete gamma already stitches together different algorithms.]

@labodyn
Copy link
Author

labodyn commented May 14, 2018

I found some a formula to calculate the results in python for k up to 250. I have no proof for the formula but the results seem to agree with ones @pvanmulbregt posted.

def poisson_logcdf_stable(k, mu):
    """A more numerical stable method to calculate poisson.logcdf. See Scipy issue #8424."""
    if not isinstance(mu, np.ndarray) or not isinstance(k, int):
        raise TypeError('Unsupported types, implement yourself.')

    if k < 30:
        poissons = np.zeros((k + 1, *mu.shape))
        for i in range(k + 1):
            poissons[i, ...] = poisson.logpmf(i, mu)
        result = logsumexp(poissons, axis=0)
    elif k <= 250:  # This is a magic formula without proof, but it seems to work.
        n_accuracy_samples = 15
        poissons = np.zeros((n_accuracy_samples, *mu.shape))
        for i in range(n_accuracy_samples):
            poissons[i, ...] = poisson.pmf(i + k + 1, mu)
        result = -poissons.sum(axis=0)
    else:
        raise ValueError('Underflow for k > 250. Not fixed yet.')
    return result

EDIT: the second formula for 30 < k < 250 only seems to works for mu < 30, but is otherwise wrong.

@mdhaber mdhaber added enhancement A new feature or improvement and removed defect A clear bug or issue that prevents SciPy from being installed or used as expected labels Mar 26, 2022
@mdhaber
Copy link
Contributor

mdhaber commented Feb 2, 2023

Continued fraction expansions may be helpful because they're easy to logarithmize.

For large mu:

import numpy as np
from scipy import stats, special

def logG(s, z):  # upper incomplete gamma, unregularized, large mu
    return (s*np.log(z) - z - np.log(
            (z + (1 - s)/
             (1 + 1/
              (z + (2-s)/
               (1 + 2/
                (z + (3-s)/
                 (1 + 3/
                  (z + (4-s)/
                   (1 + 4/
                    (z + (5-s)/
                     (1 + 5))))))))))))

k, mu = 5, 100
print(stats.poisson.logcdf(k, mu))  # -81.71088951390705
print(logG(k+1, mu) - special.gammaln(k+1))  # -81.71088951390705

k, mu = 5, 1000
print(stats.poisson.logcdf(k, mu))  # -inf
print(logG(k+1, mu) - special.gammaln(k+1))  # -970.243707846241

For large k:

def logg(s, z):  # lower incomplete gamma, unregularized, large z
    return (s*np.log(z) - z - np.log(
            (s - s*z/
             (s+1 + z/
              (s+2 - (s+1)*z/
               (s+3 + 2*z/
                s+4))))))

k, mu = 30, 5
print(stats.poisson.logcdf(k, mu))  # -4.551914400963152e-15
a = logg(k+1, mu) - special.gammaln(k+1)
print(np.real(special.logsumexp([0., a + 1j*np.pi])))  # -4.551914400963152e-15

k, mu = 100, 5
print(stats.poisson.logcdf(k, mu))  # 0.0
a = logg(k+1, mu) - special.gammaln(k+1)
print(np.real(special.logsumexp([0., a + 1j*np.pi])))  # 6.591317973300016e-216

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.stats
Projects
None yet
Development

No branches or pull requests

6 participants