Scipy.stats.poisson underflow/overflow + solution #8424

labodyn · 2018-02-15T13:52:01Z

poisson.stats.logcdf has underflow and overflow issues. This can be solved using logsumexp on an array of the individual poisson logpmf's.

Note: neither issue nor solution are exclusive to this scipy function.

Reproducing code example:

>>> from scipy.stats import poisson
>>> poisson.logcdf(5, 1000)
-inf
>>> poisson.logcdf(500, 10)
0.0

Possible solution:

(implemented for one pair of argument types only. Might be fully vectorized instead of for loop?)

import numpy as np
from scipy.stats import poisson
from scipy.misc import logsumexp

def poisson_logcdf_stable(k, mu):
    assert isinstance(mu, np.ndarray) and isinstance(k, int), 'Unsupported types, implement yourself.'
    poissons = np.zeros((k + 1, *mu.shape))
    for i in range(k + 1):
        poissons[i, ...] = poisson.logpmf(i, mu)
    return logsumexp(poissons, axis=0)

Check resolvement of issues

>>> mu = np.array([1000, 5])
>>> poisson.logcdf(5, mu)
[       -inf -0.48457219]
>>> poisson_logcdf_stable(5, mu)
[ -9.70243708e+02  -4.84572190e-01] 
>>> poisson.logcdf(500, mu)
[-156.76165407    0.        ]
>>> poisson_logcdf_stable(500, mu)
[ -1.56761654e+02  -6.66133815e-16]

Scipy/Numpy/Python version information:

0.19.1 1.13.3 sys.version_info(major=3, minor=6, micro=3, releaselevel='final', serial=0)

The text was updated successfully, but these errors were encountered:

chrisb83 · 2018-02-24T20:36:06Z

Hi,

logcdf(x) simply returns log(cdf(x)). So in your examples above, the problem already arises when computing the cdf.

stats.poisson.cdf(5, 1000) # 0.0; same for stats.poisson.pmf(5, 1000)
stats.poisson.cdf(500, 10) # 1.0

(R gives the same result using ppois)

And the cdf is simply special.pdtr (https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.pdtr.html)

I don't know how much accuracy one can expect here (note: exp(-1000) in the first expression), but it is more a question about the accuracy of the cdf.

labodyn · 2018-02-24T23:09:22Z

Hi,

The log cdf can be calculated from the log pmf with full accuracy instead of taking the log of the cdf.

chrisb83 · 2018-02-25T08:43:10Z

OK, you are right, the logcdf should be computed from the logpmf in that case instead of log(cdf(.)).
I think one needs to define _logcdf for poisson in _discrete_distn.py:

def _logcdf(self, x, mu):
    k = floor(x)
    x = np.arange(0, k+1)
    return logsumexp(self._logpmf(x, mu))

This amounts to logsumexp(stats.poisson.logpmf(np.arange(0, 6), 1000)) in your first example.

josef-pkt · 2018-02-25T13:18:53Z

The question is in how to do this efficiently and vectorized in x and mu.
Looping over all (x, mu) pairs might be much slower than the current version.

pvanmulbregt · 2018-04-02T20:31:38Z

I'm not convinced by the poisson_logcdf_stable(500, 5)=-6.66133815e-16 example. I did the calculations in SageMath using a 10,000bit real field, and the pmf values decayed very quickly.

   k           pmf             cdf              1-cdf               log(cdf)
   0    6.73794700e-03  0.0067379469990855    9.93262053e-01   -5.00000000e+00
   1    3.36897350e-02  0.0404276819945128    9.59572318e-01   -3.20824053e+00
   2    8.42243375e-02  0.1246520194830811    8.75347981e-01   -2.08222927e+00
   3    1.40373896e-01  0.2650259152973617    7.34974085e-01   -1.32792766e+00
   4    1.75467370e-01  0.4404932850652124    5.59506715e-01   -8.19860078e-01
   5    1.75467370e-01  0.6159606548330632    3.84039345e-01   -4.84572190e-01
   6    1.46222808e-01  0.7621834629729387    2.37816537e-01   -2.71567987e-01
   7    1.04444863e-01  0.8666283259299927    1.33371674e-01   -1.43145084e-01
   8    6.52780393e-02  0.9319063652781514    6.80936347e-02   -7.05229358e-02
   9    3.62655774e-02  0.9681719426937951    3.18280573e-02   -3.23455807e-02
  10    1.81327887e-02  0.9863047314016170    1.36952686e-02   -1.37899139e-02
  11    8.24217669e-03  0.9945469080869906    5.45309191e-03   -5.46801429e-03
  12    3.43424029e-03  0.9979811483725630    2.01885163e-03   -2.02089226e-03
  13    1.32086165e-03  0.9993020100208601    6.97989979e-04   -6.98233688e-04
  14    4.71736303e-04  0.9997737463238232    2.26253676e-04   -2.26279275e-04
  15    1.57245434e-04  0.9999309917581444    6.90082419e-05   -6.90106230e-05
  16    4.91391982e-05  0.9999801309563696    1.98690436e-05   -1.98692410e-05
  17    1.44527054e-05  0.9999945836617300    5.41633827e-06   -5.41635294e-06
  18    4.01464038e-06  0.9999985983021079    1.40169789e-06   -1.40169887e-06
  19    1.05648431e-06  0.9999996547864179    3.45213582e-07   -3.45213642e-07
  20    2.64121077e-07  0.9999999189074954    8.10925046e-08   -8.10925079e-08
  21    6.28859708e-08  0.9999999817934663    1.82065338e-08   -1.82065339e-08
  22    1.42922661e-08  0.9999999960857323    3.91426767e-09   -3.91426768e-09
  23    3.10701437e-09  0.9999999991927467    8.07253300e-10   -8.07253300e-10
  24    6.47294660e-10  0.9999999998400414    1.59958640e-10   -1.59958640e-10
  25    1.29458932e-10  0.9999999999695003    3.04997078e-11   -3.04997078e-11
  26    2.48959485e-11  0.9999999999943963    5.60375933e-12   -5.60375933e-12
  27    4.61036083e-12  0.9999999999990066    9.93398502e-13   -9.93398502e-13
  28    8.23278719e-13  0.9999999999998299    1.70119783e-13   -1.70119783e-13
  29    1.41944607e-13  0.9999999999999718    2.81751762e-14   -2.81751762e-14
  30    2.36574345e-14  0.9999999999999954    4.51774169e-15   -4.51774169e-15
  31    3.81571524e-15  0.9999999999999993    7.02026458e-16   -7.02026458e-16
  32    5.96205506e-16  0.9999999999999999    1.05820953e-16   -1.05820953e-16
  33    9.03341675e-17  1.0000000000000000    1.54867851e-17   -1.54867851e-17
  34    1.32844364e-17  1.0000000000000000    2.20234871e-18   -2.20234871e-18
  35    1.89777663e-18  1.0000000000000000    3.04572079e-19   -3.04572079e-19
  36    2.63580087e-19  1.0000000000000000    4.09919915e-20   -4.09919915e-20
  37    3.56189307e-20  1.0000000000000000    5.37306080e-21   -5.37306080e-21
  38    4.68670141e-21  1.0000000000000000    6.86359393e-22   -6.86359393e-22
  39    6.00859155e-22  1.0000000000000000    8.55002376e-23   -8.55002376e-23
  40    7.51073944e-23  1.0000000000000000    1.03928432e-23   -1.03928432e-23
...
 244   1.69438241e-310  1.0000000000000000   3.52965828e-312  -3.52965828e-312
 245   3.45792329e-312  1.0000000000000000   7.17349982e-314  -7.17349982e-314
 246   7.02829936e-314  1.0000000000000000   1.45200453e-315  -1.45200453e-315
 247   1.42273266e-315  1.0000000000000000   2.92718678e-317  -2.92718678e-317
 248   2.86841273e-317  1.0000000000000000   5.87740492e-319  -5.87740492e-319
 249   5.75986671e-319  1.0000000000000000   1.17538217e-320  -1.17538217e-320
 250   1.15216109e-320  1.0000000000000000   2.32210854e-322  -2.32210854e-322
 251   2.27270197e-322  1.0000000000000000   4.94065646e-324  -4.94065646e-324
 252   4.94065646e-324  1.0000000000000000    0.00000000e+00   -0.00000000e+00
 253    0.00000000e+00  1.0000000000000000    0.00000000e+00   -0.00000000e+00

For k=33, the logcdf was already less than 10^(-16).
The logcdf bottomed out near k=250 at 10^(-324), and after that the cdf was too close to 1 to get a value.
At k=500, the pmf was ~ e^(-1811) and none of the higher-k values were going to contribute to the cdf. I wonder if the non-zero value returned by poisson_logcdf_stable() is just rounding error arising during the computations inside poisson.logpmf and then scipy.misc.logsumexp.

labodyn · 2018-04-03T09:16:29Z

@pvanmulbregt
You're right, my formula is wrong for large k. logsumexp works by normalisation on the largest value in the sum, which results in underflow of the smaller values. These smaller values are important for the precise end result however; a log is taken over a value very close to one.

For small k and large mu, which causes overflow issues in current scipy implementation, I think my solution still holds. In this situation, a log is taken over a value very close to zero; digits several places after the comma are not very important here.

EDIT: I meant my formula has underflow issues, the formula is mathematically still correct indeed.

pvanmulbregt · 2018-04-03T15:27:14Z

The formula is fine, it just incurs rounding errors which overwhelm the final result.
In the mu=1000, k=5 example, the logcdf() is ~-970, so the cdf will underflow a 64bit double and be 0. Are you asking for logcdf() to be accurate in this part of the domain?
One possible alternative might be to use an approximation valid for large mu, the Normal(mean=mu, var=mu) with a continuity correction. Is that appropriate for your needs?
[I can see a few ways to compute log(cdf): Using the incomplete gamma function as now; use logsumexp(); using the Normal(mean=mu, var=mu) approximation. But it's not immediately clear to me how to stitch any of them them together preserving a CDF's properties --- increasing as a function of k, decreasing as a function of mu --- and control the computational requirements. The incomplete gamma already stitches together different algorithms.]

labodyn · 2018-05-14T15:15:20Z

I found some a formula to calculate the results in python for k up to 250. I have no proof for the formula but the results seem to agree with ones @pvanmulbregt posted.

def poisson_logcdf_stable(k, mu):
    """A more numerical stable method to calculate poisson.logcdf. See Scipy issue #8424."""
    if not isinstance(mu, np.ndarray) or not isinstance(k, int):
        raise TypeError('Unsupported types, implement yourself.')

    if k < 30:
        poissons = np.zeros((k + 1, *mu.shape))
        for i in range(k + 1):
            poissons[i, ...] = poisson.logpmf(i, mu)
        result = logsumexp(poissons, axis=0)
    elif k <= 250:  # This is a magic formula without proof, but it seems to work.
        n_accuracy_samples = 15
        poissons = np.zeros((n_accuracy_samples, *mu.shape))
        for i in range(n_accuracy_samples):
            poissons[i, ...] = poisson.pmf(i + k + 1, mu)
        result = -poissons.sum(axis=0)
    else:
        raise ValueError('Underflow for k > 250. Not fixed yet.')
    return result

EDIT: the second formula for 30 < k < 250 only seems to works for mu < 30, but is otherwise wrong.

mdhaber · 2023-02-02T09:18:48Z

Continued fraction expansions may be helpful because they're easy to logarithmize.

For large mu:

import numpy as np
from scipy import stats, special

def logG(s, z):  # upper incomplete gamma, unregularized, large mu
    return (s*np.log(z) - z - np.log(
            (z + (1 - s)/
             (1 + 1/
              (z + (2-s)/
               (1 + 2/
                (z + (3-s)/
                 (1 + 3/
                  (z + (4-s)/
                   (1 + 4/
                    (z + (5-s)/
                     (1 + 5))))))))))))

k, mu = 5, 100
print(stats.poisson.logcdf(k, mu))  # -81.71088951390705
print(logG(k+1, mu) - special.gammaln(k+1))  # -81.71088951390705

k, mu = 5, 1000
print(stats.poisson.logcdf(k, mu))  # -inf
print(logG(k+1, mu) - special.gammaln(k+1))  # -970.243707846241

For large k:

def logg(s, z):  # lower incomplete gamma, unregularized, large z
    return (s*np.log(z) - z - np.log(
            (s - s*z/
             (s+1 + z/
              (s+2 - (s+1)*z/
               (s+3 + 2*z/
                s+4))))))

k, mu = 30, 5
print(stats.poisson.logcdf(k, mu))  # -4.551914400963152e-15
a = logg(k+1, mu) - special.gammaln(k+1)
print(np.real(special.logsumexp([0., a + 1j*np.pi])))  # -4.551914400963152e-15

k, mu = 100, 5
print(stats.poisson.logcdf(k, mu))  # 0.0
a = logg(k+1, mu) - special.gammaln(k+1)
print(np.real(special.logsumexp([0., a + 1j*np.pi])))  # 6.591317973300016e-216

ilayn added defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.stats labels Feb 21, 2018

mdhaber added enhancement A new feature or improvement and removed defect A clear bug or issue that prevents SciPy from being installed or used as expected labels Mar 26, 2022

steppi mentioned this issue Mar 9, 2024

META: Streamlined Special Function Development in SciPy SDG Tracking #20223

Open

26 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scipy.stats.poisson underflow/overflow + solution #8424

Scipy.stats.poisson underflow/overflow + solution #8424

labodyn commented Feb 15, 2018 •

edited

chrisb83 commented Feb 24, 2018 •

edited

labodyn commented Feb 24, 2018

chrisb83 commented Feb 25, 2018

josef-pkt commented Feb 25, 2018

pvanmulbregt commented Apr 2, 2018

labodyn commented Apr 3, 2018 •

edited

pvanmulbregt commented Apr 3, 2018

labodyn commented May 14, 2018 •

edited

mdhaber commented Feb 2, 2023 •

edited

Scipy.stats.poisson underflow/overflow + solution #8424

Scipy.stats.poisson underflow/overflow + solution #8424

Comments

labodyn commented Feb 15, 2018 • edited

Reproducing code example:

Possible solution:

Check resolvement of issues

Scipy/Numpy/Python version information:

chrisb83 commented Feb 24, 2018 • edited

labodyn commented Feb 24, 2018

chrisb83 commented Feb 25, 2018

josef-pkt commented Feb 25, 2018

pvanmulbregt commented Apr 2, 2018

labodyn commented Apr 3, 2018 • edited

pvanmulbregt commented Apr 3, 2018

labodyn commented May 14, 2018 • edited

mdhaber commented Feb 2, 2023 • edited

labodyn commented Feb 15, 2018 •

edited

chrisb83 commented Feb 24, 2018 •

edited

labodyn commented Apr 3, 2018 •

edited

labodyn commented May 14, 2018 •

edited

mdhaber commented Feb 2, 2023 •

edited