Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

statsmodels.stats.rates.test_poisson_2indep has ZeroDivisionError if count2 is 0 #8313

Closed
Anaphory opened this issue Jun 16, 2022 · 5 comments

Comments

@Anaphory
Copy link

Describe the bug

When running statsmodels.stats.rates.test_poisson_2indep on data with count2=0, the function raises a ZeroDivisionError instead of computing the test statistics.

Code Sample, a copy-pastable example if possible

>>> from statsmodels.stats.rates import test_poisson_2indep
/home/gereon/.local/lib/python3.10/site-packages/statsmodels/compat/pandas.py:65: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  from pandas import Int64Index as NumericIndex
>>> test_poisson_2indep(1, 1, 0, 4)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/gereon/.local/lib/python3.10/site-packages/statsmodels/stats/rates.py", line 140, in test_poisson_2indep
    ratio = rates[0] / rates[1]
ZeroDivisionError: float division by zero
>>> test_poisson_2indep(0, 4, 1, 1)
<class 'statsmodels.stats.base.HolderTuple'>
statistic = -2.0
pvalue = 0.04550026389635839
distribution = 'normal'
method = 'score'
alternative = 'two-sided'
rates = (0.0, 1.0)
ratio = 0.0
ratio_null = 1
tuple = (-2.0, 0.04550026389635839)

Expected Output

I believe the test is symmetric, so

>>> test_poisson_2indep(1, 1, 0, 4)
<class 'statsmodels.stats.base.HolderTuple'>
statistic = -2.0
pvalue = 0.04550026389635839
distribution = 'normal'
method = 'score'
alternative = 'two-sided'
rates = (1.0, 0.0)
ratio = 0.0
ratio_null = 1
tuple = (-2.0, 0.04550026389635839)

Output of import statsmodels.api as sm; sm.show_versions()

INSTALLED VERSIONS

Python: 3.10.4.final.0
OS: Linux 5.10.114-1-MANJARO #1 SMP PREEMPT Mon May 9 07:52:55 UTC 2022 x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

statsmodels

Installed: 0.13.2 (/home/gereon/.local/lib/python3.10/site-packages/statsmodels)

Required Dependencies

cython: 0.29.28 (/usr/lib/python3.10/site-packages/Cython)
numpy: 1.22.3 (/usr/lib/python3.10/site-packages/numpy)
scipy: 1.8.1 (/home/gereon/.local/lib/python3.10/site-packages/scipy)
pandas: 1.4.2 (/usr/lib/python3.10/site-packages/pandas)
dateutil: 2.8.2 (/usr/lib/python3.10/site-packages/dateutil)
patsy: 0.5.2 (/home/gereon/.local/lib/python3.10/site-packages/patsy)

Optional Dependencies

matplotlib: 3.5.2 (/usr/lib/python3.10/site-packages/matplotlib)
backend: GTK3Agg
cvxopt: Not installed
joblib: 1.1.0 (/usr/lib/python3.10/site-packages/joblib)

Developer Tools

IPython: 8.3.0 (/usr/lib/python3.10/site-packages/IPython)
jinja2: 3.0.3 (/usr/lib/python3.10/site-packages/jinja2)
sphinx: 4.5.0 (/usr/lib/python3.10/site-packages/sphinx)
pygments: 2.12.0 (/usr/lib/python3.10/site-packages/pygments)
pytest: 7.1.2 (/usr/lib/python3.10/site-packages/pytest)
virtualenv: 20.11.0 (/usr/lib/python3.10/site-packages/virtualenv)

@Anaphory
Copy link
Author

The offending line is this one.

rates = (y1 / n1, y2 / n2)

If I am right about it being symmetric, it should boil down to

if count2 == 0:
    return test_poisson_2indep(count2, exposure2, count1, exposure1, ratio_null=1/ratio_null, method=method, alternative=alternative, etest_kwds=etest_kwds)

but there may be things in the one-sided tests in particular that I am forgetting.

@josef-pkt
Copy link
Member

josef-pkt commented Jun 16, 2022

zero division should be a warning, but Python raises.
We should use numpy in cases like ratio = rates[0] / rates[1]

np.float64(5) / 0
<ipython-input-93-51152f6d3e5e>:1: RuntimeWarning: divide by zero encountered in double_scalars
  np.float64(5) / 0
inf

The problem is still present in my current PR #8166

score test works even for zero counts,

(using my branch)

test_poisson_2indep(1, 1, np.float64(0), 4)
...\statsmodels\statsmodels\stats\rates.py:858: RuntimeWarning: divide by zero encountered in double_scalars
  ratio = rate1 / rate2

<class 'statsmodels.stats.base.HolderTuple'>
statistic = 2.0
pvalue = 0.04550026389635839
distribution = 'normal'
method = 'score'
alternative = 'two-sided'
rates = (1.0, 0.0)
ratio = inf
diff = 1.0
value = None
rates_cmle = None
ratio_null = 1
tuple = (2.0, 0.04550026389635839)

I think the fix should be something like the following which works automatically for scalar and vectorized rates
np.float64(1) * rate1 / rate2

edit
using np.divide is simpler

@josef-pkt
Copy link
Member

current behavior in #8166
several RuntimeWarnings

It looks like we got continuity with respect to count around zero

Note, nan in exact-cond and cond-midp statistics are by design, not available

count1, n1, count2, n2 = 1, 1, np.float64(0), 4
​
for compare in ["ratio", "diff"]:
    print(compare)
    for meth in method_names_poisson_2indep["test"][compare]:
        tst = test_poisson_2indep(count1, n1, count2, n2, method=meth,
                                  compare=compare,
                                  value=None, alternative='two-sided')
        tst2 = test_poisson_2indep(count1, n1, count2+1e-10, n2, method=meth,
                                  compare=compare,
                                  value=None, alternative='two-sided')
        print("   %-12s" % meth, tst.tuple, tst2.tuple)
​
ratio
   wald         (1.0, 0.31731050786291415) (0.9999999999718752, 0.3173105078765248)
   score        (2.0, 0.04550026389635839) (1.99999999985, 0.0455002639125557)
   score-log    (inf, 0.0) (9.764858116912382, 1.5933487650281872e-22)
   wald-log     (nan, nan) (0.00024412145289839742, 0.9998052192637062)
   exact-cond   (nan, 0.19999999999999996) (nan, 0.19999999999999996)
   cond-midp    (nan, 0.09999999999999995) (nan, 0.09999999999223139)
   sqrt         (1.549895138835137, 0.12116668661678107) (1.5498951387621074, 0.12116668663431229)
   etest-score  (2.0, 0.08791618383341977) (1.99999999985, 0.08791618383563313)
   etest-wald   (1.0, 0.5411722584954477) (0.9999999999718752, 0.5411722585188068)

diff
   wald         (1.0, 0.31731050786291415) (0.9999999999718752, 0.3173105078765248)
   score        (2.0, 0.04550026389635839) (1.99999999985, 0.0455002639125557)
   waldccv      (0.8081220356417685, 0.4190203335343159) (0.8081220356199162, 0.4190203335468944)
   etest-score  (2.0, 0.08791618383341977) (1.99999999985, 0.08791618383563313)
   etest-wald   (1.0, 0.5411722584954477) (0.9999999999718752, 0.5411722585188068)

@josef-pkt
Copy link
Member

wald-log is messier, 3 divisions

elif method in ['wald-log']:
     stat = (np.log(y1 / y2) - np.log(r_d)) / np.sqrt(1 / y1 + 1 / y2)

also, we don't use asarray in the functions

@josef-pkt
Copy link
Member

josef-pkt commented Jun 17, 2022

asarray it is

It works and it still returns a scalar (nulmpy scalar)

type(tst.pvalue)
numpy.float64

change will be in PR #8166 (fixing main would just cause messy merge conflicts)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants