error with GEE on missing data. #1877

Closed
brentp opened this Issue Aug 8, 2014 · 11 comments

Projects

None yet

4 participants

@brentp
brentp commented Aug 8, 2014

Self-contained script below. This is because the missing data removes entire groups and then it attempts to index an array with an empty array in cluster_list

import statsmodels.api as sm
from statsmodels.genmod.dependence_structures import Independence
import patsy
import pandas as pd
import numpy as np
from StringIO import StringIO

data = StringIO("""\
id,al,status,fake,grps
4A,A,1,1,0
5A,A,1,2.0,1
6A,A,1,1,2
7A,A,1,2.0,3
8A,A,1,1,4
9A,A,1,2.0,5
11A,A,1,1,6
12A,A,1,2.0,7
13A,A,1,1,8
14A,A,1,1,9
15A,A,1,1,10
16A,A,1,2.0,11
17A,A,1,2.0,12
18A,A,1,2.0,13
19A,A,1,2.0,14
20A,A,1,2.0,15
2C,C,0,2.0,0
3C,C,0,1,1
4C,C,0,1,2
5C,C,0,2.0,3
6C,C,0,1,4
9C,C,0,1,5
10C,C,0,1,6
12C,C,0,1,7
14C,C,0,1,8
15C,C,0,1,9
17C,C,0,1,10
22C,C,0,1,11
23C,C,0,1,12
24C,C,0,1,13
32C,C,0,2.0,14
35C,C,0,1,15""")

df = pd.read_csv(data)

y, X = patsy.dmatrices("status ~ fake", df)


# this is OK.
model = sm.GEE(y, X, groups = df['grps'],
           cov_struct=Independence(), family=sm.families.Binomial())
print("OK!!")


# add missing data.
df.ix[df['fake'] == 1, 'fake'] = np.nan
y, X = patsy.dmatrices("status ~ fake", df)

#ERROR here:
model = sm.GEE(y, X, groups = df['grps'],
           cov_struct=Independence(), family=sm.families.Binomial())

Here is the output + traceback:

OK!!
Traceback (most recent call last):
  File "t.py", line 59, in <module>
    cov_struct=Independence(), family=sm.families.Binomial())
  File "/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/genmod/generalized_estimating_equations.py", line 261, in __init__
    constraint=constraint)
  File "/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/genmod/generalized_estimating_equations.py", line 335, in _reset
    self.endog_li = self.cluster_list(self.endog)
  File "/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/genmod/generalized_estimating_equations.py", line 411, in cluster_list
    for k in self.group_labels]
IndexError: arrays used as indices must be of integer (or boolean) type
@josef-pkt
Member

I think this is mainly a disagreement about who is doing the missing value handling.
(We don't support yet that extra arrays are correctly handled for missing values if patsy removes the missing values first.)

However, AFAICS, In this example we need to raise an exception that groups doesn't have the same length as y and X.
There is no way that GEE can correctly interpret groups if it is longer than y and X, and doesn't match up.
If y and X were also pandas series or dataframes then we could match on indices. That's the idea for sharing missing handling between patsy and statsmodels.

Note using pandas dropna to drop the missing observations from the entire dataframe including from df['grps'] works (but breaks in fit with numpy.linalg.linalg.LinAlgError: Singular matrix)

# add missing data.
df.ix[df['fake'] == 1, 'fake'] = np.nan
df = df.dropna()
y, X = patsy.dmatrices("status ~ fake", df)

#ERROR here:
model = sm.GEE(y, X, groups = df['grps'],
           cov_struct=Independence(), family=sm.families.Binomial())
@josef-pkt josef-pkt added this to the 0.6 milestone Aug 8, 2014
@brentp
brentp commented Aug 8, 2014

I see. It does seem like something could be done in when from_formula is used.

model = sm.GEE.from_formula("status ~ fake", df, groups = df['grps'],
           cov_struct=Independence(), family=sm.families.Binomial())

gives the same error.

@kshedden
Contributor

I will look into this soon. If I recall I previously checked for empty groups but took that out because I thought it was superfluous.

@josef-pkt
Member

@kshedden I don't think you have to fix empty groups unless we find an example for it. (I guess it is not possible to have them.)

The "bug" is that this example doesn't raise an exception with incorrect groups (wrong shape) right away.

The extension that we don't have yet in any model, is to disable patsy's missing value handling, or to cooperate with patsy's missing value handling. issue #805

@kshedden
Contributor

I still haven't had a chance to look into this, but from_formula should work with missing data. I will figure out what is going on there...

@kshedden kshedden added a commit to kshedden/statsmodels that referenced this issue Aug 12, 2014
@kshedden kshedden Fix #1877 relating to missing data handling in GEE ed1bc5f
@kshedden
Contributor

I think I have fixed this now. I was using groups instead of self.groups after the missing data was already processed upstream. This is fixed in my gee_sensitivity3 branch. I also added a test for this. All tests run clean locally, waiting for travis to return.

However the example from @brentp will still fail with a linear algebra error, because after dropping the cases with missing values, the independent variable becomes constant.

@kshedden
Contributor

Travis returns green.

@kshedden
Contributor

My mistake. This is not going to work out for the time being. I was misled because it worked with an old version of Patsy, but that was a fluke.

GEE called from the API should handle missing data properly, but GEE using formulas will not work with missing data until this issue is addressed upstream.

@josef-pkt josef-pkt closed this in 455add8 Aug 20, 2014
@josef-pkt
Member

reopen, this has to be fixed in base

@josef-pkt josef-pkt reopened this Aug 23, 2014
@josef-pkt
Member

this should still get a closer look before a 0.6 release

@bert9bert bert9bert added a commit to bert9bert/statsmodels that referenced this issue Aug 29, 2014
@kshedden @bert9bert kshedden + bert9bert Fix #1877 relating to missing data handling in GEE 26a78dd
@PierreBdR PierreBdR pushed a commit to PierreBdR/statsmodels that referenced this issue Sep 2, 2014
@kshedden @josef-pkt kshedden + josef-pkt Fix #1877 relating to missing data handling in GEE b343427
@jseabold
Member

Closed by #2034.

@jseabold jseabold closed this Oct 15, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment