Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error with GEE on missing data. #1877

Closed
brentp opened this issue Aug 8, 2014 · 11 comments
Closed

error with GEE on missing data. #1877

brentp opened this issue Aug 8, 2014 · 11 comments

Comments

@brentp
Copy link
Contributor

brentp commented Aug 8, 2014

Self-contained script below. This is because the missing data removes entire groups and then it attempts to index an array with an empty array in cluster_list

import statsmodels.api as sm
from statsmodels.genmod.dependence_structures import Independence
import patsy
import pandas as pd
import numpy as np
from StringIO import StringIO

data = StringIO("""\
id,al,status,fake,grps
4A,A,1,1,0
5A,A,1,2.0,1
6A,A,1,1,2
7A,A,1,2.0,3
8A,A,1,1,4
9A,A,1,2.0,5
11A,A,1,1,6
12A,A,1,2.0,7
13A,A,1,1,8
14A,A,1,1,9
15A,A,1,1,10
16A,A,1,2.0,11
17A,A,1,2.0,12
18A,A,1,2.0,13
19A,A,1,2.0,14
20A,A,1,2.0,15
2C,C,0,2.0,0
3C,C,0,1,1
4C,C,0,1,2
5C,C,0,2.0,3
6C,C,0,1,4
9C,C,0,1,5
10C,C,0,1,6
12C,C,0,1,7
14C,C,0,1,8
15C,C,0,1,9
17C,C,0,1,10
22C,C,0,1,11
23C,C,0,1,12
24C,C,0,1,13
32C,C,0,2.0,14
35C,C,0,1,15""")

df = pd.read_csv(data)

y, X = patsy.dmatrices("status ~ fake", df)


# this is OK.
model = sm.GEE(y, X, groups = df['grps'],
           cov_struct=Independence(), family=sm.families.Binomial())
print("OK!!")


# add missing data.
df.ix[df['fake'] == 1, 'fake'] = np.nan
y, X = patsy.dmatrices("status ~ fake", df)

#ERROR here:
model = sm.GEE(y, X, groups = df['grps'],
           cov_struct=Independence(), family=sm.families.Binomial())

Here is the output + traceback:

OK!!
Traceback (most recent call last):
  File "t.py", line 59, in <module>
    cov_struct=Independence(), family=sm.families.Binomial())
  File "/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/genmod/generalized_estimating_equations.py", line 261, in __init__
    constraint=constraint)
  File "/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/genmod/generalized_estimating_equations.py", line 335, in _reset
    self.endog_li = self.cluster_list(self.endog)
  File "/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/genmod/generalized_estimating_equations.py", line 411, in cluster_list
    for k in self.group_labels]
IndexError: arrays used as indices must be of integer (or boolean) type
@josef-pkt
Copy link
Member

I think this is mainly a disagreement about who is doing the missing value handling.
(We don't support yet that extra arrays are correctly handled for missing values if patsy removes the missing values first.)

However, AFAICS, In this example we need to raise an exception that groups doesn't have the same length as y and X.
There is no way that GEE can correctly interpret groups if it is longer than y and X, and doesn't match up.
If y and X were also pandas series or dataframes then we could match on indices. That's the idea for sharing missing handling between patsy and statsmodels.

Note using pandas dropna to drop the missing observations from the entire dataframe including from df['grps'] works (but breaks in fit with numpy.linalg.linalg.LinAlgError: Singular matrix)

# add missing data.
df.ix[df['fake'] == 1, 'fake'] = np.nan
df = df.dropna()
y, X = patsy.dmatrices("status ~ fake", df)

#ERROR here:
model = sm.GEE(y, X, groups = df['grps'],
           cov_struct=Independence(), family=sm.families.Binomial())

@josef-pkt josef-pkt added this to the 0.6 milestone Aug 8, 2014
@brentp
Copy link
Contributor Author

brentp commented Aug 8, 2014

I see. It does seem like something could be done in when from_formula is used.

model = sm.GEE.from_formula("status ~ fake", df, groups = df['grps'],
           cov_struct=Independence(), family=sm.families.Binomial())

gives the same error.

@kshedden
Copy link
Contributor

I will look into this soon. If I recall I previously checked for empty groups but took that out because I thought it was superfluous.

@josef-pkt
Copy link
Member

@kshedden I don't think you have to fix empty groups unless we find an example for it. (I guess it is not possible to have them.)

The "bug" is that this example doesn't raise an exception with incorrect groups (wrong shape) right away.

The extension that we don't have yet in any model, is to disable patsy's missing value handling, or to cooperate with patsy's missing value handling. issue #805

@kshedden
Copy link
Contributor

I still haven't had a chance to look into this, but from_formula should work with missing data. I will figure out what is going on there...

@kshedden
Copy link
Contributor

I think I have fixed this now. I was using groups instead of self.groups after the missing data was already processed upstream. This is fixed in my gee_sensitivity3 branch. I also added a test for this. All tests run clean locally, waiting for travis to return.

However the example from @brentp will still fail with a linear algebra error, because after dropping the cases with missing values, the independent variable becomes constant.

@kshedden
Copy link
Contributor

Travis returns green.

@kshedden
Copy link
Contributor

My mistake. This is not going to work out for the time being. I was misled because it worked with an old version of Patsy, but that was a fluke.

GEE called from the API should handle missing data properly, but GEE using formulas will not work with missing data until this issue is addressed upstream.

@josef-pkt
Copy link
Member

reopen, this has to be fixed in base

@josef-pkt
Copy link
Member

this should still get a closer look before a 0.6 release

bert9bert pushed a commit to bert9bert/statsmodels that referenced this issue Aug 29, 2014
PierreBdR pushed a commit to PierreBdR/statsmodels that referenced this issue Sep 2, 2014
@jseabold
Copy link
Member

Closed by #2034.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants