Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error with GEE on missing data. #1877

Closed
brentp opened this Issue Aug 8, 2014 · 11 comments

Comments

Projects
None yet
4 participants
@brentp
Copy link

commented Aug 8, 2014

Self-contained script below. This is because the missing data removes entire groups and then it attempts to index an array with an empty array in cluster_list

import statsmodels.api as sm
from statsmodels.genmod.dependence_structures import Independence
import patsy
import pandas as pd
import numpy as np
from StringIO import StringIO

data = StringIO("""\
id,al,status,fake,grps
4A,A,1,1,0
5A,A,1,2.0,1
6A,A,1,1,2
7A,A,1,2.0,3
8A,A,1,1,4
9A,A,1,2.0,5
11A,A,1,1,6
12A,A,1,2.0,7
13A,A,1,1,8
14A,A,1,1,9
15A,A,1,1,10
16A,A,1,2.0,11
17A,A,1,2.0,12
18A,A,1,2.0,13
19A,A,1,2.0,14
20A,A,1,2.0,15
2C,C,0,2.0,0
3C,C,0,1,1
4C,C,0,1,2
5C,C,0,2.0,3
6C,C,0,1,4
9C,C,0,1,5
10C,C,0,1,6
12C,C,0,1,7
14C,C,0,1,8
15C,C,0,1,9
17C,C,0,1,10
22C,C,0,1,11
23C,C,0,1,12
24C,C,0,1,13
32C,C,0,2.0,14
35C,C,0,1,15""")

df = pd.read_csv(data)

y, X = patsy.dmatrices("status ~ fake", df)


# this is OK.
model = sm.GEE(y, X, groups = df['grps'],
           cov_struct=Independence(), family=sm.families.Binomial())
print("OK!!")


# add missing data.
df.ix[df['fake'] == 1, 'fake'] = np.nan
y, X = patsy.dmatrices("status ~ fake", df)

#ERROR here:
model = sm.GEE(y, X, groups = df['grps'],
           cov_struct=Independence(), family=sm.families.Binomial())

Here is the output + traceback:

OK!!
Traceback (most recent call last):
  File "t.py", line 59, in <module>
    cov_struct=Independence(), family=sm.families.Binomial())
  File "/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/genmod/generalized_estimating_equations.py", line 261, in __init__
    constraint=constraint)
  File "/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/genmod/generalized_estimating_equations.py", line 335, in _reset
    self.endog_li = self.cluster_list(self.endog)
  File "/usr/local/lib/python2.7/dist-packages/statsmodels-0.6.0-py2.7-linux-x86_64.egg/statsmodels/genmod/generalized_estimating_equations.py", line 411, in cluster_list
    for k in self.group_labels]
IndexError: arrays used as indices must be of integer (or boolean) type
@josef-pkt

This comment has been minimized.

Copy link
Member

commented Aug 8, 2014

I think this is mainly a disagreement about who is doing the missing value handling.
(We don't support yet that extra arrays are correctly handled for missing values if patsy removes the missing values first.)

However, AFAICS, In this example we need to raise an exception that groups doesn't have the same length as y and X.
There is no way that GEE can correctly interpret groups if it is longer than y and X, and doesn't match up.
If y and X were also pandas series or dataframes then we could match on indices. That's the idea for sharing missing handling between patsy and statsmodels.

Note using pandas dropna to drop the missing observations from the entire dataframe including from df['grps'] works (but breaks in fit with numpy.linalg.linalg.LinAlgError: Singular matrix)

# add missing data.
df.ix[df['fake'] == 1, 'fake'] = np.nan
df = df.dropna()
y, X = patsy.dmatrices("status ~ fake", df)

#ERROR here:
model = sm.GEE(y, X, groups = df['grps'],
           cov_struct=Independence(), family=sm.families.Binomial())

@josef-pkt josef-pkt added this to the 0.6 milestone Aug 8, 2014

@brentp

This comment has been minimized.

Copy link
Author

commented Aug 8, 2014

I see. It does seem like something could be done in when from_formula is used.

model = sm.GEE.from_formula("status ~ fake", df, groups = df['grps'],
           cov_struct=Independence(), family=sm.families.Binomial())

gives the same error.

@kshedden

This comment has been minimized.

Copy link
Contributor

commented Aug 11, 2014

I will look into this soon. If I recall I previously checked for empty groups but took that out because I thought it was superfluous.

@josef-pkt

This comment has been minimized.

Copy link
Member

commented Aug 11, 2014

@kshedden I don't think you have to fix empty groups unless we find an example for it. (I guess it is not possible to have them.)

The "bug" is that this example doesn't raise an exception with incorrect groups (wrong shape) right away.

The extension that we don't have yet in any model, is to disable patsy's missing value handling, or to cooperate with patsy's missing value handling. issue #805

@kshedden

This comment has been minimized.

Copy link
Contributor

commented Aug 12, 2014

I still haven't had a chance to look into this, but from_formula should work with missing data. I will figure out what is going on there...

@kshedden

This comment has been minimized.

Copy link
Contributor

commented Aug 12, 2014

I think I have fixed this now. I was using groups instead of self.groups after the missing data was already processed upstream. This is fixed in my gee_sensitivity3 branch. I also added a test for this. All tests run clean locally, waiting for travis to return.

However the example from @brentp will still fail with a linear algebra error, because after dropping the cases with missing values, the independent variable becomes constant.

@kshedden

This comment has been minimized.

Copy link
Contributor

commented Aug 12, 2014

Travis returns green.

@kshedden

This comment has been minimized.

Copy link
Contributor

commented Aug 13, 2014

My mistake. This is not going to work out for the time being. I was misled because it worked with an old version of Patsy, but that was a fluke.

GEE called from the API should handle missing data properly, but GEE using formulas will not work with missing data until this issue is addressed upstream.

@josef-pkt josef-pkt closed this in 455add8 Aug 20, 2014

@josef-pkt

This comment has been minimized.

Copy link
Member

commented Aug 23, 2014

reopen, this has to be fixed in base

@josef-pkt

This comment has been minimized.

Copy link
Member

commented Aug 23, 2014

this should still get a closer look before a 0.6 release

bert9bert added a commit to bert9bert/statsmodels that referenced this issue Aug 29, 2014

PierreBdR pushed a commit to PierreBdR/statsmodels that referenced this issue Sep 2, 2014

@jseabold

This comment has been minimized.

Copy link
Member

commented Oct 15, 2014

Closed by #2034.

@jseabold jseabold closed this Oct 15, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.