New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

missing in extra data (example sandwiches, robust covariances) #1220

Open
josef-pkt opened this Issue Dec 9, 2013 · 7 comments

Comments

Projects
None yet
3 participants
@josef-pkt
Member

josef-pkt commented Dec 9, 2013

see comments #863

The sandwich calculations require additional arrays like groups or time.

If patsy or statsmodels missing='drop' removes the rows with missing, then the data of the user doesn't match up with the cleaned endog, exog.

Currently the user needs to drop rows to get consistent matching data for all variables. (it's just a dropna() when using pandas)

Instead we can drop internally, if missing='drop' and we have the index of the dropped rows.

@josef-pkt josef-pkt referenced this issue Jan 29, 2014

Closed

Gee #1314

@josef-pkt

This comment has been minimized.

Show comment
Hide comment
@josef-pkt

josef-pkt Jun 16, 2014

Member

we should just get a helper function that checks and converts/adjusts extra arrays for missing values, based on the stored information about which rows have been dropped or kept.

Member

josef-pkt commented Jun 16, 2014

we should just get a helper function that checks and converts/adjusts extra arrays for missing values, based on the stored information about which rows have been dropped or kept.

@josef-pkt josef-pkt added this to the 0.6 milestone Aug 12, 2014

@josef-pkt

This comment has been minimized.

Show comment
Hide comment
@josef-pkt

josef-pkt Aug 12, 2014

Member

adding 0.6, if this is not easy to add, we drop it again from 0.6 milestone

Member

josef-pkt commented Aug 12, 2014

adding 0.6, if this is not easy to add, we drop it again from 0.6 milestone

@jseabold

This comment has been minimized.

Show comment
Hide comment
@jseabold

jseabold Oct 9, 2014

Member

Is there an example that demonstrates the problem somewhere for a test?

Member

jseabold commented Oct 9, 2014

Is there an example that demonstrates the problem somewhere for a test?

@jseabold

This comment has been minimized.

Show comment
Hide comment
@jseabold

jseabold Oct 15, 2014

Member

Should be closed by #2034. A test case here would be helpful. Re-open if still present.

Member

jseabold commented Oct 15, 2014

Should be closed by #2034. A test case here would be helpful. Re-open if still present.

@jseabold jseabold closed this Oct 15, 2014

@josef-pkt

This comment has been minimized.

Show comment
Hide comment
@josef-pkt

josef-pkt Nov 15, 2015

Member

reopening

#2034 fixed the handling of extra arrays in model.__init__, but cov_type, cov_kwds are fit arguments, and don't have the missing value connection yet.
http://stackoverflow.com/questions/33712400/statsmodels-ols-clustered-standard-errors-not-accepting-series-from-df

We can add the check to cov_type handling for the case where the original and the cov arrays have matching pandas indices, and there are no extra missing values in the cov_kwd Series or DataFrame, In all other case we don't have enough information to adjust the cov_kwd data and need to raise an exception.

Member

josef-pkt commented Nov 15, 2015

reopening

#2034 fixed the handling of extra arrays in model.__init__, but cov_type, cov_kwds are fit arguments, and don't have the missing value connection yet.
http://stackoverflow.com/questions/33712400/statsmodels-ols-clustered-standard-errors-not-accepting-series-from-df

We can add the check to cov_type handling for the case where the original and the cov arrays have matching pandas indices, and there are no extra missing values in the cov_kwd Series or DataFrame, In all other case we don't have enough information to adjust the cov_kwd data and need to raise an exception.

@josef-pkt josef-pkt reopened this Nov 15, 2015

@josef-pkt josef-pkt modified the milestones: 0.10, 0.6 May 27, 2017

@jbrockmendel

This comment has been minimized.

Show comment
Hide comment
@jbrockmendel

jbrockmendel May 27, 2017

Contributor

Is this related to the place(s) in ModelData.__init__ that run self.__dict__.update(...)?

Contributor

jbrockmendel commented May 27, 2017

Is this related to the place(s) in ModelData.__init__ that run self.__dict__.update(...)?

@josef-pkt

This comment has been minimized.

Show comment
Hide comment
@josef-pkt

josef-pkt May 27, 2017

Member

@jbrockmendel No, I think not.
According to my last comment here, extra arrays in a model.__init__ are or can be handled by the missing check. that's why the issue was closed. (I don't know if every model with extra arrays already uses this.)

However, the problem is that we get now additional arrays or pandas equivalents in fit cov_kwds for panel and cluster cov_types. Those are currently not checked for whether the original data had rows removed because of missing value handling. Those arrays also never go through model.data handling.

I moved it to 0.10 because it will be a bit messy and users can work around by just using dropna() on their initial dataframe. (I still think users are better off to dropna themselves than to push it into the model.)

aside:
The missing=drop option in __init__ checks all arrays for any missing in each row. In fit arrays, we already have the model and cannot work if there are any additional missing values. However it will be possible to adjust for the missing that were removed in __init__ if we have the matching index.

Member

josef-pkt commented May 27, 2017

@jbrockmendel No, I think not.
According to my last comment here, extra arrays in a model.__init__ are or can be handled by the missing check. that's why the issue was closed. (I don't know if every model with extra arrays already uses this.)

However, the problem is that we get now additional arrays or pandas equivalents in fit cov_kwds for panel and cluster cov_types. Those are currently not checked for whether the original data had rows removed because of missing value handling. Those arrays also never go through model.data handling.

I moved it to 0.10 because it will be a bit messy and users can work around by just using dropna() on their initial dataframe. (I still think users are better off to dropna themselves than to push it into the model.)

aside:
The missing=drop option in __init__ checks all arrays for any missing in each row. In fit arrays, we already have the model and cannot work if there are any additional missing values. However it will be possible to adjust for the missing that were removed in __init__ if we have the matching index.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment