Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: GEE default covariance is not used #1906

Closed
josef-pkt opened this issue Aug 18, 2014 · 10 comments

Comments

Projects
None yet
2 participants
@josef-pkt
Copy link
Member

commented Aug 18, 2014

If I set the covariance_type in GEE.fit, then it doesn't seem to be used.

The summary table is the same whatever covariance_type I have chosen.
bse is also the same
summary, confint and similar have a covariance_type argument but the default doesn't depend on the covariance_type set in GEE.fit

other issue: I prefer to standardize covariance_type -> cov_type

aside:
found while comparing GEE and GLM with cluster robust standard errors
naive is over/under dispersed, scale differs compared to GLM naive by pearson_chi2 / df_resid

bse_naive = np.sqrt(np.diag(rslt1c.naive_covariance))
print(bse_naive)
print(rslt_glm.bse.values)
print(bse_naive / rslt_glm.bse.values)

[ 0.30598606  0.10789098  0.00908399  0.00114894]
[ 0.1356608   0.04783413  0.00402744  0.00050939]
[ 2.25552299  2.25552299  2.25552299  2.25552299]

np.sqrt(rslt_glm.pearson_chi2 / rslt_glm.model.df_resid)
2.2555229851698391

@josef-pkt josef-pkt added this to the 0.6 milestone Aug 18, 2014

@josef-pkt josef-pkt changed the title GEE: default covariance is not used BUG: GEE default covariance is not used Aug 19, 2014

@josef-pkt

This comment has been minimized.

Copy link
Member Author

commented Aug 19, 2014

moving to prio-high

took me a while to see what's wrong

bse is correctly set and uses the define covariance_type

@cache_readonly
    def bse(self):
        return self.standard_errors(self.covariance_type)

However, if we call summary right after fit, then the covariance_type get's changed to the default robust, and then the covariance_type is not the one set in fit anymore
self.covariance_type = covariance_type

summary cannot choose the covariance_type, i.e. we cannot have it as an argument, because it relies on cached attributes which become inconsistent if some of them are changed.

also I don't see pvalues being set to the default covariance_type (pvalues are inherited), I haven't verified this yet, just based on reading the code.

Similar, other inherited methods like wald_test and conf_int should be using the wrong cov_params (just a guess)

edit conf_int is not inherited, but it uses "robust" as default in the method argument, and does not default to the chosen self.covariance_type

@kshedden I will try to prepare a PR for this so that the inherited methods work correctly.
Essentially, use cov_params_default everywhere if covariance_type/cov_type is not specifically given as argument.
(This would follow mostly my recent changes to the handling of covariance matrices and, I think, avoid the possibility to get inconsistent attributes and results.)

@josef-pkt

This comment has been minimized.

Copy link
Member Author

commented Aug 19, 2014

Two results instances, one with and one without calling summary()

http://nbviewer.ipython.org/gist/josef-pkt/3cf9c6a2bf304904c839#Appendix:-Convariance-Types-in-GEE

mod1b = GEE.from_formula("y ~ age + trt + base", "subject", data, cov_struct=ind, family=fam)
rslt1b = mod1b.fit(covariance_type='naive')
rslt1b_ = mod1b.fit(covariance_type='naive')
print '\n'
print rslt1b.summary()


                               GEE Regression Results                              
===================================================================================
Dep. Variable:                           y   No. Observations:                  236
Model:                                 GEE   No. clusters:                       59
Method:                        Generalized   Min. cluster size:                   4
                      Estimating Equations   Max. cluster size:                   4
Family:                            Poisson   Mean cluster size:                 4.0
Dependence structure:         Independence   Num. iterations:                    51
Date:                     Tue, 19 Aug 2014   Scale:                           5.087
Covariance type:                    robust   Time:                         12:13:20
====================================================================================
                       coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------------
Intercept            0.5730      0.361      1.589      0.112        -0.134     1.280
trt[T.progabide]    -0.1519      0.171     -0.888      0.375        -0.487     0.183
age                  0.0223      0.011      1.960      0.050      2.11e-06     0.045
base                 0.0226      0.001     18.451      0.000         0.020     0.025
==============================================================================
Skew:                          3.7823   Kurtosis:                      28.6672
Centered skew:                 2.7597   Centered kurtosis:             21.9865
==============================================================================
In [31]:

#print(rslt1b_.summary())
In [32]:

rslt1b.covariance_type, rslt1b_.covariance_type
Out[32]:
('robust', 'naive')
In [39]:

rslt1b.bse, rslt1b_.bse
Out[39]:
(array([ 0.36072614,  0.17105111,  0.01140096,  0.00122675]),
 array([ 0.30598606,  0.10789098,  0.00908399,  0.00114894]))
@kshedden

This comment has been minimized.

Copy link
Contributor

commented Aug 19, 2014

@josef-pkt I can work on this, but it sounds like you have already started.
Let me know if you want me to take this over. I'm fine of course if you
can do it.

On Wed, Aug 20, 2014 at 12:33 AM, Josef Perktold notifications@github.com
wrote:

Two results instances, one with and one without calling summary()

http://nbviewer.ipython.org/gist/josef-pkt/3cf9c6a2bf304904c839#Appendix:-Convariance-Types-in-GEE

mod1b = GEE.from_formula("y ~ age + trt + base", "subject", data, cov_struct=ind, family=fam)
rslt1b = mod1b.fit(covariance_type='naive')
rslt1b_ = mod1b.fit(covariance_type='naive')
print '\n'
print rslt1b.summary()

                           GEE Regression Results

Dep. Variable: y No. Observations: 236
Model: GEE No. clusters: 59
Method: Generalized Min. cluster size: 4
Estimating Equations Max. cluster size: 4
Family: Poisson Mean cluster size: 4.0
Dependence structure: Independence Num. iterations: 51
Date: Tue, 19 Aug 2014 Scale: 5.087

Covariance type: robust Time: 12:13:20

                   coef    std err          z      P>|z|      [95.0% Conf. Int.]

Intercept 0.5730 0.361 1.589 0.112 -0.134 1.280
trt[T.progabide] -0.1519 0.171 -0.888 0.375 -0.487 0.183
age 0.0223 0.011 1.960 0.050 2.11e-06 0.045

base 0.0226 0.001 18.451 0.000 0.020 0.025

Skew: 3.7823 Kurtosis: 28.6672

Centered skew: 2.7597 Centered kurtosis: 21.9865

In [31]:

#print(rslt1b_.summary())
In [32]:

rslt1b.covariance_type, rslt1b_.covariance_type
Out[32]:
('robust', 'naive')
In [39]:

rslt1b.bse, rslt1b_.bse
Out[39]:
(array([ 0.36072614, 0.17105111, 0.01140096, 0.00122675]),
array([ 0.30598606, 0.10789098, 0.00908399, 0.00114894]))


Reply to this email directly or view it on GitHub
#1906 (comment)
.

@josef-pkt

This comment has been minimized.

Copy link
Member Author

commented Aug 19, 2014

Yes I already started, mainly to see if fixing summary and adding cov_params_default solves the main problems.
I'm working on top of two other PR's, one not yet pushed.

@josef-pkt

This comment has been minimized.

Copy link
Member Author

commented Aug 19, 2014

another tricky piece:
setting cov_params_default in GEEResult.__init__ doesn't actually work because the covariances and covariance_type is added to the results instance after the init has been set.

we would have to set it in the fit method already
However, it's cleaner to add some of the covariance information as arguments in GEEResults.__init__ similar to the way I handle it in the other models.

@josef-pkt

This comment has been minimized.

Copy link
Member Author

commented Aug 19, 2014

@kshedden If you want to see how I did it for GEE, the discrete models and for RegressionModel
https://github.com/statsmodels/statsmodels/pull/1870/files#diff-724e15f4019b2a24906f6b271d67e370L876

new keywords in XXXResults.__init__ : cov_type='nonrobust', cov_kwds=None, use_t=None

@josef-pkt

This comment has been minimized.

Copy link
Member Author

commented Aug 19, 2014

GEE would default to cov_type='robust' as it does now, since it's the standard for GEE.

@josef-pkt

This comment has been minimized.

Copy link
Member Author

commented Aug 19, 2014

related to last point OrdinalGEEResults and NominalGEEResults need the same __init__ arguments. Both of those __init__ methods could currently be dropped, they don't do anything except call super.

josef-pkt added a commit to josef-pkt/statsmodels that referenced this issue Aug 20, 2014

@josef-pkt

This comment has been minimized.

Copy link
Member Author

commented Aug 20, 2014

I started the minimal changes so far in #1916

@kshedden One question, based on my reading of the code the subclasses NominalGEEResults and OrdinalGEEResults never get the naive and bias corrected covariances attached. They are calculated in the super call but not attached to the subclass results instance.

Is this intentional or a bug? Should those two also have all three covariances?
(right now I only attached the robust_covariance so all existing unit tests pass after the first refactoring.

a consequence of fixing this: So far cov_params() was pointing to the robust covariance. Now it always returns the default covariance, cov_params_default that depends on the cov_type. To have the robust covariance matrix available, I attached it as new attribute robust_covariance, which follows the same pattern as currently for the other two.

(more radical change: only calculate robust and robust_bc covariances on demand, and move part of GEE_covmat to be called by the results class. (And store the naive covariance as normalized_cov_params.) (robust_bc doesn't need robust, AFAICS from a quick look.)

@josef-pkt

This comment has been minimized.

Copy link
Member Author

commented Aug 23, 2014

fixed in #1924 and merged

@josef-pkt josef-pkt closed this Aug 23, 2014

bert9bert added a commit to bert9bert/statsmodels that referenced this issue Aug 29, 2014

yarikoptic added a commit to yarikoptic/statsmodels that referenced this issue Oct 23, 2014

Merge commit 'v0.5.0-1269-g957a43e' into debian-experimental
* commit 'v0.5.0-1269-g957a43e': (59 commits)
  TST: avoid pandas in assert with wrapped results (failure pandas>0.12)
  REF: GEEResults rename to postfix names, keep old as alias for now
  BUG: NominalGEE pandas handling, fix typo, adjust unit tests
  ENH: GEE add Wrapper, closes statsmodels#1904
  BUG: NominalGEE fix initialization for pandas, closes statsmodels#1931
  BUG/ENH: add df_resid, df_model, fixes t_test, closes statsmodels#1918
  REF: GEE rename naive_covariance -> cov_naive, same for others
  REF: rename covariance_type -> cov_type
  TST: test_glm: remove test noise
  BUG: GEE subclasses cov_params_default doesn't get attached, unit tests
  REF: GEE summary remove covariance_type argument (not possible)
  BUG/TST: GEE fix predict closes statsmodels#1919 and fix conf_int for cov_params_default add GEE to generic test, and adjust those (missing _results, summary2)
  BUG: GEE partial cleanup of cov_type, see statsmodels#1906
  ENH: cov_kwds add generic 'scaling_factor' option to match Stata
  BUG: genmod links, CDFLink, probit deriv2 use approx_fprime (not complex)
  BUG: fix used hessian in sandwich_covariance, with GLM test cases
  REF: GLM change default scale in score_obs (tests pass, possibly wrong, see PR comment)
  REF/TST generic cov_type, TST compare GLM Logit, GLM OLS (this fails)
  REF/ENH: add get_robustcov_results generically to LikelihoodModel/Results    NegativeBinomial as test case, needs cleanup for code duplication.
  ENH: add cov_type to GLM.fit, with tests for family Poisson
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.