BUG: GEE default covariance is not used #1906

Closed
josef-pkt opened this Issue Aug 18, 2014 · 10 comments

Projects

None yet

2 participants

@josef-pkt
Member

If I set the covariance_type in GEE.fit, then it doesn't seem to be used.

The summary table is the same whatever covariance_type I have chosen.
bse is also the same
summary, confint and similar have a covariance_type argument but the default doesn't depend on the covariance_type set in GEE.fit

other issue: I prefer to standardize covariance_type -> cov_type

aside:
found while comparing GEE and GLM with cluster robust standard errors
naive is over/under dispersed, scale differs compared to GLM naive by pearson_chi2 / df_resid

bse_naive = np.sqrt(np.diag(rslt1c.naive_covariance))
print(bse_naive)
print(rslt_glm.bse.values)
print(bse_naive / rslt_glm.bse.values)

[ 0.30598606  0.10789098  0.00908399  0.00114894]
[ 0.1356608   0.04783413  0.00402744  0.00050939]
[ 2.25552299  2.25552299  2.25552299  2.25552299]

np.sqrt(rslt_glm.pearson_chi2 / rslt_glm.model.df_resid)
2.2555229851698391
@josef-pkt josef-pkt added this to the 0.6 milestone Aug 18, 2014
@josef-pkt josef-pkt changed the title from GEE: default covariance is not used to BUG: GEE default covariance is not used Aug 19, 2014
@josef-pkt
Member

moving to prio-high

took me a while to see what's wrong

bse is correctly set and uses the define covariance_type

@cache_readonly
    def bse(self):
        return self.standard_errors(self.covariance_type)

However, if we call summary right after fit, then the covariance_type get's changed to the default robust, and then the covariance_type is not the one set in fit anymore
self.covariance_type = covariance_type

summary cannot choose the covariance_type, i.e. we cannot have it as an argument, because it relies on cached attributes which become inconsistent if some of them are changed.

also I don't see pvalues being set to the default covariance_type (pvalues are inherited), I haven't verified this yet, just based on reading the code.

Similar, other inherited methods like wald_test and conf_int should be using the wrong cov_params (just a guess)

edit conf_int is not inherited, but it uses "robust" as default in the method argument, and does not default to the chosen self.covariance_type

@kshedden I will try to prepare a PR for this so that the inherited methods work correctly.
Essentially, use cov_params_default everywhere if covariance_type/cov_type is not specifically given as argument.
(This would follow mostly my recent changes to the handling of covariance matrices and, I think, avoid the possibility to get inconsistent attributes and results.)

@josef-pkt
Member

Two results instances, one with and one without calling summary()

http://nbviewer.ipython.org/gist/josef-pkt/3cf9c6a2bf304904c839#Appendix:-Convariance-Types-in-GEE

mod1b = GEE.from_formula("y ~ age + trt + base", "subject", data, cov_struct=ind, family=fam)
rslt1b = mod1b.fit(covariance_type='naive')
rslt1b_ = mod1b.fit(covariance_type='naive')
print '\n'
print rslt1b.summary()


                               GEE Regression Results                              
===================================================================================
Dep. Variable:                           y   No. Observations:                  236
Model:                                 GEE   No. clusters:                       59
Method:                        Generalized   Min. cluster size:                   4
                      Estimating Equations   Max. cluster size:                   4
Family:                            Poisson   Mean cluster size:                 4.0
Dependence structure:         Independence   Num. iterations:                    51
Date:                     Tue, 19 Aug 2014   Scale:                           5.087
Covariance type:                    robust   Time:                         12:13:20
====================================================================================
                       coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------------
Intercept            0.5730      0.361      1.589      0.112        -0.134     1.280
trt[T.progabide]    -0.1519      0.171     -0.888      0.375        -0.487     0.183
age                  0.0223      0.011      1.960      0.050      2.11e-06     0.045
base                 0.0226      0.001     18.451      0.000         0.020     0.025
==============================================================================
Skew:                          3.7823   Kurtosis:                      28.6672
Centered skew:                 2.7597   Centered kurtosis:             21.9865
==============================================================================
In [31]:

#print(rslt1b_.summary())
In [32]:

rslt1b.covariance_type, rslt1b_.covariance_type
Out[32]:
('robust', 'naive')
In [39]:

rslt1b.bse, rslt1b_.bse
Out[39]:
(array([ 0.36072614,  0.17105111,  0.01140096,  0.00122675]),
 array([ 0.30598606,  0.10789098,  0.00908399,  0.00114894]))
@kshedden
Contributor

@josef-pkt I can work on this, but it sounds like you have already started.
Let me know if you want me to take this over. I'm fine of course if you
can do it.

On Wed, Aug 20, 2014 at 12:33 AM, Josef Perktold notifications@github.com
wrote:

Two results instances, one with and one without calling summary()

http://nbviewer.ipython.org/gist/josef-pkt/3cf9c6a2bf304904c839#Appendix:-Convariance-Types-in-GEE

mod1b = GEE.from_formula("y ~ age + trt + base", "subject", data, cov_struct=ind, family=fam)
rslt1b = mod1b.fit(covariance_type='naive')
rslt1b_ = mod1b.fit(covariance_type='naive')
print '\n'
print rslt1b.summary()

                           GEE Regression Results

Dep. Variable: y No. Observations: 236
Model: GEE No. clusters: 59
Method: Generalized Min. cluster size: 4
Estimating Equations Max. cluster size: 4
Family: Poisson Mean cluster size: 4.0
Dependence structure: Independence Num. iterations: 51
Date: Tue, 19 Aug 2014 Scale: 5.087

Covariance type: robust Time: 12:13:20

                   coef    std err          z      P>|z|      [95.0% Conf. Int.]

Intercept 0.5730 0.361 1.589 0.112 -0.134 1.280
trt[T.progabide] -0.1519 0.171 -0.888 0.375 -0.487 0.183
age 0.0223 0.011 1.960 0.050 2.11e-06 0.045

base 0.0226 0.001 18.451 0.000 0.020 0.025

Skew: 3.7823 Kurtosis: 28.6672

Centered skew: 2.7597 Centered kurtosis: 21.9865

In [31]:

#print(rslt1b_.summary())
In [32]:

rslt1b.covariance_type, rslt1b_.covariance_type
Out[32]:
('robust', 'naive')
In [39]:

rslt1b.bse, rslt1b_.bse
Out[39]:
(array([ 0.36072614, 0.17105111, 0.01140096, 0.00122675]),
array([ 0.30598606, 0.10789098, 0.00908399, 0.00114894]))


Reply to this email directly or view it on GitHub
#1906 (comment)
.

@josef-pkt
Member

Yes I already started, mainly to see if fixing summary and adding cov_params_default solves the main problems.
I'm working on top of two other PR's, one not yet pushed.

@josef-pkt
Member

another tricky piece:
setting cov_params_default in GEEResult.__init__ doesn't actually work because the covariances and covariance_type is added to the results instance after the init has been set.

we would have to set it in the fit method already
However, it's cleaner to add some of the covariance information as arguments in GEEResults.__init__ similar to the way I handle it in the other models.

@josef-pkt
Member

@kshedden If you want to see how I did it for GEE, the discrete models and for RegressionModel
https://github.com/statsmodels/statsmodels/pull/1870/files#diff-724e15f4019b2a24906f6b271d67e370L876

new keywords in XXXResults.__init__ : cov_type='nonrobust', cov_kwds=None, use_t=None

@josef-pkt
Member

GEE would default to cov_type='robust' as it does now, since it's the standard for GEE.

@josef-pkt
Member

related to last point OrdinalGEEResults and NominalGEEResults need the same __init__ arguments. Both of those __init__ methods could currently be dropped, they don't do anything except call super.

@josef-pkt josef-pkt added a commit to josef-pkt/statsmodels that referenced this issue Aug 20, 2014
@josef-pkt josef-pkt BUG: GEE partial cleanup of cov_type, see #1906 6805ccf
@josef-pkt
Member

I started the minimal changes so far in #1916

@kshedden One question, based on my reading of the code the subclasses NominalGEEResults and OrdinalGEEResults never get the naive and bias corrected covariances attached. They are calculated in the super call but not attached to the subclass results instance.

Is this intentional or a bug? Should those two also have all three covariances?
(right now I only attached the robust_covariance so all existing unit tests pass after the first refactoring.

a consequence of fixing this: So far cov_params() was pointing to the robust covariance. Now it always returns the default covariance, cov_params_default that depends on the cov_type. To have the robust covariance matrix available, I attached it as new attribute robust_covariance, which follows the same pattern as currently for the other two.

(more radical change: only calculate robust and robust_bc covariances on demand, and move part of GEE_covmat to be called by the results class. (And store the naive covariance as normalized_cov_params.) (robust_bc doesn't need robust, AFAICS from a quick look.)

@josef-pkt josef-pkt added a commit to josef-pkt/statsmodels that referenced this issue Aug 20, 2014
@josef-pkt josef-pkt BUG: GEE partial cleanup of cov_type, see #1906 e0af9c4
@josef-pkt josef-pkt added a commit to josef-pkt/statsmodels that referenced this issue Aug 21, 2014
@josef-pkt josef-pkt BUG: GEE partial cleanup of cov_type, see #1906 06c5245
@josef-pkt
Member

fixed in #1924 and merged

@josef-pkt josef-pkt closed this Aug 23, 2014
@bert9bert bert9bert added a commit to bert9bert/statsmodels that referenced this issue Aug 29, 2014
@josef-pkt @bert9bert josef-pkt + bert9bert BUG: GEE partial cleanup of cov_type, see #1906 92d574b
@yarikoptic yarikoptic added a commit to yarikoptic/statsmodels that referenced this issue Oct 23, 2014
@yarikoptic yarikoptic Merge commit 'v0.5.0-1269-g957a43e' into debian-experimental
* commit 'v0.5.0-1269-g957a43e': (59 commits)
  TST: avoid pandas in assert with wrapped results (failure pandas>0.12)
  REF: GEEResults rename to postfix names, keep old as alias for now
  BUG: NominalGEE pandas handling, fix typo, adjust unit tests
  ENH: GEE add Wrapper, closes #1904
  BUG: NominalGEE fix initialization for pandas, closes #1931
  BUG/ENH: add df_resid, df_model, fixes t_test, closes #1918
  REF: GEE rename naive_covariance -> cov_naive, same for others
  REF: rename covariance_type -> cov_type
  TST: test_glm: remove test noise
  BUG: GEE subclasses cov_params_default doesn't get attached, unit tests
  REF: GEE summary remove covariance_type argument (not possible)
  BUG/TST: GEE fix predict closes #1919 and fix conf_int for cov_params_default add GEE to generic test, and adjust those (missing _results, summary2)
  BUG: GEE partial cleanup of cov_type, see #1906
  ENH: cov_kwds add generic 'scaling_factor' option to match Stata
  BUG: genmod links, CDFLink, probit deriv2 use approx_fprime (not complex)
  BUG: fix used hessian in sandwich_covariance, with GLM test cases
  REF: GLM change default scale in score_obs (tests pass, possibly wrong, see PR comment)
  REF/TST generic cov_type, TST compare GLM Logit, GLM OLS (this fails)
  REF/ENH: add get_robustcov_results generically to LikelihoodModel/Results    NegativeBinomial as test case, needs cleanup for code duplication.
  ENH: add cov_type to GLM.fit, with tests for family Poisson
  ...
f48717d
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment