robust covariance, cov_type in fit #1870

Merged
merged 13 commits into from Aug 21, 2014

Projects

None yet

2 participants

@josef-pkt
Member

integrating robust covariances into the model fit method, see issue #1418
summary issue for robust covariances is #1158

continuing after PR #1867
now starting with discrete models to build the generic version

two possible problems using a standalone function, e.g.
get_robustcov_results(res_olsg_._results, cov_type='HC1', use_self=True)

  • cache already has content that doesn't get overwritten
  • we cannot call the standalone function with the wrapper. We need access and change the underlying Results instance.

use_self=True is only intended for use in a results.init when we can make sure we haven't used the cache yet, and internally we only have the results instance and not the wrapper.

@coveralls

Coverage Status

Coverage decreased (-0.05%) when pulling e1a1917 on josef-pkt:ENH_covtype_discrete into 8709f00 on statsmodels:master.

@josef-pkt josef-pkt added this to the 0.6 milestone Aug 4, 2014
@coveralls

Coverage Status

Coverage decreased (-0.04%) when pulling e2659fa on josef-pkt:ENH_covtype_discrete into 8709f00 on statsmodels:master.

@coveralls

Coverage Status

Coverage decreased (-0.02%) when pulling 4741cac on josef-pkt:ENH_covtype_discrete into 8709f00 on statsmodels:master.

@coveralls

Coverage Status

Coverage increased (+0.0%) when pulling 3e042b4 on josef-pkt:ENH_covtype_discrete into 8709f00 on statsmodels:master.

@josef-pkt
Member

some problems in the design:
commit: REF/ENH: add get_robustcov_results generically to LikelihoodModel/Results 3e042b4

Since we are changing some attributes in the results instance, we need to do it at the right time.

  • transform parameters in fit: Negative Binomial (and some others) change the transform_params during fit. That means that the super class fit only sees the transformed parameter version (of loglike, score/score_obs and hessian). Calculating robust covariance matrices based on transformed parameters is incorrect for the final estimate of untransformed parameters.

example NegativeBinomial nb uses log(alpha) internally but reports alpha. (This is different from models were the original parameterization is in transformed params, e.g. the link functions)

  • current implementation in the last commit, allows for different ways and different locations for adding the robust covariances.
    In LikelihoodModel and Results it is hidden behind kwargs, so that subclasses can intercept cov_type before calling the super methods.
  • use_t there were some changes where use_t is specified and added to the model. #1830 added it to the LikelihoodModel.
    TODO: currently it use_t is handled at different places and is duplicated or repeated several times.
  • new generic method of LikelihoodModel results: _get_robustcov_results. This should allow more generic code that can be overwritten in subclasses if there are specific cases of robust covariances. for example overdispersion in Poisson.
@josef-pkt
Member

possible bug in code so far: how do we handle scale if it separately estimated.

In a new set of test I'm comparing more GLM models to equivalent other models.
For GLM Gaussian compared to OLS the bse have the wrong scale in the current default.
I get correct answers if I set scale=1 in GLM.score_obs.

Current test cases for MLE robust covariances have scale=1, Poisson, NegativeBinomial, and GLM also agrees with Logit.

Difference between estimating equations with a canceled multiplicative term versus score/score_obs which should be the derivative of the full loglikelihood function.

TODO: add score_obs and Hessian (correctly scaled) to RegressionModels, OLS,...
get NormalMLE model back - where did I park my initial version?

@coveralls

Coverage Status

Coverage decreased (-0.0%) when pulling a939030 on josef-pkt:ENH_covtype_discrete into 8709f00 on statsmodels:master.

@coveralls

Coverage Status

Coverage increased (+0.0%) when pulling 389ee4d on josef-pkt:ENH_covtype_discrete into 8709f00 on statsmodels:master.

josef-pkt added some commits Aug 4, 2014
@josef-pkt
Member

rebased and force pushed.

comment to earlier comment:

possible bug in code so far: how do we handle scale if it separately estimated.

scale wasn't a problem in the code when comparing GLM to OLS,
there was a problem somewhere else that got me confused (accessed the wrong array).

Non-native scale will or might still be a problem in overdispersed Poisson, or when scale is fixed, but not here.

Aside: score and Hessian of OLS have a division by the scale/sigma**2.
GLM uses the division by scale. For OLS, scale drops out of sandwich calculation for OLS and is not included.

@coveralls

Coverage Status

Coverage increased (+0.01%) when pulling 0bcc690 on josef-pkt:ENH_covtype_discrete into ab1a57b on statsmodels:master.

@josef-pkt
Member

TravisCI is green, I will merge this tomorrow.

(work on documentation needs to wait.)

@josef-pkt josef-pkt commented on the diff Aug 21, 2014
statsmodels/base/covtype.py
+# -*- coding: utf-8 -*-
+"""
+Created on Mon Aug 04 08:00:16 2014
+
+Author: Josef Perktold
+License: BSD-3
+
+"""
+
+from statsmodels.compat.python import lrange, lzip, range
+
+import numpy as np
+
+
+def get_robustcov_results(self, cov_type='HC1', use_t=None, **kwds):
+ """create new results instance with robust covariance as default
@josef-pkt
josef-pkt Aug 21, 2014 Member

add use_self explicitly as keyword
docstring doesn't mention, attach to existing instance

@josef-pkt
josef-pkt Aug 21, 2014 Member

need to make docstring into template that can be copied to model methods

@josef-pkt josef-pkt commented on the diff Aug 21, 2014
statsmodels/base/covtype.py
+
+from statsmodels.compat.python import lrange, lzip, range
+
+import numpy as np
+
+
+def get_robustcov_results(self, cov_type='HC1', use_t=None, **kwds):
+ """create new results instance with robust covariance as default
+
+ Parameters
+ ----------
+ cov_type : string
+ the type of robust sandwich estimator to use. see Notes below
+ use_t : bool
+ If true, then the t distribution is used for inference.
+ If false, then the normal distribution is used.
@josef-pkt
josef-pkt Aug 21, 2014 Member

use_t is now separate keyword in fit.
remove it from here ?

@josef-pkt josef-pkt commented on the diff Aug 21, 2014
statsmodels/base/covtype.py
+
+ Parameters
+ ----------
+ cov_type : string
+ the type of robust sandwich estimator to use. see Notes below
+ use_t : bool
+ If true, then the t distribution is used for inference.
+ If false, then the normal distribution is used.
+ kwds : depends on cov_type
+ Required or optional arguments for robust covariance calculation.
+ see Notes below
+
+ Returns
+ -------
+ results : results instance
+ This method creates a new results instance with the requested
@josef-pkt
josef-pkt Aug 21, 2014 Member

outdated, prefered is use_self

@josef-pkt josef-pkt commented on the diff Aug 21, 2014
statsmodels/base/covtype.py
+
+ - `maxlag` integer (required) : number of lags to use
+ - `kernel` string (optional) : kernel, default is Bartlett
+ - `use_correction` False or string in ['hac', 'cluster'] (optional) :
+ If False the the sandwich covariance is calulated without
+ small sample correction.
+ - `df_correction` bool (optional)
+ adjustment to df_resid, see cov_type 'cluster' above
+ #TODO: we need more options here
+
+ Reminder:
+ `use_correction` in "nw-groupsum" and "nw-panel" is not bool,
+ needs to be in [False, 'hac', 'cluster']
+
+ TODO: Currently there is no check for extra or misspelled keywords,
+ except in the case of cov_type `HCx`
@josef-pkt
josef-pkt Aug 21, 2014 Member

needs to be fixed

@josef-pkt josef-pkt commented on the diff Aug 21, 2014
statsmodels/base/covtype.py
+ # other models
+ # TODO: make it DRYer repeated code for checking kwds
+ if cov_type in ('HC0', 'HC1', 'HC2', 'HC3'):
+ if kwds:
+ raise ValueError('heteroscedasticity robust covarians ' +
+ 'does not use keywords')
+ res.cov_kwds['description'] = ('Standard Errors are heteroscedasticity ' +
+ 'robust ' + '(' + cov_type + ')')
+
+ res.cov_params_default = getattr(self, 'cov_' + cov_type.upper(), None)
+ if res.cov_params_default is None:
+ # results classes that don't have cov_HCx attribute
+ res.cov_params_default = sw.cov_white_simple(self,
+ use_correction=False)
+ elif cov_type == 'HAC':
+ maxlags = kwds['maxlags'] # required?, default in cov_hac_simple
@josef-pkt
josef-pkt Aug 21, 2014 Member

do we use default for maxlags in HAC?
can be changed in a backwards compatible way.

@josef-pkt josef-pkt commented on the diff Aug 21, 2014
statsmodels/base/model.py
@@ -409,8 +409,18 @@ def fit(self, start_params=None, method='newton', maxiter=100,
warn(warndoc, RuntimeWarning)
Hinv = None
+ if 'cov_type' in kwargs:
+ cov_kwds = kwargs.get('cov_kwds', {})
+ kwds = {'cov_type':kwargs['cov_type'], 'cov_kwds':cov_kwds}
+ else:
+ kwds = {}
+ if 'use_t' in kwargs:
+ kwds['use_t'] = kwargs['use_t']
+ #prints for debugging
@josef-pkt
josef-pkt Aug 21, 2014 Member

these kwargs are additional to any optimizer kwargs.
Not enough unit tests to see whether this works for all combinations.
refactor optimization options to optim_kwargs (later) see GMM

@josef-pkt josef-pkt commented on the diff Aug 21, 2014
statsmodels/base/covtype.py
+
+ Notes
+ -----
+ Warning: Some of the options and defaults in cov_kwds may be changed in a
+ future version.
+
+ The covariance keywords provide an option 'scaling_factor' to adjust the
+ scaling of the covariance matrix, that is the covariance is multiplied by
+ this factor if it is given and is not `None`. This allows the user to
+ adjust the scaling of the covariance matrix to match other statistical
+ packages.
+ For example, `scaling_factor=(nobs - 1.) / (nobs - k_params)` provides a
+ correction so that the robust covariance matrices match those of Stata in
+ some models like GLM and discrete Models.
+
+ The following covariance types and required or optional arguments are
@josef-pkt
josef-pkt Aug 21, 2014 Member

TODO: add nonrobust handling in here

@josef-pkt josef-pkt commented on the diff Aug 21, 2014
statsmodels/base/model.py
+ # robust covariance
+ # We put cov_type in kwargs so subclasses can decide in fit whether to
+ # use this generic implementation
+ if 'use_t' in kwargs:
+ use_t = kwargs['use_t']
+ if use_t is not None:
+ self.use_t = use_t
+ if 'cov_type' in kwargs:
+ cov_type = kwargs.get('cov_type', 'nonrobust')
+ cov_kwds = kwargs.get('cov_kwds', {})
+
+ if cov_type == 'nonrobust':
+ self.cov_type = 'nonrobust'
+ self.cov_kwds = {'description' : 'Standard Errors assume that the ' +
+ 'covariance matrix of the errors is correctly ' +
+ 'specified.'}
@josef-pkt
josef-pkt Aug 21, 2014 Member

nonrobust doesn't define cov_params_default
move to get_robustcov_results or not.
OLS AFAIR the nonrobust case doesn't change the current calls and code path in OLS, except for adding this information.

@josef-pkt
Member

merging, see followup in #1922

@josef-pkt josef-pkt merged commit e373b72 into statsmodels:master Aug 21, 2014

2 checks passed

continuous-integration/appveyor AppVeyor build succeeded
Details
continuous-integration/travis-ci The Travis CI build passed
Details
@josef-pkt josef-pkt deleted the josef-pkt:ENH_covtype_discrete branch Aug 21, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment