New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: MLEInfluence for two-part models, extra params, BetaModel #7912
Conversation
my checking: in their appendix: betareg doesn't have df_betas and dffits as public methods. I might have to split params so we can have cook's distance separately for mean and precision. The ranking in betareg's and my cook distance looks pretty different. Aside: |
checking with explicit LOO loop (copied from OLSInfluence and adjusted for second exog) d_params looks close, except I would have thought diff should be the other way (opposite sign)
|
reference for influence in model with variable dispersion I guess I don't bother matching R betareg for now. They ignore the effect of changing dispersion parameter estimates. I might just verify a few things with the LOO. I haven't checked the fitted/resid related influence and outlier statistics yet. e.g. (internally) studentized residual |
score residual as used in MLEInfluence.resid_studentized are still unclear My definition is kind of ok and more generic. But it's not what the literature on beta regression is doing. The score residuals defined as sf / hf are reasonably close to pearson residuals in the BetaModel case for the income-foodexpenditure data with one slope variable for precision. |
infrastructure |
latest commit adds same pattern as in BetaModel to NBP and GPP using numdiff I didn't look at the new results numbers yet.
needs checking |
I'm not sure if I should add the update decision: I will add the results methods. Then, I have a good place to provide model specific explanations and references. This is mainly for the BetaModel case, where I can refer to Rocha/Simas to explain the differences to R betareg. Probit doesn't support MLEInfluence yet. for later: we should add explicit looo loop for selected observations to the Influence classes. |
reading up again a bit for MLE fisher information matrix,, hessian is the variance of the score Var(score) = E(hessian) so sf / hf standardizes the score by the hessian in a one parameter model (without exog) in 2-parameter model so standardizing sf1 only needs hf11 (no matrix inverse) However, according to this, I think we should standardize sf1 by sqrt(hf11). the above needs to use negative hessien, -hf update So this is the analogue to the constant distribution parameter case, except we take derivatives wrt the linear predictors. I found a related article, that uses score_obs directly (including the exog part) Kapitula, Laura Ring, and Edward J. Bedrick. 2005. “Diagnostics for the Exponential Normal Growth Curve Model.” Statistics in Medicine 24 (1): 95–108. https://doi.org/10.1002/sim.1919. Aside: |
one question then is which "score residuals" I should include and how
MLEInfluence.resid_studentized is cached attribute and we cannot add options to it.
One more, I just realized I'm using sf_i / hf_i i.e. dividing by hessian_factor for an individual observation Kapitula and Bedrick use information matrix/hessian of the sample where I is negative hessian (second derivative) at MLE. Cooks distance uses cov_params ie. also hessian at MLE, but it might be a robust version and not hessian. |
One problem still is that there is no citable reference for what I'm doing. Another aside: |
(back to this after a detour to downloading and skimming articles0 GPP hf[0] has around 5 to 6% of observations with wrong sign (positive) in docvis example hessian agrees with numdiff hessian. However, results look good using the code of check_jac to compare hessian
GPP test classes do not subclass any other class in tests aside default in GPP is p=1 I'm getting about the same fraction of hf with wrong sign if p=2 |
I don't see any bug or problem in GPP hessian_factor yet.
update Also the loglike looks globally concave as function of mean mu, based on plot and numdiff hessian at the same point. docvis data, second to last observation has positive hessian_factor (warning in unit test)
|
Also we need a separate results class for NBP. I'm giving up for now on investigating GPP hessian. That takes more work and should be a separate issue. I want to add asides: The ZI models inherit aic, bic from DiscreteResults, AFAICS. Are those correct? |
merging this as is. all green I might switch to using resid_pearson as default for resid_studentized for GPP or in general. |
It works but I have not looked at the results numbers yet, only smoke test
uses numdiff for
_deriv_score_obs_dendog
, see #7891_deriv_score_obs_dendog
takes derivative w.r.t. endogthis might be a problem in discrete models
count models NBP, GPP should be ok
Probit will likely be a problem, and we cannot use complex step derivatives.
I'm not sure whether definitions used in MLEInfluence apply to Probit.
Standardized score residuals use sf[0] / hf[0]. Does this work in general?
Do we need the same for the second score_factor?
Note,
_deriv_mean_dparams
needs to be w.r.t. full params even if mean only depends on mean params (for consistent shape with hessian)R betareg has some of the influence outlier measures to test against.