Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SUMM/ENH/FAQ avoiding data problems from ill conditioning and tools #6381

Open
josef-pkt opened this issue Jan 4, 2020 · 0 comments
Open

Comments

@josef-pkt
Copy link
Member

josef-pkt commented Jan 4, 2020

I didn't realize that my original data is ill conditioned, it's badly scaled and has high multicollinearity? What can I do?

current context: multicollinearity diagnostics for the model.exog which might be badly parameterized.
i.e. we don't want "theoretical" diagnostics that use some transformed data, but we want to know what the problems with our actual exog is. #2380

example: vif based on correlation matrix ignores multicollinearity or problems with the constant. OLS cond. number is based on exog and sensitive to scaling.
Belsley 1980 has an example where the constant is almost collinear with a exog with large mean and small variance.
Nist test cases has a similar example with small coefficient of variation. And another example with badly scaled polynomials.

We want to alert users to those problems or make it easy for them to check. e.g. #1908 for adding sanity checks

However, Belsley argues that fundamental ill conditioning cannot be removed by transformation. If the transformation removes ill conditioning, then the transformation itself is ill conditioned, i.e. we just shift the ill conditioning from the model to the transformation.

second: numerical problems in optimization, example Poisson with large values of exog
#1131
#1715
#3925
#1131
#1699 #4577 StandardizeTransform
fit_transformed ?

#2062

Note: in nonlinear model like GLM and discrete X could be reasonably well behave, but the nonlinear transform creates numerical problems, e.g. exp under or overflow.
In this case, transforming exog improves convergence of the optimizer and doesn't just shilft ill conditioning.

current plans
functions to identify multicollinearity, condition index, vif and similar
quick summary including bad scaling, e.g. min/max, coefficient of variation of data.
multicollinearity measure both on data/exog and on model and results, score_obs, hessian and cov_params

later:
look into fit_transformed for GLM and discrete again.
alternative estimator, eg. ridge, penalized, Firth

note: Firth includes properties of data including endog
As in perfect separation, (almost) empty cells, there is the additional separate issue whether there is enough (or too much) information about the relationship between exog and endog.

related: perfectly collinear variables or unidentified parameters. What can we still infer?
e.g. is_estimable #6271

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant