SUMM/ENH/FAQ avoiding data problems from ill conditioning and tools #6381

josef-pkt · 2020-01-04T04:40:17Z

I didn't realize that my original data is ill conditioned, it's badly scaled and has high multicollinearity? What can I do?

current context: multicollinearity diagnostics for the model.exog which might be badly parameterized.
i.e. we don't want "theoretical" diagnostics that use some transformed data, but we want to know what the problems with our actual exog is. #2380

example: vif based on correlation matrix ignores multicollinearity or problems with the constant. OLS cond. number is based on exog and sensitive to scaling.
Belsley 1980 has an example where the constant is almost collinear with a exog with large mean and small variance.
Nist test cases has a similar example with small coefficient of variation. And another example with badly scaled polynomials.

We want to alert users to those problems or make it easy for them to check. e.g. #1908 for adding sanity checks

However, Belsley argues that fundamental ill conditioning cannot be removed by transformation. If the transformation removes ill conditioning, then the transformation itself is ill conditioned, i.e. we just shift the ill conditioning from the model to the transformation.

second: numerical problems in optimization, example Poisson with large values of exog
#1131
#1715
#3925
#1131
#1699 #4577 StandardizeTransform
fit_transformed ?

#2062

Note: in nonlinear model like GLM and discrete X could be reasonably well behave, but the nonlinear transform creates numerical problems, e.g. exp under or overflow.
In this case, transforming exog improves convergence of the optimizer and doesn't just shilft ill conditioning.

current plans
functions to identify multicollinearity, condition index, vif and similar
quick summary including bad scaling, e.g. min/max, coefficient of variation of data.
multicollinearity measure both on data/exog and on model and results, score_obs, hessian and cov_params

later:
look into fit_transformed for GLM and discrete again.
alternative estimator, eg. ridge, penalized, Firth

note: Firth includes properties of data including endog
As in perfect separation, (almost) empty cells, there is the additional separate issue whether there is enough (or too much) information about the relationship between exog and endog.

related: perfectly collinear variables or unidentified parameters. What can we still infer?
e.g. is_estimable #6271

The text was updated successfully, but these errors were encountered:

josef-pkt added type-enh comp-base design labels Jan 4, 2020

josef-pkt mentioned this issue Jan 4, 2020

Descriptive Statistics failing with pandas dataframes #2630

Closed

josef-pkt mentioned this issue Mar 6, 2023

SUMM: singular exog (again) #8721

Open

josef-pkt mentioned this issue Nov 22, 2023

ENH: combined model checks, diagnostics, outlier influence, multicollinearity #9068

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SUMM/ENH/FAQ avoiding data problems from ill conditioning and tools #6381

SUMM/ENH/FAQ avoiding data problems from ill conditioning and tools #6381

josef-pkt commented Jan 4, 2020 •

edited

SUMM/ENH/FAQ avoiding data problems from ill conditioning and tools #6381

SUMM/ENH/FAQ avoiding data problems from ill conditioning and tools #6381

Comments

josef-pkt commented Jan 4, 2020 • edited

josef-pkt commented Jan 4, 2020 •

edited