<a href="https://colab.research.google.com/github/BI-DS/EBA-3530/blob/main/Lecture_5/multicollinearity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import numpy as np
import statsmodels.api as sm

# Multicollinearity
Let's simulate some data to see the effect of multicollinearity and how can we solve a wrongly specified design matrix.

In [2]:
# Number of observations for each category
Nk = 30

# Number of categories
K = 3

# Total number of observations
N = Nk*K

#% Construct a x matrix with dummy variables (zero one vectors).
x = np.zeros((N, K + 1)) # N obversvations in total. K categories & intercept
x[:,0] = np.ones(N) # Add the intercept

# Loop in the dummies
xo = np.ones(Nk)
cnt = 0

# the first column is the intercept in the regression
# the second column a dummy variable for the first category, etc.
for i in range(K):
  x[cnt:Nk+cnt,i+1] = xo
  cnt = cnt + Nk

What is the problem with x?

In [12]:
print(np.sum(x[:,1:],axis=1))

[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]


In [3]:
# Let y be a function of some random noise + intercept. AND, let the true
# dummy for the thirs category be zero. This means that the first category
# has the same intercept as the intercept itself.
beta = np.array([10, 5, -5, 0])

# y = x'*beta will give
# y_1 = 10 + 5  = 15 for category 1
# y_2 = 10 - 5  = 5  for category 2
# y_3 = 10      = 10 for category 3
# But we also add some noise
y = np.dot(x,beta) + np.random.normal(size=(N))

In [None]:
model = sm.OLS(y,x).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.951
Model:                            OLS   Adj. R-squared:                  0.950
Method:                 Least Squares   F-statistic:                     847.1
Date:                Wed, 22 Feb 2023   Prob (F-statistic):           9.19e-58
Time:                        12:26:13   Log-Likelihood:                -119.42
No. Observations:                  90   AIC:                             244.8
Df Residuals:                      87   BIC:                             252.3
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          7.4747      0.073    101.924      0.0

**See the warning message!** The regression is rank deficient,
mening that the variables in x are linear combination of each other. In particular, we know from above that `np.sum(x[:,1:],1)=1`, i.e., all columns sum to 1, which is the same as `x[:,0]` (the intercept)


**Dropping the 1st category**

In [6]:
model = sm.OLS(y,x[:,[0,2,3]]).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.945
Model:                            OLS   Adj. R-squared:                  0.944
Method:                 Least Squares   F-statistic:                     748.6
Date:                Wed, 13 Mar 2024   Prob (F-statistic):           1.50e-55
Time:                        09:35:28   Log-Likelihood:                -130.18
No. Observations:                  90   AIC:                             266.4
Df Residuals:                      87   BIC:                             273.9
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         15.0089      0.191     78.629      0.0

Remember the true beta is `[10 5 -5 0]` for the intercept, cat1, cat2, and cat3, respectively. Notice that the estimated parameters for cat2 and cat3 are relative to cat1, i.e. beta[2] - beta[1] = -10 and beta[4] - beta[1] = -5. **Just as the estimated coefficients above!**

Remember that
* $y_1 = 10 + 5 \rightarrow 15$
* $y_2 = 10 -5 \rightarrow 5$
* $y_3 = 10  \rightarrow 10$

Let's double check...

In [7]:
print('Estimate category 1 is {:.1f}'.format(model.params[0]))
print('Estimate category 2 is {:.1f}'.format(model.params[0]+model.params[1]))
print('Estimate category 3 is {:.1f}'.format(model.params[0]+model.params[2]))

Estimate category 1 is 15.0
Estimate category 2 is 4.6
Estimate category 3 is 9.9


Now try a **lasso** model

In [8]:
from sklearn.linear_model import LassoCV

# define the LassoCV object
Lasso = LassoCV(cv=5, fit_intercept=False).fit(x,y.ravel())

Print lasso coefficients ...

In [9]:
print('Lasso estimated intercept is {:.1f}'.format(Lasso.intercept_))
print('Lasso estimated coefficients are {}'.format(np.round(Lasso.coef_,1)))

Lasso estimated intercept is 0.0
Lasso estimated coefficients are [ 9.8  5.1 -5.2  0. ]
