In [None]:
import numpy as np
import statsmodels.api as sm

# Multicollinearity
Let's simulate some data to see the effect of multicollinearity and how can we solve a wrongly specified design matrix. 

In [7]:
# Number of observations for each category
Nk = 30

# Number of categories
K = 3

# Total number of observations
N = Nk*K

#% Construct a x matrix with dummy variables (zero one vectors). 
x = np.zeros((N, K + 1)) # N obversvations in total. K categories & intercept
x[:,0] = np.ones(N) # Add the intercept

# Loop in the dummies
xo = np.ones(Nk)
cnt = 0

# the first column is the intercept in the regression
# the second column a dummy variable for the first category, etc. 
for i in range(K):
  x[cnt:Nk+cnt,i+1] = xo
  cnt = cnt + Nk

In [16]:
# Let y be a function of some random noise + intercept. AND, let the true
# dummy for the thirs category be zero. This means that the first category
# has the same intercept as the intercept itself. 

beta = np.array([10, 5, -5, 0])
# y = x'*beta will give
# y_1 = 10 + 5  = 15 for category 1
# y_2 = 10 - 5 = 5   for category 2
# y_3 = 10     = 10  for category 3
# But we also add some noise
y = np.dot(x,beta) + np.random.normal(size=(N))

In [None]:
model = sm.OLS(y,x).fit()
print(model.summary())

**See the warning message!** The regression is rank deficient,
mening that the variables in x are linear combination of each other. In particular, we know from above that `np.sum(x[:,1:],1)=1`, i.e., all columns sum to 1, which is the same as `x[:,0]` (the intercept)


**Dropping the 1st category**

In [None]:
model = sm.OLS(y,x[:,[0,2,3]]).fit()
print(model.summary())

Remember the true beta is `[10 5 -5 0]` for intercept, cat1, cat2, and cat3. Notice that the estimated parameters for cat2 and cat3 are relative to cat1, i.e. beta[2] - beta[1] = -10 and beta[4] - beta[1] = -5. **Just as the estimated coefficients above!**

We know that $y = x'\beta$, hence
* $y_1 = 10 + 5 \rightarrow 15$
* $y_2 = 10 -5 \rightarrow 5$
* $y_3 = 10  \rightarrow 10$

Let's double check...

In [40]:
print('Estimate category 1 is {:.1f}'.format(model.params[0]))
print('Estimate category 2 is {:.1f}'.format(model.params[0]+model.params[1]))
print('Estimate category 3 is {:.1f}'.format(model.params[0]+model.params[2]))

Estimate category 1 is 15.2
Estimate category 2 is 4.7
Estimate category 3 is 9.9


Now try a **lasso** model

In [41]:
from sklearn.linear_model import LassoCV

# define the LassoCV object
Lasso = LassoCV(cv=5, fit_intercept=True).fit(x,y.ravel())

Print lasso coefficients ...

In [46]:
print('Lasso estimated intercept is {:.1f}'.format(Lasso.intercept_))
print('Lasso estimated coefficients are {}'.format(np.round(Lasso.coef_[1:],1)))

Lasso estimated intercept is 9.9
Lasso estimated coefficients are [ 5.2 -5.1  0. ]
