#**Ridge Regression (Tikhonov regularization)**

In the previous topics we mentioned that it is unwise to include variables in a model that possess multicollinearity. Multicollinearity can create inaccurate estimates of the regression coefficients, inflate the standard errors of the
regression coefficients, deflate the partial t-tests for the regression coefficients, give false, nonsignificant, pvalues, and degrade the predictability of the model (and that’s just for starters).

There are five sources (see Montgomery [1982] for details):

1. Data collection. In this case, the data have been collected from a narrow subspace of the independent variables. The multicollinearity has been created by the sampling methodology—it does not exist in the population. Obtaining more data on an expanded range would cure this multicollinearity problem. The
extreme example of this is when you try to fit a line to a single point.

2. Physical constraints of the linear model or population. This source of multicollinearity will exist no matter what sampling technique is used. Many manufacturing or service processes have constraints on independent variables (as to their range), either physically, politically, or legally, which will create multicollinearity.

3. Over-defined model. Here, there are more variables than observations. This situation should be avoided. NCSS Statistical Software NCSS.com Ridge Regression 335-2 © NCSS, LLC. All Rights Reserved.

4. Model choice or specification. This source of multicollinearity comes from using independent variables that are powers or interactions of an original set of variables. It should be noted that if the sampling subspace of independent variables is narrow, then any combination of those variables will increase the
multicollinearity problem even further.

5. Outliers. Extreme values or outliers in the X-space can cause multicollinearity as well as hide it. We call this outlier-induced multicollinearity. This should be corrected by removing the outliers before ridge regression is applied.


The concept behind Ridge regression also know as L2 Regularization is to adjust the estimates that you would normally get from Ordinary Least Squares regression to give new estimates (which have a small amount of bias), but which will have unbiased variance and subsequently deals with inflated VIF and reduce overfitting. It doesn't get rid of attibutes but can point you you to those which are less significant. In ridge regression, the cost function is altered by adding a penalty equivalent to square of the magnitude of the coefficients. It can help us with identifying an dealing with overfitting.

In ordinary least squares regression we minimise the following cost function:

>>$min_{\beta}(y-X\beta)^T(y-X\beta)$ which gives

>>$\hat{B}_R={(X^TX)}^{-1}X^Ty$

In ridge regression we have the following :

>>$min_{\beta}(y-X\beta)^T(y-X\beta) +\lambda(\beta^T\beta-c)$ which gives

>>$\hat{B}_R={(X^TX+{\lambda}I)}^{-1}X^Ty$

and is equivalent to saying we are going to minimize the cost function for the Ordinary least squares regression under the condition below:

>>For some c> 0 $\sum_{j=0}^p \beta^T\beta<c$

Choosing a value for k is not a simple task, which is perhaps one major reason why ridge regression isn’t used as much as least squares or logistic regression. You can read one way to find k in Dorugade and D. N. Kashid’s paper Alternative Method for Choosing Ridge Parameter for Regression. The literature does recommend that the $\lambda$ value be kept under 0.3.








Now lets complete a ridge regression in python. You will see directly below I have created an artifical dataset which has a large amount of multicollinearity. In the code below and I have printed it out so it relates to my comments below. Only run the generator if you want to experiment
I have also written a function to calculate the standard error of the parameter estimates. Statsmodel does this automatically but it is not available for Ridge regression on Google Colabs. We need this to see how the Standard error changes when we implement the ridge regression.

In [7]:
### Generator for artifical Dataset.

import numpy as np
n_samples, n_features = 10, 5
rng = np.random.RandomState(0)
y = rng.randn(n_samples)
X = rng.randn(n_samples, n_features)
X[:,4]=2.5*X[:,2]+2.2*X[:,3]+(X[:,4]/100)

In [8]:
import numpy as np
n_samples, n_features = 10, 5
X=np.array([[ 0.14404357 , 1.45427351,  0.76103773,  0.12167502,  2.17471798],
   [ 0.33367433,  1.49407907, -0.20515826,  0.3130677,   0.16731233],
   [-2.55298982,  0.6536186,   0.8644362,  -0.74216502,  0.551025  ],
   [-1.45436567,  0.04575852, -0.18718385,  1.53277921,  2.91884823],
   [ 0.15494743,  0.37816252, -0.88778575, -1.98079647, -6.58069572],
   [ 0.15634897,  1.23029068,  1.20237985, -0.38732682,  2.1508076 ],
   [-1.04855297, -1.42001794, -1.70627019,  1.9507754,   0.02093387],
   [-0.4380743,  -1.25279536,  0.77749036, -1.61389785, -1.60897678],
   [-0.89546656,  0.3869025,  -0.51080514, -1.18063218, -3.87468547],
   [ 0.42833187,  0.06651722,  0.3024719,  -0.63432209,-0.64295627]])
X2=X[0:10,0:4]
print(X2)

y=[ 1.76405235,0.40015721,  0.97873798,  2.2408932,   1.86755799, -0.97727788,  0.95008842, -0.15135721, -0.10321885,  0.4105985 ]

[[ 0.14404357  1.45427351  0.76103773  0.12167502]
 [ 0.33367433  1.49407907 -0.20515826  0.3130677 ]
 [-2.55298982  0.6536186   0.8644362  -0.74216502]
 [-1.45436567  0.04575852 -0.18718385  1.53277921]
 [ 0.15494743  0.37816252 -0.88778575 -1.98079647]
 [ 0.15634897  1.23029068  1.20237985 -0.38732682]
 [-1.04855297 -1.42001794 -1.70627019  1.9507754 ]
 [-0.4380743  -1.25279536  0.77749036 -1.61389785]
 [-0.89546656  0.3869025  -0.51080514 -1.18063218]
 [ 0.42833187  0.06651722  0.3024719  -0.63432209]]


In [9]:
def se(X,mse):

  SE=np.zeros(len(X[0,:]))
  for i in range(0,len(X[0,:])):
     SE[i]=np.sqrt(mse/np.square(X[:,i]-np.mean(X[:,i])).sum())

  return SE

The following piece of code prints out the correlation matrix and the VIF for the 5 X factors. It then goes on to produce the Coefficients, RSquared and the standard errors for each parameter. If you multiply the Standard error by 1.96 and subtract and add this value to your parameter estimate. You should get a confidence interval which is

>>$\beta \pm 1.96.S.E$

If zero falls in this interval then it tells us the parameter in question should be dropped.

In [10]:
from sklearn.linear_model import LinearRegression
from statsmodels.stats.outliers_influence import variance_inflation_factor
import pandas as pd

vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X2, i) for i in range(X2.shape[1])]

print(vif)

print(np.corrcoef(X.transpose()))
reg = LinearRegression().fit(X2, y)
mse=np.square(y-reg.predict(X2)).sum()/(n_samples-(n_features))
print('Standard errors are ',se(X2,mse))
print('r_squared :',reg.score(X2, y))
print('reg coefficents : ',reg.coef_)

   VIF Factor
0    1.015238
1    1.239427
2    1.362308
3    1.151187
[[ 1.          0.28495567  0.050991   -0.22445704 -0.17770115]
 [ 0.28495567  1.          0.43815271 -0.12449802  0.22222447]
 [ 0.050991    0.43815271  1.         -0.35107814  0.44352535]
 [-0.22445704 -0.12449802 -0.35107814  1.          0.68349441]
 [-0.17770115  0.22222447  0.44352535  0.68349441  1.        ]]
Standard errors are  [0.40718094 0.38835036 0.43162862 0.30906448]
r_squared : 0.25409974966549564
reg coefficents :  [-0.27391759  0.27547156 -0.4669002   0.11115832]


In [11]:
print(y)

[1.76405235, 0.40015721, 0.97873798, 2.2408932, 1.86755799, -0.97727788, 0.95008842, -0.15135721, -0.10321885, 0.4105985]


We can see using the formula in the previous section that the first 2 regression coefficents are in significant. The remianing 3 are higly significant and would possibly suggest multi-collinearity or an over specified model.

In [12]:
from sklearn.linear_model import Ridge
#X=X[:,0:4]
clf = Ridge(alpha=0.2)
ridge=clf.fit(X, y)

mse=np.square(y-ridge.predict(X)).sum()/(n_samples-n_features)
print('Standard errors are ',se(X,mse))
print('r_squared :',ridge.score(X,y))
print('reg coefficents : ',ridge.coef_)

Standard errors are  [0.4073432  0.38850512 0.43180063 0.30918765 0.13440328]
r_squared : 0.2535051418548737
reg coefficents :  [-0.26452402  0.26160454 -0.29887643  0.24664966 -0.06131185]


If we use a $\lambda$ of 0.2 we get the last variable being the only significant variable. This doesnt makes sense as we created it from X3 and X4 and we would have expected the 5th variable to be insignificant and thus should be dropped. However, if you remember back in MOOC 1 we talked about being very careful when removing variables. It is not straight forward. You will see that when we do Lasso Regression that the 5th variable is the one selected for exclusion.

#**Review**

We have looked at how to implement Ridge Regression and how the multicollinearity can send us in the wrong direction with respect to the assumptions about which variables are appropriate.The major drawback to it is that there is no selection methodology for $\lambda$.

Adjust the values of alpha ($\lambda$) and see what happens.

We are now going to move to the next step which will cover Lasso Regression.