## Regularization and model selection for Regression

The Gauss-Markov theorem states that the least squares estimate of the parameters $\beta$ in a linear regression model has the smallest mean square error (MSE) of all linear estimators with no bias. However, there can still be a biased estimator with smaller MSE. In such estimators, bias is traded in for a larger reduction in variance and hence lower MSE. Variable selection, regularization are some methods that are used to design such estimators.

___Subset selection___ is the process of identifying a reduced subset of predictors from the entire set of $p$ predictors to model the response, so that we may achieve lower variance in our estimators. The _Best Subset Selection_ method fits $\binom{p}{k}$ models for each $k = 1, 2 ... p$ and selects the best resulting model from among all the possibilities. _Stepwise Selection_ methods improve upon the computationally infeasible _Best Subset Selection_ by starting with a null (Forward Stepwise) or full (Backward Stepwise) model and adding (Forrward) or removing (Backward) predictors one step at a time in a greedy manner. Subset selection, therefore results in models with lower number of predictors, that are more interpretable and has lower prediction error. However, the process is still discrete - we either remove or add the variables, and hence have high variance. Regularization (aka Shrinkage) methods are a continuous versions of variable selection.

#### Regularization methods
In regularization methods, we fit the model with all $p$ variables, but shrink the coefficients towards zero, thereby reducing variance at the cost of slight increase in bias. In the least squares fitting algorithm, we estimated the parameters $\beta$ by minimizing the residual sum of squares (RSS):

\begin{align}
J(\beta) & = RSS \\
& = \sum_{i=1}^{n}(y_i - \hat y_i)^2 \\
& = \sum_{i=1}^{n}(y_i - \beta_0 - \sum_{j=1}^{p}\beta_jx_{ij})
\end{align}

In regularization methods, we add a shrinkage penalty term to the cost-function we are trying to minimize. For ___Ridge Regression___, the cost-function takes the form:

$$J(\beta) = \sum_{i=1}^{n}(y_i - \beta_0 - \sum_{j=1}^{p}\beta_jx_{ij}) + \lambda \sum_{j=1}^{p}\beta_j^2$$

And for ___Lasso Regression___:
$$J(\beta) = \sum_{i=1}^{n}(y_i - \beta_0 - \sum_{j=1}^{p}\beta_jx_{ij}) + \lambda \sum_{j=1}^{p}\lvert \beta_j \rvert$$

Where $\lambda \ge 0$, is the tuning parameter. As with least squares, $\beta$ are estimated so that RSS is as small as possible, however the choice of $\beta$ are penalized based on the shrinkage penalty term. In both Ridge and Lasso, the shrinkage term is small when $\beta$ are small, and hence this effectively has the effect of shrinking $\beta$ towards $0$. The tuning parameter $\lambda$ controls the impact of the penalty term on the regression model. When $\lambda = 0$, it is essentially the least squares estimate, and when $\lambda \to \infty$, the coefficient estimates will tend to zero. The addition of the shrinkage penalty term essentially reduces the flexibility of the model, thereby increasing bias, but decreasing variance. As $\lambda$ is increased, the fit will become poorer and at one point the bias is so high that the test MSE ends up becoming high as well.

The difference between Lasso and Ridge regression is the choice of penalty term. The lasso uses a $\ell_1$ penalty: the $\ell_1$ norm of the coefficient vector $\beta$ is the sum of its absolute values $\lVert \beta \rVert_1 = \sum \lvert \beta_j \rvert$. Ridge regression uses $\ell_2$ penalty, the norm of which is the Euclidean distance $\lVert \beta \rVert_2 = \sqrt{\sum_j \beta_j^2}$. While both penalty terms force the estimates to approach zero, the $\ell_2$ penalty encourages solutions where most parameters values are small, whereas the $\ell_1$ often results in solutions where parameters are exactly zero. So, Lasso regression performs something like subset selection, and hence produces more interpretable models.