# 6 Linear Model Selection and Regularization

In the regression setting, the standard Linear Model

<a id="Formula6.1"></a>
<font size=5><center> $ Y = \beta_{0} + \beta_{1}X_{1} + \dots + \beta_{p}X_{p} + \epsilon $ </center></font>

is commonly used to describe the relationship between a response $ Y $ and a set of Variables $ X_{1},X_{2},\dots,X_{p} $. typically fits this model using least squares.

- In Chapter 7 we generalize <a href="#Formula6.1">(6.1)</a> in order to `accommodate non-linear`, but `still additive`, `relationships`, while 

- in Chapter 8 we consider even `more general non-linear models`. However, the linear model `has distinct advantages` in terms of `inference` and, on `real-world problems`, is `often surprisingly competitive` in relation to non-linear methods.

Hence, before moving to the `non-linear world`, we discuss in this chapter some ways in which the `simple linear model` can be `improved`, by `replacing plain least squares fitting` with some `alternative fitting procedures`.

__Why might we want to use another fitting procedure instead of `least squares`?__ 

    As we will see, alternative fitting procedures can yield better _prediction accuracy_ and _model interpretability_.
    
- ___Prediction Accuracy:___ Provided that the `true relationship` between the `response and the predictors` is approximately `linear`, the `least squares` estimates will have `low bias`.  <br>If $n \gt\gt p$—that is, if $n$, the `number of observations`, is `much larger` than $p$, the `number of variables`—then the `least squares estimates` tend to also have `low variance`, and hence will perform well on `test observations`. <br>However, if $n$ is not much larger than $p$, then there can be a lot of `variability` in the `least squares fit`, resulting in `overfitting` and `consequently poor predictions` on `future observations not used in model training`. And if $p \gt n$, then there is __`no longer a unique least squares coefficient estimate`__: the `variance` is `infinite` so the method `cannot be used at all`. <br>By constraining or shrinking the estimated coefficients, we can often substantially reduce the variance at the cost of a negligible increase in bias. This can lead to substantial improvements in the accuracy with which we can predict the response for observations not used in model training.


- ___Model Interpretability:___ It is often the case that some or many of the variables used in a multiple regression model are in fact not associated with the response. Including such irrelevant variables leads to unnecessary complexity in the resulting model. By removing these variables—that is, by setting the corresponding coefficient estimates to zero—we can obtain a model that is more easily interpreted. Now least squares is extremely unlikely to yield any coefficient estimates that are exactly zero. In this chapter, we see some approaches for automatically performing ___`feature selection`___ or ___`variable selection`___—that is, for excluding irrelevant variables from a multiple regression model.

There are many `alternatives`, both `classical` and `modern`, to using `least squares` to fit <a href="#Formula6.1">(6.1)</a>. 

In this chapter, we discuss `three important classes of methods`.

1. ___Subset Selection___. This approach involves identifying a subset of the p predictors that we believe to be related to the response. We then fit a model using least squares on the reduced set of variables.


2. ___Shrinkage___. This approach involves fitting a model involving all p predictors. However, the estimated coefficients are shrunken towards zero relative to the least squares estimates. This shrinkage (also known as regularization) has the effect of reducing variance. Depending on what type of shrinkage is performed, some of the coefficients may be estimated to be exactly zero. Hence, shrinkage methods can also perform variable selection.


3. ___Dimension Reduction.___ This approach involves projecting the p predictors into a M -dimensional subspace, where M < p. This is achieved by computing M different linear combinations, or projections, of the variables. Then these M projections are used as predictors to fit a linear regression model by least squares.



## 6.1 Subset Selection
### 6.1.1 Best Subset Selection

To perform ___`Best Subset Selection`___, we fit a separate lease square regression for each possible combination of the $p$ predictors. That is, we fit all $p$ models that contain exactly one predictor, all $ \binom{p}{2}  = p(p − 1)/2$ models that contain `exactly two predictors`, and so forth. 

We then look at all of the `resulting models`, with the `goal of identifying` the `one that is best`.


The `problem of selecting the best model` from `among` the $2^p$ `possibilities` considered by `best subset selection` is `not trivial`. 

This is usually broken up `into two stages`, as described in <a href="#Algorithm-6.1-Best-Subset-Selection">___`Algorithm 6.1`___</a>.

***
#### Algorithm 6.1 Best Subset Selection
***
<font size=3 face="Times New Roman"><b>

1.  Let $ M_{0} $ denote the null model , which contains no predictors. This model simply predicts the sample mean for each observation. 


2. For $ k = 1, 2, \dots, p$:
    
    (a) Fit all k p models that contain exactly k predictors.
    
    (b) Pick the best among these $k_{p}$ models, and call it $ M_{k} $. Here best is defined as having the smallest RSS, or equivalently largest $R^2$ .


3. Select a single best model from among $ M_{0} , \dots , M_{p}$ using cross-validated prediction error, $C_{p}$ (AIC), BIC, or adjusted $R^2$ .
</b></font>
***

In <a href="#Algorithm-6.1-Best-Subset-Selection">__`Algorithm 6.1`__</a>, Step 2 identifies the best model (on the training data) for each subset size, in order to reduce the problem from one of $2^p$ possible models to one of $p + 1$ possible models. <br>In <a href="#Figure6.1">Figure 6.1</a>, these models form the lower frontier depicted in red.


<a id="Figure6.1"></a>
![image.png](Figures/Figure6.1.png)
>__FIGURE 6.1.__ For each possible model containing a subset of the ten predictors in the Credit data set, the RSS and $R^2$ are displayed. 
<br>The red frontier tracks the best model for a given number of predictors, according to RSS and $R^2$ . Though the data set contains only ten predictors, the x-axis ranges from 1 to 11, since one of the variables is categorical and takes on three values, leading to the creation of two dummy variables.
