# Subset selection






## Best subset selection

To perform best subset selection, we fit a separate least squares regression for each possible combination of the $D$  predictors. That is, we fit all $D$  models $\left(\begin{array}{c} D \\ 2 \end{array}\right)$ that contain exactly one predictor, all $2 = D(D − 1)/2$ models that contain exactly two predictors, and so forth. We then look at all of the resulting models, with the goal of identifying the one that is best.


```{prf:algorithm} Best subset selection
:label: best-subset

<!-- **Inputs** Given a Network $G=(V,E)$ with flow capacity $c$, a source node $s$, and a sink node $t$

**Output** Compute a flow $f$ from $s$ to $t$ of maximum value -->

1. Let $\mathcal{M}_0$ denote the null model, which contains no predictors. This model simply predicts the sample mean for each observation.
2. For $k = 1, 2, \ldots, D$:
	1. Fit all $\left(\begin{array}{c} D \\ k \end{array}\right)$ models that contain exactly $k$ predictors, where $D$  is the total number of predictors.
	2. Choose the best among these $\left(\begin{array}{c} D \\ k \end{array}\right)$ models, denoted $\mathcal{M}_k$. Here best is defined using some criterion, such as $\mathrm{R}^2$.
	<!-- 2. Choose the best among these $\left(\begin{array}{c} D \\ k \end{array}\right)$ models, denoted $\mathcal{M}_k$. Here best is defined using some criterion, such as $C_p$, $AIC$, $BIC$, adjusted $R^2$, or $Mallow's\;C_p$. -->
3. Select a single best model from $\mathcal{M}_0, \mathcal{M}_1, \ldots, \mathcal{M}_D$ using cross-validation. 
<!-- Here we will use $C_p$, but you could also use $AIC$, $BIC$, adjusted $R^2$, or $Mallow's\;C_p$. -->

```

## Stepwise selection

For computational reasons, best subset selection cannot be applied with very large $D$ . Best subset selection may also suffer from statistical problems when $D$  is large. The larger the search space, the higher the chance of finding models that look good on the training data, even though they might not
have any predictive power on future data. Thus an enormous search space can lead to overfitting and high variance of the coefficient estimates. For both of these reasons, stepwise methods, which explore a far more restricted set of models, are attractive alternatives to best subset selection.

### Forward stepwise selection

Forward stepwise selection is a computationally eﬃcient alternative to best subset selection. While the best subset selection procedure considers all $2^N$ possible models containing subsets of the $D$  predictors, forward stepwise considers a much smaller set of models. Forward stepwise selection begins with a model containing no predictors, and then adds predictors to the model, one-at-a-time, until all of the predictors are in the model. In particular, at each step the variable that gives the greatest additional improvement to the fit is added to the model.


```{prf:algorithm} Forward stepwise selection
:label: forward-stepwise

1. Let $\mathcal{M}_0$ denote the null model, which contains no predictors. This model simply predicts the sample mean for each observation.
2. For $k = 1, 2, \ldots, D$:
    1. Fit all $D - k + 1$ models that contain all of the predictors in $\mathcal{M}_{k-1}$ plus one additional predictor.
    2. Choose the best among these $D - k + 1$ models, denoted $\mathcal{M}_k$. Here best is defined using some criterion, such as $\mathrm{R}^2$..
3. Select a single best model from $\mathcal{M}_0, \mathcal{M}_1, \ldots, \mathcal{M}_D$ using cross-validation.
    
```


Unlike best subset selection, which involved fitting $2^N$ models, forward stepwise selection involves fitting one null model, along with $D − k$ models in the kth iteration, for $k = 0, \dots, D − 1$. This amounts to a total of $1 + \sum_{k=0}^{D-1}(D-k) = 1 + N(N+1)/2$ models. This is a substantial difference: when $D = 20$, best subset selection requires fitting 1,048,576 models, whereas forward stepwise selection requires fitting only 211 models.

Forward stepwise selection’s computational advantage over best subset selection is clear. Though forward stepwise tends to do well in practice, it is not guaranteed to find the best possible model out of all $2^N$ models containing subsets of the $D$  predictors. For example, suppose that in a
given data set with $D$  = 3 predictors, the best possible one-variable model contains $x_1$, and the best possible two-variable model instead contains $x_2$ and $x_3$. Then forward stepwise selection will fail to select the best possible two-variable model, because $\mathcal{M}_1$ will contain $x_1$, so $\mathcal{M}_2$ must also contain $x_1$ together with one additional variable.

### Backward stepwise selection

Backward stepwise selection is similar to forward stepwise selection, except that it begins with the full model that contains all $D$  predictors, and then removes the least useful predictor, one-at-a-time, until the model contains no predictors.

```{prf:algorithm} Backward stepwise selection
:label: backward-stepwise

1. Let $\mathcal{M}_0$ denote the null model, which contains no predictors. This model simply predicts the sample mean for each observation.
2. For $k = 1, 2, \ldots, D$:
    1. Fit all $\left(\begin{array}{c} D \\ k \end{array}\right)$ models that contain exactly $k$ predictors.
    2. Choose the best among these $\left(\begin{array}{c} D \\ k \end{array}\right)$ models, denoted $\mathcal{M}_k$. Here best is defined using some criterion $\mathrm{R}^2$..
3. Select a single best model from $\mathcal{M}_0, \mathcal{M}_1, \ldots, \mathcal{M}_D$ using cross-validation.
```

Like forward stepwise selection, the backward selection approach searches through only $1 + D(D + 1)/2$ models, and so can be applied in settings where $D$  is too large to apply best subset selection. Also like forward stepwise selection, backward stepwise selection is not guaranteed to yield the best
model containing a subset of the $D$  predictors.

## Choosing the best model

Best subset selection, forward selection, and backward selection result in the creation of a set of models, each of which contains a subset of the $D$  predictors. To apply these methods, we need a way to determine which of these models is best. We can use either a validation set approach or a cross-validation approach introduced in {doc}`Cross-validation <../05-cross-val-bootstrap/cross-validation>` to estimate the test error directly and then choose the best model.

**Example: feature selection using `scikit-learn`**

Here we use the [Credit dataset](https://github.com/pykale/transparentML/blob/main/data/Credit.csv) to illustrate the feature selection. To select features, we use `RFECV`, recursive feature elimination with cross-validation, which is a backward stepwise selection approach implemented in `scikit-learn`.


In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFECV

%matplotlib inline

Load the [Credit dataset](https://github.com/pykale/transparentML/blob/main/data/Credit.csv) dataset, convert the values of variables (predictors) `Student`, `Own`, `Married`, and `Region` from category to numbers ('0' and '1'), and inspect the first three rows.

In [None]:
credit_url = "https://github.com/pykale/transparentML/raw/main/data/Credit.csv"

credit_df = pd.read_csv(credit_url)
credit_df["Student2"] = credit_df.Student.map({"No": 0, "Yes": 1})
credit_df["Own2"] = credit_df.Own.map({"No": 0, "Yes": 1})
credit_df["Married2"] = credit_df.Married.map({"No": 0, "Yes": 1})
credit_df["South"] = credit_df.Region.map(
    {"South": 1, "North": 0, "West": 0, "East": 0}
)
credit_df["West"] = credit_df.Region.map({"West": 1, "North": 0, "South": 0, "East": 0})
credit_df["East"] = credit_df.Region.map({"East": 1, "North": 0, "South": 0, "West": 0})
# credit_df["Region2"] = credit_df.Region.astype("category")
credit_df.head(3)

In [None]:
X = credit_df.drop(["Own", "Student", "Married", "Region", "Balance"], axis=1).values
y = credit_df.Balance.values

In [None]:
# Create the RFE object and compute a cross-validated score.
regr = LinearRegression()
# The "neg_mean_squared_error" scoring is proportional to the R^2 of the prediction

min_features_to_select = 1  # Minimum number of features to consider
cv = 10  # Number of folds in cross-validation
rfecv = RFECV(
    estimator=regr,
    step=1,
    cv=cv,
    scoring="neg_mean_squared_error",
    min_features_to_select=min_features_to_select,
)
rfecv.fit(X, y)

print("Optimal number of features : %d" % rfecv.n_features_)

# Plot number of features VS. cross-validation scores
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score (MSE)")
scores = np.concatenate(
    [-rfecv.cv_results_["split%s_test_score" % i].reshape(-1, 1) for i in range(cv)],
    axis=1,
)
plt.plot(
    range(min_features_to_select, X.shape[1] + min_features_to_select),
    scores,
)
plt.show()

## Exercises

min 3 max 5

