# Week 3 Overview
This week, you will examine the following techniques to use with linear regression: forward and backward selection, principal component regression (PCR), and partial least squares regression (PLSR).

### Learning Objectives
At the end of this week, you should be able to: 
- Discern the appropriate conditions for using forward and backward selection, PCR, and PLSR on your project’s dataset. 
- Perform forward and backward selection to analyze your project’s dataset. 
- Perform PCR and PLSR to analyze your project’s dataset. 

## 3.1 Lesson: Forward and Backward Selection

### Forward Selection
Recall that forward and backward selection are stepwise methods for selecting variables in a regression model. Forward selection starts with no predictors and adds them one by one based on which improves the model most, while backward selection starts with all predictors and removes the least useful ones step by step. 

In forward selection, we perform linear regression by first considering each individual feature as the sole relevant feature in the regression. We test which regression leads to the least loss and stick with that feature. Then, we move on to pick the next feature. For example, suppose our three features are the mass, diameter, and price of a cookie. We want to predict how tasty the cookie is. Then we’d start by performing a regression where mass is the only feature and try to predict tastiness. We’d perform another regression for just diameter and another for just price.

Which of these is the best regression (according to some metric)? Let’s say mass is the best feature at level one. Now we have to do two more regressions: mass and diameter or mass and price. We compare these two to each other, giving us the best regression at level two. We don’t try diameter and price, because our level one regression was mass, so we have to include mass. At level three, there’s only one possibility: mass and diameter and price. So that’s our best regression at level three.  

We keep going until a certain criterion is met: 
- We may predetermine how many features we want to use, or 
- We could test the model on the validation set and stop adding features when the validation loss stops getting better. 

### Backwards Selections
In backward selection, we start with all possible features and then start subtracting them one at a time. Each time we subtract a feature, the performance on the training data will get worse, but the performance on the validation data might improve. Again, we could keep going until: 
- We reach a predetermined number of features, or 
- We test the model on the validation set and stop removing features when the validation loss stops getting better. 

### Limitations of Forward and Backward Selection
Note that neither of these approaches considers all possible feature combinations. If there are $N$ features, then you consider $N$ features in the first set, then $N-1$ (because you've already included one of them), then $N-2$, and so on. In total - you can consider at most $N$ choose 2, or $\frac{N(N-1)}{2}$ combinations, the same is true for backward selection.

For example, if the features are $A$, $B$, and $C$, and you pick $A$ first in forward selection, then you have tried $B$ and $C$ alone, but you will never reach the combination ($B,C$) without $A$. (in fact, this combination $(B,C)$ is the conly combination you will miss in this case). There are seven combinations, and you will reach six of them. 

Likewise, if you remove $A$ first in backward selection, then you have tried $(A,B)$, $(A,C)$, and $(B,C)$., but you will never reach the feature set that contains $A$ alone. (Again, $A$ is the *only* combination you will miss in this case.)

Although in the above examples only one feature is missed, with large datasets the problem gets worse. 
- With 3 features there are 7 combinations, but with 10 features there are 1,023 combinations. 
- In forward/backward selection you'd only reach $\frac{10 \cdot 11}{2} \; =\; 55$ of them

The idea is that you can (in principle) reach any combination as long as the one-at-a-time process does a good job of picking the best combination of features. 

Also note that if you compute the p-values of these regressions in the usual way, these p-values cannot be trusted. That’s because if you test 55 different regressions and pick the best one, it could be that the regression was good just by luck. In fact, getting at least one p-value of 0.02 (close to $\frac{1}{55}$) would be expected for this number of regressions, even if none of them is measuring anything real, assuming their p-values are independent of one another. 

### Think About It
In forward selection with four features ($A$, $B$, $C$, and $D$), which feature combinations do you miss if you add the features in the order: $A$, $B$, $C$, and then $D$.

## Principal Component Regression (PCR)

In a PCR, we perform a principal component analysis (PCA) to transform the data. Then we perform a linear regression on the transformed samples.

For example, if $X$ is:
$$
X \;=\; \begin{bmatrix}
2 & 1 \\
1 & 2
\end{bmatrix}
$$

First, we center the data by subtracting the mean of each column:
$$
X^* \;=\; \begin{bmatrix}
0.5 & -0.5 \\
-0.5 & 0.5
\end{bmatrix}
$$

Then we want the eigenvectors of
$$
\frac{1}{n - 1} \, X^{*T} X^*,
$$
where $n$ is the number of features in the problem (the number of columns of $X$). In our case, $n = 2$, so
$$
\frac{1}{2 - 1} \, X^{*T} X^* \;=\; \begin{bmatrix}
0.5 & -0.5 \\
-0.5 & 0.5
\end{bmatrix}.
$$

The principal component vectors are then $[1,\,1]$ (with eigenvalue 0) and $[1,\,-1]$ (with eigenvalue 1). With PCR, we’d use the latter (with the larger eigenvalue) as the first principal component, in the form:
$$
\bigl[\sqrt{0.5},\; -\sqrt{0.5}\bigr].
$$

The first score vector then comes from taking the inner product of the first row (a sample, after centering) with each principal component; for example:
$$
[\,0.5,\; -0.5\,] \;\cdot\; \bigl[\sqrt{0.5},\, -\sqrt{0.5}\bigr] \;=\; \sqrt{0.5}
\quad\text{and}\quad
[\,0.5,\; -0.5\,] \;\cdot\; \bigl[\sqrt{0.5},\, \sqrt{0.5}\bigr] \;=\; 0.
$$
Thus, the score vector for the first sample is $[\sqrt{0.5},\,0]$.

The score vector of the second sample is calculated similarly as $[-\sqrt{0.5},\,0]$. The second component is not useful as a predictor in that it always has the same value (hence the eigenvalue 0).

> **Note:** If you use *all* of the new features (the principal components), the linear regression on the transformed samples does exactly the same thing as regression on the original samples; only the coefficients are adjusted to account for the linear transformation from the base features to PCA scores. The only way to get a different answer is if you remove some of the features. This could be helpful if:
>
> - Using all of the features would mean overfitting. (Performance on the validation set improves when features are removed.) This is especially likely with the PCA scores if some of the scores have very low variance.
> - Fewer features would lead to better explainability (e.g., you can graph two features, but you can’t graph 100 features).

This approach assumes that the high-variance directions selected by the PCA are the correct ones. It fails if a low-variance direction is what’s needed. That is, there could be a direction in which the data features do not vary much, but that small variance triggers a large deviation in the target or outcome variable.

In PCR, as with PCA, it can be helpful to standardize the features. That is, each feature should be rescaled so as to have mean zero and standard deviation one. Otherwise, the importance of a feature could be very arbitrary and depend on the unit you choose (miles vs. feet).

---

## Partial Least Squares Regression (PLSR)

With PLSR, we perform dimensionality reduction similarly to PCR, but with a key difference: instead of selecting components that capture the most variance in the features $X$, we select components that maximize the covariance between the projected features (score vectors) and the outcome $Y$.

This approach differs from PCR in that it explicitly incorporates the relationship between the features and the target variable. While PCR may retain components that explain a lot of variance in $X$ but are uninformative for predicting $Y$, PLS prioritizes directions in the data that are most useful for prediction. This is particularly helpful when some of the components with large variance are not actually relevant to the outcome.

---

## Think About It

- Describe a situation in which a particular feature (or principal component) would have low variance but would still be important in making a prediction. (For instance, imagine that we don’t standardize the data.)  