**SM339 &#x25aa; Applied Statistics &#x25aa; Spring 2024 &#x25aa; Uhan**

# Lesson 21. Correlated Predictors

_Setup._ Tweak the width and height values below to adjust the size of your plots in this notebook.

In [None]:
options(repr.plot.width=8, repr.plot.height=8)

## What is multicollinearity?

- Roughly speaking: _when predictors are related to each other_

- Formal definition: A set of predictors exhibits __multicollinearity__ when one or more of the predictors is strongly correlated with some linear combination of the other predictors in the set

- Multicollinearity _doesn't always indicate a poor model_, but it can result in some odd behavior

- If two predictors are exactly related (correlation = 1 or -1), then then estimates for $\beta$ are not unique

## Red flags

### Example 1

Suppose we would like to predict the price of a house based on its $\mathit{Size}$ (in square feet) and its $\mathit{Lot}$ size (also in square feet).
The dataset `Houses` in `Stat2Data` contains relevant data for 20 houses sold in 2008 in a small midwestern town.

In [None]:
library(Stat2Data)
data(Houses)
head(Houses)

#### a.
The two-predictor model that predicts $\mathit{Price}$ from $\mathit{Size}$ and $\mathit{Lot}$ is

$$ \mathit{Price} = \beta_0 + \beta_1 \mathit{Size} + \beta_2 \mathit{Lot} + \varepsilon \qquad \varepsilon \sim \text{iid } N(0, \sigma_{\varepsilon}^2) $$

#### b.
Fit the model in R.

#### c.
Check the diagnostic plots for linearity, equal variance, and normality.

*Write your notes here. Double-click to edit.*

#### d.
What's strange about the summary output?

*Write your notes here. Double-click to edit.*

<div class="alert alert-danger">
    <h4>Red flag #1 for multicollinearity.</h4>
    The ANOVA F-test says that the model as a whole is effective, but none (or few) of the individual predictors are significant
</div>

- So... what's going on?

    - The individual $t$-tests are testing whether the predictor is helpful _given that the other predictor(s) are in the model_

    - Highly correlated predictors contain similar information, so if one is already in the model, including the other doesn't help significantly

    - We may want to try dropping a predictor and seeing if the resulting model performs similarly

#### e.
Do $\mathit{Size}$ and $\mathit{Lot}$ seem to be "highly" correlated?

*Write your notes here. Double-click to edit.*

#### f.
Check if $\mathit{Size}$ and/or $\mathit{Lot}$ are significant as __single__ predictors of $\mathit{Price}$.

*Write your notes here. Double-click to edit.*

### Example 2

In a previous lesson, we looked at the `Perch` data, with measurements on 56 perch from a Finnish lake.
We decided that the best model for predicting $\mathit{Weight}$ from $\mathit{Length}$ and $\mathit{Width}$ included both individual predictors and an interaction term.

In [None]:
data(Perch)
head(Perch)

#### a.
What kind of association (positive, negative, none) do we expect between each predictor and the response?

*Write your notes here. Double-click to edit.*

#### b.
Re-fit this model from a previous lesson:

$$ \mathit{Weight} = \beta_0 + \beta_1 \mathit{Length} + \beta_2 \mathit{Width} + \beta_3 (\mathit{Length} \times \mathit{Width}) + \varepsilon \qquad \varepsilon \sim \text{iid } N(0, \sigma_{\varepsilon}^2) $$

What seems kind of strange?

*Write your notes here. Double-click to edit.*

<div class="alert alert-danger">
    <h4>Red flag #2 for multicollinearity.</h4>
    The signs of the estimated coefficients in the fitted model don't seem right
</div>

- So... what's going on?

    - $\mathit{Length}$ and $\mathit{Width}$ essentially show up twice each, due to the interaction term

    - All three predictors are probably highly correlated (we can check this below)

    - Highly correlated predictors contain similar information &mdash; it's difficult to separate out their effects

    - __In this situation, we should not interpret individual coefficients, but we may still use the model to make predictions__

#### c.

Compute the correlation between $\mathit{Weight}$, $\mathit{Length}$, and $\mathit{Width}$.

<div class="alert alert-info">
    <h3>Recap: red flags for multicollinearity</h3>

The signature red flags for multicollinearity are:

1. The ANOVA F-test says that the model as a whole is effective, but none (or few) of the individual predictors are significant (Example 1)

2. The signs of the estimated coefficients in the fitted model don't seem right (Example 2)
</div>

## Formal detection of multicollinearity

- For each predictor $X_i$ in a model, the __variance inflation factor (VIF)__ is computed as

    $$ \mathit{VIF}_i = \frac{1}{1 - R_i^2} $$
    
    where $R_i^2$ is the coefficient of multiple determination for a model that predicts $X_i$ using the other predictors in the model
    
<div class="alert alert-info">
    <h3>Rule of thumb for using VIF</h3>
    As a <em>rough</em> rule, we are usually concerned about multicollinearity if <em>any VIF > 5</em>
</div>

### Example 3

What are the VIFs for the predictors in the model in the Perch example?
Should we be concerned about multicollinearity?

_Note._ You may need to install the `car` library first:

```r
install.packages('car') 
```

You only need to do this once per computer.

*Write your notes here. Double-click to edit.*

## What should/can we do about it?

- Multicollinearity does not necessarily indicate a bad model

- The related predictors might all be important in the model, like in the Perch example

- If our rule of thumb detects multicollinearity, _we should discount individual coefficients and $t$-tests_

- We can still use the model to make predictions, in spite of multicollinearity, although it may be an indication that a simpler model would be sufficient

- If the purpose of the model is to investigate the effects of the individual predictors (i.e., we want to interpret the individual $\beta$s), then we can:

    1. _Drop a predictor_ and check if the reduced model is about effective as the full model
    
    2. _Combine predictors_ &ndash; this is common with survey data