**SM339 &#x25aa; Applied Statistics &#x25aa; Spring 2024 &#x25aa; Uhan**

# Lesson 20. Using Existing Predictors to Create New Predictors &mdash; Part 2

_Setup._ Tweak the width and height values below to adjust the size of your plots in this notebook.

In [None]:
options(repr.plot.width=8, repr.plot.height=8)

## Including polynomial terms

- Consider the following model with polynomial terms:

    $$ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1^2 + \beta_4 X_2^2 + \varepsilon \qquad \varepsilon \sim N(0, \sigma_{\varepsilon}^2) $$

- We can fit such a model with the following R code:

```r
fit <- lm(y ~ x1 + I(x1^2) + x2 + I(x2^2))
```

- `I()` is used to ensure that operators are treated as math

## Including an interaction term

- Consider the following model with polynomial terms:

    $$ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1 X_2 + \varepsilon \qquad \varepsilon \sim N(0, \sigma_{\varepsilon}^2) $$

- We can fit such a model with the following R code:

```r
fit <- lm(y ~ x1 + x2 + x1:x2)
```

## Example 1

The dataset `Perch` contains the weight (in grams), length (in centimeters), and width (in centimeters) for 56 perch caught in a lake in Finland.

![https://www.flickr.com/photos/chesbayprogram/26004012710](img/perch.jpg)

In [None]:
library(Stat2Data)
data(Perch)

We would like to find a model that does a good job predicting perch weight based on the fish's width and/or length.
We'll explore three potential models.
For each one, report whether the linearity condition appears to be met, and the $R^2_{adj}$ value.

### a.
__Model 1.__
A two-predictor model using linear terms for both predictors.

$$ \mathit{Weight} = \beta_0 + \beta_1 \mathit{Width} + \beta_2 \mathit{Length} + \varepsilon \qquad \varepsilon \sim N(0, \sigma_{\varepsilon}^2) $$

*Write your notes here. Double-click to edit.*

### b.
__Model 2.__
A model that includes both predictors and their interaction.

$$ \mathit{Weight} = \beta_0 + \beta_1 \mathit{Width} + \beta_2 \mathit{Length} + \beta_3 (\mathit{Width} \times \mathit{Length}) + \varepsilon \qquad \varepsilon \sim N(0, \sigma_{\varepsilon}^2) $$

*Write your notes here. Double-click to edit.*

### c.
__Model 3.__
A complete second order model.

$$ \mathit{Weight} = \beta_0 + \beta_1 \mathit{Width} + \beta_2 \mathit{Length} + \beta_3 (\mathit{Width} \times \mathit{Length}) + \beta_4 \mathit{Width}^2 + \beta_5 \mathit{Length}^2 + \varepsilon \qquad \varepsilon \sim N(0, \sigma_{\varepsilon}^2) $$

*Write your notes here. Double-click to edit.*

### d.
Based on your findings above, which of these models do you think is best? Why? Consider diagnostic plots, $R_{adj}^2$, and model complexity.

*Write your notes here. Double-click to edit.*

<div class="alert alert-warning">
    Remember:
    <blockquote>All models are wrong, but some are useful. -George Box</blockquote>
</div>

## Some more notes

- Why not just throw in everything we can think of?

- This will improve the fit to our sampled data, but probably won't generalize to the population

- Also, with more predictors, we lose degrees of freedom when calculating standard errors (SE), so confidence intervals and prediction intervals get wider, making it "harder" to reject hypotheses

- So... we only want to include as many terms as we need to actually make a significant difference &mdash; we don't want extra noise

## Recap: ways to compare models

- Adjusted $R^2$

- Which is simpler (if $R^2_{adj}$ values are close)

- Check linearity condition (and other conditions if inference is desired)

- Individual $t$-test $p$-values (for single terms)