**SM339 &#x25aa; Applied Statistics &#x25aa; Spring 2023 &#x25aa; Uhan**

# Lesson 21. Techniques for Choosing Variables

*Edit and run the cell below to resize the plots.*

In [None]:
options(repr.plot.width=8, repr.plot.height=8)

## Overview

- A __term__ is a predictor, function of a predictor (like quadratic terms), or quantity derived from more than one predictor 

- Suppose we have many potential terms that we can include in our model

- If we have $k$ possible terms, then how many possible models are there? 

*Write your notes here. Double-click to edit.*

- In this lesson, we will learn about techniques for choosing a "good" set of predictors

## Example 1

The dataset `FirstYearGPA` from the `Stat2Data` library contains measurements on 219 college students.
The response variable is $\mathit{GPA}$ (grade point average after one year of college).
The potential predictors are:

| Variable | Description | 
| :- | :- |
| _HSGPA_ | High school GPA |
| _SATV_ | Verbal/critical reading SAT score |
| _SATM_ | Math SAT score |
| _Male_ | 1 for male, 0 for female |
| _HU_ | Number of credit hours earned in humanities courses in high school |
| _SS_ | Number of credit hours earned in social science courses in high school |
| _FirstGen_ | 1 if the student is first in their family to attend college |
| _White_ | 1 for white students, 0 for others |
| _CollegeBound_ | 1 if attended a high school where $\ge$ 50% of students intdent to go on to college |

In [None]:
library(Stat2Data)
data(FirstYearGPA)
head(FirstYearGPA)

- Let's start by creating scatterplots between all the _quantitative_ variables in `FirstYearGPA`

- This way, we can visually see the correlations between our response variable $\mathit{GPA}$ and the other quantitative variables

- We can create a dataframe using certain existing columns from another dataframe like this:

    ```r
    FirstYearGPA[, c('GPA', 'HSGPA', 'SATV', 'SATM', 'HU', 'SS')]
    ```
   <br>
    
    - We've used a similar construct before to select certain rows of a dataframe
    - Note the placement of the comma!<br><br>
    
- Then, we can use `pairs()` function to get scatterplots between all the variables in the dataframe, like this:

- We can also create boxplots of our response variable $\mathit{GPA}$ _by group_ for the four categorical predictors, like this:

- We can also get the correlation between $\mathit{GPA}$ and all the other possible predictors, like this:

* Based on the plots and computations above, which predictors do you think are promising?

*Write your notes here. Double-click to edit.*

## Best subsets regression

- __Best subsets regression__ chooses predictors by comparing _all possible subsets of predictors_ according to some metric 

    - For example, adjusted $R^2$

- Given the power and speed of today's computers, this is feasible, as long as the number of predictors is not too large

- In R, we can use the `regsubsets()` function from the `leaps` library 

    - _Note._ You may need to install the `leaps` library first:

        ```r
            install.packages('leaps')
        ```

## Example 2

Continuing the `FirstYearGPA` example in Example 1 above...

* We can run the best subsets regression procedure like this:

In [None]:
library(leaps)

In [None]:
models <- regsubsets(GPA ~ HSGPA + SATV + SATM + Male + HU + SS + FirstGen + White + CollegeBound, 
                     data = FirstYearGPA, nbest = 2)
sum <- summary(models)
cbind(as.data.frame(sum$outmat), sum$rsq, sum$adjr2, sum$cp)

- The `nbest = 2` keyword argument in `regsubsets()` tells R to output the information for the 2 models with the highest $R^2$ at each size

- The second to last column of the `regsubsets()` output table shows the adjusted $R^2$ for each model

- Suppose that our goal is to find a model that maximizes the adjusted $R^2$ 

- Which is the "best" model under our criteria?

*Write your notes here. Double-click to edit.*

* Let's fit the model with the highest adjusted $R^2$:

- It appears that two of the predictors, $\mathit{Male}$ and $\mathit{SS}$, would not be significant at the 0.05 level

- Could a simpler model be just as effective?

- Should we perhaps use another criterion for selecting the "best" model?

## Mallows's $C_p$

- The criteria we have used so far to evaluate a model (e.g., $R^2$, adjusted $R^2$, individual $t$-tests, etc.) depend _only_ on the predictors in the model being evaluated

- They do _not_ take into account what information might be available in the other potential predictors that aren't in the model

- __Mallows's $C_p$__ is a measure of model quality that _does_ consider other potential predictors

- These are the values in the last column of the table output by `regsubsets()` above

- We prefer models where $C_p$ is _small_

- For details on how Mallows's $C_p$ is computed, see Section 4.2 in STAT2 

## Example 3

Continuing the `FirstYearGPA` example from Examples 1 and 2...

* Which model has the smallest $C_p$?

*Write your notes here. Double-click to edit.*

* Let's fit the model with the smallest $C_p$:

- Note that this model omits the $\mathit{Male}$ predictor that had an insignificant $t$-test in the six-predictor model with the highest adjusted $R^2$

- However, $\mathit{SS}$ is still not significant at the 0.05 level

- Looking at the `regsubsets()` output table above, there is a four-predictor model that removes $\mathit{SS}$ with essentially the same $C_p$ of 3.900

- Let's fit this model:

- Now all predictors are significant at the 0.05 level

- In the interest of parsmiony, we would typically prefer the simpler four-predictor model &ndash; the increase in $C_p$ is only very minor

## Notes

- We saw in Examples 2 and 3 that choosing a "best" model can differ, depending on the metric used

- Often, there is more than one model that does a good job of predicting a response variable

- It is possible for different statisticians who are studying the same dataset to come up with somewhat different regression models

- "Best" is misleading - we are not searching for one true ideal model, but for a good model that helps us answer the question of interest