# Tutorial 9: Prediction and Model Selection

#### Lecture and Tutorial Learning Goals:

By the end of this section, students will be able to:

- Explain the difference between confidence intervals for prediction and prediction confidence intervals and what elements need to be estimated to construct these intervals.

- Write a computer script to calculate these intervals. Interpret and communicate the results from that computer script.

- Give an example of a question that can be answered by predictive modelling.

- Explain the algorithms for the following variable selection methods: • Forward selection • Backward selection

- Explain when a linear regression is an appropriate model to predict new outcomes based on new values of the input variables.

- List model metrics that are suitable for evaluation of a statistical model developed for the purpose of predictive modelling (e.g., RMSE), as well as how they are calculated.

- Discuss how different estimation methods can result in different predictions.

In [1]:
# Run this cell before continuing.
library(tidyverse)
library(broom)
library(repr)
library(infer)
library(gridExtra)
library(faraway)
library(mltools)
library(leaps)
library(glmnet)
library(cowplot)
source("tests_tutorial_09.R")

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.4.4     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors

Attaching package: ‘gridExtra’


The following object is masked from ‘package:dplyr’:

    combine



Attaching package: ‘mltools’


The following object is masked from

## 1. Prediction CI *versus* CI for Prediction

In previous lectures we have learned how to estimate LR models and used them to make inference about the population parameters. In this lecture we will learn different concepts related to *prediction*.

> **Heads up**: It is important to distinguished between *in-sample* prediction from *out-of-sample* prediction

We have seen different measures to compare the *in-sample* values of the response with their corresponding predicted values using a LR to evaluate the goodness of the model.

In this first section we are going to recognize and measure the *uncertainty* of these predictions.

Let us start by loading the dataset to be used throughout this tutorial. We will use the dataset `fat` from the library `faraway`. You can find detailed information about it in [Johnson (1996)](https://www.tandfonline.com/doi/full/10.1080/10691898.1996.11910505). This dataset contains the percentage of body fat and a whole variety of body measurements (continuous variables) of 252 men. We will use the variable `brozek` as the response variable and a subset 14 variables to build different models. 

Run the code below to create the working data frame called `fat_sample`.

In [2]:
fat_sample <- fat %>%
  select(
    brozek, age, weight, height, adipos, neck, chest, abdom,
    hip, thigh, knee, ankle, biceps, forearm, wrist
  )

head(fat_sample,3)

Unnamed: 0_level_0,brozek,age,weight,height,adipos,neck,chest,abdom,hip,thigh,knee,ankle,biceps,forearm,wrist
Unnamed: 0_level_1,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,12.6,23,154.25,67.75,23.7,36.2,93.1,85.2,94.5,59.0,37.3,21.9,32.0,27.4,17.1
2,6.9,22,173.25,72.25,23.4,38.5,93.6,83.0,98.7,58.7,37.3,23.4,30.5,28.9,18.2
3,24.6,22,154.0,66.25,24.7,34.0,95.8,87.9,99.2,59.6,38.9,24.0,28.8,25.2,16.6


The response variable `brozek` is the percent of body fat using Brozek's equation:

$$\texttt{brozek} = \frac{457}{\texttt{density}} - 414.2,$$

where body `density` is measured in $\text{g}/\text{cm}^3$.

The 14 input variables are:

- `age`: Age in $\text{years}$.
- `weight`: Weight in $\text{lb}$.
- `height`: Height in $\text{in}$.
- `adipos`: Adiposity index in $\text{kg}/\text{m}^2$.

$$\texttt{adipos} = \frac{\texttt{weight}}{\texttt{height}^2}$$

- `neck`: Neck circumference in $\text{cm}$.
- `chest`: Chest circumference in $\text{cm}$.
- `abdom`: Abdomen circumference at the umbilicus and level with the iliac crest in $\text{cm}$.
- `hip`: Hip circumference in $\text{cm}$.
- `thigh`: Thigh circumference in $\text{cm}$.
- `knee`: Knee circumference in $\text{cm}$.
- `ankle`: Ankle circumference in $\text{cm}$.
- `biceps`: Extended biceps circumference in $\text{cm}$.
- `forearm`: Forearm circumference in $\text{cm}$.
- `wrist`: Wrist circumference distal to the styloid processes in $\text{cm}$.

**Question 1.0**
<br>{points: 1}

Let's start by building a SLR using only `weight` to predict `brozek`.

Use the `lm()` function to estimate the SLR. Store this estimated model in the variable `SLR_fat`.

*Fill out those parts indicated with ..., uncomment the corresponding code in the cell below, and run it.*

In [3]:
# SLR_fat <- ...(..., ...)
# SLR_fat

# your code here
SLR_fat <- lm(brozek ~ weight, data = fat_sample)
SLR_fat



Call:
lm(formula = brozek ~ weight, data = fat_sample)

Coefficients:
(Intercept)       weight  
    -9.9952       0.1617  


In [4]:
test_1.0()

[32mTest passed[39m 🎊
[32mTest passed[39m 😸
[1] "Success!"
[32mTest passed[39m 😸
[32mTest passed[39m 🎉
[1] "Success!"


**Question 1.1**
<br>{points: 1}

In previous lectures, we have learned how to obtain and interpret confidence intervals for the regression parameters. 

Since the predictions are functions of the estimated LR, they also depend on the sample used! A different sample would have resulted in a different estimated LR and different predictions! As dicussed for the estimation of the regression parameters, we can obtain confidence intervals that take into account the sample-to-sample variation of the predictions as well! 

There are 2 type of intervals we can construct depending on the quantity we want to predict: *confidence intervals for prediction* and *prediction confidence intervals*

> **Heads up**: Isn't this confusing?? 

Let's start by computing *confidence intervals for prediction*. These are intervals to predict the *average* brozek index for men of different weights. 

Using `SLR_fat` and `predict`, obtain the asymptotic 95% CIP (confidence intervals for prediction). Create a dataframe, called `fat_cip`, that contains the response, the input, the predictions, and the lower and upper bounds of the intervals for each observation **in that order from left-to-right**. 

*Fill out those parts indicated with ..., uncomment the corresponding code in the cell below, and run it.*

In [5]:
# fat_cip <- fat_sample  %>% 
#    select(..., ...) %>% 
#    cbind(predict(...,interval="confidence",se.fit=TRUE)$fit)  %>% 
#    mutate_if(is.numeric, round, 3)
# head(fat_cip)


# your code here
fat_cip <- fat_sample  %>% 
   select(brozek, weight) %>% 
   cbind(predict(SLR_fat, interval="confidence", se.fit=TRUE)$fit)  %>% 
   mutate_if(is.numeric, round, 3)
head(fat_cip)

Unnamed: 0_level_0,brozek,weight,fit,lwr,upr
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,12.6,154.25,14.948,13.954,15.943
2,6.9,173.25,18.021,17.246,18.796
3,24.6,154.0,14.908,13.909,15.907
4,10.9,184.75,19.881,19.105,20.657
5,27.8,184.25,19.8,19.026,20.573
6,20.6,210.25,24.004,22.89,25.118


In [6]:
test_1.1()

[32mTest passed[39m 😀
[32mTest passed[39m 😸
[32mTest passed[39m 🎊
[32mTest passed[39m 🎉
[32mTest passed[39m 🌈
[32mTest passed[39m 🥳
[32mTest passed[39m 🥳
[32mTest passed[39m 🎉
[32mTest passed[39m 🌈
[1] "Success!"


**Question 1.2**
<br>{points: 1}

We have just calculated the 95% confidence interval for the mean brozek index for men of different weights in our sample. 

Provide a brief interpretation for the 95% confidence interval for prediction you have calculated in row 1.

> *Your answer goes here.*

Row 1: with 95% confidence, the expected value of a brozek of 12.6 percent is between 13954 and 15943 (rounded).

**Question 1.3**
<br>{points: 1}

Let's now compute and interpret *prediction confidence intervals*. These are intervals to predict the (actual) brozek index for men of different weights.  

You can use `SLR_fat` and `predict` again to obtain the asymptotic 95% PI (prediction intervals) changing the argument `interval`. Create a dataframe, called `fat_pi`, that contains the response, the input, the predictions, and the lower and upper bounds of the intervals for each observation, **in that order from left to right**.

> **Heads up**: read the warning message! since your goal is to predict an actual value, it is important to note that this is not coming from a test set.

*Fill out those parts indicated with ..., uncomment the corresponding code in the cell below, and run it.*

In [7]:
# fat_pi <- fat_sample  %>% 
#    select(..., ...) %>% 
#    cbind(predict(...,interval="prediction",se.fit=TRUE)$fit)  %>% 
#    mutate_if(is.numeric, round, 3)
# head(fat_pi)


# your code here
fat_pi <- fat_sample  %>% 
   select(brozek, weight) %>% 
   cbind(predict(SLR_fat,interval="prediction",se.fit=TRUE)$fit)  %>% 
   mutate_if(is.numeric, round, 3)
head(fat_pi)

“predictions on current data refer to _future_ responses
”


Unnamed: 0_level_0,brozek,weight,fit,lwr,upr
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,12.6,154.25,14.948,2.824,27.072
2,6.9,173.25,18.021,5.913,30.129
3,24.6,154.0,14.908,2.784,27.032
4,10.9,184.75,19.881,7.773,31.989
5,27.8,184.25,19.8,7.692,31.908
6,20.6,210.25,24.004,11.87,36.138


In [8]:
test_1.3()

[32mTest passed[39m 🌈
[32mTest passed[39m 😀
[32mTest passed[39m 🌈
[32mTest passed[39m 😀
[32mTest passed[39m 🎊
[32mTest passed[39m 😸
[32mTest passed[39m 🎉
[32mTest passed[39m 🥇
[32mTest passed[39m 😸
[1] "Success!"


**Question 1.4**
<br>{points: 1}

We have just calculated the 95% prediction interval for the brozek index of men of different weights in our sample. 

Provide a brief interpretation for the 95% prediction interval you have calculated in row 1.
Your interpretation goes here.

> *Your answer goes here.*

Row 1: with 95% confidence, the value of a brozek of 12.6 percent is between 2824 and 27072 (rounded).

**Question 1.5**
<br>{points: 1}

Compare the confidence intervals computed in **Question 1.1** with those computed in **Question 1.3** (by row). Which confidence intervals are wider?? Respond and explain why in one or two sentences.

> *Your answer goes here.*

Confidence intervals computed in Question 1.1 account for the sample-to-sample variation, whereas the prediction intervals computed in Question 1.3 also account for the uncertainty of individual observations. Therefore, prediction intervals (Q1.3) are wider than confidence intervals (Q1.1).

## 2. Predictive Modelling using Linear Regression

In this section you will use the LR as a *predictive model*. Predictive models are built and trained to predict *new* observations. Thus, we need two types of datasets: a *training* set and a *test* set. 

If two independent datasets are not available to build a predictive model, we can:

- approximate the *test* MSE

or 

- use the data in hand and split it to create these datasets.

In this section, you will split the data to build a predictive model on one part using all available variables and test it on the second part of the data.

**Question 2.0**
<br>{points: 1}

Let's start by randomly splitting `fat_sample` in two sets on a 70-30% basis: `training_fat` (70% of the data) and `testing_fat` (the remaining 30%) and then train a full LR with all the available input variables on the training set.

You can do the following:

1. Create an `ID` column in `fat_sample` (i.e., `fat_sample$ID`) with the row number corresponding to each man in the sample.

2. Use the function `sample_n()` to create `training_fat` (sampling *without* replacement) with 70\% of the observations coming from `fat_sample`.

3. Use `anti_join()` with `fat_sample` and `training_fat` to create `testing_fat` by column `ID`. 

4. Remove the variable `ID` used to split the data

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [9]:
set.seed(123) # DO NOT CHANGE!

# fat_sample$ID <- rownames(fat_sample)
# training_fat <- ...(..., size = nrow(fat_sample) * 0.70,
#   replace = ...
# )

# testing_fat <- anti_join(...,
#   ...,
#   by = ...
# )

# training_fat <- training_fat %>% select(-"ID")
# testing_fat <- testing_fat %>% select(-"ID")

# head(training_fat)
# nrow(training_fat)

# head(testing_fat)
# nrow(testing_fat)

# your code here
fat_sample$ID <- rownames(fat_sample)
training_fat <- sample_n(fat_sample, size = nrow(fat_sample) * 0.70,
  replace = FALSE
)

testing_fat <- anti_join(fat_sample,
  training_fat,
  by = "ID"
)

training_fat <- training_fat %>% select(-"ID")
testing_fat <- testing_fat %>% select(-"ID")

head(training_fat)
nrow(training_fat)

head(testing_fat)
nrow(testing_fat)


Unnamed: 0_level_0,brozek,age,weight,height,adipos,neck,chest,abdom,hip,thigh,knee,ankle,biceps,forearm,wrist
Unnamed: 0_level_1,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
159,12.8,30,136.5,68.75,20.3,35.9,88.7,76.6,89.8,50.1,34.8,21.8,27.0,34.9,16.9
207,31.7,44,166.0,65.5,27.2,39.1,100.6,93.9,100.1,58.9,37.6,21.4,33.1,29.5,17.3
179,22.0,38,187.25,69.25,27.5,38.0,102.7,92.7,101.9,64.7,39.5,24.7,34.8,30.3,18.1
14,20.8,30,205.25,71.25,28.5,39.4,104.1,101.8,108.6,66.0,41.5,23.7,36.9,31.6,18.8
195,22.3,42,162.75,72.75,21.6,35.4,92.2,85.6,96.5,60.2,38.9,22.4,31.7,27.1,17.1
170,16.5,35,172.75,69.5,25.2,37.6,99.1,90.8,98.1,60.1,39.1,23.4,32.5,29.8,17.4


Unnamed: 0_level_0,brozek,age,weight,height,adipos,neck,chest,abdom,hip,thigh,knee,ankle,biceps,forearm,wrist
Unnamed: 0_level_1,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
2,6.9,22,173.25,72.25,23.4,38.5,93.6,83.0,98.7,58.7,37.3,23.4,30.5,28.9,18.2
3,24.6,22,154.0,66.25,24.7,34.0,95.8,87.9,99.2,59.6,38.9,24.0,28.8,25.2,16.6
10,12.0,23,198.25,73.5,25.8,42.1,99.6,88.6,104.1,63.1,41.7,25.0,35.6,30.0,19.2
12,8.5,27,216.0,76.0,26.3,39.4,103.6,90.9,107.7,66.2,39.2,25.9,37.2,30.2,19.0
15,21.7,35,187.75,69.5,27.4,40.5,101.3,96.4,100.1,69.0,39.0,23.1,36.1,30.5,18.2
18,22.4,32,209.25,71.0,29.2,42.1,107.6,97.5,107.0,66.9,40.0,24.4,38.2,31.6,19.3


In [10]:
test_2.0()

[32mTest passed[39m 🥇
[32mTest passed[39m 🥇
[32mTest passed[39m 🎉
[32mTest passed[39m 🥇
[32mTest passed[39m 🎊
[32mTest passed[39m 🥇
[32mTest passed[39m 😀
[32mTest passed[39m 🥳
[1] "Success!"
[32mTest passed[39m 😀
[32mTest passed[39m 🥳
[32mTest passed[39m 🥇
[32mTest passed[39m 🥇
[32mTest passed[39m 😀
[32mTest passed[39m 🎊
[32mTest passed[39m 🎊
[32mTest passed[39m 🥳
[1] "Success!"


**Question 2.1**
<br>{points: 1}

Let's start by building a predictive additive LR with *all* **14** inputs. Call this object `fat_full_OLS`. 

Estimate an additive LR with *all* **14** inputs against the response variable `brozek`  using `lm()` and data from `training_fat`. 

> **If you write down the input variables, the order should match the column order from `training_fat` to pass the autograding tests**.

This will be our baseline model.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [11]:
# fat_full_OLS <- lm(...,
#   ...
# )
# fat_full_OLS

# your code here
fat_full_OLS <- lm(brozek ~ .,
  data = training_fat
)
fat_full_OLS


Call:
lm(formula = brozek ~ ., data = training_fat)

Coefficients:
(Intercept)          age       weight       height       adipos         neck  
  -12.67471      0.08101     -0.08122     -0.04235      0.10737     -0.61347  
      chest        abdom          hip        thigh         knee        ankle  
   -0.13109      0.95090     -0.20715      0.17006      0.08332      0.34849  
     biceps      forearm        wrist  
    0.19753      0.39584     -1.49116  


In [12]:
test_2.1()

[32mTest passed[39m 🥇
[32mTest passed[39m 🌈
[1] "Success!"


**Question 2.2**
<br>{points: 1}

Using `predict()` and `fat_full_OLS`, obtain the (out-of-sample) predicted brozek values for men in `testing_fat`. 

> `second_set_fat` will be used as independent *test data*

Store them in a variable called `fat_test_pred_full_OLS`. 

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [13]:
# fat_test_pred_full_OLS <- ...(..., newdata = ...)
# head(fat_test_pred_full_OLS)

# your code here
fat_test_pred_full_OLS <- predict(fat_full_OLS, newdata = testing_fat)
head(fat_test_pred_full_OLS)

In [14]:
test_2.2()

[32mTest passed[39m 🥳
[32mTest passed[39m 🎊
[1] "Success!"


**Question 2.3**
<br>{points: 1}

We will now compute the **Root Mean Squared Error (RMSE)** using data from the test set to evaluate the predictive model. This metric has the same units as the response; and the smaller the value, the better the model.

Use the function `rmse()` from the `mltools` package to compute the $\text{RMSE}_{\text{test}}$ based on the *predicted* brozed values stored in `fat_test_pred_full_OLS` for men in the test set. Note that the observed brozek values for these men are in `testing_fat$brozek`. 

Store this metric in a tibble called `fat_RMSE_models` with two columns:

- `Model`: The regression model from which we will obtain the prediction accuracy.
- `RMSE`: The $\text{RMSE}_{\text{test}}$ corresponding to the model.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [15]:
# fat_RMSE_models <- tibble(
#   Model = "OLS Full Regression",
#   RMSE = ...(
#     ...,
#     ...
#   )
# )
# fat_RMSE_models

# your code here
fat_RMSE_models <- tibble(
  Model = "OLS Full Regression",
  RMSE = rmse(
    fat_test_pred_full_OLS,
    testing_fat$brozek
  )
)
head(fat_RMSE_models)


Model,RMSE
<chr>,<dbl>
OLS Full Regression,3.997984


In [16]:
test_2.3()

[32mTest passed[39m 🎉
[32mTest passed[39m 🎊
[32mTest passed[39m 🥳
[32mTest passed[39m 😸
[32mTest passed[39m 😀
[1] "Success!"


## 3. Selecting a predictive model

The previous model uses all input variables to predict. However, we may want to select a smaller model by using only a subset of the input variables. The *stepwise selection* algorithms presented in worksheet_09 can be used to build predictive models. 

A good predictive model would be one that minimizes the *test* MSE. However, we can not use the same set to select the model and evaluate its performance. 

Metrics such as $C_p$, AIC and BIC are computed with the *training* set and can be used to *approximate* the *test* MSE, without looking at the *test* data. 

The test set will then be used *only* to assess the predictive performance of the selected model.

**Question 3.0**
<br>{points: 1}

Using only the training data in `training_fat`, select a reduced LR using the **forward selection** algorithm. Recall that this method is implemented in the function `regsubsets()` from library `leaps`.

The function `regsubsets()` identifies various subsets of input variables selected for models of different sizes. The argument `x` of `regsubsets()` is analogous to `formula` in `lm()`. 

Create one object using `regsubsets()`with `training_fat` and call it `fat_forward_sel`. We will use `fat_fwd_summary` to check your results.

> **Maintain the order of columns seen in `training_fat`**

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [17]:
# fat_forward_sel <- ...(
#   ..., ...,
#   ...,
#   ...
# )
# fat_forward_sel

#fat_fwd_summary <- summary(fat_forward_sel)

#fat_fwd_summary <- tibble(
#    n_input_variables = 1:14,
#    RSS = fat_fwd_summary$rss,
#    BIC = fat_fwd_summary$bic,
#    Cp = fat_fwd_summary$cp
#)

# your code here
fat_forward_sel <- regsubsets(
    x = brozek ~ .,
    nvmax = 14,
    data = training_fat,
    method = "forward"
)
fat_forward_sel

fat_fwd_summary <- summary(fat_forward_sel)

fat_fwd_summary <- tibble(
   n_input_variables = 1:14,
   RSS = fat_fwd_summary$rss,
   BIC = fat_fwd_summary$bic,
   Cp = fat_fwd_summary$cp
)


Subset selection object
Call: regsubsets.formula(x = brozek ~ ., nvmax = 14, data = training_fat, 
    method = "forward")
14 Variables  (and intercept)
        Forced in Forced out
age         FALSE      FALSE
weight      FALSE      FALSE
height      FALSE      FALSE
adipos      FALSE      FALSE
neck        FALSE      FALSE
chest       FALSE      FALSE
abdom       FALSE      FALSE
hip         FALSE      FALSE
thigh       FALSE      FALSE
knee        FALSE      FALSE
ankle       FALSE      FALSE
biceps      FALSE      FALSE
forearm     FALSE      FALSE
wrist       FALSE      FALSE
1 subsets of each size up to 14
Selection Algorithm: forward

In [18]:
test_3.0()

[32mTest passed[39m 😀
[32mTest passed[39m 🥇
[32mTest passed[39m 🥇
[32mTest passed[39m 😸
[32mTest passed[39m 🥳
[32mTest passed[39m 🥇
[32mTest passed[39m 🎉
[1] "Success!"


**Question 3.1**
<br>{points: 1}

Out of the fourteen best models selected for each size by the *forward* subset algorithm and stored in `fat_forward_sel`, we will select the best one in terms of the *out-of-sample* prediction accuracy, estimated by the Mallow's $C_p$. 

Use the $C_p$ computed for each model, stored in `fat_forward_summary`, to select the best predictive model and indicate which input variables are in the selected model.

> **Heads up:** The most accurate model will have the smallest $C_p$. 


**A.** `age`.

**B.** `weight`.

**C.** `height`.

**D.** `adipos`.

**E.**  `neck`.

**F.**  `chest`.

**G.**  `abdom`.

**H.**  `hip`.

**I.**  `thigh`.

**J.**  `knee`.

**K.**  `ankle`.

**L.**  `biceps`.

**M.**  `forearm`.

**N.**  `wrist`.

*Assign your answers to the object `answer3.1`. Your answers have to be included in a single string indicating the correct options **in alphabetical order** and surrounded by quotes.*

In [19]:
#Run this cell below before continuing.

fat_fwd_summary
summary(fat_forward_sel)

n_input_variables,RSS,BIC,Cp
<int>,<dbl>,<dbl>,<dbl>
1,3819.952,-172.0625,58.927345
2,3116.23,-202.7281,18.385321
3,3003.401,-204.0482,13.564448
4,2931.132,-203.1645,11.195607
5,2889.05,-200.5392,10.651578
6,2803.1,-200.6842,7.455676
7,2761.383,-198.1527,6.933729
8,2727.868,-195.1314,6.907693
9,2705.357,-191.4194,7.546799
10,2688.527,-187.3472,8.529381


Subset selection object
Call: regsubsets.formula(x = brozek ~ ., nvmax = 14, data = training_fat, 
    method = "forward")
14 Variables  (and intercept)
        Forced in Forced out
age         FALSE      FALSE
weight      FALSE      FALSE
height      FALSE      FALSE
adipos      FALSE      FALSE
neck        FALSE      FALSE
chest       FALSE      FALSE
abdom       FALSE      FALSE
hip         FALSE      FALSE
thigh       FALSE      FALSE
knee        FALSE      FALSE
ankle       FALSE      FALSE
biceps      FALSE      FALSE
forearm     FALSE      FALSE
wrist       FALSE      FALSE
1 subsets of each size up to 14
Selection Algorithm: forward
          age weight height adipos neck chest abdom hip thigh knee ankle biceps
1  ( 1 )  " " " "    " "    " "    " "  " "   "*"   " " " "   " "  " "   " "   
2  ( 1 )  " " "*"    " "    " "    " "  " "   "*"   " " " "   " "  " "   " "   
3  ( 1 )  " " "*"    " "    " "    "*"  " "   "*"   " " " "   " "  " "   " "   
4  ( 1 )  " " "*"    " "    " "

In [20]:
# answer3.1 <- 

# your code here
answer3.1 <- "ABEGIKMN"

In [21]:
test_3.1()

[32mTest passed[39m 🥇
[32mTest passed[39m 😸
[32mTest passed[39m 😸
[1] "Success!"


**Question 3.2**
<br>{points: 1}

Use the variables selected by the forward subset algorithm to build a *predictive* model. 

1. Identify the size of the model that minimizes the $C_P$, call it `cp_min`

2. Find the name of the variables for the best model of size `cp_min`, selected by the forward algorithm. Store them in an object called `selected_var`. Do not include the intercept with the variable names. 

3. Select only those columns and the response `brozek` from `training_fat`. Called the reduced data frames `training_subset`. 

> The previous step allows you to conveniently fit `lm` on all variables in the data, except the response. Note that the test set can include additional variables that won't be used to predict if not included in the model.

4. Train the predictive model using `lm()` and the reduced `training_subset` data. Call it `fat_red_OLS`. 

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [22]:
# cp_min = which.min(...$Cp) 
# selected_var <- names(...(fat_forward_sel, ...))[-1]

# training_subset <- training_fat %>% select(all_of(selected_var),brozek)

# fat_red_OLS <- ...(...,
#   ...
# )

# summary(fat_red_OLS)
# your code here
cp_min = which.min(fat_fwd_summary$Cp) 
selected_var <- names(coef(fat_forward_sel, cp_min))[-1]
training_subset <- training_fat %>% select(all_of(selected_var),brozek)
fat_red_OLS <- lm(brozek ~ .,
  data = training_subset
)

summary(fat_red_OLS)


Call:
lm(formula = brozek ~ ., data = training_subset)

Residuals:
    Min      1Q  Median      3Q     Max 
-10.727  -2.851  -0.173   2.894   9.779 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -32.34254   10.43232  -3.100  0.00227 ** 
age           0.08912    0.03399   2.622  0.00955 ** 
weight       -0.12138    0.03993  -3.040  0.00275 ** 
neck         -0.54808    0.26202  -2.092  0.03798 *  
abdom         0.88547    0.07769  11.397  < 2e-16 ***
thigh         0.19425    0.13019   1.492  0.13758    
ankle         0.39202    0.27368   1.432  0.15390    
forearm       0.46343    0.18916   2.450  0.01532 *  
wrist        -1.42923    0.57360  -2.492  0.01369 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.042 on 167 degrees of freedom
Multiple R-squared:  0.7467,	Adjusted R-squared:  0.7345 
F-statistic: 61.53 on 8 and 167 DF,  p-value: < 2.2e-16


In [23]:
test_3.2()

[32mTest passed[39m 😀
[32mTest passed[39m 🌈
[1] "Success!"


**Question 3.3**
<br>{points: 1}

Use the trained model `fat_red_OLS` to predict the responses of the test set `testing_fat`, and call the resulting object `fat_test_pred_red_OLS`. 

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [24]:
# fat_test_pred_red_OLS <- ...(..., ...)

# your code here
fat_test_pred_red_OLS <- predict(fat_red_OLS, testing_fat)

In [25]:
test_3.3()

[32mTest passed[39m 🥳
[32mTest passed[39m 🎊
[1] "Success!"


**Question 3.4**
<br>{points: 1}

Use the function `rmse()` to compute the RMSE of predicted brozek values of men in the test set stored in `fat_test_pred_red_OLS`. Add this metric as another row in the tibble `fat_RMSE_models` with `"OLS Reduced Regression"` in the column `Model` and the corresponding $\text{RMSE}_{\text{test}}$ in column `RMSE`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [26]:
# fat_RMSE_models <- rbind(
#   fat_RMSE_models,
#   tibble(
#     Model = ...
#     RMSE = ...
#     )
#   )
# fat_RMSE_models

# your code here
fat_RMSE_models <- rbind(
  fat_RMSE_models,
  tibble(
    Model = "OLS Reduced Regression",
    RMSE = rmse(fat_test_pred_red_OLS,
                testing_fat$brozek)
    )
  )
fat_RMSE_models

Model,RMSE
<chr>,<dbl>
OLS Full Regression,3.997984
OLS Reduced Regression,3.957596


In [27]:
test_3.4()

[32mTest passed[39m 😀
[32mTest passed[39m 🎊
[32mTest passed[39m 🥳
[32mTest passed[39m 🌈
[32mTest passed[39m 🥇
[1] "Success!"


**Question 3.5**
<br>{points: 1}

Based on your results in `fat_RMSE_models`, which model has the best *out-of-sample* prediction performance?

**A.** OLS Full Regression.

**B.** OLS Reduced Regression.

*Assign your answer to an object called `answer3.5`. Your answer should be one of `"A"` or `"B"` surrounded by quotes.*

In [28]:
# answer3.5 <- 

# your code here
answer3.5 <- "B"

In [29]:
test_3.5()

[32mTest passed[39m 😀
[32mTest passed[39m 😀
[32mTest passed[39m 😀
[1] "Success!"
