# Homework 7: Model Selection

**Due**: Monday May 29th

- **Format**: We expect students to complete the homework notebooks using Google Colab (see Discussion 1), but this is not explicitly required and you may use whatever software you would like to run notebooks. 
- **Answers**: As a general guiding policy, you should always try to make it as clear as possible what your answer to each question is, and how you arrived at your answer. Generally speaking, this will mean including all code used to generate results, outputting the actual results to the notebook, and (when necessary) including written answers to support your code.
- **Submission**: Homeworks will be *submitted to Gradescope*, and we expect all students to do question matching on Gradescope upon submission.
- **Late Policy**: All students are allowed 7 total slip days for the quarter, and at most 5 can be used for a single HW assignment. There will be no late credit if you have used up all your slip days. Also, your lowest HW grade will be dropped.

In [2]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import patsy
sns.set_style()

## Question 1: Identifying Bias and Variance

For each of the comparisons below, identify which model has higher bias and which model has higher variance. Explain your reasoning.

**Part (a)**: Suppose $y \sim \beta_0 + \beta_1 x + \epsilon$, where $\epsilon \sim N(0,\sigma^2)$. 

We train two models on a sample of $n$ randomly generated $y_i$'s.
1. The first model is fit via least squares and has the fitted form $\hat{y} = \hat{\beta}_0$. 
2. The second model has the fitted form $\hat{y} = y_1$.

**Part (b)**: Suppose we have the same data as in part (a).

We fit two models via least squares:
1. The first model's fitted form is $\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x + \hat{\beta}_2 x^2 + \cdots + \hat{\beta}_{n-1} x^{n-1}$.
2. The second model's fitted form is $\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x + \hat{\beta}_2 x^2 + \cdots + \hat{\beta}_{n-2} x^{n-2}$.

**Part (c)**: Now suppose $y \sim \beta_0 + \beta_1 x + \beta_2 x^2 + \epsilon$, where $\epsilon \sim N(0,\sigma^2)$. Further suppose $z \sim \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \epsilon$, where $\epsilon \sim N(0,\sigma^2)$.

We fit two models via least squares:
1. The first model's fitted form is $\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x + \hat{\beta}_2 x^2 + \hat{\beta}_3 x^3$.
2. The second model's fitted form is $\hat{z} = \hat{\beta}_0 + \hat{\beta}_1 x + \hat{\beta}_2 x^2$.

## Question 2: Bias and Variance of Linear Regression

In this question, we will explore the bias-variance tradeoff for a simple linear regression model.

**Part (a)**: Consider simple linear regression data $(x_1, y_1), \ldots, (x_n, y_n)$ that has been standardized so that $\sum_i x_i = 0, \sum_i x_i^2 = 1, \sum_i y_i = 0, \sum_i y_i^2 = 1$. What are the resulting least squares estimates for $\beta_0$ and $\beta_1$ on this standardized data?

*Hint*: You may start from the formula's for $\beta_1$ and $\beta_0$ that are in the lecture notes, and simplify using the extra assumption that the data is standardized.  

**Part (b)**: Under the probabilistic model $y = \beta_1 x + \epsilon$, where $\epsilon \sim N(0, \sigma^2)$, show that the prediction $\hat{y} = \hat{\beta}_1 x$ is an unbiased predictor of $y$. 

*Hint:* In the standard linear regression model, $x$ is not random, only $\epsilon$ is random.

**Part (c)**: Why is it possible that $\hat{y}$ is unbiased is part (b), and yet linear regression models can have high bias?

## Question 3: Model Selection

Let's return to the cars data that we looked at for HW6.

In [4]:
cars_df = pd.read_csv("https://raw.githubusercontent.com/stanford-mse-125/homework/main/data/used_cars.csv")
honda_df = cars_df[cars_df["make"] == "Honda"]

**Part (a)**: Recall from HW6 that we used polynomial regression in order to predict price from mileage for Honda cars. 

Use 5-fold cross validation to estimate the out-of-sample RMSE of the following models:

- Linear: $\widehat{\text{price}}$ = $\beta_0$ + $\beta_1$ $\cdot$ $\text{mileage}$
- Quadratic: $\widehat{\text{price}}$ = $\beta_0$ + $\beta_1$ $\cdot$ $\text{mileage}$ + $\beta_2$ $\cdot$ $\text{mileage}^2$
- Cubic: $\widehat{\text{price}}$ = $\beta_0$ + $\beta_1$ $\cdot$ $\text{mileage}$ + $\beta_2$ $\cdot$ $\text{mileage}^2$+ $\beta_3$ $\cdot$ $\text{mileage}^3$
- Quartic: $\widehat{\text{price}}$ = $\beta_0$ + $\beta_1$ $\cdot$ $\text{mileage}$ + $\beta_2$ $\cdot$ $\text{mileage}^2$ + $\beta_3$ $\cdot$ $\text{mileage}^3$ + $\beta_4$ $\cdot$ $\text{mileage}^4$

What model do you think is the best model for the data, based on the results of cross-validation?

**Part (b)**: Repeat part (a), but for each linear model, also add in terms for $\text{year}$ and $\text{model}$ of the Honda. Do the results change? If so, how?