# Homework 5: Simple Linear Regression

**!!! IMPORTANT, DO NOT PROCEED BEFORE COMPLETING THE STEP BELOW !!!**

If you haven't already, please make a copy of this notebook and save to your Google Drive. This is imperative so that your work is saved as you go.

**Due Date**: Thursday, May 15th at 11:59pm.

**Submission Instructions**:
- Download the notebook: Go to File --> Download --> Download .ipynb.
- Upload the notebook: Click the Files icon (left side under the Key icon) --> Click the Upload icon (left most of 4) --> Select the file you just downloaded.
- Run the last cell in this notebook.
- Find the new pdf file in the same location as your uploaded notebook.
- Click the 3 vertical dots for this pdf file --> Click Download.
- IMPORTANT: check that your pdf file has not cut off any work from your notebook.
- Upload the pdf to Gradescope.

**Learning Outcomes**:
- Build a simple linear regression model.
- Use simple linear regression for inference.
- Use bootstrapping to compute standard errors and confidence intervals for regression coefficients.
- Understand how and why we might want to log transform a variable.

## Set up

Run the cell below to import the libraries and packages we are going to use. The cleanest way to fit a linear regression model in Python is to use the formula API from the `statsmodels` package.

In [1]:
import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm
from statsmodels.formula.api import ols
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme()

## Exercise 1: Analyze survey responses

Review the responses you received for your poll. Create one plot and summarize an interesting takeaway from the responses in 1 sentence.

In [None]:
# Your code here!

Your answer here!

## Exercise 2: Simple Linear Model

We'll explore simple linear regression with a dataset of used cars. Our goal will be to predict the price of used cars based on the car's mileage. Use the following code to read and preview the dataset.

In [2]:
url = "https://raw.githubusercontent.com/stanford-mse-125-2025/mse-125-2025-public/refs/heads/main/data/used_cars.tsv"
cars = pd.read_csv(url, sep="\t")
cars.head()

Unnamed: 0,type,year,make,model,trim,mileage,price
0,USED,2010,Acura,TL,Base,73936,19388
1,USED,2012,Acura,MDX,Technology Package,32453,34898
2,USED,2010,Acura,TL,Base,34302,22000
3,USED,2009,Acura,TL,SH-AWD,98772,17988
4,USED,2007,Acura,MDX,Base,65677,22777


**Part (a):** Create a new dataset called `accords`. The dataset should only include rows corresponding to used Honda Accords. Using the `accords` data and the `ols` function, fit the following linear regression model:

$y_{\text{price},i} = \beta_0 + \beta_1 x_{\text{mileage},i} + \epsilon_i$, where $\epsilon_i \sim N(0, \sigma^2)$.


Print your regression output with the `summary` function.








In [3]:
## Your code here!


**Part (b):** Using your output from part (a), identify and interpret the following quantities in no more than one sentence each. Make sure to include units if applicable.


* $\hat{\beta}_0$
* $\hat{\beta}_1$
* se($\hat{\beta}_0$)
* se($\hat{\beta}_1$)

Your answer here!





**Part (c):** Calculate a 95% confidence interval on your coefficients from your regression model.

In [4]:
# Your code here!


**Part (d):** Create a scatterplot visualizing the linear relationship between the mileage of used Accords and their price. Add your simple linear regression line in red color and make sure to include axes labels and a title.

In [5]:
# Your code here!


## Exercise 3: Prediction

Using the predict function and your model from the previous exercise, predict the average price of a used Accord with an odometer reading of 50,000 miles. Provide a 95% confidence interval for this mean. Make sure to print your answers.

In [6]:
# Your code here!


## Exercise 4: Prediction

Repeat Exercise 2 for a used Accord with an odometer reading of 300,000 miles. Make sure to print your answers.

Using this result, can you identify a critical issue with your regression model? Answer in one or two sentences.

In [7]:
# Your code here!


Your answer here!


## Exercise 5: Log-transformed Outcome Model

One way to address the issue that you hopefully identified in Exercise 4 is to use a *log-transformed outcome* model.
This model has the form

$$
\log(Y_i) = \beta_0 + \beta_1 X_i + \epsilon_i
$$

where $\log$ is the natural logarithm.

**Part (a)**: Fit a new linear regression model to the Accords dataset, using the log-transformed outcome model. Print a summary of the fitted model.

In [8]:
# Your code here!


**Part (b)**: Repeat Exercise 1b with your new model. That is, plot a scatterplot of the data with a red line for your new fitted model. Make sure your plot is on the scale of the original data.

In [9]:
# Your code here!


**Part (c)**: Based on your plot from part (b), is this model a better fit to the data than the model from Exercise 1? Why or why not?

Your answer here!

**Part (d)**: Using this model, make a new prediciton for price of a used Accord with 300,000 miles. How does your prediction compare to your prediction from Exercise 3?

In [10]:
# Your code here!


Your answer here!

**Part (e)**: One downside of the log-transformed model is that we have to rethink how to interpret coefficients. In a normal linear regression model, the slope $\beta_1$ represents the expected change in $Y$ for every unit increase in $x$, because

$$
E[Y | x = x_0 + 1] - E[Y | x = x_0] = \beta_0 + \beta_1(x_0 + 1) - (\beta_0 + \beta_1 x_0) = \beta_1.
$$

In the log-transformed model, it turns out to be convenient to interpret $\exp(\beta_1)$, rather than $\beta$. What does $\beta_1$ represent in this model? Use math similar to above to justify your answer.
(Hint: start by writing out $E[Y | x = x_0 + 1]$ and $E[Y | x = x_0]$ for the log-transformed model.)

Your answer here!


**Part (f)**: Using your derivation from part (e), provide an interpretation for the fitted value of $\exp(\beta_1)$ in your model from part (a).

Your answer here!


## Exercise 6: Inference

In Exercises 3 and 4, we used simple linear regression for prediction. In this exercise, we will use simple linear regression for inference — to
draw conclusions about the relationship between a predictor variable (like mileage) and a response variable (like price). Rather than just estimating the slope, we ask whether the observed relationship is statistically significant or could plausibly be due to random chance.

One common goal is to test whether the slope coefficient $ \beta_1 $
  is different from zero — in other words, whether there is evidence that the predictor
𝑥 (mileage) is associated with changes in the outcome
𝑦 (price).

**Part (a):** Write out the null and alternative hypotheses for testing whether mileage significantly predicts price.

Your answer here!





**Part (b):** Report the slope estimate, standard error, t-statistic, and p-value for the slope from your model output from Exercise 2.

Your answer here!











**Part (c):** Based on your results, do you reject the null hypothesis at the 5% significance level? What does this mean in the context of the data?

Your answer here!


## Exercise 7: Bootstrapping

In class, we derived the formula for regression coefficients using partial derivatives. Luckily, packages like `statsmodels` do this for us, calculating standard errors and confidence intervals for the slope and intercept using formulas that assume certain conditions — like normally distributed residuals and constant variance. But what if those assumptions don't hold?

One alternative is bootstrapping. By repeatedly sampling (with replacement) from the original dataset and refitting the model each time, we can build up an empirical distribution for each coefficient and use it to estimate standard errors and confidence intervals. In this exercise, we’ll use bootstrapping to estimate the variability of our regression coefficients and compare the results to the `statsmodels` output from Exercise 2.

**Part (a):**  Using the `accords` dataset, perform 1,000 bootstrap resamples. For each resample:

- Use `.sample(frac=1, replace=True)` to generate a new bootstrap dataset

- Fit a linear regression model predicting price from mileage

After collecting all 1,000 bootstrap estimates:

- Calculate the standard error of the intercept and slope

- Construct a 95% confidence interval using percentiles for the regression coefficients for the intercept and slope

Print the standard errors and 95% CIs for each coefficient.


In [11]:
np.random.seed(125)

# Your code here!


**Part (b):** Compare your bootstrap standard errors and confidence intervals to the ones from Exercise 2.
Are they similar? Write a sentence interpreting what this tells you about uncertainty in your estimates.









Your answer here!


## Converting to PDF

Use the below cell to convert your notebook to pdf, using the instructions at the beginning of the notebook. **Before submitting, check to make sure that none of your work got cut off.**

In [None]:
!apt-get update -qq > /dev/null
!apt-get install -qq --fix-missing pandoc texlive-latex-base texlive-latex-extra > /dev/null
!jupyter nbconvert --to latex "/content/HW5.ipynb" > /dev/null
!sed -i 's/❗/!/g' /content/HW5.tex
!pdflatex -interaction=nonstopmode -halt-on-error "/content/HW5.tex" > /dev/null