# Homework 6: Linear Regression

**!!! IMPORTANT, DO NOT PROCEED BEFORE COMPLETING THE STEP BELOW !!!**

If you haven't already, please make a copy of this notebook and save to your Google Drive. This is imperative so that your work is saved as you go.

**Due Date**: Thursday, May 22nd at 11:59pm.

**Submission Instructions**:
- Download the notebook: Go to File --> Download --> Download .ipynb.
- Upload the notebook: Click the Files icon (left side under the Key icon) --> Click the Upload icon (left most of 4) --> Select the file you just downloaded.
- Run the last cell in this notebook.
- Find the new pdf file in the same location as your uploaded notebook.
- Click the 3 vertical dots for this pdf file --> Click Download.
- IMPORTANT: check that your pdf file has not cut off any work from your notebook.
- Upload the pdf to Gradescope.

**Learning Outcomes**:
- Interpret regression output
- Understand different features that can be used to enhance regression models
- Create an algorithm to run cross-validation
- Work with the Lasso model
- Understand multicollinearity

## Set up

Just as we did in HW 5, run the cell below to import the libraries and packages we are going to use. The cleanest way to fit a linear regression model in Python is to use the formula API from the `statsmodels` package.

In [None]:
import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm
from statsmodels.formula.api import ols
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme()

## Scenario: Modeling Used Car Prices

Let's return back to our used car dataset from HW 5. Use the following code to read and preview the dataset.

In [None]:
import pandas as pd
url = "https://raw.githubusercontent.com/stanford-mse-125-2025/mse-125-2025-public/refs/heads/main/data/used_cars.tsv"
cars = pd.read_csv(url, sep="\t")
cars

Unnamed: 0,type,year,make,model,trim,mileage,price
0,USED,2010,Acura,TL,Base,73936,19388
1,USED,2012,Acura,MDX,Technology Package,32453,34898
2,USED,2010,Acura,TL,Base,34302,22000
3,USED,2009,Acura,TL,SH-AWD,98772,17988
4,USED,2007,Acura,MDX,Base,65677,22777
...,...,...,...,...,...,...,...
1281,USED,2013,Volkswagen,Jetta,SE PZEV,39233,14599
1282,USED,2013,Volkswagen,Jetta,SE PZEV,35939,13599
1283,USED,2013,Volkswagen,Jetta,SE PZEV,31768,14200
1284,USED,2012,Volkswagen,Jetta,S,65985,13495


## Exercise 0

As a reminder, we created a *log-transformed outcome* model for Honda Accords, transforming the Accord prices with log, using the following formula:

$$
\log(Y_i) = \beta_0 + \beta_1 X_i + \epsilon_i
$$

Create that exact same model again (feel free to copy from your answer from HW 5).

In [None]:
# Your code here!



## Exercise 1

**Part (a)**: Let's refine our regression model above to include the model year as an additional predictor. You should still use just the Honda Accord data. Print the regression summary.

In [None]:
# Your code here!



**Part (b)**: Compare the R-squared of the revised regression model to our model from before. What does this tell us about the relative fits of both?

Your answer here!



**Part (c)**: Fit our new model, including year, to the entire dataset (not just Honda Accords). Print the regression summary.

In [None]:
# Your code here!



**Part (d)**: Compare the R-squared value for this model, which is applied to all used cars, to that of the other model, applied to just Honda Accords. What is the likely reason for this difference?

Your answer here!



**Part (e)**: Update the model from part (c) to address the discrepancy in the R-squared value. Print the regression summary.

In [None]:
# Your code here!



**Part (f)**: Look at the coefficient for model[T.Civic]. Interpret this value.

Your answer here!



**Part (g)**: Now, build on your model from part (e). This time, add an interaction term; specifically between mileage and year. Print out the regression summary.

In [None]:
# Your code here!



**Part (h)**: What is the coefficient on your interaction term? How did the coefficients on the two variables involved in your interaction term change from part (e) to part (h)? What does this say about multicollinearity?

Your answer here!



## Scenario: Ames Housing

Now, let's use the Ames housing dataset from lecture. Use the following code to read and preview the dataset.

In [None]:
import pandas as pd
url = "https://raw.githubusercontent.com/stanford-mse-125-2025/mse-125-2025-public/refs/heads/main/data/ames_housing.csv"
ames = pd.read_csv(url)
ames = ames.rename(columns={"1stFlrSF": "FirstFlrSF"})
ames

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,175000
1456,1457,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,,0,2,2010,WD,Normal,210000
1457,1458,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,...,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1458,1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2010,WD,Normal,142125


## Exercise 2

**Part (a)**: Fit the regression model from class (see the slides). This model should only include FirstFlrSF, FullBath, TotRmsAbvGrd. Print the regression summary.

In [None]:
# Your code here!



**Part (b)**: Write a function that (1) takes in a generic model formula (e.g. "Y ~ X") as well as a generic dataset (e.g. Ames dataframe), and (2) conducts a cross validation test with k=5 folds using the inputted model formula string. The function should print two values: (1) the average RMSE value across the 5 test sets, as well as (2) the average RMSE value across the 5 training sets. **DO NOT use SKLearn CV tools**

In [None]:
# Your code here!



**Part (c)**: Use your function from part (b) on the model from part (a).

In [None]:
# Your code here!



**Code review**: Now, let's take a look at the following code (from our in-class demo), creating a Lasso model for our Ames housing data with regularization parameter alpha = 10,000:

In [None]:
# Use lasso to select features

from sklearn.linear_model import Lasso

model = Lasso(alpha=10000)
X = ames.drop(columns=["SalePrice"])
y = ames["SalePrice"]

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer, make_column_selector

preprocessor = make_column_transformer(
    (make_pipeline(SimpleImputer(strategy="mean"), StandardScaler()), make_column_selector(dtype_include=["int64", "float64"])),
    (OneHotEncoder(handle_unknown="ignore", sparse_output=False), make_column_selector(dtype_include=["object"])),
    remainder="passthrough",
    verbose_feature_names_out=False,
).set_output(transform="pandas")
X = preprocessor.fit_transform(X)

model.fit(X, y)

# Print the selected features
selected_features = X.columns[model.coef_ != 0]
print("Number of selected features:", len(selected_features))
print("")
print("Selected features:", selected_features)
print("")
print("Coefficients, in order:")
print(model.sparse_coef_)
print("")
print("R-squared value:", model.score(X, y))

**Part (d)**: We now see the features selected by the Lasso model with alpha = 10,000. Again, use your function from part (b) to get cross validation RMSE metrics for a linear regression model that uses the features selected by the Lasso model.

In [None]:
# Your code here!



**Part (e)**: Compare your RMSE metrics in part (d) to those from part (c). What do you notice? What does this suggest?

Your answer here!





**Part (f)**: Change the alpha value to 6,000 and re-run the code above. What happens to the number of selected features and R-squared value? Explain both of these changes.

Your answer here!



**Part (g)**: Repeat the same process from part (d). Use your function from part (b) to get cross validation RMSE metrics for a linear regression model that uses the features selected by the Lasso model with alpha = 6,000.

In [None]:
# Your code here!



**Part (h)**: Compare your RMSE metrics in part (g) to those from part (c) and part (d). What do you notice? What does this suggest?

Your answer here!



**Part (i)**: Give one reason why one might prefer the model in part (d) to the model in part (g). Given another reason for the opposite to be true.

Your answer here!


## Converting to PDF

Use the below cell to convert your notebook to pdf, using the instructions at the beginning of the notebook. **Before submitting, check to make sure that none of your work got cut off.**

In [None]:
!apt-get update -qq > /dev/null
!apt-get install -qq --fix-missing pandoc texlive-latex-base texlive-latex-extra > /dev/null
!jupyter nbconvert --to latex "/content/HW6.ipynb" > /dev/null
!sed -i 's/❗/!/g' /content/HW6.tex
!pdflatex -interaction=nonstopmode -halt-on-error "/content/HW6.tex" > /dev/null

W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
[NbConvertApp] Converting notebook /content/HW6.ipynb to latex
[NbConvertApp] Writing 75188 bytes to /content/HW6.tex

kpathsea: Running mktexpk --mfmode / --bdpi 600 --mag 1+0/600 --dpi 600 tctt1095
mktexpk: Running mf-nowin -progname=mf \mode:=ljfour; mag:=1+0/600; nonstopmode; input tctt1095
This is METAFONT, Version 2.71828182 (TeX Live 2022/dev/Debian) (preloaded base=mf)

(/usr/share/texlive/texmf-dist/fonts/source/jknappen/ec/tctt1095.mf
(/usr/share/texlive/texmf-dist/fonts/source/jknappen/ec/exbase.mf)
(/usr/share/texlive/texmf-dist/fonts/source/jknappen/ec/tctt.mf
(/usr/share/texlive/texmf-dist/fonts/source/jknappen/ec/txsymb.mf
 Ok (/usr/share/texlive/texmf-dist/fonts/source/jknappen/ec/exaccess.mf
 Ok) (/usr/share/texlive/texmf-dist/fonts/source/jknappen/ec/txpseudo.mf
 Ok) (/usr/share/tex