# Homework 7: Logistic Regression and Additive models

!!! **STOP** !!!

If you haven't already, please make a copy of this notebook and save to your google drive.
This is imperative so that your work is saved as you go.

**Due Date**: Thursday May 29th at 11:59pm.

**Submission Instructions**:
- Download the notebook: Go to File --> Download --> Download .ipynb
- Upload the notebook: Click the Files icon (left side under the Key icon) --> Click the Upload icon (left most of 4) --> Select the file you just downloaded.
- Run the last cell in this notebook.
- Find the new pdf file in the same location as your uploaded notebook.
- Click the 3 vertical dots for this pdf file --> Click Download.
- IMPORTANT: check that your pdf file has not cut off any work from your notebook.
- Upload the pdf to Gradescope.

**Learning Outcomes**:
- Interpret and evaluate classification models.
- Fit and interpret logistic regression models.
- Apply additive models (DNAMite) to real-world regression data.
- Compare model performance between linear regression and additive models.

## Set up

Run the cell below to import the libraries we are going to use.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme()
from scipy import stats
from statsmodels.formula.api import ols, logit

## Exercise 1: Classification Metrics


Consider the following confusion matrix, which compares the predictions and actual labels in a classification problem of predicting whether or not a person has COVID-19.

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

cm = ConfusionMatrixDisplay(confusion_matrix=np.array([[1040, 86], [101, 356]]))
cm.plot()
plt.title("Confusion Matrix for Covid Test")

**Part (a)**: Calculate the accuracy, precision, TPR, and FPR for the above classifier. You must calculate these metrics manually, without using functions available in various packages. For each metric, interpret the result and explain what the metric is measuring in the context of the problem. Use no more than 2 sentences per metric.

In [9]:
# Code here!



Answer here!



## Exercise 2: Logistic Regression

You decide to try to build your own model to predict whether it will rain tomorrow based on historical data. You use the following data to build your model.

In [None]:
weather_df = pd.read_csv("https://raw.githubusercontent.com/stanford-mse-125/homework/main/data/weather.csv")
weather_df = weather_df.dropna().reset_index(drop=True)
weather_df

**Part (a)**: Build a logistic regression model to predict `RainTomorrow` from `MinTemp` and `MaxTemp` as well as `Rainfall` from the current day. Interpret the **coefficient for `Rainfall`** in the context of the problem and  the **coefficient for `MaxTemp`**on the odds scale. Use 2 sentences for each interpretation.

In [17]:
# Code here!



Answer here!



**Part (b)**: Plot the ROC curve for the logistic regression model from Part (a). Generate the plot manually using `matplotlib`. Do not use any external libraries or built-in ROC plotting functions. Mark and label the threshold of 0.5. Finally, include a one-sentence explanation of how you identified this point.


In [22]:
# Code here!



Answer here!



**Part (c)**: You really hate when you expect it not to rain and then it actually does. Specifically, you decide that a false negative error, i.e. predicting it won't rain when it does, is 3 times worse than a false positive, i.e. predicting it will rain when it doesn't. Use the ROC curve to find the *optimal threshold* for this setting using the gives costs for false positive and false negatives. Output the optimal threshold, as well as a plot which compares threshold vs cost.

In [26]:
# Code here!



**Part (d)**: One metric that is used sometimes when false negatives and false positives have different costs is the [F-beta score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.fbeta_score.html#:~:text=Compute%20the%20F%2Dbeta%20score,recall%20in%20the%20combined%20score.), which generalizes the F1 score. Using $\beta = 3$ and the optimal threshold determined in part (d), find the out-of-sample F-beta score using 5-fold cross validation for the given model. Set `random_state=42` in Kfold function.

In [29]:
from sklearn.model_selection import KFold
from sklearn.metrics import fbeta_score

# Code here!



## Exercise 3: Additive models


Additive models offer a flexible alternative to linear models by capturing non-linear relationships between features and the outcome while maintaining interpretability. In this exercise, you'll use the `dnamite` package to fit an additive model on a subset of features from the Ames housing dataset to predict Price. Please refer to dnamite's [documentation](https://dnamite.readthedocs.io/en/latest/).

In [None]:
# Install dnamite if not installed
%pip install dnamite

In [47]:
# Import Ames Housing dataset

data = pd.read_csv("https://raw.githubusercontent.com/stanford-mse-125-2025/mse-125-2025-public/refs/heads/main/data/ames_housing.csv")

X = data.drop(columns=["SalePrice", "Id"])
y = data["SalePrice"]

**Part (a):** Use the `dnamite` package (see [dnamite.readthedocs.io](https://dnamite.readthedocs.io)) to fit a model on the Ames housing dataset. Select 6 features by using `reg_param = 0.04` and `random_state=10`.


In [None]:
from dnamite.models import DNAMiteRegressor

# Code here!



**Part (b):** Now, fit a dnamite model on the selected features

In [52]:
# Code here!



**Part (c):** Fit a linear regression model using the same 6 features selected above. Compare the R^2 of the linear model to that of the dnamite model. Briefly comment on any differences in performance.

In [55]:
from sklearn.metrics import r2_score

# Code here!



Answer here!



**Part (d)**: Plot the shape function for 1stFlrSF from the dnamite model. Provide one example illustrating how its interpretation differs from the coefficient of 1stFlrSF in the linear regression model. Use no more than 3 sentences.

In [59]:
# Code here!



Answer here!



## Converting to PDF

Use the below cell to convert your notebook to pdf, using the instructions at the beginning of the notebook.

In [None]:
!apt-get update -qq > /dev/null
!apt-get install -qq --fix-missing pandoc texlive-latex-base texlive-latex-extra > /dev/null
!jupyter nbconvert --to latex "/content/HW7.ipynb" > /dev/null
!sed -i 's/❗/!/g' /content/HW7.tex
!pdflatex -interaction=nonstopmode -halt-on-error "/content/HW7.tex" > /dev/null