# Homework 8: Logistic Regression

**Due**: Monday June 5th.

- **Format**: We expect students to complete the homework notebooks using Google Colab (see Discussion 1), but this is not explicitly required and you may use whatever software you would like to run notebooks. 
- **Answers**: As a general guiding policy, you should always try to make it as clear as possible what your answer to each question is, and how you arrived at your answer. Generally speaking, this will mean including all code used to generate results, outputting the actual results to the notebook, and (when necessary) including written answers to support your code.
- **Submission**: Homeworks will be *submitted to Gradescope*, and we expect all students to do question matching on Gradescope upon submission.
- **Late Policy**: All students are allowed 7 total slip days for the quarter, and at most 5 can be used for a single HW assignment. There will be no late credit if you have used up all your slip days. Also, your lowest HW grade will be dropped.

In [2]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
sns.set_theme()

## Question 1: Classification Metrics

Consider the following confusion matrix, which compares the predictions and actual labels in a classification problem of predicting whether or not a person has covid-19.

**Part (a)**: Calculate the accuracy, precision, recall, and F1 score for the above classifier. You must calculate these metrics manually, without using functions available in various packages. For each metric, explain what the metric is measuring in the context of the problem.

**Part (b)**: For each of the following classification problems, determine whether or not you think a false negative error or a false positive error is worse, or if the errors are comparable. Provide justification for your answer (there may be more than 1 correct answer depending on your justification).

- A model that predicts whether (True) or not (False) a vaccine should be given to a person. Assume the vaccine generally has very minor side effects.
- A model that predicts whether (True) or not (False) an email is spam and should thus be filtered. 
- A model that predictis whether (True) or not (False) a bank transaction should be marked as potentially fraudulent, and thus reviewed by the account owner.

## Question 2: Weather Data

You decide to try to build your own model to predict weather it will rain tomorrow based on historical data. You use the following data to build your model.

In [7]:
weather_df = pd.read_csv("data/weather.csv")
weather_df = weather_df.dropna().reset_index(drop=True)
weather_df

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2009-01-01,Cobar,17.9,35.2,0.0,12.0,12.3,SSW,48.0,ENE,...,20.0,13.0,1006.3,1004.4,2.0,5.0,26.6,33.4,No,No
1,2009-01-02,Cobar,18.4,28.9,0.0,14.8,13.0,S,37.0,SSE,...,30.0,8.0,1012.9,1012.1,1.0,1.0,20.3,27.0,No,No
2,2009-01-04,Cobar,19.4,37.6,0.0,10.8,10.6,NNE,46.0,NNE,...,42.0,22.0,1012.3,1009.2,1.0,6.0,28.7,34.9,No,No
3,2009-01-05,Cobar,21.9,38.4,0.0,11.4,12.2,WNW,31.0,WNW,...,37.0,22.0,1012.7,1009.1,1.0,5.0,29.1,35.6,No,No
4,2009-01-06,Cobar,24.2,41.0,0.0,11.2,8.4,WNW,35.0,NW,...,19.0,15.0,1010.7,1007.4,1.0,6.0,33.6,37.6,No,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
56415,2017-06-20,Darwin,19.3,33.4,0.0,6.0,11.0,ENE,35.0,SE,...,63.0,32.0,1013.9,1010.5,0.0,1.0,24.5,32.3,No,No
56416,2017-06-21,Darwin,21.2,32.6,0.0,7.6,8.6,E,37.0,SE,...,56.0,28.0,1014.6,1011.2,7.0,0.0,24.8,32.0,No,No
56417,2017-06-22,Darwin,20.7,32.8,0.0,5.6,11.0,E,33.0,E,...,46.0,23.0,1015.3,1011.8,0.0,0.0,24.8,32.1,No,No
56418,2017-06-23,Darwin,19.5,31.8,0.0,6.2,10.6,ESE,26.0,SE,...,62.0,58.0,1014.9,1010.7,1.0,1.0,24.8,29.2,No,No


**Part (a)**: Build a logistic regression model to predict `RainTomorrow` from `MinTemp` and `MaxTemp` as well as `Rainfall` from the current day. Interpret the coefficient for `RainFall` in the context of the problem.

**Part (b)**: You really hate when you expect it not to rain and then it actually does. Specifically, you decide that a false negative error, i.e. predicting it won't rain when it does, is 3 times worse than a false positive, i.e. predicting it will rain when it doesn't. Use the ROC curve to find the *optimal threshold* for this setting using the gives costs for false positive and false negatives. 

As a guide, you should:
1) Fix a thresold $t$.
2) Compute model predictions using the given threshold $t$.
3) Compute the false positive rate (fpr) and the false negative rate (fnr) using the predictions from 2).
4) Compute the *cost* of the threshold as $1 \cdot fpr  + 3 \cdot fnr$.
5) Repeat 1) - 4) for many thresholds $t \in [0, 1]$, and find the threshold that minimizes the cost.

Output the optimal threshold, as well as a plot which compares threshold vs cost.

**Part (c)**: One metric that is used sometimes when false negatives and false positives have different costs is the [F-beta score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.fbeta_score.html#:~:text=Compute%20the%20F%2Dbeta%20score,recall%20in%20the%20combined%20score.), which generalizes the F1 score. Using $\beta = 3$ and the optimal threshold determined in part (b), find the out-of-sample F-beta score using 5-fold cross validation for the given model.