# Phase 3 Code Challenge Review

Made using resources from the Seattle team - thanks y'all.

## Overview

* Gradient Descent
* Logistic Regression
* Classification Metrics
* Decision Trees

In [None]:
# Imports
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import roc_auc_score, plot_roc_curve
from sklearn.tree import export_graphviz
import graphviz

In [None]:
# from src.call import call_on_students

## Part 1: Gradient Descent

### Set Up

In [None]:
# Grab the data from 'auto-mpg.csv'
mpg_df = pd.read_csv("data/auto-mpg.csv")

In [None]:
# Explore the data
mpg_df.head()

In [None]:
# Let's plot a simple linear regression line using just the horsepower column
plt.figure(figsize=(8, 6))
sns.regplot(x='horsepower', y='mpg', data=mpg_df, line_kws={"color":"orange"})
plt.title('Relationship Between Horsepower and MPG')
plt.xlim(0, 250)
plt.show()

The above graph shows an approximate best fit line for the relationship between `horsepower` and `mpg` in our data.


### 1) Describe the below chart: What is it showing? What does it tell us?

![Slope-RSS relationship image](images/slope-rss-relationship.png)

In [None]:
# call_on_students(1)

#### Answer: 

- Plot shows the error (RSS) on the y-axis and the slope of the model on the x-axis
- From this graph you can see that it arrived at about `m = -0.158` for the optimal coefficient value, since it's around that point that the error term (RSS) is smallest


### 2) Imagine that you're starting at a slope towards the top upper left corner. Using Zoom's annotate feature, demonstrate how gradient descent would work 

In [None]:
# call_on_students(1)

### 3) What is a step size when talking about gradient descent? How does learning rate regulate step size?

In [None]:
# call_on_students(1)

#### Answer: 

- Step size captures the amount to change the coefficient as it tries to minimize the error term
- Learning rate determines how large those steps are to start


## Part 2: Logistic Regression

### 4) Describe a logistic regression model:

- What kind of target is a logistic regression model used for?
- What are the predictions that a logistic regression model outputs?
- How is it different from linear regression?
- Is it a parametric or non-parametric model?

In [None]:
# call_on_students(1)

#### Answer: 

- Used for classification problems (categorical targets)
- Log-odds, which are translated into probabilities
- Linear regression predicts a continuous target, and is not bound between 0 and 1
- Parametric


### 5) Compare a logistic regression model to any of the other model types we've learned:

- List one benefit of logistic regression when compared to the other model type
- List one reason the other model type might be more useful

In [None]:
# call_on_students(1)

#### Answer: 

- Benefit: simple to interpret, fits quickly, not prone to overfitting
- Another model might be more useful if the target is imbalanced, or if there are interaction terms in the data


## Part 3: Logistic Regression and Classification Metrics with Code

### Set Up

In [None]:
# Let's use the same data, but now with a classification target
mpg_class = pd.read_csv('data/auto-mpg-classification.csv')

In [None]:
# Check this new dataframe out
mpg_class.head()

### 6) Prepare our data for modeling:

1. Perform a train/test split
2. Scale the inputs


In [None]:
# call_on_students(1)

In [None]:
# Train-test split
# Set test_size=0.33 and random_state=42
X = mpg_class.drop(columns='target')
y = mpg_class['target']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

In [None]:
# Scale the data
scaler = StandardScaler()

scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

### 7) Explore the `target` column and our model-less baseline

1. What is the breakdown of the `target` column in our training data?
2. What would a model-less baseline look like in this context?
3. How accurate would that model-less understanding be on our test data?

In [None]:
# call_on_students(1)

#### Part 1: explore the target column breakdown in train data

In [None]:
# Code to explore
y_train.value_counts(normalize=True)

#### Answer:

- Imbalanced target - 74% of training data is in class 0


#### Part 2: What would a model-less baseline look like in this context?

#### Answer:

- Predicting only our majority class, 0


#### Part 3: How accurate would that baseline be on test data?


In [None]:
# Code to find the answer
y_test.value_counts(normalize=True)

#### Answer:

- 75% accurate on test data

### 8) What is one problem you could foresee based on this breakdown, and what is one strategy you could employ to address that problem?

In [None]:
# call_on_students(1)

#### Answer:

- Target is imbalanced
- Oversampling, synthetic oversampling (SMOTE), set `class_weight`
- Note that undersampling doesn't make sense here, since our dataset is so small


### 9) Fit a logistic regression model, and plot a confusion matrix of the results on our test set

In [None]:
# call_on_students(1)

In [None]:
# Fit a logistic regression model
# Name the model `logreg` and set random_state = 42
logreg = LogisticRegression(random_state=42)
logreg.fit(X_train_scaled, y_train)

In [None]:
# Plot a confusion matrix on the test data
plot_confusion_matrix(logreg, X_test_scaled, y_test);

### 10) Calculate the accuracy, precision, recall and f1-score for the test set

You can use the confusion matrix above, or sklearn functions

In [None]:
# call_on_students(1)

In [None]:
# Grab predictions if using sklearn functions
test_preds = logreg.predict(X_test_scaled)

In [None]:
# Accuracy
# By hand: TP + TN / TP + TN + FP + FN
accuracy = (23 + 97) / (23 + 97 + 1 + 9)
print(accuracy)

# Using sklearn
accuracy = accuracy_score(y_test, test_preds)
print(accuracy)

In [None]:
# Precision
# By hand: TP / TP + FP
precision = 23 / (23 + 1)
print(precision)

# Using sklearn
precision = precision_score(y_test, test_preds)
print(precision)

In [None]:
# Recall
# By hand: TP / TP + FN
recall = 23 / (23 + 9)
print(recall)

# Using sklearn
recall = recall_score(y_test, test_preds)
print(recall)

In [None]:
# F1-Score
# By hand
f1score = 2 * precision * recall / (precision + recall)
print(f1score)

# Using sklearn
f1score = f1_score(y_test, test_preds)
print(f1score)

### 11) Calculate the ROC-AUC on the test set, and plot the ROC curve

For this you'll definitely want to use the sklearn functions!

In [None]:
# call_on_students(1)

In [None]:
# Calculate roc-auc
# Need predicted probabilities
test_probas = logreg.predict_proba(X_test_scaled)[:,1]

roc_auc_score(y_test, test_probas)

In [None]:
# Plot the ROC curve
plot_roc_curve(logreg, X_test_scaled, y_test);

### 12) Evaluate! Based on the metrics of our test data, how is our model doing?

In [None]:
# call_on_students(1)

#### Answer:

- Doing well! Very high metrics all around - more FN than FP (better precision than recall)


## Part 4: Decision Trees

### Set Up

Let's try a decision tree classifier. 

First, let's just have the tree split once, using just a single column. How would you set that up? Use random_state = 42.

In [None]:
# Create two different decision trees with a single split
dt_maxdepth1_v1 = DecisionTreeClassifier(max_depth=1, random_state = 42)
dt_maxdepth1_v2 = DecisionTreeClassifier(max_depth=1, random_state = 42)

# Train the two trees on different columns

# First fit dt_maxdepth1_v1 on 'weight', set it equal to dt_weight
dt_weight = dt_maxdepth1_v1.fit(X_train[['weight']], y_train)

# Then fit dt_maxdepth1_v2 on 'origin', set it equal to dt_origin
dt_origin = dt_maxdepth1_v2.fit(X_train[['origin']], y_train)

#### Images:

Here we've created two images of what the nodes should look like.

| Version 1: Weight | Version 2: Origin |
| ----------------- | ----------------- |  
| ![max depth 1 - version 1](images/dt-maxdepth1-v1.png) | ![max depth 1 - version 2](images/dt-maxdepth1-v2.png) |

### 13) Just looking at the images, which of these trees does a better job splitting the data? How can you tell?

In [None]:
# call_on_students(1)

#### Answer:

- The first DT produces more pure splits, thus is doing a better job of separating the data

In [None]:
# If you want to check on your answer, let's try out just the default .score() for the models here.
print(dt_weight.score(X_test[['weight']], y_test))

print(dt_origin.score(X_test[['origin']], y_test))

### 13 bonus) What's the default scoring metric for the sklearn DecisionTreeClassifier? Is it always the best metric to use?

#### Answer:

- The documentation says that the score will "Return the mean accuracy on the given test data and labels"
- That is, it predicts for each set of x values fed in, and then checks that against each y. Then it averages the result. Although this is handy, there may be cases when average precision, recall, or F1 might be more appropriate.

### 14) Fit a decision tree model, and plot a confusion matrix of the results on our test set

In [None]:
# call_on_students(1)

In [None]:
# Fit a decision tree model
# Name the model `dt` and set random_state = 42
dt = DecisionTreeClassifier(random_state=42)

dt.fit(X_train, y_train)

In [None]:
# Plot a confusion matrix on the test data
plot_confusion_matrix(dt, X_test, y_test)
plt.show()

In [None]:
# Visualizing the ROCs for the models we've done
fig, ax = plt.subplots()
plot_roc_curve(dt, X_test, y_test, ax=ax)
plot_roc_curve(logreg, X_test_scaled, y_test, ax=ax)

plt.title("Receiver Operating Characteristic Curves\n(Evaluated on Test Set)")
plt.show()

### 15) Which is the better model according to ROC-AUC score? How can you tell?

In [None]:
# call_on_students(1)

#### Answer:

- Logistic regression has the higher roc-auc score, and has more area under the curve since it's closer to the top left corner of the graph