In [None]:
#Run the following code to print multiple outputs from a cell
get_ipython().ast_node_interactivity = 'all'

# Predictive Performance

Over the past several classes, we've built a number of classification models that we assessed using accuracy. For example, when predicting whether or not someone paid their credit card bill, we got the following results:

* Decision Tree: 71% accurate
* Random Forest: 80% accurate
* Support Vector Machine (w/o normalized features): 78%
* Support Vector Machine (w/ normalized features): 82% accurate
* Neural Network: 82% accurate

The major drawback to using this number -- as we've seen -- is the fact that it doesn't take into account how accurate (or inaccurate) the model is for paid vs. missed payments. In our example, it's more costly for the bank to misclassify someone as paying their bill when they didn't pay. However, the SVM model without normalized features predicted *no one* would miss their bill and yet still achieved 78% accuracy. This often occurs with imbalanced data, like we had with the credit card customers.

Two measures that try to account for this are the ROC (receiver operating characteristic) curve and the AUC (area under the curve).

## Importing & Setting up the Data
Import the file, "creditCardDefaultReduced.csv", and save it in a variable called `df`. 

In [None]:
import pandas as pd
df = pd.read_csv("creditCardDefaultReduced.csv")
df

Next, create your `outcome` and `features` variables:

In [None]:
outcome = df["Payment"]
numericFeatures = df[["Limit_Bal", "Bill_Amt1", "Pay_Amt1", "Age"]]
dummiesMarriage = pd.get_dummies(df["Marriage"], prefix = "Marriage", drop_first = True)
dummiesCard = pd.get_dummies(df["Card"], prefix = "Card", drop_first = True)
dummiesPay_0 = pd.get_dummies(df["Pay_0"], prefix = "Pay_0", drop_first = True)
features = pd.concat([numericFeatures, dummiesMarriage, dummiesPay_0, dummiesCard], axis = 1)

Our next step is to partition the data into training and test data sets:

In [None]:
from sklearn.model_selection import train_test_split
featuresTrain, featuresTest, outcomeTrain, outcomeTest = train_test_split(features, 
                                                                          outcome, 
                                                                          test_size = 0.33, 
                                                                          random_state = 42)

Let's scale our features and use these features for all of our models:

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
featuresTrain_norm = scaler.fit_transform(featuresTrain)
featuresTest_norm = scaler.transform(featuresTest)

## Building the Models

The following code cell builds our 4 models...we will be using these to assess the models' fit using ROC curves and the AUC:

In [None]:
# Decision Tree
import sklearn.tree
modTree = sklearn.tree.DecisionTreeClassifier(random_state = 42)
modTree.fit(featuresTrain_norm, outcomeTrain)

# Random Forest
import sklearn.ensemble
modForest = sklearn.ensemble.RandomForestClassifier(random_state = 42)
modForest.fit(featuresTrain_norm, outcomeTrain)

# Support Vector Machine
import sklearn.svm
modSVM = sklearn.svm.SVC(random_state = 42)
modSVM.fit(featuresTrain_norm, outcomeTrain)

# Neural Network
import sklearn.neural_network
modNN = sklearn.neural_network.MLPClassifier(random_state = 42, hidden_layer_sizes = (30,30))
modNN.fit(featuresTrain_norm, outcomeTrain)

## ROC Curves

Another limitation of the accuracy score is that it only provides a measure of the model's performance at a 50% threshold...i.e., the credit card customer is characterized as missing the payment unless the probability of payment exceeds 50%. There are, of course, other thresholds the bank could consider. For example, if it's costly to misclassify customers who will miss their payment, maybe the bank should use a threshold of 60% instead.

The ROC curve plots a model's true positive rate against its false positive rate at different thresholds (from 0% to 100%). We'll be using the `roc_curve()` function from `sklearn.metrics`, which returns an array of false positive rates, true positive rates, and the thresholds at which those rates were calculated.

Let's see how this works, starting with the Decision Tree (`modTree`). We first need to get predictions from our model so that we can compare them to the actual outcomes. Up until now, we've calculated the predictions as follows:

In [None]:
modTree.predict(featuresTest_norm)[0:19]
# [0:20] displays the first 19 predictions

Notice the predictions are either `Paid` or `Missed`. But we want to be able to use probabilities in our calculations. So let's use the `predict_proba()` function instead:

In [None]:
predTree = modTree.predict_proba(featuresTest_norm)
predTree[0:19]

For each prediction, the first number in the `[]`s is the probability of `Missed` and the second number is the probability of `Paid`. (Although only 0's and 1s display, these numbers are being rounded to the nearest whole number.) 

For the `roc_curve()` function, we only need the probability of `Paid`. We can do this by providing "column criteria"...we want the 2nd column and, because Python always starts counting at 0, that means we need column 1:

In [None]:
predTree[0:19, 1]

For the "row criteria", we've been displaying rows 0 to 19...if you want all rows, you can just type a `:`, like this:

In [None]:
predTree[:, 1]
# this will show a preview of the first 3 and the last 3 probabilities for Paid

Now, we're ready to use the `roc_curve()` function. Note: Because the `outcomeTest` variable contains `Paid` and `Missed` labels instead of 0's and 1's, we need to include the `pos_label` parameter to tell the function which category equates to a 1.

In [None]:
fpr_tree, tpr_tree, threshold_tree = sklearn.metrics.roc_curve(outcomeTest, 
                                                          predTree[:, 1], 
                                                         pos_label = "Paid")

fpr_tree   # an array of false positive rates at a range of thresholds
tpr_tree   # an array of true positive rates at a range of thresholds
threshold_tree  # an array of the thresholds used to calculate the rates

Now, let's plot the curve:

In [None]:
import matplotlib.pyplot as plt
plt.plot(fpr_tree, tpr_tree, label = "Decision Tree") #plots the ROC curve
plt.plot([0,1], [0,1])                                #plots a reference line
plt.legend()                                          #displays the legend
plt.xlabel("False Positive Rate -->")                 #labels the x-axis
plt.ylabel("True Positive Rate -->")                  #labels the y-axis

Now you try...get predicted probabilities for the Random Forest model (`modForest`) and then plot the ROC curve:

Now, copy the `plt.plot()` line of code that drew the Random Forest ROC curve and paste it into the code cell with the Decision Tree ROC curve. See what happens when you combine the code.

Run the following code cell to get the predicted probabilities and ROC curve for the Neural Network model and then copy/paste the `plt.plot()` line of code into the code cell for the Decision Tree above in order to display all 3 curves on top of each other.

In [None]:
predNN = modNN.predict_proba(featuresTest_norm)
fpr_NN, tpr_NN, threshold_NN = sklearn.metrics.roc_curve(outcomeTest,
                                                     predNN[:, 1],
                                                     pos_label = "Paid")

plt.plot(fpr_NN, tpr_NN, label = "Neural Network")
plt.plot([0,1], [0,1])
plt.legend()
plt.xlabel("False Positive Rate -->")
plt.ylabel("True Positive Rate -->")

For support vector machines, predictions for whether or not a customer `Missed` or `Paid` is based on distance to the support vector plane and not based on probability. Therefore, we need to use `decision_function()` instead of `predict_proba()` to get predictions. This will return 1 number instead of 2, so we also don't need to include `[:, 1]` in the `roc_curve()` command. Other than that, the rest of the code is the same:

In [None]:
predSVM = modSVM.decision_function(featuresTest_norm)
fpr_SVM, tpr_SVM, threshold_SVM = sklearn.metrics.roc_curve(outcomeTest,
                                                       predSVM,
                                                       pos_label = "Paid")

plt.plot(fpr_SVM, tpr_SVM, label = "Support Vector Machine")
plt.plot([0,1], [0,1])
plt.legend()
plt.xlabel("False Positive Rate -->")
plt.ylabel("True Positive Rate -->")

Based on a comparison of the ROC curves, which model seems to be doing a better job of balancing predictions across the different thresholds?

## AUC (Area under the Curve)

While it's nice to be able to visually compare the ROC curves, the area under the curve (AUC) is a very useful measure to interpret the curve. The AUC is the probability a model will correctly classify the customer. We can use the `auc()` function from `sklearn.metrics` to calculate it: 

In [None]:
auc_tree = sklearn.metrics.auc(fpr_tree, tpr_tree)
auc_tree

We would interpret this as follows: the probability the Decision Tree will correctly classify someone as either `Paid` or `Missed` is 60.1%.

Now you try...calculate the AUCs for the remaining 3 models. Which model has the highest AUC? Does this make sense based on the ROC curves?

## Youden's J Statistic

Youden's J statistic is an easy way to determine which threshold best balances the true vs. false positive rates. It's calculated as `True Positive Rate - False Positive Rate`.

We can calculate the Youden's J statistic for every threshold used when calculating the `roc_curve()` and then use the corresponding threshold for the highest J statistic. 

Based on the highest AUC, let's find the optimal threshold for the neural network:

In [None]:
# calculate the Youden's J stat for each threshold
Jstat_NN = tpr_NN - fpr_NN

# locate the index location of the highest J stat
import numpy as np
optimal_index = np.argmax(Jstat_NN)

# locate the threshold at that index location
optimal_threshold = threshold_NN[optimal_index] 
optimal_threshold

This suggests that, if we use the neural network model, any prediction with a probability less than 0.747 should be classified as `Missed` and anything above that threshold should be classified as `Paid`.

Here's how to get the new classification report based on this threshold:

In [None]:
# recalculate predictions using the optimal threshold
predNN_optimal = (predNN[:, 1] >= optimal_threshold)

# convert the predicted True/False values to "Paid" or "Missed"
predNN_optimal = np.array(["Paid" if pred == True else "Missed" for pred in predNN_optimal])

# print the classification report
print(sklearn.metrics.classification_report(outcomeTest, predNN_optimal))

As a reminder, here's the original classification report using a 50% threshold:

In [None]:
predNN_original = modNN.predict(featuresTest_norm)
print(sklearn.metrics.classification_report(outcomeTest, predNN_original))

The overall accuracy of the model decreased from 82% to 77%. However, notice the F1 score for the `Missed` category increased from 44% to 51%. 

If it's more costly to the bank to misclassify customers who miss their payments, the increased F1 score on `Missed` customers may outweigh the decreased overall accuracy.

# Another Example

For practice, see if you can build Decision Tree, Random Forest, SVM, and Neural Network models using the `diabetes.csv` data set and then evaluate the models using ROC curves, AUC, and Youden's J statistic. For your models, use the variable `Outcome` as your outcome variable and all other variables as your features. (For the neural network model, use `hidden_layer_sizes = (40,40)` and include `max_iter = 1000` as a parameter.)