In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

**ROC AUC curves compare the TPR and the FPR for different classification thresholds for a classifier.
ROC AUC curves help us select the best model for a job, by evaluating how well a model distinguishes between classes.**

Legend:

ROC = receiver operating curve

AUC = area under curve

TPR = true positive rate

FPR = false positive rate

In [None]:
dir = '../input/sms-spam-collection-dataset/spam.csv'
import pandas as pd
df = pd.read_csv(dir, encoding='ISO-8859-1')
df.head()

> Convert v1 into your labels, y, and v2, into your features, X. Labels need to be integers to feed into a model so if spam set to 1, and if ham set to 0. If you don’t know, ham means it’s not spam.

In [None]:
import numpy as np
y = np.array([(1 if i=='spam' else 0) for i in df.v1.tolist()])
X = np.array(df.v2.tolist())

x is now an array of strings and y is an array of 1's and 0's
split the data in test and train sets .and note the data has not been vectorized yet

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit
splitter = StratifiedShuffleSplit(
    n_splits=1, test_size=0.3, random_state=0)
for train_index, test_index in splitter.split(X, y):
    X_train_pre_vectorize, X_test_pre_vectorize = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

Fit a vectorizer and transform the test and train sets.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train_pre_vectorize)
X_test = vectorizer.transform(X_test_pre_vectorize)

Choose a classifier and fit it on the train set. I’ve arbitrarily chosen LogisticRegression.

In [None]:
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression()
classifier.fit(X_train, y_train)

Normally this is where we’d predict classes for the test set, but since we’re just interested in building the ROC AUC curve, skip it.
Let’s predict probabilities of classes, and convert the result to an array.

In [None]:
y_score = classifier.predict_proba(X_test)
y_score = np.array(y_score)
print(y_score)

The following code can be a little confusing. Using label_binarize() with 3 (or more) classes would convert a single y value [2], into [0 0 1], or [0] into [1 0 0], but it doesn’t work the same with only 2 classes. So we call numpy’s hstack to reformat the output.

In [None]:
from sklearn.preprocessing import label_binarize
y_test_bin = label_binarize(y_test, neg_label=0, pos_label=1, classes=[0,1])
y_test_bin = np.hstack((1 - y_test_bin, y_test_bin))
print(y_test_bin)

Generate the curve.

In [None]:
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in [0,1]:
    # collect labels and scores for the current index
    labels = y_test_bin[:, i]
    scores = y_score[:, i]
    
    # calculates FPR and TPR for a number of thresholds
    fpr[i], tpr[i], thresholds = roc_curve(labels, scores)
    
    # given points on a curve, this calculates the area under it
    roc_auc[i] = auc(fpr[i], tpr[i])

At this point, we could calculate the ROC curve for the 0 and 1 class separately. But for simplicity, we’ll combine them and generate a single curve.
Disclaimer: This makes more sense when classes are balanced or it may obscure the fact the model is doing poorly in a single class if it does well in the other. But we’ll do it here anyway to learn how.
We’ll use “micro-averaging” and flatten the TPR for both classes together, likewise with the FPR. .ravel() will do this for us. So for example,[[1,0],[0,1]] becomes [1,0,0,1]

In [None]:
fpr["micro"], tpr["micro"], _ = roc_curve(y_test_bin.ravel(), y_score.ravel())
roc_auc['micro'] = auc(fpr["micro"], tpr["micro"])

In [None]:
plt.figure()
lw = 2
plt.plot(fpr[1], tpr[1], color='darkorange',
         lw=lw, label='ROC curve (area = %0.2f)' % roc_auc[1])
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()


The greater the area under the orange curve the better the model can distinguish between classes. Another way to look at this is the closer the curve is to the top left, the better.

What do we mean by “different classification thresholds”?

If you’re used to using sklearn classifiers out of the box, you’ll know that .predict outputs the predicted class. But you may not know that is based on a 50% classification threshold by default.
Rather than calling .predict(), most classifiers, like LogisticRegression also have a method called predict_proba() which predicts the probability that an example falls into each class.
Using this you could recalculate outputted classes using whatever threshold you specify.

Why ROC and AUC?

The curve is an “ROC curve” which plots the TPR and FPR at different thresholds.
AUC is just a calculation of the area under the curve. It’s a way to quantify the model’s accuracy without looking at the curve, or for comparing curves between 2 models when eyeballing areas under the curve doesn’t give a clear winner.

TPR is the number of TP divided by the sum of TP and FN.

Example 1: A model that predicts if an image is of a dog.

TP: correctly predicted that an image of a dog is a dog.

FN: incorrectly predicted that an image of a dog is a cat.


Example 2: A model that predicts if a message is SPAM.

TP: correctly predicted that a SPAM message is SPAM.

FN: incorrectly predicted that a SPAM message is HAM.

What the ROC AUC curve does not do well

ROC curves are not the best choice for imbalanced datasets. They’re best when there’s an even split between classes. Otherwise, the model doing well making classifications for a specific class may hide the fact the model does poorly predicting other classes.

Takeaways

ROC AUC curve can give us insight into a model’s predictive ability in discriminating between classes.

Models with higher AUC are generally more performant than models with lower AUC.

ROC AUC is not best for heavily imbalanced datasets.