# What is the "Area Under the Curve" (AUC)?

It took me a while to grasp what the **ROC curve** means when I first read the [Wikipedia article](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) referenced in the description of this competition. There are many underlying concepts that must be understood before we can have a sense of why the **area under the curve** (AUC) can be used as a metric for binary classification. This tutorial is my two cents on trying to clearify some of these concepts.

Let's start by loading libraries and data, encoding categorical variables and creating train and validation sets.

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn import metrics
from lightgbm import LGBMClassifier

train = pd.read_csv('/kaggle/input/tabular-playground-series-mar-2021/train.csv', index_col='id')
test = pd.read_csv('/kaggle/input/tabular-playground-series-mar-2021/test.csv', index_col='id')
target = train.pop('target')

for c in train.columns:
    if train[c].dtype=='object': 
        lbl = LabelEncoder()
        lbl.fit(list(train[c].values) + list(test[c].values))
        train[c] = lbl.transform(train[c].values)
        test[c] = lbl.transform(test[c].values)

X_train, X_valid, y_train, y_valid = train_test_split(train, target, test_size=0.1, random_state=0)

# Base model and ROC curve

We will use the LGBM base model as a first example. Let's start by fitting the model and making predictions.

In [None]:
model = LGBMClassifier(random_state=0, metric='auc')
model.fit(X_train, y_train)
y_pred = model.predict_proba(X_valid)[:,1]

To compute the Area Under the Curve (AUC), we can use the built-in method `roc_auc_score`.

In [None]:
auc = metrics.roc_auc_score(y_valid, y_pred)
print('AUC =', f'{auc:0.4f}')

To plot the curve and the area under it, we need to compute two things: the **False Positive Rate** (`fpr`) and the **True Positive Rate** (`tpr`). I will explain what those mean shortly, but for now let's just run the follwing code and observe the results.

In [None]:
fpr, tpr, thresholds = metrics.roc_curve(y_valid, y_pred)

plt.figure(figsize=(4, 4), dpi=100)
plt.axis('scaled')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.title('ROC curve')
plt.plot(fpr, tpr, 'b')
plt.fill_between(fpr, tpr, facecolor='lightblue', alpha=0.5)
plt.text(0.95, 0.05, 'AUC = %0.4f' % auc, ha='right', fontsize=12, weight='bold', color='red')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

Notice how the graph is contained inside a unit square, and so the AUC can be at most 1. This ideal situation (i.e., AUC = 1) only happens if we predict every single output correctly and with total certainty.

# True Positives, False Positives, True Negatives and False Negatives

OK, but how do we compute `tpr` and `fpr` to build the graph? In order to explain that, I will use a much smaller example, with only 10 values (`actual_values`). In what follows we will assume that **1 means "positive"** and **0 means "negative"**. As the code below shows, the list has 4 positives and 6 negatives.

In [None]:
actual_values = [0, 1, 0, 0, 0, 1, 1, 0, 1, 0]

n_positives = sum(actual_values)
n_negatives = len(actual_values) - n_positives

print('Total positives =', n_positives)
print('Total negatives =', n_negatives)

 Suppose that we have the follwing predicted probabilities (`predicted_probs`) for these values being equal to 1 ("positive").

In [None]:
predicted_probs = [0.40, 0.95, 0.18, 0.59, 0.81, 0.61, 0.19, 0.20, 0.24, 0.24]

These probabilities are not our final predictions. They only serve as information to help us decide whether to predict 0 or 1 for each outcome. And since they are all numbers **between 0 and 1**, we need to set a `threshold` and establish that:

* if the predicted probability is above the threshold, predict 1 (positive)
* otherwise, predict 0 (negative)

The threshold can be any number in [0, 1]. In fact, the ROC curve is built considering **all numbers** in this interval. But lt's go one step at a time and choose a single number as our threshold. A natural choice is 0.5.

In [None]:
threshold = 0.5

Based on the value of the threshold and on the predicted probabilities, the code below will make binary predictions for the 10 outcomes and store them in `predicted_values`.

In [None]:
def compute_predicted_values(predicted_probs, threshold):
    predicted_values = []
    for i in range(len(predicted_probs)):
        if predicted_probs[i] > threshold:
            predicted_values.append(1)
        else:
            predicted_values.append(0)
    return predicted_values

predicted_values = compute_predicted_values(predicted_probs, threshold)

print('actual_values   :', actual_values)
print('predicted_values:', predicted_values)

If you compare the two lists shown above, you will realize that there are four types of outcomes:

1. The actual value is 1 and we predicted 1: this is called a **True Positive** (TP) ‚úîÔ∏è
2. The actual value is 0 and we predicted 0: this is called a **True Negative** (TN) ‚úîÔ∏è
3. The actual value is 0 and we predicted 1: this is called a **False Positive** (FP) ‚ùå
4. The actual value is 1 and we predicted 0: this is called a **False Negative** (FN) ‚ùå

The code below classifies each prediction in one of these four cases.

In [None]:
def compute_outcomes(actual_values, predicted_values):
    outcomes = []
    for i in range(len(actual_values)):
        if actual_values[i] == 1:
            if predicted_values[i] == 1:
                outcomes.append('TP')
            else:
                outcomes.append('FN')
        else:
            if predicted_values[i] == 1:
                outcomes.append('FP')
            else:
                outcomes.append('TN')
    return outcomes
    
outcomes = compute_outcomes(actual_values, predicted_values)
                
print('outcomes:', outcomes)

# True Positive Rate and False Positive Rate

Of course, we want to have as many TP and TN as possible (correct predictions) and avoid FP and FN (wrong predictions). In this example and **for a threshold of 0.5**, there were 2 TP, 4 TN, 2 FP and 2 FN.

Now, to draw the ROC curve, we need the **True Positive Rate** (`tpr`) and the **False Positive Rate** (`fpr`). These are given by:

* **True Positive Rate = True Positives / Total Positives**
* **False Positive Rate = False Positives / Total Negatives**

So let's compute those values.

In [None]:
tpr = outcomes.count('TP') / n_positives
fpr = outcomes.count('FP') / n_negatives

print('True Positive Rate = %0.2f' % tpr)
print('False Positive Rrate = %0.2f' % fpr)

# Plotting the ROC curve and computing AUC

Now let's plot the single point we just obtained, with `fpr` as the x-coordinate and `tpr` as the y-coordinate.

In [None]:
plt.figure(figsize=(4, 4), dpi=100)
plt.axis('scaled')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.title('ROC curve')
plt.plot(fpr, tpr, 'bo')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

OK, we've calculated one point of the ROC curve. But how can se get the entire graph? As mentioned before, the curve is obtained by considering all values in the interval [0, 1] as possible thresholds. Let's first consider the two extreme points:

* By setting `threshold = 0`, all our predictions would be equal to 1 (unless we have predictions probabilities equal to 0, which is rare). If everything is predicted as positives, then there can only be TP and FP (and no TN or FN). Therefore we must alse have `fpr = 1` and `tpr = 1`, which corresponds to the upper right corner of the graph, i.e. the point with coordinates (1, 1).
* By setting `threshold = 1`, nothing will be predicted as positive, so TP = FP = 0 and consequently `fpr = 0` and `tpr = 0`, which corresponds to the lower left corner of the graph, i.e. the point with coordinates (0, 0).

For every value of the threshold between 0 and 1, we will obtain a new point of the curve. If we start from 1 and continuously decrease the value of the threshold down to 0, the number of TP and FP will either increase or remain unchanged. Therefore, the **ROC curve goes from (0, 0) to (1, 1) continuously and is monotonically increasing**.

Notice, however, that if we vary the threshold from 1 to 0, the values of `fpr` and `tpr` will only change when the threshold "crosses" one of the values in the list of predicted probabilities (`predicted_probs`). Hence we only need to use these values (plus 0 and 1) as thresholds to build the curve. Let's do just that.

In [None]:
thresholds = [1] + sorted(predicted_probs, reverse=True) + [0]

def compute_tpr_fpr(actual_values, predicted_probs, thresholds, n_positives, n_negatives):
    tpr = []
    fpr = []
    for threshold in thresholds:
        predicted_values = compute_predicted_values(predicted_probs, threshold)
        outcomes = compute_outcomes(actual_values, predicted_values)
        tpr.append(outcomes.count('TP') / n_positives)
        fpr.append(outcomes.count('FP') / n_negatives)
    return tpr, fpr

tpr, fpr = compute_tpr_fpr(actual_values, predicted_probs, thresholds, n_positives, n_negatives)

The code above is not very efficient, but I will stick to it for presentation purposes. Now let's plot the curve.

In [None]:
plt.figure(figsize=(4, 4), dpi=100)
plt.axis('scaled')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.title('ROC curve')
plt.plot(fpr, tpr, 'bo-')
plt.fill_between(fpr, tpr, facecolor='lightblue', alpha=0.5)
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

Finally, let's compute the AUC. Because the curve is piecewise linear, we can calculate this area by dividing it into trapezoids. Thankfully, there is a numpy function that does ir for us.

In [None]:
auc = np.trapz(tpr, fpr)
print('AUC =', f'{auc:0.4f}')

This notebook presented some of the concepts necessary to understand the ROC curve and the AUC. Of course, there is more to be explored regarding the subject, but I prefer to keep it short and simple.

Thanks for reaching the end of it! üòä