# Unit 4 Building and Understanding AUC-ROC

Welcome\! Today, we are diving into a fascinating metric in machine learning called AUC-ROC.

Our goal for this lesson is to understand what AUC (Area Under the Curve) and ROC (Receiver Operating Characteristic) are, how to calculate and interpret the AUC-ROC metric, and how to visualize the ROC curve using Python. Ready to explore? Let's get started\!

### Understanding ROC

**ROC (Receiver Operating Characteristic):** This graph shows the performance of a classification model at different threshold settings. [cite\_start]It plots the True Positive Rate (TPR) against the False Positive Rate (FPR)[cite: 1]. [cite\_start]In this context, a **threshold** is a value that determines the cutoff point for classifying a positive versus a negative outcome based on the model's predicted probabilities[cite: 1]. [cite\_start]For example, if the threshold is set to 0.5, any predicted probability above 0.5 is classified as positive, and anything below is classified as negative[cite: 1]. [cite\_start]By varying this threshold, we generate different True Positive and False Positive rates, which are then used to plot the ROC curve[cite: 1].

Imagine you have a medical test used to detect a particular disease. [cite\_start]**True Positive Rate (TPR)** measures how effective the test is at correctly identifying patients who have the disease (true positives)[cite: 1]. [cite\_start]**False Positive Rate (FPR)**, on the other hand, measures how often the test incorrectly indicates the disease in healthy patients (false positives)[cite: 1].

[cite\_start]$TPR = \\frac{True \\ Positives \\ (TP)}{True \\ Positives \\ (TP) + False \\ Negatives \\ (FN)}$ [cite: 1]

[cite\_start]$FPR = \\frac{False \\ Positives \\ (FP)}{False \\ Positives \\ (FP) + True \\ Negatives \\ (TN)}$ [cite: 1]

Note that:

  * [cite\_start]When the threshold is set to 1, it means that we classify all values as negatives, resulting in both TPR and FPR being 0[cite: 1].
  * [cite\_start]When the threshold is set to 0, it means that we classify all values as positives, resulting in both TPR and FPR being 1[cite: 1].
  * [cite\_start]That means than the ROC curve will always start at point (0, 0) and end at point (1, 1)[cite: 1].

### Plotting the ROC Curve

[cite\_start]Visualizing the ROC curve helps understand model performance at different thresholds[cite: 1]. [cite\_start]Let's look at a Python code snippet to see these concepts in action[cite: 1]. [cite\_start]We'll manually calculate the ROC data and then plot it using `matplotlib`[cite: 1].

```python
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import confusion_matrix

# Sample binary classification data
y_true = np.array([0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1])
y_scores = np.array([0.0, 0.4, 0.5, 0.8, 0.4, 0.8, 0.5, 0.8, 0.7, 0.5, 1])

# Get unique thresholds
thresholds = np.sort(np.unique(y_scores))

# Initialize lists to hold TPR and FPR values
tpr = []
fpr = []

# Calculate TPR and FPR for each threshold
for thresh in thresholds:
    y_pred = (y_scores >= thresh).astype(int)
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()

    tpr.append(tp / (tp + fn))  # True Positive Rate
    fpr.append(fp / (fp + tn))  # False Positive Rate

# Plotting ROC curve
plt.plot(fpr, tpr, marker='.')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()
```

[cite\_start]In the example above, `y_true` represents the true labels, and `y_scores` is an array with the predicted probabilities[cite: 1].

[cite\_start]Breaking Down the Code[cite: 1]:

  * **Import Libraries**: Import necessary libraries. [cite\_start]We use `numpy` for numerical operations, `matplotlib` for plotting, and `confusion_matrix` from `sklearn.metrics` to compute confusion matrix values[cite: 1].
  * [cite\_start]**Define True Labels and Scores**: `y_true` holds the binary class labels (0 for class 0, 1 for class 1), and `y_scores` contains the predicted probabilities[cite: 1].
  * [cite\_start]**Get Unique Thresholds**: Extract unique threshold values from `y_scores` using `np.sort` and `np.unique`[cite: 1].
  * [cite\_start]**Initialize TPR and FPR Lists**: These lists will collect True Positive Rate (TPR) and False Positive Rate (FPR) values for each threshold[cite: 1].
  * **Calculate TPR and FPR for Each Threshold**: Iterate over the thresholds, make predictions based on current threshold, compute `tn`, `fp`, `fn`, and `tp` using `confusion_matrix`. [cite\_start]Use these values to compute `TPR` and `FPR` at each threshold and append them to their respective lists[cite: 1].
  * **Plot the Curve**: Use `matplotlib.pyplot` to plot these values. [cite\_start]`plt.plot(fpr, tpr, marker='.')` plots the ROC curve with points marked by dots[cite: 1].
  * [cite\_start]**Add Labels**: Add labels to the x- and y-axes and a title with `plt.xlabel('False Positive Rate')`, `plt.ylabel('True Positive Rate')`, and `plt.title('ROC Curve')`[cite: 1].

[cite\_start]Running this code, you'll see a graph (ROC curve) showing how TPR and FPR change with different threshold values[cite: 1].

### Understanding AUC-ROC

[cite\_start]**AUC (Area Under the Curve):** This single number summary indicates how well the model distinguishes between the two classes[cite: 1]. [cite\_start]An AUC of 1 means perfect distinction, while an AUC of 0.5 means the model's predictions are no better than random guessing[cite: 1].

Why AUC-ROC is Useful:

  * [cite\_start]**Useful for Imbalanced Classes**: AUC is particularly useful when you have imbalanced classes[cite: 1]. [cite\_start]While accuracy can be misleading, AUC gives a better measure of model performance by focusing on the balance between TPR and FPR[cite: 1].
  * [cite\_start]**Threshold Independence**: AUC-ROC evaluates the model performance across all classification threshold values, giving a comprehensive overview compared to metrics like precision or recall, which are threshold-dependent[cite: 1].

### Comparing Models Using AUC-ROC

Let's define another set of predictions which are more accurate:

```python
y_scores_better = np.array([0.0, 0.2, 0.6, 0.4, 0.9, 0.7, 0.9, 0.8, 0.9, 0.5, 1])
```

And plot the ROC curve for both sets of predictions. [cite\_start]This time we will use a simpler way to calculate TPR and FPR lists – using the `roc_curve` function from `sklearn`: [cite: 1]

```python
from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt # Ensure this is imported if not already in the scope
import numpy as np # Ensure this is imported if not already in the scope

# (y_true and y_scores, y_scores_better are defined above)

# Calculate ROC curve for first set of scores
fpr1, tpr1, _ = roc_curve(y_true, y_scores)

# Calculate ROC curve for second set of scores
fpr2, tpr2, _ = roc_curve(y_true, y_scores_better)

# Plotting both ROC curves
plt.plot(fpr1, tpr1, marker='.', label='model 1')
plt.plot(fpr2, tpr2, marker='.', label='model 2')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()
```

[cite\_start]The orange curve (Model 2) has a greater area under itself than the blue one (Model 1), which indicates a better performance of the corresponding model[cite: 1].

### Calculating AUC-ROC with `sklearn`

[cite\_start]Let's look at how to calculate the AUC-ROC score using the `roc_auc_score` function from `sklearn.metrics`: [cite: 1]

```python
from sklearn.metrics import roc_auc_score
# (y_true, y_scores, y_scores_better are defined above)

# Calculate AUC-ROC for the first set of scores
auc_roc_1 = roc_auc_score(y_true, y_scores)
print(f"AUC-ROC (Model 1): {auc_roc_1}")  # AUC-ROC (Model 1): 0.6166666666666667

# Calculate AUC-ROC for the second set of scores
auc_roc_2 = roc_auc_score(y_true, y_scores_better)
print(f"AUC-ROC (Model 2): {auc_roc_2}")  # AUC-ROC (Model 2): 0.9666666666666668
```

[cite\_start]Running this code, you'll see an output like `AUC-ROC (Model 1): 0.6166` and `AUC-ROC (Model 2): 0.96666`, indicating that the second model is better at distinguishing between the classes[cite: 1].

### Lesson Summary

[cite\_start]In this lesson, we learned about AUC-ROC, an essential metric for evaluating binary classification models[cite: 1]. [cite\_start]We understood its components: the ROC curve and the AUC value[cite: 1]. [cite\_start]We also saw how to calculate these metrics using Python and `sklearn.metrics`, and how to visualize the ROC curve using `matplotlib`[cite: 1].

[cite\_start]Understanding and interpreting AUC-ROC helps us evaluate how well our classification model can distinguish between different classes[cite: 1]. [cite\_start]By visualizing the ROC curve, we can see our model's performance at various threshold values, which is invaluable for model selection and tuning[cite: 1].

Now it's your turn\! [cite\_start]In the practice section, you'll get hands-on experience calculating AUC-ROC[cite: 1]. [cite\_start]This practice will solidify your understanding and help you apply what you've learned to real-world scenarios[cite: 1]. [cite\_start]Enjoy the practice, and remember: learning by doing is key\! [cite: 1]

## Calculating TPR and FPR

Hey, Space Explorer! Let's dive deeper into our medical diagnosis test. Calculate the True Positive Rate (TPR) and False Positive Rate (FPR) for the following thresholds: 0.2, 0.4, 0.6, and 0.8. This will help us understand how different thresholds affect our model's ability to detect the disease.

```python
from sklearn.metrics import confusion_matrix

# True labels indicating disease presence (1) or absence (0)
y_true = [0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1]

# Predicted probabilities from the medical test indicating likelihood of disease
y_scores = [0.1, 0.3, 0.6, 0.4, 0.7, 0.5, 0.9, 0.8, 0.9, 0.2, 1.0]

# Thresholds to evaluate
thresholds = [0.2, 0.4, 0.6, 0.8]

# TODO: Calculate TPR and FPR for different thresholds 
# TODO: And print the in the format Threshold: {threshold} – TPR: {tpr}, FPR: {fpr}

```

Sure, Space Explorer\! Let's calculate the TPR and FPR for the given thresholds.

```python
from sklearn.metrics import confusion_matrix

# True labels indicating disease presence (1) or absence (0)
y_true = [0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1]

# Predicted probabilities from the medical test indicating likelihood of disease
y_scores = [0.1, 0.3, 0.6, 0.4, 0.7, 0.5, 0.9, 0.8, 0.9, 0.2, 1.0]

# Thresholds to evaluate
thresholds = [0.2, 0.4, 0.6, 0.8]

# Calculate TPR and FPR for different thresholds
for threshold in thresholds:
    y_pred = [1 if score >= threshold else 0 for score in y_scores]
    
    # Compute confusion matrix
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    
    # Calculate TPR and FPR
    tpr = tp / (tp + fn)
    fpr = fp / (fp + tn)
    
    # Print the results
    print(f"Threshold: {threshold} – TPR: {tpr:.2f}, FPR: {fpr:.2f}")

```

## Calculate AUC-ROC for Medical Test

Space Voyage Explorer, ready to boost your coding skills? Here, we calculate TPR and FPR – just like we did in the previous task, but using the sklearn's built-in roc_cuve function. You can observe an example of it in the starter solution.

Your goal is to duplicate the following steps for a set of predictions called y_scores_better:

Calculate the TPR and FPR arrays using the roc_curve function
Plot the second ROC-curve on the same graph by calling plt.plot function with new TPR and FPR arrays.
Good luck, and may the yield of your code be ever in your favor!

```python
import numpy as np
from sklearn.metrics import roc_auc_score, roc_curve
import matplotlib.pyplot as plt

# Given predictions and true labels for a medical diagnosis scenario
y_true = np.array([0, 1, 1, 1, 0, 1, 0, 0, 1, 0])
y_scores = np.array([0.8, 0.7, 0.8, 0.6, 0.4, 0.25, 0.4, 0.4, 0.95, 0.22])
y_scores_better = np.array([0.4, 0.9, 0.9, 0.7, 0.5, 0.8, 0.4, 0.4, 0.5, 0.8])
fpr, tpr, _ = roc_curve(y_true, y_scores)
# TODO: calculate the FPR and TPR arrays for the second set of scores

plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve')
# TODO: plot the ROC-curve for the second set of scores
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic example')
plt.legend(loc="lower right")
plt.show()

```

## Calculating AUC-ROC for Medical Diagnosis