<a href="https://colab.research.google.com/github/rhodes-byu/cs180-winter25/blob/main/notebooks/16-binary-metrics
.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Binary Confusion Matrix

| Actual \ Predicted | Positive (P) | Negative (N) |
|---------------------|--------------|--------------|
| Positive (P)        | True Positive (TP) | False Negative (FN) |
| Negative (N)        | False Positive (FP) | True Negative (TN) |

## Precision and Recall

### Recall (True Positive Rate, Sensitivity)

Recall answers the question, "**What proportion of the positives did we find?**".   

High recall score: the model finds all/most of the real positive cases, but potentially at the expense of more false positives as well.

In cases where missing a positive prediction is bad (e.g., medical diagnosis), we want high recall. 

The formula for Recall is:

$$
\text{Recall} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Negatives (FN)}}
$$

(Note: The denominator is just the total number of positive values in the **original labels**.)

### Precision

What proportion of all positive predictions are accurate? 

For example, we want high precision in cases such as a spam detector: we want to ensure all emails that are predicted to be spam are indeed spam. We don't want to lose important emails, but it is not that big of a deal of few spam emails get to our inbox.

The formula for Precision is:

$$
\text{Precision} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Positives (FP)}}
$$

(Note: The denominator is just the number of values that were **predicted** positive.)

### Recall / Precision Tradeoff:

High recall, low precision: Finds most of the real positive cases, but at the expense of more false positives as well.

High precision, low recall: All emails that are designated spam are indeed spam, but the model misses a lot of spam.


### The precision-recall curve (PR curve)

The precision-recall curve (PR curve) is a graphical representation of the tradeoff between precision and recall for different threshold values of a classification model. It is particularly useful when dealing with imbalanced datasets, where one class significantly outnumbers the other.

- **Precision** is plotted on the y-axis.
- **Recall** is plotted on the x-axis.

A good model will have a PR curve that bows towards the top-right corner, indicating high precision and recall across different thresholds.

#### How the PR Curve is Generated:
The PR curve is generated by varying the decision threshold of the classification model. For each threshold:
- Predictions are classified as positive or negative based on whether their probability score exceeds the threshold.
- Precision and recall are calculated based on the resulting confusion matrix.

The area under the precision-recall curve (PR-AUC) is a single scalar value that summarizes the performance of the model. A higher PR-AUC indicates better performance. An AUC of 0.5 indicates no discriminative ability (equivalent to random guessing), while an AUC of 1.0 signifies perfect discrimination between classes.

<figure>
    <img src="https://user-images.githubusercontent.com/26833433/76019078-0a79fb00-5ed6-11ea-8b5b-5697bbbd7e7e.png" alt="PR-Curve Example" width="400"/>
    <figcaption><a href="https://github.com/ultralytics/yolov3/issues/898">https://github.com/ultralytics/yolov3/issues/898</a></figcaption>
</figure>

### When to Use the PR Curve

- **Imbalanced Datasets**: The PR curve is more informative than the ROC curve when dealing with imbalanced datasets, where the positive class is rare. Examples include fraud detection and disease diagnosis.
- **Costly False Positives**: If false positives are more significant or costly than false negatives (e.g., spam email detection), the PR curve is preferred as it emphasizes precision.

## F1 Score

The F1 score serves as a balance between precision and recall. It is the harmonic mean of precision and recall. The F1 score is calculated as follows:

$$
F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
$$

The F1 score ranges from 0 to 1, where 1 is the best possible F1 score. The F1 score balances the trade-off between these two metrics. A score of 1 indicates perfect precision (no false positives) and perfect recall (no false negatives), meaning the model made no classification errors. A score of 0 indicates either precision or recall is zero—meaning the model failed completely in identifying one of the classes correctly.

In classification problems with imbalanced classes (e.g., fraud detection, rare disease diagnosis), accuracy can be misleading—predicting the majority class most of the time can yield high accuracy but poor performance on the minority class.
The F1 score, by combining precision and recall, better reflects the model’s ability to correctly identify the minority class. It penalizes models that do well on only one of these two metrics, offering a more realistic view of model performance in such settings.

## Sensitivity and Specificity

### Sensitivity

Sensitivity is another term for Recall. (True positives out of all original positive labels.)

### Specificity 

The proportion of true negatives out of the actual negatives.

The formula for Specificity is:

$$
\text{Specificity} = \frac{\text{True Negatives (TN)}}{\text{True Negatives (TN)} + \text{False Positives (FP)}}
$$

(Note: The denominator is just the total number of negative values in the **original labels**.)

### False Positive Rate (FPR)

The False Positive Rate (FPR) is the ratio of negative instances that are incorrectly classified as positive. It complements the True Negative Rate (TNR), which measures the proportion of negatives correctly identified as such.

$$
\text{FPR} = 1 - \text{Specificity}
$$

### The ROC Curve:

The Receiver Operating Characteristic (ROC) curve is a graphical representation of a classifier's performance across different threshold values. It plots the True Positive Rate (TPR) against the False Positive Rate (FPR).

- **True Positive Rate (TPR)** is also known as Recall or Sensitivity.
- **False Positive Rate (FPR)** is calculated as $ \text{FPR} = 1 - \text{Specificity} $.

The area under the ROC curve (ROC-AUC) is a single scalar value that summarizes the model's performance. A higher ROC-AUC indicates better performance, with a value of 1 representing a perfect model and 0.5 representing a random classifier.

<figure>
    <img src="https://www.mathworks.com/help/examples/nnet/win64/CompareDeepLearningModelsUsingROCCurvesExample_01.png" alt="ROC Curve Example" width="400"/>
    <figcaption>https://www.mathworks.com/help/deeplearning/ug/compare-deep-learning-models-using-ROC-curves.html</figcaption>
</figure>

### When to use which curve?

Use a ROC curve for balanced datasets to analyze the trade-off between true positive rate (TPR) and false positive rate (FPR), providing an overall measure of a model’s ability to discriminate between classes. However, it may not be ideal for imbalanced datasets as it considers true negatives.

Use a PR curve for imbalanced datasets, especially when the positive class is rare or false positives are costly. Since it focuses on precision (positive predictive value) and recall (TPR), it better reflects model performance when class imbalance is a concern.

In [None]:
from sklearn.metrics import (
    precision_score, recall_score, accuracy_score, roc_curve, roc_auc_score,
    precision_recall_curve, f1_score, confusion_matrix
)
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.datasets import fetch_openml
import seaborn as sns

In [None]:
data = fetch_openml(name="heart-disease", version=1, as_frame=True, parser = 'auto')

X = data.data
y = X.pop('target')

# data = load_breast_cancer()
# X = pd.DataFrame(data.data, columns=data.feature_names)
# y = pd.Series(data.target)

In [None]:
sns.countplot(x = y)

In [None]:
X.info()

In [None]:
X_scaled = X.apply(lambda x: (x - x.mean()) / x.std() if x.name in data.feature_names else x)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size = 0.3, stratify = y, random_state = 42)

In [None]:
model = LogisticRegression()
# model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)


In [None]:
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Predicted Negative', 'Predicted Positive'], yticklabels=['Actual Negative', 'Actual Positive'])

In [None]:
### Recall (Of all actual positive, how many were predicted as positive)
recall = recall_score(y_test, y_pred)
print(f'Recall: {recall}')

In [None]:
### Precision (Of all predicted positive, how many were actually positive)
precision = precision_score(y_test, y_pred)
print(f'Precision: {precision}')

In [None]:
### F1 Score (Harmonic mean of precision and recall)
f1 = f1_score(y_test, y_pred)
print(f'F1 Score: {f1}')

### The Decision Function and Predict Probabilities

In [None]:
# For logistic regression, decision_function returns the logits
# y_scores = model.decision_function(X_test)

y_proba = model.predict_proba(X_test)
y_scores = y_proba[:, 1]
np.round(y_proba, 3)

In [None]:
### PR-Curve
precisions, recalls, thresholds = precision_recall_curve(y_test, y_scores)
print(len(thresholds))

In [None]:
plt.plot(thresholds)
plt.title('Decision Boundary Thresholds')
plt.xlabel('Index')
plt.ylabel('Threshold Value')

In [None]:
sns.lineplot(x=precisions, y=recalls)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
for i, threshold in enumerate(thresholds):
    if i % 5 == 0:  # Add points for every 10th threshold to avoid clutter
        plt.text(precisions[i], recalls[i] + 0.01, f'{threshold:.2f}', fontsize=8, color='red')



### Sensitivity and Specificity (by "hand")

In [None]:
confusion_matrix(y_test, y_pred)

In [None]:
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()

recall = tp / (tp + fn)
precision = tp / (tp + fp)
f1 = 2 * precision * recall / (precision + recall)

sensitivity = tp / (tp + fn)
specificity = tn / (tn + fp)

In [None]:
print(f'Recall: {recall}')
print(f'Precision: {precision}')
print(f'F1 Score: {f1}\n')
print(f'Sensitivity: {sensitivity}')
print(f'Specificity: {specificity}')

In [None]:
def predict_from_proba(proba, threshold=0.5):
    proba = np.asarray(proba)  # Ensure numpy array

    if proba.ndim == 1:  # Handle 1D case (implicitly binary)
        y_scores = proba
    elif proba.shape[1] == 2:
        y_scores = proba[:, 1]  # Take probabilities of the positive class
    elif proba.shape[1] == 1:
        y_scores = proba.ravel()  # Flatten single-column array
    else:
        raise ValueError("Only binary cases are supported. proba must have shape (n, d) with d <= 2")

    y_pred = (y_scores >= threshold).astype(int)  # More efficient thresholding

    return y_pred


In [None]:
y_pred_thresh = predict_from_proba(y_proba, threshold=0.5)

In [None]:
# Plot the confusion matrix
cm_thresh = confusion_matrix(y_test.values, y_pred_thresh)
sns.heatmap(cm_thresh, annot=True, fmt='d', cmap='Blues', xticklabels=['Predicted Negative', 'Predicted Positive'], yticklabels=['Actual Negative', 'Actual Positive'])
plt.title('Confusion Matrix with Threshold')
plt.xlabel('Predicted Labels')
plt.ylabel('Actual Labels')
plt.show()

In [None]:
from ipywidgets import interact

@interact(threshold=(0.0, 1.0, 0.01))
def update(threshold=0.01):
    y_pred_thresh = predict_from_proba(y_proba, threshold)
    cm = confusion_matrix(y_test, y_pred_thresh)
    ax = sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
    ax.set_xlabel('Predicted Labels')
    ax.set_ylabel('Actual Labels')
    plt.title(f'Confusion Matrix at threshold = {threshold:.2f}')
    plt.show()


In [None]:
def get_threshold_metrics(y_proba, y_true, threshold=0.5):
    y_pred = predict_from_proba(y_proba, threshold)
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()

    sensitivity = tp / (tp + fn)
    specificity = tn / (tn + fp)
    f1 = 2 * sensitivity * specificity / (sensitivity + specificity)

    return sensitivity, specificity, f1


In [None]:
@interact(threshold=(0.0, 1.0, 0.01))
def plot_metrics(threshold=0.5):
    sensitivity, specificity, f1 = get_threshold_metrics(y_proba, y_test, threshold)
    sensitivity, specificity, f1 = np.round(sensitivity, 3), np.round(specificity, 3), np.round(f1, 3)

    metrics = {'Sensitivity': sensitivity, 'Specificity': specificity, 'F1 Score': f1}
    plt.bar(metrics.keys(), metrics.values(), color=['blue', 'green', 'orange'])
    plt.ylim(0, 1)
    plt.title(f'Metrics at Threshold = {threshold:.2f}')
    plt.ylabel('Score')
    plt.show()

In [None]:
### ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_scores)
roc_auc = roc_auc_score(y_test, y_scores)

plt.plot(fpr, tpr, label=f'AUC = {roc_auc:.3f}')
plt.plot([0, 1], [0, 1], linestyle='--', color='red')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title(f'ROC Curve (AUC = {roc_auc:.3f})')
plt.legend(loc='lower right')
plt.show()

In [None]:
roc_auc = roc_auc_score(y_test, y_scores)
print(f"ROC AUC: {roc_auc:.3f}")