In [None]:
# Run this cell.
from lec_utils import *
import nb4 as util
plotly.io.renderers.default = 'notebook'

from sklearn.datasets import load_breast_cancer

full = load_breast_cancer()
df = pd.DataFrame(full['data'], columns=full['feature_names'])
df['target'] = 1 - full['target']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.drop('target', axis=1), df['target'], random_state=23, stratify=df['target'])

#### DAIR-3 Workshop, Day 2 • Building Robust ML Models

# Part 4: Model Evaluation

**Instructor**: Suraj Rampure (rampure@umich.edu)

### Outline

- Classifier evaluation.
- Choosing a threshold.
- Summary.

## Classifier evaluation

---

### Outcomes in binary classification

- When performing **binary** classification, there are four possible outcomes.<br><small>Note: A "positive prediction" is a prediction of 1, and a "negative prediction" is a prediction of 0.</small>

- We typically organize the four quantities above into a **confusion matrix**.

| | Predicted Negative | Predicted Positive |
| --- | --- | --- |
| **Actually Negative** | True Negative (TN) ✅ | False Positive (FP) ❌ |
| **Actually Positive** | False Negative (FN) ❌ | True Positive (TP) ✅ |


- Note that in the four acronyms – TP, FN, TN, FP – the **first letter** is whether the prediction is correct, and the **second letter** is what the prediction is.

- **Depending on the situation, false negatives may be worse than false positives (or vice versa!).**

### Example: Accuracy of COVID tests

- The results of 100 Michigan Medicine COVID tests are given below.

| | Predicted Negative | Predicted Positive |
| --- | --- | --- |
| **Actually Negative** | TN = 90 ✅ | FP = 1 ❌ |
| **Actually Positive** | FN = 8 ❌ | TP = 1 ✅ |
<center><i><small>Michigan Medicine test results</small></i></center>

- 🤔 **Question:** What is the accuracy of the test?

$$
\text{accuracy} = \frac{\text{# points classified correctly}}{\text{# points}}
$$

- **🙋 Answer:** $$\text{accuracy} = \frac{TP + TN}{TP + FP + FN + TN} = \frac{1 + 90}{100} = 0.91$$

- **Followup:** At first, the test seems good. But, suppose we build a classifier that predicts that **nobody has COVID**. What would its accuracy be?

- **Answer to followup:** Also 0.91! There is severe **class imbalance** in the dataset, meaning that most of the data points are in the same class (no COVID). **Accuracy doesn't tell the full story!**

<div class="alert alert-danger"><h3>Warning #7: Accuracy Can Mislead</h3>
    
In cases of class imbalance, accuracy can be misleading!

### Recall

| | Predicted Negative | Predicted Positive |
| --- | --- | --- |
| **Actually Negative** | TN = 90 ✅ | FP = 1 ❌ |
| <span style='color:orange'><b>Actually Positive</b></span> | <span style='color:orange'>FN = 8</span> ❌ | <span style='color:orange'>TP = 1</span> ✅ |

<center><i><small>Michigan Medicine test results</small></i></center>

- 🤔 **Question:** What proportion of individuals who actually have COVID did the test **identify**?

- **🙋 Answer:** $\frac{1}{1 + 8} = \frac{1}{9} \approx 0.11$.

- More generally, the **recall** of a binary classifier is the proportion of <span style='color:orange'><b>actually positive instances</b></span> that are correctly classified. We'd like this number to be as close to 1 (100%) as possible.

$$\text{recall} = \frac{TP}{\text{# actually positive}} = \frac{TP}{TP + FN}$$

- To compute recall, look at the <span style='color:orange'><b>bottom (positive) row</b></span> of the above confusion matrix.

### Recall isn't everything, either!

$$\text{recall} = \frac{TP}{TP + FN}$$

- 🤔 **Question:** Can you design a "COVID test" with perfect recall?

- **🙋 Answer:** Yes – **just predict that everyone has COVID!**

| | Predicted Negative | Predicted Positive |
| --- | --- | --- |
| **Actually Negative** | TN = 0 ✅ | FP = 91 ❌ |
| <span style='color:orange'><b>Actually Positive</b></span> | <span style='color:orange'>FN = 0</span> ❌ | <span style='color:orange'>TP = 9</span> ✅ |

<center><i><small>everyone-has-COVID classifier</small></i></center>

$$\text{recall} = \frac{TP}{TP + FN} = \frac{9}{9 + 0} = 1$$

- Like accuracy, recall on its own is not a perfect metric. Even though the classifier we just created has perfect recall, it has 91 false positives!

### Precision

| | Predicted Negative | <span style='color:orange'>Predicted Positive</span> |
| --- | --- | --- |
| **Actually Negative** | TN = 0 ✅ | <span style='color:orange'>FP = 91</span> ❌ |
| **Actually Positive** | FN = 0 ❌ | <span style='color:orange'>TP = 9</span> ✅ |

<center><i><small>everyone-has-COVID classifier</small></i></center>

- The **precision** of a binary classifier is the proportion of <span style='color:orange'><b>predicted positive instances</b></span> that are correctly classified. We'd like this number to be as close to 1 (100%) as possible.

$$\text{precision} = \frac{TP}{\text{# predicted positive}} = \frac{TP}{TP + FP}$$

- To compute precision, look at the <span style='color:orange'><b>right (positive) column</b></span> of the above confusion matrix.<br><small>**Tip:** A good way to remember the difference between precision and recall is that in the denominator for 🅿️recision, both terms have 🅿️ in them (TP and FP).</small>

- Note that the "everyone-has-COVID" classifier has perfect recall, but a precision of $\frac{9}{9 + 91} = 0.09$, which is quite low.

- 🚨 **Key idea:** There is a "tradeoff" between precision and recall. Ideally, you want both to be high. For a particular prediction task, one may be important than the other.

### Precision and recall

<center><img src="images/Precisionrecall.svg.png" width=30%></center>

<center>(<a href="https://en.wikipedia.org/wiki/Precision_and_recall">source</a>)</center>

<div class="alert alert-success">
    
### Discussion
    
$$\text{precision} = \frac{TP}{TP + FP} \: \: \: \:  \: \: \: \: \text{recall} = \frac{TP}{TP + FN}$$
    
- When might high **precision** be more important than high recall?
- When might high **recall** be more important than high precision?

### Combining precision and recall
   

- If we care equally about a model's precision $PR$ and recall $RE$, we can combine the two using a single metric called the **F1-score**:

$$\text{F1-score} = \text{harmonic mean}(PR, RE) = 2\frac{PR \cdot RE}{PR + RE}$$

- Both F1-score and accuracy are overall measures of a binary classifier's performance. But remember, accuracy is misleading in the presence of class imbalance, and doesn't take into account the kinds of errors the classifier makes.

### Other evaluation metrics for binary classifiers

- We just scratched the surface! This [excellent table from Wikipedia](https://en.wikipedia.org/wiki/Template:Diagnostic_testing_diagram) summarizes the many other metrics that exist.

<center><img src='images/wiki-table.png' width=75%></center>

- If you're interested in exploring further, a good next metric to look at is **true negative rate (i.e. specificity)**, which is the analogue of recall for true negatives.

## Choosing a threshold

---

### Recap: Logistic regression for tumor classification

- Let's train a logistic regression model that uses mean radius **only** to predict tumor malignancy.

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train[['mean radius']], y_train)

In [None]:
model.intercept_, model.coef_

- The fit model makes predictions using $P(y_i = 1 | x_i) = \sigma(-16.09 + 1.09 x_i)$,

    where $x_i$ represents patient $i$'s mean radius.

- Visually, the model's <span style="color:#097054"><b>predicted probabilities</b></span> look like:

In [None]:
util.show_one_feature_plot_with_logistic(X_train, y_train)

### Thresholding

- Suppose $\vec x_i$ is a **feature vector** containing the information about individual $i$, used for predicting the probability of a tumor, $P(y_i = 1 | \vec x_i)$.

- In order to classify $\vec{x}_i$ as either yes ($y_i = 1$) or no ($y_i = 0$), we apply a **threshold** $T$ to the predicted probability.

<center><img src="images/threshold.svg" width=600><small>With a threshold of $T = 0.6$, a predicted probability of 0.68 is classified as <span style="color:blue">malignant (class 1)</span>,<br>and a predicted probability of 0.55 is classified as <span style="color:orange">benign (class 0)</span>.</small></center>

- More generally, if we pick a threshold of $T$, then any feature vector $\vec{x}_i$ such that:

    $$\sigma(\vec{w}^* \cdot \text{Aug}(\vec{x}_i)) \geq T$$ 

    is classified as class 1.

- **Question**: How do we choose the "right" threshold?

- `sklearn`'s default threshold of $T = 0.5$ is **not** guaranteed to yield the highest **accuracy**!<br><small>To find optimal parameters, `sklearn` we minimized mean cross-entropy loss, and mean cross-entropy loss doesn't involve our threshold.</small>

### Choosing a custom threshold

- If we want to use a custom threshold, we'll need to implement the logic ourselves.

<center><img src="images/threshold.svg" width=300></center>

In [None]:
def predict_thresholded(X, T):
    '''Calls model_logistic_multiple.predict_proba.
       For each P(y_i = 1 | x_i), returns 1 if >= T and 0 if < T.'''
    probs = model.predict_proba(X)[:, 1]
    return (probs >= T).astype(int)

- Now, we can choose any threshold we'd like, and compute the accuracy of the resulting predictions.

In [None]:
predict_thresholded([[14.5]], 0.5)

In [None]:
predict_thresholded([[14.5]], 0.4)

In [None]:
predict_thresholded(X_train[['mean radius']], 0.4)

In [None]:
# Training accuracy for the threshold T = 0.4.
(predict_thresholded(X_train[['mean radius']], 0.4) == y_train).mean()

### Accuracy vs. threshold

- Accuracy is defined as:

$$\text{accuracy} = \frac{\text{# points classified correctly}}{\text{# points}} = \frac{TP + TN}{TP + FP + FN + TN}$$

- How does the model's **training** accuracy change as the threshold changes?<br><small>Note that we'd see a similar trend with test accuracy, too.</small>

In [None]:
util.plot_vs_threshold(X_train[['mean radius']], y_train, 'Accuracy')

- The threshold with the best training accuracy (among the thresholds we tried) is $T = 0.59$, which has a training accuracy of 88.9\%.

- Remember that 63\% of tumors in the training set are benign, so we can achieve a 63\% training accuracy just by always predicting "benign"! This means that a good model's accuracy should be much higher than 63\%.

In [None]:
pd.Series(y_train).value_counts(normalize=True)

### Metrics for binary classification

$$\text{precision} = \frac{TP}{\text{# predicted positive}} = \frac{TP}{TP + FP}$$

<center><small>Here, a false positive ($FP$) is when we predict a tumor is malignant, when it is benign.</small></center>

$$\text{recall} = \frac{TP}{\text{# actually positive}} = \frac{TP}{TP + FN}$$

<center><small>Here, a false negative ($FN$) is when we predict that a tumor is benign, when it is malignant?.</small></center>

- A binary classifier's **confusion matrix** displays its number of true positives ($TP$), false positives ($FP$), true negatives ($TN$), and false negatives ($FN$).

In [None]:
util.show_confusion(X_train[['mean radius']], y_train, T=0.5)

- Remember, we're predicting whether tumors are malignant (1) or benign (0). **Which is worse: a false positive or a false negative?**

Observe how the values in the confusion matrix change as the threshold changes!

In [None]:
interact(lambda T: util.show_confusion(X_train, y_train, T), T=(0, 1.01, 0.01));

### Precision vs. threshold

- Precision is defined as:

$$\text{precision} = \frac{TP}{\text{# predicted positive}} = \frac{TP}{TP + FP}$$

<center><small>Here, a false positive ($FP$) is when we predict a tumor is malignant, when it is benign.</small></center>

- How does the model's training **precision** change as the threshold changes?

In [None]:
util.plot_vs_threshold(X_train[['mean radius']], y_train, 'Precision')

- If the "bar" is higher to predict 1, then we will have fewer positives in general, and thus fewer false positives.

- As the **threshold increases** ⬆️, the denominator in $\text{precision} = \frac{TP}{TP + FP}$ will decrease, and so **precision tends to increase** ⬆️.<br><small>There are some cases where a slightly higher threshold led to a slightly lower precision; why?</small>

### Recall vs. threshold

- Recall is defined as:

    $$\text{recall} = \frac{TP}{\text{# actually positive}} = \frac{TP}{TP + FN}$$

<center><small>Here, a false negative ($FN$) is when we predict that a tumor is benign, when it is malignant.</small></center>

- How does the model's training **recall** change as the threshold changes?

In [None]:
util.plot_vs_threshold(X_train[['mean radius']], y_train, 'Recall')

- Note that the denominator in $\text{recall} = \frac{TP}{\text{# actually positive}}$ is constant. As the **threshold increases** ⬆️:
    - true positives get converted to false negatives, so
    - the numerator of recall ($TP$) decreases, and so
    - **recall decreases** ⬇️.

### Precision vs. recall

- We can visualize how precision and recall vary **together**.

In [None]:
util.pr_curve(X_train[['mean radius']], y_train).update_layout(width=400, height=400)

- The curve above is called a **PR curve**.

- **Question**: Given the information above, what threshold would you choose?

- **Answer**: The threshold whose point is closest to the **top right corner** of the plot above. <br><small>Why? The top right corner is where precision = 1 and recall = 1, and we want both to be high.</small>

### ROC curves

- A more popular variant of the PR curve is the **ROC curve**.<br><small>ROC stands for "receiver operating characteristic."

- A ROC curve plots true positive rate (TPR) vs. false positive rate (FPR) for all possible thresholds, where:

$$\underbrace{\text{true positive rate (TPR)} = \frac{TP}{\text{# actually positive}} = \frac{TP}{TP + FN} = \text{recall}}_\text{we want this to be close to 1!}$$

$$\underbrace{\text{false positive rate (FPR)} = \frac{FP}{\text{# actually negative}} = \frac{FP}{FP + TN}}_\text{we want this to be close to 0!}$$

In [None]:
util.draw_roc_curve(X_train[['mean radius']], y_train).update_layout(width=400, height=400)

- If we care about TPR and FPR equally, the best threshold is the one whose point is closest to the **top left corner** in the plot above.<br><small>Why? The top left corner is where $TPR = 1$ and $FPR = 0$, and we want $TPR$ to be high and $FPR$ to be low.

Run this cell to set up the next slide.

In [None]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, auc
import numpy as np

# Create subplots
fig = make_subplots(rows=1, cols=2, subplot_titles=('Single Feature', 'All Features'))

# Fit logistic regression models and calculate ROC curves
# Single feature model
lr_single = LogisticRegression()
lr_single.fit(X_train[['mean radius']], y_train)
y_pred_proba_single = lr_single.predict_proba(X_train[['mean radius']])[:, 1]
fpr_single, tpr_single, _ = roc_curve(y_train, y_pred_proba_single)
auc_single = auc(fpr_single, tpr_single)

# All features model
lr_all = LogisticRegression()
lr_all.fit(X_train, y_train)
y_pred_proba_all = lr_all.predict_proba(X_train)[:, 1]
fpr_all, tpr_all, _ = roc_curve(y_train, y_pred_proba_all)
auc_all = auc(fpr_all, tpr_all)

# Add filled areas under the curves
fig.add_trace(
    go.Scatter(
        x=fpr_single, 
        y=tpr_single,
        fill='tonexty',
        fillcolor='rgba(128, 0, 128, 0.3)',  # Purple with alpha
        line=dict(color='purple', width=2),
        name=f'Single Feature (AUC = {auc_single:.3f})',
        showlegend=False
    ),
    row=1, col=1
)

fig.add_trace(
    go.Scatter(
        x=fpr_all, 
        y=tpr_all,
        fill='tonexty',
        fillcolor='rgba(0, 128, 0, 0.3)',  # Green with alpha
        line=dict(color='green', width=2),
        name=f'All Features (AUC = {auc_all:.3f})',
        showlegend=False
    ),
    row=1, col=2
)

# Add diagonal reference lines
fig.add_trace(
    go.Scatter(
        x=[0, 1], 
        y=[0, 1],
        mode='lines',
        line=dict(dash='dash', color='gray'),
        name='Random Classifier',
        showlegend=False
    ),
    row=1, col=1
)

fig.add_trace(
    go.Scatter(
        x=[0, 1], 
        y=[0, 1],
        mode='lines',
        line=dict(dash='dash', color='gray'),
        name='Random Classifier',
        showlegend=False
    ),
    row=1, col=2
)

# Add AUC annotations
fig.add_annotation(
    x=0.6, y=0.3,
    text=f"AUC = {auc_single:.3f}",
    showarrow=False,
    font=dict(size=14, color="purple"),
    bgcolor="rgba(255, 255, 255, 0.8)",
    bordercolor="purple",
    borderwidth=1,
    row=1, col=1
)

fig.add_annotation(
    x=0.6, y=0.3,
    text=f"AUC = {auc_all:.3f}",
    showarrow=False,
    font=dict(size=14, color="green"),
    bgcolor="rgba(255, 255, 255, 0.8)",
    bordercolor="green",
    borderwidth=1,
    row=1, col=2
)

# Update layout
fig.update_layout(
    showlegend=False, 
    width=1000, 
    height=500, 
    title='Comparison of ROC Curves'
)

fig.update_xaxes(title_text="False Positive Rate", range=[0, 1])
fig = fig.update_yaxes(title_text="True Positive Rate", range=[0, 1])

### Comparing ROC curves

- Below, we draw two ROC curves:
    - On the **left**, for the model that uses just mean radius to predict malignancy.
    - On the **right**, for a model that uses all 30 features to predict malignancy.

In [None]:
fig

- A common metric for the quality of a binary classifier is the **area under curve (AUC)** for the ROC curve.<br><small>Larger values are better!</small>

### ROC curves vs. PR curves

- **Discuss**: Suppose we're deciding between Model A and Model B, both of which are models that predict probabilities (like logistic regression), and suppose **both** of the following are true:

    $$\text{ROC-AUC(Model A)} > \text{ROC-AUC(Model B)}$$
    $$\text{PR-AUC(Model A)} < \text{PR-AUC(Model B)}$$

    In what scenario would we choose Model A? Model B? Why?

- See [**here**](https://stats.stackexchange.com/questions/7207/roc-vs-precision-and-recall-curves) for a good discussion on the differences between PR curves and ROC curves.</small>

### Class imbalance

- In cases of class imbalance, we've discussed:
    - Setting `stratify` when performing a `train_test_split`.
    - Choosing a metric (precision, recall) that better suits the task at hand.
    - Using PR curves instead of ROC curves to choose a threshold.

- Another solution is to change how the model itself is trained, to account for the imbalance.

In [None]:
LogisticRegression(class_weight='balanced')

In [None]:
util.show_balancing_demo()

- By setting `class_weight='balanced'`, the minority class is given a higher "cost" in the model's optimization routine, effectively duplicating the minority class.

    <br>

    **If using `class_weight='balanced'`, refer to the documentation to see how the weights are assigned!**

## Summary

---

### Warnings

1. Missing random seeds.

2. Inconsistent variance/standard deviation formulas.

3. Centering with PCA.

4. Lack of cross-validation.

5. Unreported regularization.

6. Leakage with feature standardization.

7. Accuracy can mislead.

### Time-permitting: TRIPOD+AI Guidelines

Read the **TRIPOD+AI checklist [here](https://www.tripod-statement.org/wp-content/uploads/2019/12/TRIPODAI_checklist.pdf)**, in particular, sections 5, 7, 9, 12-16. How has our work today informed how we'd approach these guidelines?