# Advanced Classification Metrics

---

By default, the `.score` method of a logistic regression model in sklearn returns accuracy:

$$Accuracy = \frac{total~predicted~correct}{total~predicted}$$

However, accuracy is not always the most relevant metric.

Consider the **confusion matrix** for a binary classification problem where we have 165 observations/rows of people who are either smokers or nonsmokers.

<table style="border: none">
<tr style="border: none">
    <td style="border: none; vertical-align: bottom">n = 165</td>
    <td style=""><b>Predicted: No</b></td>
    <td style=""><b>Predicted: Yes</b></td>
</tr>
<tr>
    <td><b>Actual: No</b></td>
    <td style="text-align: center"><font color="blue">TN = 50</font></td>
    <td style="text-align: center"><font color="red">FP = 10</font></td>
    <td style="text-align: center">60</td>
</tr>
<tr>
    <td><b>Actual: Yes</b></td>
    <td style="text-align: center"><font color="orange">FN = 5</font></td>
    <td style="text-align: center"><font color="green">TP = 100</font></td>
    <td style="text-align: center">105</td>
</tr>
<tr style="border: none">
    <td style="border: none"></td>
    <td style="text-align: center">55</td>
    <td style="text-align: center">110</td>
</tr>

</table>


- <font color="green">**True positives (TP):**</font> These are cases in which we predicted yes (smokers), and they actually are smokers.
- <font color="blue">**True negatives (TN):**</font> We predicted no, and they are nonsmokers.
- <font color="red">**False positives (FP):**</font> We predicted yes, but they were not actually smokers. (This is also known as a "Type I error.")
- <font color="orange">**False negatives (FN):**</font> We predicted no, but they are smokers. (This is also known as a "Type II error.")

**Exercise.**

Categorize these cases as TP, TN, FP, or FN.
    
- We predict that a growth is malignant, and it is benign. (is_malignant=1)
- We predict that an image does not contain a cat, and it does not. (has_cat=1)
- We predict that a locomotive will fail in the next two weeks, and it does. (breaks=1)
- We predict that a user will like a song, and she does not. (likes_song=1)

<a id="accuracy-true-positive-rate-and-false-negative-rate"></a>
### Accuracy, True Positive Rate, and False Negative Rate

**Accuracy:** Overall, how often is the classifier correct?

<span>
    (<span style="color: green">TP</span>+<span style="color: blue">TN</span>)/<span style="color: purple">total</span> = (<span style="color: green">100</span>+<span style="color: blue">50</span>)/<span style="color: purple">165</span> = 0.91
</span>

<table style="border: none">
<tr style="border: none">
    <td style="border: none; vertical-align: bottom; color: purple">n = 165</td>
    <td style=""><b>Predicted: No</b></td>
    <td style=""><b>Predicted: Yes</b></td>
</tr>
<tr>
    <td><b>Actual: No</b></td>
    <td style="text-align: center; background-color: blue">TN = 50</td>
    <td style="text-align: center">FP = 10</td>
    <td style="text-align: center">60</td>
</tr>
<tr>
    <td><b>Actual: Yes</b></td>
    <td style="text-align: center">FN = 5</td>
    <td style="text-align: center; background-color: green">TP = 100</td>
    <td style="text-align: center">105</td>
</tr>
<tr style="border: none">
    <td style="border: none"></td>
    <td style="text-align: center">55</td>
    <td style="text-align: center">110</td>
</tr>

</table>

**True positive rate (TPR)** asks, “Out of all of the target class labels, how many were accurately predicted to belong to that class?”

For example, given a medical exam that tests for cancer, how often does it correctly identify patients with cancer?

<span>
<span style="color: green">TP</span>/<span style="color: aqua">actual yes</span> = <span style="color: green">100</span>/<span style="color: aqua">105</span> = 0.95
</span>

<table style="border: none">
<tr style="border: none">
    <td style="border: none; vertical-align: bottom">n = 165</td>
    <td style=""><b>Predicted: No</b></td>
    <td style=""><b>Predicted: Yes</b></td>
</tr>
<tr>
    <td><b>Actual: No</b></td>
    <td style="text-align: center">TN = 50</td>
    <td style="text-align: center">FP = 10</td>
    <td style="text-align: center">60</td>
</tr>
<tr>
    <td><b>Actual: Yes</b></td>
    <td style="text-align: center">FN = 5</td>
    <td style="text-align: center;background-color: green">TP = 100</td>
    <td style="text-align: center;color: aqua">105</td>
</tr>
<tr style="border: none">
    <td style="border: none"></td>
    <td style="text-align: center">55</td>
    <td style="text-align: center">110</td>
</tr>

</table>

**False positive rate (FPR)** asks, “Out of all items not belonging to a class label, how many were predicted as belonging to that target class label?”

For example, given a medical exam that tests for cancer, how often does it trigger a “false alarm” by incorrectly saying a patient has cancer?

<span>
<span style="color: orange">FP</span>/<span style="color: fuchsia">actual no</span> = <span style="color: orange">10</span>/<span style="color: fuchsia">60</span> = 0.17
</span>

<table style="border: none">
<tr style="border: none">
    <td style="border: none; vertical-align: bottom">n = 165</td>
    <td style=""><b>Predicted: No</b></td>
    <td style=""><b>Predicted: Yes</b></td>
</tr>
<tr>
    <td><b>Actual: No</b></td>
    <td style="text-align: center">TN = 50</td>
    <td style="text-align: center;background-color: orange">FP = 10</td>
    <td style="text-align: center;color:fuchsia">60</td>
</tr>
<tr>
    <td><b>Actual: Yes</b></td>
    <td style="text-align: center">FN = 5</td>
    <td style="text-align: center">TP = 100</td>
    <td style="text-align: center">105</td>
</tr>
<tr style="border: none">
    <td style="border: none"></td>
    <td style="text-align: center">55</td>
    <td style="text-align: center">110</td>
</tr>

</table>

**Exercise.**

We turn the probabilities output by a logistic regression model into "hard" predictions by setting a threshold. For instance, we might treat all probabilities above .5 as positive predictions and the rest as negative predictions.

- Does the true positive rate of a logistic regression model increase or decrease if we change the threshold probability for treating a prediction as positive from .5 to .6?

- Does the false positive rate of a logistic regression model increase or decrease if we change the threshold probability for treating a prediction as positive from .5 to .6?

- Describe a situation in which you would want to use a high threshold probability.

- Describe a situation in which you would want to use a low threshold probability.

- Calculate the accuracy, true positive rate, and false positive rate for the confusion matrix below.

<table style="border: none">
<tr style="border: none">
    <td style="border: none; vertical-align: bottom">n = 140</td>
    <td style=""><b>Predicted: No</b></td>
    <td style=""><b>Predicted: Yes</b></td>
</tr>
<tr>
    <td><b>Actual: No</b></td>
    <td style="text-align: center">30</td>
    <td style="text-align: center">10</td>
    <td style="text-align: center">40</td>
</tr>
<tr>
    <td><b>Actual: Yes</b></td>
    <td style="text-align: center">60</td>
    <td style="text-align: center">40</td>
    <td style="text-align: center">100</td>
</tr>
<tr style="border: none">
    <td style="border: none"></td>
    <td style="text-align: center">90</td>
    <td style="text-align: center">50</td>
</tr>

</table>

### Accuracy

**Advantages:**

- Intuitive: it's a lot like an exam score where you get total correct/total attempted.

**Disadvantages:**

- Potentially misleading: Can look OK when model is just outputting the most common label.
    - Particularly bad when classes are imbalanced -- e.g. train doesn't break 99% of the time, so a model that always says "won't break" has 99% accuracy -- but it fails exactly when we need it!
- Doesn't account for relative costs of false positives and false negatives.
- Doesn't say anything about how far predicted probabilities are from the correct labels.

**Other metrics to investigate:**
    
- **Classification error:** Proportion of incorrect predictions (1-accuracy, lower is better).
- **Receiver Operating Characteristic (ROC) curves:** True positive rate vs. false positive rate across all possible threshold probabilities. The **area under the ROC curve** (AUC) is a measure of how well your model performs overall across those thresholds.
  - Allows you to visualize the performance of your classifier across all possible classification thresholds, thus helping you to choose a threshold that appropriately balances true positives and false positives.
  - Still useful when there is high class imbalance (unlike classification accuracy/error).
  - Harder to use when there are more than two response classes.
- **Log loss**: Measures how far the output probabilities are from the correct labels. (Useful when you want to make expected value calculations with those probabilities or triage cases for further attention.)
- **True Negative Rate**, **False Negative Rate**
- **Recall** (a.k.a. True Positive Rate), **Precision** (proportion of positive predictions that are true)

These measures are all readily available in sklearn.