# Evaluation Metrics for Classification

## 1. Definition

There are ***MANY*** metrics available for classification problems. It may be a bit ***confusing*** at first, so let's look at the ***confusion matrix*** to understand it better (pun intended!).

### 1.1 Confusion Matrix

The ***confusion matrix*** is the contingency table of ***actual*** (rows) vs ***predicted*** (columns) ***classes***.

Some representations start with positive samples on both first row and columns. But ***Scikit-Learn*** results are returned with ***negative samples first***. So, we're sticking with its convention to avoid confusion!

Therefore, a matrix has 4 values, as shown in the picture:

![](./img/confusion_matrix.png)

The confusion matrix provides the necessary information to build a lot of different metrics.

&nbsp; | &nbsp;
:---:|:---:
![](https://upload.wikimedia.org/wikipedia/commons/thumb/2/26/Precisionrecall.svg/264px-Precisionrecall.svg.png) | ![](https://upload.wikimedia.org/wikipedia/commons/thumb/e/e7/Sensitivity_and_specificity.svg/264px-Sensitivity_and_specificity.svg.png)
<center>Source: Wikipedia</center> | <center>Source: Wikipedia</center>

Notice that the matrix is built on top of ***predicted classes***, not ***probabilities***. It means you should first decide on a ***threshold*** to convert probabilities into classes and only then compute the matrix.

Changing the ***threshold*** will change the matrix and, consequently, the metrics that depend on its values.

So, it is possible to ***tweak the threshold*** to achieve a better performance on a given metric.

### 1.2 Accuracy

***How often my classifier is right?***

This is the most straightforward metric of all - how often a classifier is right, generally speaking.

It may be a ***misleading*** metric, though, if the dataset is ***imbalanced***.

$$
Accuracy = \frac{TP + TN}{Total}
$$

### 1.3 Precision

***My classifier says it's positive - how often is it right?***

If ***False Positives*** are a ***problem***, this is the metric you should pay attention to.

Example: if you want to classify videos as ***appropriate for kids*** (positive) or not (negative), you ***really*** don't want a ***false positive***, that is, an ***inappropriate video*** showing up. You will end up ***rejecting good videos***, but that's a lesser problem.

$$
Precision = \frac{TP}{TP + FP}
$$

### 1.4 True Positive Rate (TPR) / Recall / Sensitivity

***It IS a positive sample - how often my classifier gets it right?***

If ***False Negatives*** are a ***problem***, this is the metric you should pay attention to.

Example: if you want to detect if someone has a ***rare and fatal disease*** (positive) or not (negative), you ***really*** don't want a ***false negative***, that is, ***dismissing a sick person***. You will end up ***investigating further healthy people***, but that's a lesser problem.

$$
Recall = \frac{TP}{TP + FN}
$$

### 1.5 False Positive Rate (FPR) / Specificity

***It IS a negative sample - how often my classifier gets it wrong?***

If ***False Positives*** are a ***problem***, this is the metric you should pay attention to.

$$
FPR = 1 - Specificity = 1 - \frac{TN}{TN + FP} = \frac{FP}{TN + FP}
$$

### 1.6 F1-Score

It is the ***harmonic mean*** of precision and recall, so it combines both metrics into a single value.

It favors classifiers that deliver similar levels of precision and recall.

$$
F_1 = \frac{2}{\frac{1}{precision} + \frac{1}{recall}}
$$

### Tweaking the Threshold

The metrics so far were computed for a given threshold. If we want to compare how they fare whenever we ***change the threshold*** to all its possible values, we need to construct one of these ***curves*** below.

They are especially useful to evaluate classifiers on ***imbalanced datasets***.

### 1.7 Precision-Recall Curve (Recall x Precision)

The ***PR Curve*** depicts the trade-off between ***Recall*** on the horizontal axis and ***Precision*** on the vertical axis.

![](https://scikit-learn.org/stable/_images/sphx_glr_plot_precision_recall_001.png)
<center>Source: Scikit-Learn</center>

You may have noticed the curve is somewhat ***bumpy***.

If you ***raise the threshold***, you will move to the ***left*** on the curve. 

It means you're trying to ***avoid False Positives*** at the expense of ***trading True Positives for False Negatives***.
1. More FN reduces Recall (TPR) (less TP has little impact as its on both numerator and denominator)
2. Less FP increases precision, but less TP reduces precision

But, as you shift the threshold, you may ***lose more TPs than FPs***, and then it will reduce your precision momentarily.

### 1.8 ROC Curve (FPR x TPR)

The ***ROC Curve*** depicts the trade-off between ***False Positive Rate*** on the horizontal axis and ***True Positive Rate*** on the vertical axis.

The shape of the curve will depend on ***how separable*** the classes are:
- perfectly separable: the "curve" would actually be a square, going straight up to 1 and staying there
- completely overlapped: the "curve" would actually be a diagonal line, from the origin to the upper right corner
- somewhat separable: a curve like the one in the figure below

![](https://scikit-learn.org/stable/_images/sphx_glr_plot_roc_001.png)
<center>Source: Scikit-Learn</center>

If you ***raise the threshold***, you will move to the ***left*** on the curve. 

It means you're trying to ***avoid False Positives*** at the expense of ***trading True Positives for False Negatives***.
1. More FN reduces TPR (Recall) (less TP has little impact as its on both numerator and denominator)
2. Less FP reduces FPR

Since TP is not present on the calculation of FPR, we ***do not*** observe the bumpiness as in the PR Curve.

### 1.9 Area Under ROC

The ROC Curve is a very popular method of evaluating a binary classifier. But how does one compare two curves? Unless one of them is strictly better than the other, this would be a difficult task.

To make it easier to compare classifiers, one can use the ***area*** under the ROC Curve. The closer it is to ***one***, the better the classifier, as it achieves a high ***TPR*** with a little ***FPR***.

🔹 Intuitive Meaning

AUC = the probability that your model ranks a random positive higher than a random negative.

Imagine you randomly pick 1 positive example (say, a real spam email) and 1 negative example (a normal email).

You look at your model’s predicted probability for each.

If the spam gets a higher score → that’s a “win” for the model.

If the normal email gets a higher score → that’s a “loss.”

Do this many times — the fraction of “wins” is the AUC.

So if AUC = 0.85, it means that 85% of the time, the model ranks a random positive higher than a random negative.