$$
    \text{accuracy} = \frac{\text{\# correct predictions}}{\text{\#predictions}}
$$

Limitations:
- inappropriate for imbalanced classification problems. Say almost all targets in our _sample_ test set happen to be 0. A classifier that does nothing and always outputs 0 will have a very high accuracy. But this will not generalise.

Alternatives:
- <u>Precision</u> quantifies the number of positive class predictions that actually belong to the positive class. Out of the times we said label = 1, how many times were we right?
- <u>Recall</u> quantifies the number of positive class predictions made out of all positive examples in the dataset. Out of all the label = 1 in the dataset, how many did we recognise?
- <u>F-measure</u> or <u>F-score</u> provides a single score that balances both the concerns of precision and recall in one number. It penalises large differences between precision and recall.

A <u>confusion matrix</u> looks like this for two classes:

|            | prediction = 1      | prediction = 0      |
| ---------- | ------------------- | ------------------- |
| target = 1 | True Positive (TP)  | False Negative (FN) |
| target = 0 | False Positive (FP) | True Negative (TN)  |

$$
\begin{align*}
    \text{precision}_1 &= \frac{\text{TP}}{\text{TP} + \text{FP}}\\
    \text{recall}_1 &= \frac{ \text{TP} }{ \text{TP} + \text{FN} }\\
    \text{f1}_1 &= \frac{2 \cdot \text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}\\
\end{align*}
$$

So, for instance, a true positive is a prediction that was the same as the target.
For multiple classes, a confusion matrix looks like this:

|            | prediction = 0                      | prediction = 1 | prediction = 2 |
| ---------- | ----------------------------------- | -------------- | -------------- |
| target = 0 | T0                                  | F1             | F2             |
| target = 1 | F0 #(prediction = 0 and target = 1) | T1             | F2             |
| target = 2 | F0                                  | F1             | T2             |

## Macro average
$$
\begin{align*}
    \text{precision}_c &= \frac{ \text{pred} = c \text{ and } \text{target} = c  }{ \sum_i \text{pred} = c \text{ and } \text{target} = i  } =& \frac{M_{ii}}{\sum_{j} M_{ji}}\\
    \text{recall}_c &= \frac{ \text{pred} = c \text{ and } \text{target} = c }{ \sum_i \text{pred} = i \text{ and } \text{target} = c } =& \frac{M_{ii}}{ \sum_j M_{ij} }\\
\end{align*}
$$

So:
- precision for class c is the diagonal element on row c, divided by the sum of <u>column</u> c;
- recall for class c is the diagonal element on row c, divided by the sum of <u>row</u> c.

Limitations:
- Not great for imbalanced classes. Say we have 2 classes: 2 examples from class A, and a lot from class B, and we get precision 0.5 on both classes. It should be a bigger deal that we got 0.5 on B, since we have a lot of examples.

Ways to address:
- weigh each precision by the ration of examples: prec_a * #A / (#A + #B) + prec_b  * #B/ (#A + #B).
- report not just mean, but also standard deviation (average  difference from mean).
- compute micro average.

## Micro average
But this will produce equal precision, recall, and f1 if we consider all classes.

In [22]:
import torch
targets = torch.tensor([1, 0, 2, 0, 2, 0, 2, 0, 1, 0, 2, 0, 1]).float()
outputs = torch.tensor([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]).float()

accuracy = (targets == outputs).float().mean()
print(accuracy)

precision_lst = []
epsilon = 1e-8

for c in targets.unique():
    TP = ((targets == c) & (outputs == c)).sum()
    FP = ((targets != c) & (outputs == c)).sum()
    precision = TP / (TP + FP + epsilon)
    precision_lst.append(precision)

weights = torch.bincount(targets.int()).float()
weights = weights / weights.mean()

precisions = torch.tensor(precision_lst)
precision = (precisions / weights).mean()

print(precision)

tensor(0.5385)
tensor(0.6019)
