# The Confusion Matrix

In our $k$NN notebook we introduced accuracy, but accuracy is not always the best metric. Let's introduce some additional metrics for classification problems now.

## What we will accomplish

In this notebook we will:
- Mention some deficiencies with accuracy as a metric,
- Introduce the confusion matrix,
- Derive some new performance metrics and discuss when they are appropriate,
- Define:
    - Precision,
    - Recall,
    - Specificity,
    - Sensitivity and
    - Various other rate based metrics and
- Provide a useful summary table that you can use as a "cheat sheet".

In [1]:
# to get the iris data
from sklearn.datasets import load_iris

# for data handling 
import pandas as pd
import numpy as np

# for plotting
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("dark")

## Problems with accuracy

In the $k$-nearest neighbors notebook we defined the accuracy metric as:

$$
\text{Accuracy } = \ \frac{\text{The number of correct predictions}}{\text{Total number of predictions made}}.
$$

This can be a misleading metric because it obfuscates which kinds of observations the model got correct. For example, if we had a problem where the distribution of classes was: $10\% - 1$ and $90\% - 0$ then a model that predicts every observation to be $0$ would have $90\%$ accuracy. While we would generally assume that $90\%$ indicates good performance, in this situation we have failed to identify any observation that was of class $1$. This would be terrible if, for instance, class $1$ represented the diagnosis of a treatable or curable disease.

We will thus look to develop additional performance metrics for classification problems that will allow us to think about how our models are correct.

## The confusion matrix

Additional performance measures can be derived from the confusion matrix, pictured for binary problems below.

<img src="conf_mat.png" alt="Confusion Matrix Image" width="50%;">

Contained within each box of the confusion matrix are counts of how the algorithm sorted. For instance, in the TP box would be the total number of correct positive (correctly classified as $1$) classifications the algorithm made. The diagonal thus represents data points that are correctly predicted by the algorithm and the off-diagonal represents points that are incorrectly predicted by the algorithm.  

For those of you more familiar with frequentist statistics we can think of the false negatives as the classifier making a type II error and the false positives as the classifier making a type I error.

A confusion matrix is referred to as a <i>contingency table</i> in some fields.

<i>Note that you can extend the confusion matrix to a multiclass problem by just adding rows and columns accordingly. However, we will lose the true positive, true negative nomenclature. We will see this extension in a later notebook.</i>

### Metrics derived from the confusion matrix

It can be difficult to convey classifier performance by just looking at the confusion matrix. Moreover, in your particular problem you may only be interested in a certain kind of performance. As an example, consider the case where you work for a company that builds software to flag hate speech in forum posts. In this situation your primary concern is to accurately flag hate speech when it is posted, while limiting incorrect hate speech flags may be a secondary concern.

There has thus been extensive development of metrics derived from the confusion matrix that assess different types of classification performance.

#### Precision and recall

Two popular measures derived from the confusion matrix are the algorithm's <i>precision</i> and <i>recall</i>:

$$
\text{precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}, \text{ out of all points predicted to be class } 1, \text{ what fraction were actually class } 1?
$$

$$
\text{recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}, \text{ out of all the actual data points in class } 1 \text{, what fraction did the algorithm correctly predict?}
$$

You can think of precision as how much you should trust the algorithm when it says something is class $1$. 

Recall estimates the probability that the algorithm correctly detects class $1$ data points.


Let's examine the training precision and recall for a virginica classifier using the iris data.

In [2]:
## Load the data
iris = load_iris()
iris_df = pd.DataFrame(iris['data'],columns = ['sepal_length','sepal_width','petal_length','petal_width'])

## Create a virginica variable
## this will be our target
iris_df['virginica'] = 0 
iris_df.loc[iris['target'] == 2,'virginica'] = 1

X = iris_df[['sepal_length','sepal_width','petal_length','petal_width']].to_numpy()
y = iris_df['virginica'].to_numpy()

In [4]:
iris_df.tail()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,virginica
145,6.7,3.0,5.2,2.3,1
146,6.3,2.5,5.0,1.9,1
147,6.5,3.0,5.2,2.0,1
148,6.2,3.4,5.4,2.3,1
149,5.9,3.0,5.1,1.8,1


In [5]:
from sklearn.model_selection import train_test_split

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=111,
                                                    stratify=y)

Now we will build a $k$-nearest neighbor classifier using $k=5$. We'll then examine the confusion matrix on the training data.

In [7]:
## import Nearest Neighbors
from sklearn.neighbors import KNeighborsClassifier

In [8]:
## Make the model object
knn = KNeighborsClassifier(n_neighbors = 5)

## Fit the model object
knn.fit(X_train,y_train)

## get the predictions
y_train_pred = knn.predict(X_train)

In [9]:
y_train_pred

array([0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0,
       1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0,
       0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0,
       0, 0, 0, 1, 1, 1, 0, 0, 0, 1])

`sklearn` provides a quick way to get the confusion matrix for a classifier, <a href="https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html">https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html</a>.

In [10]:
## now we can import the confusion matrix
## function from sklearn
from sklearn.metrics import confusion_matrix

In [11]:
## just like mse, actual then prediction
confusion_matrix(y_train, y_train_pred)

array([[77,  3],
       [ 2, 38]])

In [12]:
## Calculate the confusion matrix here

TN = confusion_matrix(y_train, y_train_pred)[0,0]
FP = confusion_matrix(y_train, y_train_pred)[0,1]
FN = confusion_matrix(y_train, y_train_pred)[1,0]
TP = confusion_matrix(y_train, y_train_pred)[1,1]

In [14]:
## calculate recall and precision here
print("The training recall is", 
         np.round(TP/(FN + TP), 4))

print("The training precision is", 
         np.round(TP/(FP + TP), 4))


The training recall is 0.95
The training precision is 0.9268


Alternatively we could use `sklearn`'s precision and recall metrics.

- `precision_score` docs, <a href="https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html">https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html</a>
- `recall_score` docs, <a href="https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html">https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html</a>.

In [16]:
## import precision and recall
from sklearn.metrics import precision_score, recall_score

## print the precision and recall here
print("The training recall is",
         np.round(recall_score(y_train, y_train_pred), 5))

print("The training precision is",
         np.round(precision_score(y_train, y_train_pred), 5))



The training recall is 0.95
The training precision is 0.92683


#### True positive rate, false positive rate, true negative rate and false negative rate

We may also be interested in easier to remember metrics. For example things like:
- Given that an observation is a true positive:
    - what is the probability that we correctly predict it is a positive? This is estimated with the <i>true positive rate</i>. (Note that this is the same as recall).
    - what is the probability that we incorrectly predict it is a negative? This is estimated with the <i>false negative rate</i>.
- Given that an observation is a true negative:
    - what is the probability that we correctly predict it is a negative? This is estimated with the <i>true negative rate</i>.
    - what is the probability that we incorrectly predict it is a positive? This is estimated with the <i>false positive rate</i>.
    
Depending on the application we may be incredibly interested in optimizing one or more of these measures. For example, if we were trying to predict that someone has a serious infectious disease we may be most interested in the false negative rate.

The formulae for these are given by:

$$
\text{true positive rate } = \text{ TPR } = \frac{\text{TP}}{\text{TP} + \text{FN}}
$$

$$
\text{false negative rate } = \text{ FNR } = \frac{\text{FN}}{\text{TP} + \text{FN}}
$$

$$
\text{true negative rate } = \text{ TNR } = \frac{\text{TN}}{\text{TN} + \text{FP}}
$$

$$
\text{false positive rate } = \text{ FPR } = \frac{\text{FP}}{\text{TN} + \text{FP}}
$$

These we would calculate by hand using `confusion_matrix`.

In [17]:
## Calculate All four for our virginica classifier

## TPR
print("The training true positive rate is",
         np.round(TP/(TP+FN),4))


## FNR
print("The training false negative rate is",
         np.round(FN/(TP+FN),4))



## TNR
print("The training true negative rate is",
         np.round(TN/(TN+FP),4))



## FPR
print("The training false positive rate is",
         np.round(FP/(FP+TN),4))




The training true positive rate is 0.95
The training false negative rate is 0.05
The training true negative rate is 0.9625
The training false positive rate is 0.0375


#### Sensitivity and specificity

These two have a long history of use in public health when assessing the performance of screening and diagnostic tests.

- The <i>sensitivity</i> of a classifier is the probability that it correctly identifies a positive observation (note that this is the same as the true positive rate and recall) and
- The <i>specificity</i> of a classifier is the probability that it correctly identifies a negative observation (note that this is the same as the true negative rate).

The formulae for both are given:

$$
\text{Sensitivity } = \frac{\text{TP}}{\text{TP} + \text{FN}}, 
$$

$$
\text{Specificity } = \frac{\text{TN}}{\text{TN} + \text{FP}}.
$$

While these two are identical to metrics given above, these are common enough names that it is important for you to be formally introduced to them.

## Too much to remember

That is a lot of metrics to remember. It is okay if you cannot perfectly remember what name goes with what formula (I still have to look up precision and recall). To help you out we have provided a "cheat sheet" with a table of metrics derived from the confusion matrix. You can find it here <a href="confusion_matrix_cheat_sheet.pdf">confusion_matrix_cheat_sheet.pdf</a>. You can also find metrics that we did not cover in this notebook here, <a href="https://en.wikipedia.org/wiki/Confusion_matrix">https://en.wikipedia.org/wiki/Confusion_matrix</a>.

## Careful consideration

In real world settings it is important to give careful consideration to which performance metric(s) are optimized in model selection. When selecting try to translate what the metric translates into when considering the real world problem you are considering.

For example, public health often focus on sensitivity and specificity because they can be translated into real world health impacts.

In the case of deadly disease for which we have successful regimens we may choose to go for tests that have high sensitivity. While we may opt for high specificity if the disease or condition in question does not tend to cause severe outcomes in the individual and the test or treatment is highly invasive or expensive.

A careful consideration of metrics can also contribute to your understanding of what the classifier is capable of, which should help you frame your findings in terms that stakeholders can better understand.

--------------------------

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2022.

Any potential redistributors must seek and receive permission from Matthew Tyler Osborne, Ph.D. prior to redistribution. Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)