# Iris Dataset

---

Let's fit a *k*-NN classifier on the Iris training data and generate predictions on the test features to build a confusion matrix.
tgt_preds = (KNeighborsClassifier()
             .fit(iris_train_ftrs, iris_train_tgt)
             .predict(iris_test_ftrs))

print("accuracy:", accuracy_score(iris_test_tgt, tgt_preds))
cm = confusion_matrix(iris_test_tgt, tgt_preds)
print("confusion matrix:", cm, sep="\n")

Let's use `seaborn` to display the confusion matrix as a heatmap.

sns.set(font_scale=2.0)
fig, ax = plt.subplots(1, 1, figsize=(8, 8))
cm = confusion_matrix(iris_test_tgt, tgt_preds)
ax = sns.heatmap(cm, annot=True, square=True,
                 xticklabels=iris.target_names,
                 yticklabels=iris.target_names,
                 fmt='g', cmap='rocket_r')
ax.set_xlabel('Predicted')
ax.set_ylabel('Actual')
ax.set_title('Confusion Matrix for the Iris Dataset')
plt.show()

From the confusion matrix above, we see the Setosa class is the easiest to classify since the classifier predicts 18 out of 18 of them correctly. Regarding Versicolor, the classifier misclassified one as a Virginica. Regarding the Virginicas, the classifier misclassified two as Versicolor. The remaining 16 and 13 Versicolors and Virgnicas were classified correctly, respectively.

## Dealing with Multiple Classes: Multiclass Averaging

---
When dealing with the Iris dataset, we are no longer dealing with two classes like we were in the MNIST dataset. Therefore, our dichotomous formulas for precision and recall no longer work in this instance.

However, from the confusion matrix above, we see we made three mistakes in our classifications. We classified one Versicolor as Virginica and two Virginica as Versicolor. Let's think about the value of a prediction for a moment. The precision above indicates a prediction's worth, and we calculate the precision by dividing all of our correct predictions by our total predictions. In other words, the denominator is the sum of all the values in the `PredP` column.

Implementing a similar approach in the multiclass confusion matrix above, we can consider each column independently and divide the number of predictions we got right per column by the total number of predictions per column, referred to as a one-versus-rest approach. Considering the confusion matrix above, we were correct 16 times and wrong 2 when predicting Versicolor, $\frac{16}{18}$. Likewise, we were correct 13 times and incorrect 1 when predicting Virginica, $\frac{13}{14}$. Since the classifier always predicted the Setosa class correctly, its one-versus-rest precision score equates to $1$. In total, $\{1, \frac{16}{18}, \frac{13}{14}\}$. Let's take the mean of these three numbers.

np.mean([1, 16/18, 13/14])

### Macro Precision

---

In `sklearn` , this method of summarizing the mean is called `macro` . To calculate the _macro precision_, for each column in the confusion matrix, we take the diagonal entry, which represents when we are correct, and divide by the sum of all values in the column. We then sum these values and divide them by the number of columns to get the average.

macro_prec = precision_score(iris_test_tgt,
                             tgt_preds,
                             average='macro')
print(f'Macro Precision: {macro_prec}')

cm = confusion_matrix(iris_test_tgt, tgt_preds)
n_labels = len(iris.target_names)
print(
    f"Should Equal 'Macro Precision' {(np.diag(cm) / cm.sum(axis=0)).sum() / n_labels}")
### Micro-precision

---
According to the `sklearn` documentation, despite the name, the `micro` average "calculates metrics globally by counting the total true positives, false negatives, and false positives." In other words, the `micro` average takes all the *correct* predictions and divides them by _all_ the predictions. We can calculate the micro average manually by summing the values on the diagonal of the confusion matrix and dividing by the sum of all values in the confusion matrix.

print("micro:", precision_score(iris_test_tgt, tgt_preds, average='micro'))

cm = confusion_matrix(iris_test_tgt, tgt_preds)
print("should equal avg='micro':",
      np.diag(cm).sum() / cm.sum())

### Classification Report

---

The `classification_report` wraps several of these metrics together. It computes the one-versus-all statistics and then a _weighted_ average of the values--like `macro` except with different weights. The weights come from the _support_. The support of a classification rule, such as _if x is a cat and x is striped, and x is big, then x is a tiger_, is the count of examples where the rule applies. If 35 instances out of 100 meet the constraints on the left-hand side of the _if_, then the support is 35. The classification report considers the _support_ in terms of reality. In other words, it is equivalent to the total counts in each _row_ of the confusion matrix.

Let's take a look at some simple examples.

#### Simple Examples

---

y_true = [0, 1, 2, 2, 2]
y_pred = [0, 0, 2, 2, 1]
target_names = ['class 0', 'class 1', 'class 2']
print(classification_report(y_true, y_pred, target_names=target_names))

y_true = [1, 1, 1]
y_pred = [0, 1, 1]
print(classification_report(y_true, y_pred, zero_division=0))

print(classification_report(iris_test_tgt, tgt_preds))
# average is a weighted macro average

# verify sums-across-rows
cm = confusion_matrix(iris_test_tgt, tgt_preds)
print("row counts equal support:", cm.sum(axis=1))

# $F_1$ Score

---

The precision and the recall measure the two types of errors we can make for the positive class. Maximizing the precision minimizes the false positives and maximizing the recall minimizes the false negatives.

The $F_1$ score is one of the standard measures to rate a classifier's success. The $F_1$ score is the harmonic mean of two other metrics: precision and recall. A harmonic mean is the reciprocal of the arithmetic mean of the reciprocals of the data. We can express this using the following formula:

$$ H = \frac{n}{\frac{1}{x_1} + \frac{1}{x_2} + \frac{1}{x_3} + \ldots + \frac{1}{x_n}} $$

$H$ is the harmonic mean, $n$ is the number of data points, and $x_n$ is the *n*th value in the dataset.

Applying the harmonic mean to the precision and recall, we get the following:

$$ F_1 = \frac{2}{\frac{1}{\text{precision}} + \frac{1}{\text{recall}}} = \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{\text{tp}}{\text{tp} + \frac{1}{2}(\text{fp} + \text{fn})} $$

The formula represents an _equal_ tradeoff between precision and recall. We want to be equally right in the value of our predictions and concerning the real world. The `fbeta_score` is the weighted harmonic mean of precision and recall. It reaches its optimal value at 1 and its worst value at 0. The `beta` parameter determines the weight of recall in the combined score. `beta < 1` lends more weight to precision, whereas `beta > 1` favors recall ( `beta -> 0` considers only precision, `beta -> +inf` only recall). 

$$ F_\beta = (1 + \beta^2) \times \frac{\text{precision} \times \text{recall}}{(\beta^2 \times \text{precision}) + \text{recall}} = \frac{(1 + \beta^2)\text{tp}}{{(1 + \beta^2)\text{tp} + \beta^2\text{fn} + \text{fp})}}$$

The intuition for the F-score is that both measures are balanced in importance and that only a good precision and a good recall together result in a good F-measure.

## Worst Case

---

Let's take a look at a simple example. If the classifier perfectly mispredicts all instances, we have zero precision and zero recall, resulting in a zero F-measure.
y_true = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
y_pred = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
p = precision_score(y_true, y_pred)
r = recall_score(y_true, y_pred)
f = f1_score(y_true, y_pred)
print(f'No Precision or Recall: p={p:.3f} r={r:.3f} f={f:.3f}')

Given that the precision and recall are only concerned with the positive class, we can achieve the same worst-case precision, recall, and F-measure by predicting the negative class for all examples:
y_true = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
y_pred = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
p = precision_score(y_true, y_pred, zero_division=0)
r = recall_score(y_true, y_pred)
f = f1_score(y_true, y_pred)
print(f'No Precision or Recall: p={p:.3f} r={r:.3f} f={f:.3f}')

Given that no positive cases were predicted, we must output a zero precision and recall and, in turn, F-measure.
# Best Case

---

Conversely, perfect predictions result in a perfect precision and recall, and, in turn, a perfect F-Score.
y_true = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
y_pred = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
p = precision_score(y_true, y_pred)
r = recall_score(y_true, y_pred)
f = f1_score(y_true, y_pred)
print(f'No Precision or Recall: p={p:.3f} r={r:.3f} f={f:.3f}')

## 50% Precision, Perfect Recall

---
It is not possible to have perfect precision and no recall, or no precision and perfect recall. Both precision and recall require true positives to be predicted. Let's consider the case where we predict the positive class for all cases.

This gives us 50% precision as half of the predictions are false positives. It gives us perfect recall because there are no false negatives. For the balanced dataset we are using in our examples, the precision ratio would be 0.5. Combining 50 percent precision with perfect recall results in a penalized F-measure, specifically the harmonic mean between 50 percent and 100 percent.
y_true = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
y_pred = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
p = precision_score(y_true, y_pred)
r = recall_score(y_true, y_pred)
f = f1_score(y_true, y_pred)
print(f'Result: p={p:.3f} r={r:.3f} f={f:.3f}')
## Fbeta-Measure

---

The F-measure balances precision and recall. However, on some problems, we might be interested in an F-measure that puts more attention on precision, such as when false positives are more important to minimize, but false negatives are still important. An example where it might be important to minimize false positive is search engine results.

On other problems, we might be interested in an F-measure that puts more attention on recall, such as when it is more important to minimize false negatives, but false positives are still important.

The solution is the Fbeta-measure.

The Fbeta-measure is an abstraction of the F-measure where the balance of precision and recall in the calculation of the harmonic mean is controlled by a coefficient called _beta_.

$$ F_\beta = \frac{((1 + \beta^2) \times \text{Precision} \times \text{Recall})}{\beta^2 \times \text{Precision} + \text{Recall}} $$

Three common values for the beta parameter are as follows:

* F0.5-Measure (beta=0.5): More weight on precision, less weight on recall
* F1-Measure (beta=1.0): Balance the weight on precision and recall
* F2-Measure (beta=2.0): Less weight on precision, more weight on recall

Let's take a closer look at each of these cases.
### F1-Measure

---

The F-measure discussed in the previuos section is an example of the Fbeta-measure with a _beta_ value of 1. The F-measure and F1-measure calculate the same thing.

$$ \frac{((1 + 1^2) \times \text{Precision} \times \text{Recall})}{(1 ^ 2 \times \text{Precision} + \text{Recall})} = \frac{(2 \times \text{Precision} + \text{Recall})}{(\text{Precision} + \text{Recall})}$$
Consider the case where we have 50 percent precision and perfect recall. Let's manually calculate the F1-measure for this case.
from decimal import Decimal
Decimal((2 * p * r) / (p + r)).quantize(Decimal("0.01"))
The F0.5-Measure raises the importance of precision and lowers the importance of recall. The F0.5-measure focuses more on minimizing false positives than reducing false negatives because it emphasizes precision, which minimizes false positives or incorrect hits.

Consider the case where we have 50 percent precision and perfect recall. We can manually calculate the F0.5-measure for this case as follows:

$$ \frac{((1 + 0.5^2) \times \text{Precision} \times \text{Recall})}{(0.5^2 \times \text{Precision} + \text{Recall})} = \frac{(1.25 \times \text{Precision} + \text{Recall})}{(0.25 \times \text{Precision} + \text{Recall})}$$ 
Decimal((1.25 * p * r) / (0.25 * p + r)).quantize(Decimal("0.01"))
Let's confirm this calculation.
# perfect precision, 50% recall
y_true = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
y_pred = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
p = precision_score(y_true, y_pred)
r = recall_score(y_true, y_pred)
f = fbeta_score(y_true, y_pred, beta=0.5)
print(f'Result: p={p:.3f} r={r:.3f} f={f:.3f}')
### F2-Measure

---

The F2-measure has the effect of lowering the importance of precision and increasing the importance of recall. Because maximizing precision minimizes false positives and maximizing the recall minimizes false negatives, the F2-measure puts more attention on minimizing false negatives than minimizing false positives. 

Let's consider the case where we have 50 percent precision and perfect recall. 
The $F_1$ Score metric is preferable when:
* We have imbalanced class distribution
* We're looking for a balanced measure between precision and recall (Type 1 and Type II errors)

AS the $F_1$ score is more sensitive to data distribution, it's a suitable measure for classification problems on imbalanced datasets.

Unlike binary classification, multi-class classification generates an $F_1$ score for each class separately. We can also compute an averaged $F_1$ score per classifier in Python. Th

$F_1$ computes a different kind of average from the confusion matrix. By average, we are talking about a measure of center. We know about the mean (the arithmetic average) and median (the middle-most of sorted values). However, there are other types of averages out there. The ancient Greeks cared about three averages: the arithmetic mean, the geometric mean, and the harmonic mean.

One way to view the geometric and harmonic means is as wrappers around a converted arithmetic mean. The geometric mean is computed by taking the arithmetic mean of the logarithms of the values and then exponentiating the value. What we are concerned with here is the harmonic mean. To compute it, we (1) take the arithmetic mean of the reciprocals and then (2) take the reciprocal of that. The harmonic mean is very useful when summarizing rates like speed or comparing different fractions. $F_1$ is a harmonic mean with a slight tweak. It has a constant in front. However, it's just a harmonic mean with a constant in front. The formula is as follows:

If we apply some algebra by taking common denominators and doing an invert-and-multiply, we get the usual textbook formula for $F_1$.

The formula represents an *equal* tradeoff between precision and recall, which means we want to be equally right in the value of our predictions and concerning the real world.

# Receiver Operating Characteristic (ROC) with cross-validation

---

The ROC metric can be used to evaluate classifier output quality. The ROC curve typically features the true positive rate on the Y-axis and the false positive rate on the X-axis. The top left corner of the point is the "ideal" point, representing a false positive rate of zero and a true positive rate of one. This result is often unrealistic, but a larger area under the curve is usually better. Also, the "steepness" of the ROC curve is important since it is ideal to maximize the true positive rate while minimizing the false positive rate.

The following shows the ROC response of different datasets created from K-fold cross-validation. Taking all of these cuves we can calculates the mean area under the curve and see the variance of the curve when the training set is split into different subsets, which shows how the classifier output is affected by changes in the training dta, and how different the splits generated by *k*-fold cross-validation afre from one a

# Data IO and generation

---

import numpy as np

from sklearn import datasets

# Import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target

X, y = X[y != 2], y[y != 2]
n_samples, n_features = X.shape

# Add noisy features
random_state = np.random.RandomState(0)
X = np.c_[X, random_state.randn(n_samples, 200 * n_features)]
X

## Classification and ROC analysis

---

import matplotlib.pyplot as plt

from sklearn import svm
from sklearn.metrics import auc
from sklearn.metrics import RocCurveDisplay
from sklearn.model_selection import StratifiedKFold

# Run classifier with cross-validation and plot ROC curves
cv = StratifiedKFold(n_splits=6)
classifier = svm.SVC(kernel="linear", probability=True,
                     random_state=random_state)

tprs = []
aucs = []
mean_fpr = np.linspace(0, 1, 100)
fig, ax = plt.subplots()
for i, (train, test) in enumerate(cv.split(X, y)):
    classifier.fit(X[train], y[train])
    viz = RocCurveDisplay.from_estimator(
        classifier,
        X[test],
        y[test],
        name="ROC fold {}".format(i),
        alpha=0.3,
        lw=1,
        ax=ax,
    )
    interp_tpr = np.interp(mean_fpr, viz.fpr, viz.tpr)
    interp_tpr[0] = 0.0
    tprs.append(interp_tpr)
    aucs.append(viz.roc_auc)

ax.plot([0, 1], [0, 1], linestyle="--", lw=2,
        color="r", label="Chance", alpha=0.8)

mean_tpr = np.mean(tprs, axis=0)
mean_tpr[-1] = 1.0
mean_auc = auc(mean_fpr, mean_tpr)
std_auc = np.std(aucs)
ax.plot(
    mean_fpr,
    mean_tpr,
    color="b",
    label=r"Mean ROC (AUC = %0.2f $\pm$ %0.2f)" % (mean_auc, std_auc),
    lw=2,
    alpha=0.8,
)

std_tpr = np.std(tprs, axis=0)
tprs_upper = np.minimum(mean_tpr + std_tpr, 1)
tprs_lower = np.maximum(mean_tpr - std_tpr, 0)
ax.fill_between(
    mean_fpr,
    tprs_lower,
    tprs_upper,
    color="grey",
    alpha=0.2,
    label=r"$\pm$ 1 std. dev.",
)

ax.set(
    xlim=[-0.05, 1.05],
    ylim=[-0.05, 1.05],
    title="Receiver operating characteristic example",
)
ax.legend(loc="lower right")
plt.show()

How do we compress the rich information in a many-values confusion matrix into simpler values?

In the classification above, we made three mistakes, predicting one Versicolor as Virginica and two Virginica as Versicolor. In our two-class metrics, it was the precision that drew out information about the positive prediction column. When we predict Versicolor, we are correct 16 times and wrong 2. Considering Versicolor by itself, we can calculate something similar to precision giving us $$\frac{16}{18} \approx .89 $$
Similarly, for the Virginica class we get the following. $$\frac{13}{14} \approx .93 $$
Finally, for the Setosa class, we predict $$\frac{18}{18} = 1.0 $$
The mean of $\{\frac{16}{18}, \frac{13}{14}, 1\}$ is about .9392.

from fractions import Fraction
import numpy as np
Versicolor_acc = 16/18
Virginica_acc = 13/14
Setosa_acc = 1
accuracies = [Versicolor_acc, Virginica_acc, Setosa_acc]
accuracies_arr = np.array(accuracies)
avg = np.average(accuracies_arr)
round(avg, 4)

### 6.3 ROC Curves

---

Classification methods can do more than just slap a label on an example. They can give a probability to each prediction. Imagine a classifier that compes up with scores for ten individuals who might have a disease. The scores range from .05, .15, ..., .95. Based on training, it is determined that .7 is the best break point between folks that have the disease (higher scores) and folks that are healthy (lower scores). By moving the bar to the left, or lowering the numerical break point, we increase the number of hits (sick claims) that the classifier is making. Let's add in the truth as to whether these people are sick and create a confusion matrix.

import pandas as pd

d = {'Pred Positive': ['.05 .15 .25', '.35 .45'],
     'Pred Negative': ['.55 .65', '.75 .85 .95']}

df = pd.DataFrame(data=d, index=['Real Positive', 'Real Negative'])
df

Imagine in the confusion matrix that we can move the bar between predicted positives and predicted negatives. Predicted positives are to the left and predicted negatives are to the right. Let's look at the confusion matrix again.

import pandas as pd

d = {'Predicted Positive': ['TP', 'FP'],
     'Predicted Negative': ['FN', 'TN']}

df = pd.DataFrame(data=d, index=['Real Positive', 'Real Negative'])
df

By moving the PredictionBar far enough to the right we change predicted negatives to predicted positives. If we slam the PredictionBar all the way to the right, we predict everything as a predicted positive or sick person. As a side effect, there are no predicted negatives. As a result, there are no false negatives -- we predicted everything as a 1. By predicting everything as a positive, we do great on real positives and terrible on real negatives.

Let's imagine a corresponding scenario where the PredictionBar is moved all the way to the left. Now we predict nothing as positive but everything as negative. Regarding the top row, real positives are all predicted negative. On the bottom, all real negatives are predicted negatives. there is no equivalent setup with a *horizontal* bar between real positives and real negatives. Real cats cannot become real dogs. However, our predictions can change, reality can't.

In learning systems, there are often tradeoffs that must be made. Here, the tradeoff is between how many false positives we will tolerate versus how many false negatives we will tolerate. We can control this tradeoff by moving our prediction bar, by setting a threshold. 

We can be hyper-risk-averse, labeling everyone sick so we don't miss catching a sick person (at the expense that we label some healthy people sick), or we can label everyone healthy, and not treat anyone, even the sick. There are two questions to answer:

1. How do we evaluate and select our threshold? How do we pick a specific tradeoff between false positives and false negatives?
2. How do we compare two different classificaiton systems, both of which have a whole range of possible tradeoffs?

Fortuantely, there is a graphical tool that lets us answer these questions, the *ROC curve*. The long-winded name is the *Receiver Operating Characteristic curve*. Originally it was used to quantify radar tracking of bombers headed toward England during World War II. They needed to determine whether a blip on the radar screen was a real threat (a bomber) or not (a ghosted echo of a plane or a bird): to tell true positives from false positives.

ROC curves are normally drawn in terms of *sensitivity* also called *true positive rate, TPR. $ 1 - specificity $ is the *false positive rate* (FPR). These both measure performance *with respect to the breakdown in the real world. They care how we do based on what is out there in reality. We want to have a *high* TPR: 1.0 is perfect. We want a *low* FPR: 0.0 is great. We can game the system and guarantee a high TPR by making the prediction bar so low that we say *everyone* is positive. But that sends the false up to one. If we say no one is sick, we get a great FPR of zero. There are no false claims of sickness, but our TPR is also zero, while we wanted that value to be near 1.0.

### Patterns in the ROC 

---

### Binary ROC

How do we make ROC work? There's a single call, `roc_curve` , in the `sklearn.metrics` module that does the heavy lifting after a couple of setup steps. First, let's convert the iris problem into a *binary* classification task to simplify the interpretation of the results. The binary question asks, "Is it *Versicolor*?" The answer is yes or no. Also, we need to invoke the classification scoring mechanism of our classifier so we can tell who is on which side of our prediction bar. Instead of outputting a class like *Versicolor*, we need to know a probability, such as a .7 likelihood of *Versicolor*. We do this by using `predict_proba` instead of the typical `predict` method. `predict_proba` returns probabilities for *False* and *True* in two columns. We are interested in the probability from the *True* column for building the ROC curve.

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
is_versicolor = iris.target == 1

tts_1c = train_test_split(iris.data, is_versicolor,
                          test_size=.33, random_state=21)

(iris_1c_train_ftrs, iris_1c_test_ftrs,
 iris_1c_train_tgt, iris_1c_test_tgt) = tts_1c

# build, fit, predict (probability scores) for NB model
gnb = naive_bayes.GaussianNB()
prob_true = (gnb.fit(iris_1c_train_ftrs, iris_1c_train_tgt)
             .predict_proba(iris_1c_test_ftrs)[:, 1])  # [:, 1]=="True"

With the setup done, we can do the calculations for the ROC curve and display it.

import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc

fpr, tpr, thresh = roc_curve(iris_1c_test_tgt, prob_true)
auc = auc(fpr, tpr)
print(f"FPR : {fpr}")
print(f"TPR : {tpr}")

# create the main graph
fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(fpr, tpr, 'o--')
ax.set_title(f"1-Class Iris ROC Curve\nAUC:{auc:.3f}")
ax.set_xlabel("FPR")
ax.set_ylabel("TPR")

# do a bit of work to label some points with their respective thresholds
investigate = np.array([1, 3, 5])
for idx in investigate:
    th, f, t = thresh[idx], fpr[idx], tpr[idx]
    ax.annotate(f'thresh = {th:.3f}', xy=(f+.01, t-.01),
                xytext=(f+.1, t), arrowprops={'arrowstyle': '->'})

Most of the FPR values are between 0.0 and 0.2, while the TPR values quickly jump into the range of 0.9 to 1.0. Let's take a look at the calculation of those values. Each point represents a different confusion matrix based on its own unique threshold. The following shows the confusion matrices for the second, fourth, and sixth thresholds labeled in the last graph. Due to zero-based ine to the variable xing, these occur at indices 1, 3, 5and 5 which were assigned to the variable `investigate` in the previous cell. We could have picked any of the eight thresholds that `sklearn` found. Let's look at these values.

title_fmt = "Threshold {}\n~{:5.3f}\nTPR : {:.3f}\nFPR : {:.3f}"

pn = ['Positive', 'Negative']
add_args = {'xticklabels': pn,
            'yticklabels': pn,
            'square': True}
fig, axes = plt.subplots(1, 3, sharey=True, figsize=(12, 4))
for ax, thresh_idx in zip(axes.flat, investigate):
    preds_at_th = prob_true < thresh[thresh_idx]
    cm = sklearn.metrics.confusion_matrix(1-iris_1c_test_tgt, preds_at_th)
    sns.heatmap(cm, annot=True, cbar=False, ax=ax, **add_args)

    ax.set_xlabel('Predicted')
    ax.set_title(title_fmt.format(thresh_idx,
                                  thresh[thresh_idx],
                                  tpr[thresh_idx],
                                  fpr[thresh_idx]))

axes[0].set_ylabel('Actual')

Let's say we want use the confusion matrix to determine how well a classifier identifies sick people.

### AUC: Area-Under-the-(ROC)-Curve

---

How can we summarize an ROC curve as a single value? We answer by calculating the *area under the curve* (AUC) that we've just drawn. The AUC is an *overall* measure of classifier performance at a series of thresholds. The benefit of single-value summaries is that we can easily compute other statistics on them and summarize them graphically. Let's look at several cross-validated AUCs displayed simultaneously on a strip plot.

fig, ax = plt.subplots(1, 1, figsize=(3, 3))
model = sklearn.neighbors.KNeighborsClassifier(3)
cv_auc = sklearn.model_selection.cross_val_score(
    model, iris.data, iris.target == 1, scoring='roc_auc', cv=10)
ax = sns.swarmplot(cv_auc, orient='v')
ax.set_title('10-Fold AUCs')

Many folds return perfect results.

`sklearn.metrics.roc_curve` is ill-equipped to deal with multiclass problems. We can work around this by recoding our tri-class problem into a series of me-versus-the-world or one-versus-rest (OvR) alternatives. OvR means we compare each of the following binary problems: 0 versus [1, 2], 1 versus [0, 2]; , and 2 versus [0, 2]. The difference here is that we do it for all three possibilities. The basic tool to encode these comparisons into our data is `label_binarize` . Let's look at examples 0, 50, and 100 from the original multiclass data.

## Multiclass Learners, One-versus-Rest, and ROC

---

`sklearn.metrics.roc_curve` is ill-equipped to deal with multiclass problems. We can work around this by recoding our tri-class problem into a series of me-versus-the-world or one-versus-rest (OvR) alternatives. OvR means we compare each of the following binary problems: 0 versus [1, 2], 1 versus [0, 2]; , and 2 versus [0, 2]. The difference here is that we do it for all three possibilities. The basic tool to encode these comparisons into our data is `label_binarize` . Let's look at examples 0, 50, and 100 from the original multiclass data.

Therefore, examples 0, 50, and 100 correspond to classes 0, 1, and 2. When we binarize, the classes become:

print("'Multi-label' Encoding")
print(sklearn.preprocessing.label_binarize(
    y=iris.target, classes=[0, 1, 2])[checkout])

import pandas as pd

d = {'0': ['1', '0', '0'],
     '1': ['0', '1', '0'],
     '2': ['0', '0', '1']}

df = pd.DataFrame(data=d, index=['0', '1', '2'])
df

Let's look at another example.

from sklearn.preprocessing import label_binarize
label_binarize(y=[1, 6], classes=[1, 2, 4, 6])

import pandas as pd

d = {'1': ['1', '0'],
     '2': ['0', '0'],
     '4': ['0', '0'],
     '6': ['0', '1']}

df = pd.DataFrame(data=d, index=['1', '6'])
df

The class ordering is preserved:

label_binarize([1, 6], classes=[1, 6, 4, 2])

Binary targets transfrom to a column vector:

sklearn.preprocessing.label_binarize(
    ['yes', 'no', 'no', 'yes'], classes=['no', 'yes'])

These encodings are columns of Boolean flags--yes/no for "Is it class _x_?". The first column answers, "Is it class 0?" and the answers are yes, no, and no. Now, we add a layer of complexity to our classifier. Instead of a *single* classifier, we are going to make one classifier for each target class that was just added, the three new target columns. These become (1) a classifier for class 0 versus the rest, (2) a classifier for class 1 versus the rest, and (3) a classifier for class 2 versus the rest. Then, we can look at the individual performance of the three classifiers.

iris_multi_tgt = sklearn.preprocessing.label_binarize(
    y=iris.target, classes=[0, 1, 2])

# im --> "iris multi"

(im_train_ftrs, im_test_ftrs,
 im_train_tgt, im_test_tgt) = sklearn.model_selection.train_test_split(iris.data,
                                                                       iris_multi_tgt, test_size=.33, random_state=21)

# knn wrapped up in one-versus-rest (3 classifiers)
knn = sklearn.neighbors.KNeighborsClassifier(n_neighbors=5)
ovr_knn = sklearn.multiclass.OneVsRestClassifier(knn)
pred_probs = (ovr_knn.fit(im_train_ftrs, im_train_tgt)
              .predict_proba(im_test_ftrs))

# make ROC plots
fig, ax = plt.subplots(figsize=(8, 4))
for cls in [0, 1, 2]:
    fpr, tpr, _ = sklearn.metrics.roc_curve(im_test_tgt[:, cls],
                                            pred_probs[:, cls])

    label = f'Class {cls} vs Rest (AUC = {sklearn.metrics.auc(fpr,tpr):.2f})'
    ax.plot(fpr, tpr, 'o--', label=label)

ax.legend()
ax.set_xlabel("FPR")
ax.set_ylabel("TPR")

# Another Take on Multiclass: One-versus-One

There is another take on dealing with the sometimes negative interaction between multiclass problems and learning systems. In one-versus-rest, we chunk off apples against all other fruit in one grand binary problem. For apples, we create *one* one-versus-rest classifier. Another way to do this is to chuch off apple-versus-banana, apple-versus-orange, and so on. Then, instead of one grand Boolean comparison for apples, we make $n - 1$ comparisons, where  $n$ is the number of classes we have. This alternative is called *one-versus-one*. How do we wrap the one-versus-one winners into a grand winner for making a single prediction? We can take the sums of the individual wins and the class with the biggest number of wins in the class we predict. The one-versus-one wrapper gives us *classification scores* for each individual class. The values are *not* probabilities. We take the index of the maximum classification score to find the single best-predicted class.

knn = sklearn.neighbors.KNeighborsClassifier(n_neighbors=5)
ovo_knn = sklearn.multiclass.OneVsOneClassifier(knn)
pred_scores = (ovo_knn.fit(iris_train_ftrs, iris_train_tgt)
                      .decision_function(iris_test_ftrs))
df = pd.DataFrame(pred_scores)
df['class'] = df.values.argmax(axis=1)
display(df.head())

Let's put the actual classes beside the one-versus-one classification scores:

# note: ugly way to make column headers
mi = pd.MultiIndex([['Class Indicator', 'Vote'], [0, 1, 2]],
                   [[0]*3+[1]*3, list(range(3)) * 2])
df = pd.DataFrame(np.c_[im_test_tgt, pred_scores], columns=mi)
display(df.head())

Parameters_index = '\n'.join(
    pd.MultiIndex.__doc__.splitlines()).index('Parameters')
print('\n'.join(pd.MultiIndex.__doc__.splitlines())[:])

### DataFrames

There are multiple ways to construct a DataFrame. For instance we can use a dictionary.

import pandas as pd

d = {'Age': ['25', '26'],
     'Name': ['Scott', 'Linda']}

df = pd.DataFrame(data=d)
df

import pandas as pd

d = {'Name': ['Scott', 'Linda'],
     'Age': ['25', '26']}

df = pd.DataFrame(data=d, index=['Subject1', 'Subject2'])
df.index.name = 'Person'
df

import pandas as pd

data = {'Medical Claim': ['Easy to call sick', 'Hard to call sick'],
        'Prediction': ['Easy to predict True', 'Hard to predict False']}

df = pd.DataFrame(data=data, index=['Low', 'High'])
df.index.name = 'Bar'
df

Examples_index = '\n'.join(pd.DataFrame.__doc__.splitlines()).index('Examples')
print('\n'.join(pd.DataFrame.__doc__.splitlines())[Examples_index:])

 

Each row in a confusion matrix represents an actual class, while each column represents a predicted class.

In the above matrix, the number in the first row and first column indicates that the classifier correctly identified 53, 124 images as non-5s. These are also known as *true negatives*. The remaining 1, 455 images in the first row and second column are known as *false positives* and represent images the classifier incorrectly categorized as 5s, which indeed were not.

Each row in a confusion matrix represents an *actual class*, while each column represents a *predicted class*. 
In the above matrix, the number in the first row and first column indicates that the classifier correctly identified 53, 124 images as non-5s. These are also known as *true negatives*. The remaining 1, 455 images in the first row and second column are known as *false positives* and represent images the classifier incorrectly categorized as 5s, which indeed were not. 

Next, in the second row, the number in the first column (949) indicates the images that the classifier incorrectly identified as non-5s. In other words, the images were fives, but the classifier said they were not. These are known as *false negatives*. Finally, the remaining 4, 472 images in the bottom right corner are *true positives*, the fives that the classifier identified correctly. A perfect classifier would have only true positives and true negatives.

Using a confusion matrix, we can compute additional metrics, one of which is the *precision* or the accuracy of the positive predictions. Let's take a look at a few more simple examples.

from sklearn.metrics import confusion_matrix
y_true = [2, 0, 2, 2, 0, 1]
y_pred = [0, 0, 2, 2, 0, 2]
confusion_matrix(y_true, y_pred)

y_true = ["cat", "ant", "cat", "cat", "ant", "bird"]
y_pred = ["ant", "ant", "cat", "cat", "ant", "cat"]
confusion_matrix(y_true, y_pred, labels=["ant", "bird", "cat"])

In the case of a binary classifier, we can extract the true positives, false positives, true negatives, and false negatives.

Not all these metrics are designed for classifiers. How can we identify the scorer used for a particular classifier, say, *k*-NN? You can see the whole output with `help(knn.score).` However, let's trim it down a bit.

from sklearn.neighbors import KNeighborsClassifier
import numpy as np

knn = KNeighborsClassifier()

# help(knn.score) # verbose, but complete

# print(knn.score.__doc__.splitlines()[0])
# print('\n--and--\n')
examples_index = "\n".join(np.ravel.__doc__.splitlines()).index('Examples')
print("\n".join(np.ravel.__doc__.splitlines())[examples_index:])
search_word_index = '\n'.join(
    np.ravel.__doc__.splitlines()).index('search_word')
print('\n'.join(np.ravel.__doc__.splitlines())[search_word_index:])

tn, fp, fn, tp = confusion_matrix([0, 1, 0, 1], [1, 1, 1, 0]).ravel()
(tn, fp, fn, tp)

Using the 5 and non-5 classifier, we get the following results.

tn, fp, fn, tp = confusion_matrix(y_train_5, y_train_pred).ravel()

## `ravel()`

---
(tn, fp, fn, tp)

Using the 5 and non-5 classifier, we get the following results.

confusion_matrix(y_train_5, y_train_pred)

We can use the $\text{TN}$, $\text{FP}$, $\text{FN}$, and $\text{TP}$ to calculate additional metrics. For instance, **precison** represents the accuracy of the positive predictions and is calculated via the following equation:

$$ \text{precision} = \frac{TP}{TP + FP} $$

Let's calculate the precision.

precision = tp / (tp + fp)
print(f'Precision: {precision:.2f}')

 `numpy.ravel()`
