# Evaluating Classification Models

In this chapter we discuss various ways of evaluating the performance of classification models.

##  Hard vs. Soft Classification

Classification models are concerned with predicting a **categorical response**.  The most common case is a **binary response**, which can only take one of two values.  In this chapter we will assume that these two values are encoded numerically as zero and one.  We will call observations with response one **positive classes**, and those with response zero **negative classes**.

Classification models can generally be divided into two groups, those that attempt to estimate probabilistic information, and those who only attempt to estimate class membership.

### Soft Classification

A soft classification model attempts to estimate the probability that an observation belongs to a class.  In the binary case, this means that a soft classification model attempts to estimate

$$ P(y = 1 \mid X) $$

Note that this is a conditional probability, if we change the data X, then our estimate of the probability changes.  Said differently, this means that a soft classification model produces a function of the data, and we interpret the values of this function as the probabilities of class membership.

The basic example of a soft classification model is **logistic regression**

$$ P(y = 1 \mid X) = \frac{1}{1 + e^{ -(\beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k) } } $$

As the values of the features change, our estimate of the class membership probability changes.

### Hard Classification

In contrast, a **hard classification** model attempts only to estimate class membership.  That is, its predictions are **not** probabilities, they are exact zeros or ones.

Below, we compare a soft and hard classification on a simple data set.  The image on the left shows a soft classifier; the color of the dots encode the predicted probability of class membership, it varies smoothly with increasing distance to the dashed line.  The image on the right shows a hard classifier; the dots come in exactly two colors, encoding only which *side* of the line we are on.

![Soft vs. Hard Classifier](images/hard-vs-soft-classification.png)

You may be in the habit of thinking of **all** classification models as hard classifiers, and it is very important to realize that this is not the case.  

There **are** some models that are strictly designed to be hard classifiers, but in many respects soft classification is a better strategy.  For this reason, much effort goes into researching how hard classification algorithms can be recast as soft classifiers (AdaBoost and Neural Networks both have this history).  

The most popular model that is **not** usually used as a soft classifier are Support Vector Machines.  These can be very difficult to derive probabilistic predictions from due to the slippery nature of kernels.

### The Relationship Between Soft and Hard Classification

There are good reasons statisticians always attempt to find ways to interpret their classification models as soft classifiers, they can always be turned into hard classifiers if hard classification is the ultimate goal of an analysis.

The process of taking a (binary) soft classification model and converting it into a hard classification model is called  **thresholding**.  We pick a **cutoff** or **threshold** and classify all data points whose predicted probability is above the threshold as positive classes, and all whose predicted probability is below as a negative class.

$$ \text{If } P(y = 1 \mid X) < T \text{ then classify } \hat y = 0 $$

$$ \text{If } P(y = 1 \mid X) \geq T \text{ then classify } \hat y =1 $$

In the plots below we take a single soft classification model, and show the results of varying the threshold to produce different hard classifiers.

![Changing Classification Threshold](images/varying-thresholds.png)

As shown above, a single soft classification model can be used to produce multiple hard classifiers, simply by varying where we threshold the predicted probabilities.

## Evaluating Classification Models

In this section, we discuss ways to measure and compare the performance of soft classification models. There are two basic approaches to this. In the first approach we evaluate the probabilities directly. In the second we only care about the ordering of the probabilities, so that, that something predicted as more probable than something else actually is.

First we'll discuss evaluating probability predictions of soft classifiers. We'll then discuss evaluating hard classifiers, and finally talk about using the ROC curve to evaluate soft classifiers. 

### Evaluating Probability Predictions

Soft classification models are best evaluated and compared on the basis of their predicted probabilities (*not* by converting them to hard classifiers and evaluating the result).

The gold standard metric for evaluating soft classification models is the log-loss (aka Bernoulli likelihood, aka logistic loss, aka cross entropy):

$$ \text{log-loss}(y, p) = - \sum_i y_i \log(p_i) + (1 - y_i) \log(1 - p_i) $$

In the picture below, we plot the log loss for varying values of p.  Since y can only take on two values, we simply draw one curve for y=0, and another for y=1.

![Log Loss Graph](images/log-loss.png)

Focusing on the y = 0 curve, we see that the log-loss decreases to zero as p approaches zero, and the log-loss approaches infinity as p approaches 1.  In this way, the log loss is measuring how well the probabilities p *reflect* the data.

The formula for the log-loss may look strange, but it has a simple motivation.  Imagine that our y data is generated by sampling from independent Bernoulli distributions whose probability parameter is the predicted probabilities p

$$ y_i \sim \text{Bernoulli}(p = p_i) $$

The log likelihood of this distribution **is** the log-loss, so the log-loss measures how surprising the given data is if it was all independently sampled according to the predicted probabilities.  So, if we have two soft classification models which we are comparing on a test dataset, and the log-loss of the first is smaller than the other, this means that the test data is more likely to be sampled if we use the predicted probabilities of the first model.

The log-loss is an example of a **proper scoring rule**.  Proper scoring rules are evaluation metrics that are in a certain sense "optimal" (though the exact way in which they are optimal is a bit too technical to get into here).  Another example of a proper scoring rule is the **Brier score**. The corresponding loss function (the negative of the score) is

$$ \text{brier}(y, p) = \sum_i (y_i - p_i)^2 $$

![Brier Score Graph](images/brier-score.png)

The brier score is much like the log-loss, with the major difference that the brier score cannot be larger than one.  Since the log-loss can get arbitrarily large (for example, when the predicted probability is close to one by y = 0) it is quite intolerant of data points whose predicted probabilities are *very* wrong.  The brier score, on the other hand, will tolerate a few very poorly predicted points if it can make up for these very poor predictions with better predictions on other points.

It is best practice to always use a proper scoring rule when evaluating soft classification models.  The log-loss is by far the most common choice, due to its attractive interpretation as a log-likelihood.

### Evaluating Hard Classification Models

When evaluating hard classifications, there are only four possibilities for each data point:

  - **True Negative**: Actual 0, Predicted 0.
  - **False Positive**: Actual 0, Predicted 1.
  - **False Negative**: Actual 1, Predicted 0.
  - **True Positive**: Actual 1, Predicted 1.
  
These four possibilities can be arranged into a grid, which makes them easy to remember and keep track of

|                  | Predicted 1    | Predicted 0    |
| ---------------- |:--------------:| --------------:|
| **Actual 1**     | True Positive  | False Negative |
| **Actual 0**     | False Positive | True Negative  |

Note that the qualifier true/false modifies the *prediction*, so "false positive" means "we made a positive prediction and we were wrong".

A matrix that records the various outcomes of a hard classifier is called a **confusion** matrix.

There are many hard classification evaluation metrics that can be derived from this table.

#### Accuracy

The simplest and most natural thing to do is compute the proportion of predictions that we got correct, this is called the **Accuracy** of our classifier.

$$ \text{Accuracy} = \frac{ \text{# True Positives} + \text{# True Negatives} } { \text{Total # of Data Points } } $$

Of course, the accuracy is also one minus the number of predictions we got wrong

$$ \text{Accuracy} =  1 - \frac{ \text{# False Positives} + \text{# False Negatives} } { \text{Total # of Data Points } } $$

The accuracy implicitly assumes that false positives and false negatives should be treated *equally*, i.e. are equally severe errors.  This is not often the case, we will later study principled ways to build these costs into our decision procedures.

#### False Positive and True Positive Rate

While the accuracy measures how our predictions preform across all of our dataset, the **false positive rate** and **false negative rate** focus on the positive and negative classes individually.

The **false positive rate** is the proportion of negative classes that we incorrectly classify (incorrectly classified negative classes are of course, false positives, hence the name).

$$ \text{FP Rate} = \frac{ \text{# False Positives} } { \text{# False Positives} + \text{# True Negatives} } $$

one minus the false positive rate is the **true negative rate**, which is also called the **specificity**.

$$ \text{TN Rate} = 1 - \frac{ \text{# False Positives} } { \text{# False Positives} + \text{# True Negatives} } = 
   \frac{ \text{# True Negatives} } { \text{# False Positives} + \text{# True Negatives} } $$

The **true positive rate** or **sensitivity** or **recall** is the proportion of positive classes that are correctly identified.

$$ \text{TP Rate} = \frac{ \text{# True Positives} } { \text{# False Negatives} + \text{# True Positives} } $$

Clearly, there are more ratios that we can form by combining the counts of each of the four prediction/actual possibilities, and many go by multiple names.

![confusion matrix](Confusion_Matrix.png)

Each of the measure about is the ratio of the number of one of the four options (TP, TN, FP, FN) over the sum of two of the others. The numerator is given by the location of the label; the denominator by the locations spanned by the arrow. So Precision (also known as Positive Predictive Value, PPV) is the number of True Positives divided by True Positives plus False Positives. Terms used together are colored the same.

### Hard Classification Metrics for Soft Classification Models

We've discussed some metrics that can be used to evaluate both hard and soft classifiers. Since a soft classifier can be made into a hard classifier by thresholding the predicted probabilities at a fixed level, any metric designed for hard classification models can, after thresholding, be applied to a soft classification model.

It will be interesting to see how each of these metrics behaves as we vary the threshold. In the following pictures, we plot some of these measures for a logistic regression on a testing data set.

#### Accuracy

![Accuracy at Different Thresholds](images/accuracy-at-thresholds.png)

Above we see how the accuracy for our regression varies as we change our hard classification threshold.  

For example, when the threshold is zero, we predict every test data point as a member of the positive class.  Therefore, every positive class is predicted correctly, and every negative class is predicted incorrectly, so the accuracy in this case is just the proportion of positive classes in the data set.

```print
print("Accuracy at T = 0.0: {:2.2f}".format(
    np.sum(y_test) / len(y_test)))

Accuracy at T = 0.0: 0.89
```

When the threshold is one, exactly the opposite is true, all and only the negative classes are predicted correctly, so the accuracy is the complement of when the threshold is zero

```
print("Accuracy at T = 1.0: {:2.2f}".format(
    1 - np.sum(y_test) / len(y_test)))

Accuracy at T = 1.0: 0.11
```

It is a common false belief that a threshold of 0.5 is in some sense natural or optimal, but our accuracy graph shows clearly that this is not the case.  Instead, the accuracy for this model is maximized at approximately 0.58, which is **not** the class ratio in the training data for the model (another common false belief).

```
print("Class ratio in training data: {:2.2f}".format(np.sum(y_train)/ len(y_train)))

Class ratio in training data: 0.90
```

Finally, it is another false belief that many machine learning models have trouble on **imbalanced data**.  This is generally **not** true for soft classification models, they naturally take into account the class balance in their training data since their job is only to estimate *probabilities* of class membership (but, if there are a very small **absolute** number of positive classes, obviously the model will have a hard time finding patterns in the training data, so there is a general difficulty when learning about **extremely rare** events).  This is another point in favor of using soft classification models as a core tool in machine learning.

On the other hand, if one is using accuracy as a final measure of classifier fit, then accuracy can be a misleading metric, especially on highly imbalanced data.  We will have more to say on this point in the afternoon.

### False Positive and True Positive Rates

We can do the same kind of thing for other metrics, like the false positive and true positive rates.

![False Positive Rates at Thresholds](images/fp-rate-at-thresholds.png)

Here, we see that the false positives rate decreases as we increase the threshold.  This makes intuitive sense, as we increase the threshold we become more strict in the observations that we classify as positive, meaning less of our positive classifications are specious.  At the extremes, a threshold of zero classifies everything as positive, so all our negative classes are falsely classified, at a threshold of one we classify nothing as positive, so all the negative classes are correctly classified.

![True Positive Rates at Thresholds](images/tp-rate-at-thresholds.png)

If you've understood everting up to this point, it should be easy for you to interpret the true positive rate curve above.

### ROC Curves

Above, we saw that there is a trade-off when setting the threshold of a soft classifier if a hard classification is needed.  If the threshold is set too low, then we will be very liberal in what observations we classify as positive, which drives up the false positive rate of the model.  If the threshold is set too high, then we will be very conservative in what observations we classify as positive, which will drive down the true positive rate.  We suspect that in the middle there is some sweet spot, where we have balanced a trade-off between the false positive and true positive rates.

An **ROC** (receiver operating characteristic, one of the all time great names) **curve** expresses this relationship in one picture.

![ROC Curve for Logistic Model](images/roc.png)

It can be a bit tricky to understand what is going on in this picture at first.

An ROC curve visualizes the trade-offs between false positives and true positives as we vary the threshold on a soft classification model.  Each choice of the threshold results in **one** point on the ROC curve

![ROC Curve with Points Labeled With Thresholds](images/roc-with-points.png)

On a standard ROC curve, the thresholds themselves are not shown, the curve shows what happens with the false positive and true positive rates as the threshold is *dynamically* changed between zero and one.

An ROC curve always runs between (0, 0) and (1, 1).  The (0, 0) point corresponds to a threshold of 1.0, no observations are classified as positive.  The (1, 1) point corresponds to a threshold of 0.0, all observations are classified as positive.

It is traditional to include the dashed line running along the diagonal between (0, 0) and (1, 1).  This line represents **random class assignment**.  If our model just guessed the predicted class, completely at random, then the false positive and true positive rates would always be the same (its equally likely we get a true negative class correct as it is to get it incorrect).  Therefore, a model that holds predictive power **above** random guessing (we should hope all our models have at least this power) will have ROC curves like **above** the diagonal line.

When we have two competing classifiers, it's common to study the ROC curves of both models to get a feeling for how the models make trade-offs between the false positive and false negative rates.

![ROC Curves For Two Models, Not Crossing](images/roc-two-models.png)

In this case, the random forest does a better job of distinguishing the positive from negative classes.

In some cases, the ROC curves for the models being compared will cross

![ROC Curves For Two Models, Crossing](images/roc-two-models-crossing.png)

In these cases, the ROC curves give us no statistical reason to prefer one model over another, it depends on the costs of false positive and false negative errors in our decision problem.  Again, we will have more to say on this point in the afternoon.

### The AUC

The **area** under the ROC curve is a numeric measure of a model's overall ability to distinguish between positive and negative classes.

![Shaded ROC For Logistic and Random Forest Models](images/roc-shaded-auc.png)

The ROC is a nice measure of model fit, as it averages over **all** thresholds.

#### Interpretation of AUC

The AUC has a very nice [probabilistic interpretation](http://madrury.github.io/jekyll/update/statistics/2017/06/21/auc-proof.html)

> The area under the ROC curve is the probability of our model predictions *ranking* a randomly chosen positive class higher than a random negative class.

To elaborate on this, consider the distribution of our model predictions broken down by the true class

![Predicted Probability by True Class Label](images/predicted-positive-vs-negative-classes.png)

In this plot we have colored the true positive classes blue, and the true negative classes red.  The model's predicted probability is shown on the x-axis (the y-axis is jittered randomly up and down to keep the individual points distinguishable).

We see that the model is doing a good job distinguishing positive from negative classes, the estimated probabilities for the majority of our positive classes cluster on the right side of the plot, and those for the negative classes cluster on the left.

Suppose we select one random blue observation, and one random red observation.  There is a large chance that our model will rank the observation from the positive class higher than that from the negative class (by *rank* we mean that the predicted probability is higher in one case than the other).  The AUC statistic measures exactly this, **the chance that a random pair of (red, blue) points is ranked in the correct order**.

### Precision-Recall Curve

Another chart used to describe the trade-off at different thresholds is the **Precision-Recall Curve**. Like the ROC curve it's parameterized by the threshold, but instead plots the precision (the ratio of true positives over predicted positives) against the recall (a.k.a. sensitivity a.k.a. TPR).

## Discussion: Do You Really Need Hard Classification?

The most persistent mistake in this area is the *overuse of hard classification models*.  We have been hinting at our preference for soft classification throughout this chapter, but let's be explicit: soft classification models are more flexible in practice, and they should generally be preferred.

None the less, many business or scientific problems **do** call for hard classification, especially when some decision must be made (we are going to do this or not, we are going to invest in this or that, etc).  It is important to distinguish in our problem solving process between:

- Estimating the probability of uncertain events.
- Using those probabilities to make hard decisions or develop decision rules.

The best general principle is to

> Convert your probabilities into decisions as late into the process as possible.

Probabilities give you much more flexibility in decision making, and much more knowledge about the uncertainties and trade-offs in a given a situation, than hard class assignments.  Consider the case of an insurance company attempting to maximize its profits by managing customer churn.

When a customer's contract comes up for renewal, the company has the option of offering them a new price.  After receiving the new offer, the customer decides to either accept the new price, and purchase a contract renewal, or reject, and find someone else to insure them.

The company, after changing many customer's prices over much time, finds (through fitting a soft classification model) that there is a simple relationship between how much a customer's price has been changed, and the *probability* of them staying; though they can, of course, never say definitely whether the customer will or will not leave.

But, having this probability is sufficient for the company to calculate the expected profit for a customer at a given price

$$ E[\text{profit}] = (\text{price} - \text{cost}) \times P(\text{stay} \mid \text{price}) $$

If the price is too low, the customer will almost definitely stay, but the company will not cover it's costs.  If the price is too high, the customer will almost certainly leave, and the company will get nothing.  Somewhere in the middle there is a sweet spot, and the expected profit is maximized.  This allows the company to *optimize* the price for a customer.  This would *not be possible* if we had collapsed the probability into a class assignment, knowing the probability is *essential*.

If we think of this as applying to every customer in our business, the total profit would be

$$ E[\text{total profit}] = \sum_i (\text{price}_i - \text{cost}_i) \times P(\text{stay}_i \mid \text{price}_i) $$

If we make the small assumption that customers leave independently, it's now easy to evaluate the overall effect of various pricing strategies.  We can answer questions like "what is the optimal pricing strategy under the constraint that I end up with the same number of customers as before", or "If legislation forces me to increase overall prices by 5%, what is the best way to distribute those prices amongst the customer base", or "if legislation forces me to raise prices by 5% uniformly, what is the 95% worst case scenario in terms of how many customers leave"?

Sometimes we have scenarios calling for hard classification.  For example, we have put in place a price change, which will affect our customer base as their contracts come up for renewal.  We have a fixed number of resources at our disposal to call customers and discuss their price changes, so which customers should we call?

If we know the *probability* of a customer leaving, we can simply target the customer's with the highest probability of leaving.  That is, we order all our customers by their probability of leaving, and work our way down the list.  This solution easily adapts if we gain or lose resources, we just take more or less customers from the top of the list.  On the other hand, if we had insisted upon fitting a hard classification model initially, that simply told us what customers to to call (or what customers will leave), we would not be in this happy situation, we would have to go back and refit our model, and possibly re-do lots of analysis.