In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
%matplotlib inline

In [3]:
import numpy as np
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import random
import patsy

sns.set(style="whitegrid")

In [4]:
import sys
sys.path.append('resources')
import models

# Loss and Evaluation Metrics

If we are interested solely in prediction, what does it mean for a model to be good or bad? We talked about this before in terms of the general linearized model but the question--and possible answers--is more general. 

For example, we could simply predict $y$ by using some constant value of $y$, $\breve{y}$. What $\breve{y}$ should we use? This depends on how much it costs us to be wrong. What "wrong" means depends on the value we're trying to predict or, more generally, if we have a regression problem (predicting a numeric value) or a classification problem (predicting a categorical value).

## Regression

We talked about loss and regression before. If we define loss (or error) as $y - \hat{y}$ and we wish to summarize this loss over some training set, then we have the following "usual suspects":

* **squared loss**

$L(y, \hat{y}) = (y - \hat{y})^2$

* **absolute loss**

$L(y, \hat{y}) = |y - \hat{y}|$

* **0/1 loss**

$L(y, \hat{y}) = 1.0$ if $y = \hat{y}$ else $0.0$

All of these can be turned into a normalized summary statistic by taking the sum and averaging:

$\frac{1}{n} \sum L(y, \hat{y})$

But these aren't the only possibilities. You can use other metrics such as $\sigma$ or $R^2$. Additionally, you can calculate your own Loss function. For example, imagine that you want to penalize over-estimation *more* than underestimation:

$$L(y, \hat{y}) =
\begin{cases}
(y - \hat{y})^2,  & \hat{y} > y \\
|y - \hat{y}|, & \text{otherwise}
\end{cases}
$$

This last approach can be a bit tricky. The algorithm to calculate the coefficients of the typical linear regression finds coefficients that minimize *mean squared error* or MSE. If you have your own metric you wish to minimize, you may need to implement your own algorithm using, for example, stochastic gradient techniques.

Most of the time when talking about regression problems (and not just linear regression) we will be using mean squared error, $\sigma$, or $R^2$ as our evaluation metrics. However, I don't want you to get the impression that these are the only possibilities.

## Classification

Like regression, classification has a baseline for predicting the class (or class *label*). We talked about this Null model as the relative frequency of the most common class. Technically, this would be $p$ if $p$ > 50% or $1-p=q$ if $q$ > 50%. For example, suppose that for a binary problem, the relative frequency of the most common class is 87.3%. Then this is the Null model for classification. If you guess "1" for any observation you see then you are right 87.3% of the time and wrong 12.7% of the time. This is the Null model for binary classification.

For a multi-class problem, we pick the $p_i$ with the highest value. Although we called this the Null model in the previous chapters, in machine learning this is known as *OneR*. We estimate OneR simply by calculating the relative frequencies of the class labels and picking the label with the highest relative frequency.

It is possible to formulate a variety of loss functions for the classification task.  Taking the *binary* case, *cross entropy* is one such function and the one used to find the $\beta$s in logistic regression.

$L(y, \hat{y}) = y log( \hat{y}) + (1 - y) log( 1 - \hat{y})$

and there are [others](https://en.wikipedia.org/wiki/Loss_functions_for_classification). However, I never see these functions used to *evaluate* classification models. They are normally used by the *algorithms*, $g(X, y)$, to learn the classification models, $f(X)$, from data.

Instead, there are a variety of classification *metrics* that are used to evaluate how good or bad a classification model is. The reason for this plethora of metrics is that there appear to be a number of ways that a classification prediction can go right or wrong and different ways to summarize these outcomes. For a binary classification task, where "1" is taken to mean "positive" or "in the class" and "0" is taken to be "negative" or "not in the class", the possible cases for a classification model are:

1. The true class can be "1" and the model can predict "1". This is a *true positive* or TP.
2. The true class can be "1" and the model can predict "0". This is a *false negative* or FN.
3. The true class can be "0" and the model can predict "1". This is a *false positive* or FP.
4. The true class can be "0" and the model can predict "0". This is a *true negative* or TN.

We can summarize these results in a table called a *confusion matrix*...it summarizes how confused (or not) our model is:

|  &nbsp;      | Predict 1 | Predict 0|
|:------------:|:--------:|:---------:|
| **Actual 1** | TP       | FN        |
| **Actual 0** | FP       | TN        |

Since N is simply the number of observations TP + FP + FN + TN, then we have the following:

**accuracy** = $\frac{TP + TN}{N}$ = 1 - error rate

**error rate** = $\frac{FP + FN}{N}$ = 1 - accuracy

which are the two most common metrics for evaluating classification models. We used this in previous chapters to evaluate logistic regression.

However, they may not be sufficient. As we see above, there are two types of errors: classifying something that's "1" as "0" and classifying something that's "0" as "1". The impact of these errors may not be symmetric or equally costly. We saw this when talking about Bayes Theorem. A test may be good at telling you there's a problem if there really is a problem but they also tell you when there's a problem when there really *isn't* a problem, i.e., false alarms.

These various possibilities each have their own names (sometimes several) and can be discussed in terms of the confusion matrix. We're going to look at the three more common ones.

The first is **sensitivity**. This is basically the "disease" case: if you have the disease, how good is the model (test) at detecting it? It is also called **true positive rate**, **hit rate**, and **recall**. It is defined only in terms of the positive observations, both those the model predicted correctly and those it did not.

$sensitivity = \frac{TP}{TP+FN}$ (1 - sensitivity is the **false negative rate**)

The next one is **specificity**. This is the other case. If you do not have the disease, how good is the model (test) at determining that (and not telling you that you do have it!)?

$specificity = \frac{TN}{TN+FP}$ (1 - specificity is the **false positive rate**)

The final case is **precision**  or **positive predictive value**. Basically, of the positive ("1") predictions that we made, how many were right?

$precision = \frac{TP}{TP + FP}$

When we're not looking at accuracy/error rate, we generally look at sensitivity and precision. In fact, there's a harmonic mean of the two called $F1$:

$F1 = \frac{2TP}{2TP + FP + FN}$

## Example

Let's look at our "switching" logistic regression and evaluate it in terms of these new metrics. Let's load the data:

In [5]:
wells = pd.read_csv( "resources/arsenic.wells.tsv", sep=" ")

In [6]:
wells[ "dist10"] = wells[ "dist"].apply( lambda x: int( round( x / 10, 0)))

One possible model might have been:

In [7]:
result1 = models.logistic_regression( "switch ~ dist10 + arsenic + assoc + educ", data = wells)
print(models.simple_describe_lgr(result1))

Model: switch ~ dist10 + arsenic + assoc + educ
-------------  ---------  -----
Coefficients              Value
               $\beta_0$  -0.16
dist10         $\beta_1$  -0.09
arsenic        $\beta_2$  0.47
assoc          $\beta_3$  -0.13
educ           $\beta_4$  0.04

Metrics        Value
Error ($\%$)   38.41
Efron's $R^2$  0.07
-------------  ---------  -----


The error rate here was calculated to be 38.41%. The `logistic regression` function has been enriched to return both the $y$ and $\hat{y}$ values to us. Let's use them to generate a confusion matrix:

In [8]:
from tabulate import tabulate

In [9]:
def binary_confusion_matrix(result):
    tp = 0; tn = 0; fn = 0; fp = 0
    for y, y_hat in zip(result["y"], result["y_hat"]):
        if y == 0 and y_hat == 0:
            tn += 1
        elif y == 0 and y_hat == 1:
            fp += 1
        elif y == 1 and y_hat == 1:
            tp += 1
        else:
            fn += 1
    return tabulate([["Predicted 1", tp, fp], ["Predicted 0", fn, tn]], headers=["", "Actual 1", "Actual 0"])

In [10]:
print(binary_confusion_matrix(result1))

               Actual 1    Actual 0
-----------  ----------  ----------
Predicted 1        1391         814
Predicted 0         346         469


Here we can see that our true positives were 1391. Our true negatives were 469. For our errors, our false positives were 814 and our false negatives were 346. 

* Of all the actual negatives, how many did we predict to be positive?

This the *false positive rate*: FP/(TN + FP) = 814/(814+469) = 63.4%.

* Of all the actual positives, how many did we predict to be negative?

The false negatives were 346 and the *false negative rate* was FN/(TP + FN) = 346/(1391+346) = 19.9%. 

By comparing these two rates, we can see that our model errs more often by predicting someone will switch when they didn't rather than predicting that someone will stay when they switched.

* Of all the positive predictions, how many were correct?

Finally, we can look at precision: TP/(TP+FP) = 1391/(1391 + 814) = 63.1%. Of our positive predictions (predicting switch), we were 63.1% correct.

As you can see, this gives a more nuanced view of the model than just *error rate*. The main problem here is the false positive rate. We are much more likely to predict a switch than is actually reflected in the data on people switching. For something like arsenic poisoning, this is probably more important than underestimating those who do switch (false negative rate). If this were a government program, you'd need a two pronged approach. A better model of understanding the behavior that leads to switching but monitoring of actual switches to know if the program is successful.

Confusion matrices work even for multi-class problems. Although the language isn't quite the same (you can't have true "positives" with three or more classes), there is still the sense of:

1. Of actual class $i$, how many did we predict to be something else?
2. Of predicted class $i$, how many were correct?