# Risk in Classification

##### Keywords: classification, supervised learning, decision risk, decision theory, bayes risk

In [1]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")

## The multiple risks
There are *two* risks in learning that we must consider, one to *estimate probabilities*, which we call **estimation risk**, and one to *make decisions*, which we call **decision risk**.

#### Estimation Risk
Estimation risk is what we minimize in order to fit a model. For instance we might pick (log) likelihood as our utility function: we'd pick the parameter settings that maximize the log likelihood. Or we might choose (vertical) squared error loss in fitting a line, and choose the parameters which minimize the overall sum of squares on the training data.

Estimation risk is about how we fit a model: defining what it means for a particular parameter setting to be good, and hunting for the best setting. 

You are completely free to specfiy new goals for your model fitting, but it's then your job to figure out how to find good parameter settings. Maybe there's a closed-form solution for your model/utility combination, maybe SGD will work well, maybe it's an extremely thorny and hopeless optimization task. In general, though, we stick with pre-existing models and objectives because huge amounts of time are sunk into developing each pair.

At the end of the model estimation (fitting) process we have a model with tuned parameters. Maybe it's an outcome model and just spits out a single answer for a given input that we just run with. Maybe it's a probability model and gives a (tuned) distribution over possible responses. In the second case we need to translate those probabilities into an actual action. We face Decision Risk as we transition from the uncertainty of a probability distribution to a locked-in action.

#### Decision Risk
What do we mean by a "decision" exactly? We'll use the letter $a$ here to indicate a decision, in both the regression and classification problems. In the classification problem, one example of a decision is the process used to choose the class of a sample, given the probability of being in each class.  We must mix these probabilities with "business knowledge" or "domain knowledge" to make a decision. 

The extra knowledge we supply is the **decision loss** $l(y,a)$ or **utility** $u(y,a)$ (profit, or benefit) in making a decision $a$ when the predicted variable actually had value $y$. For example, we must provide all of the losses $l$(no-cancer, biopsy), $l$(cancer, biopsy), $l$(no-cancer, no-biopsy), and $l$(cancer, no-biopsy). One set of choices for these losses may be 20, 0, 0, 200 respectively.

To simplify matters though, let's presently insist that the **decision space** each of the (finitely many) possible values of $y$. In other words, the decision to be made is a classification. Then we can use these losses to penalize mis-classification asymmetrically if we desire.

In the cancer example, we then set $l$(observed-no-cancer, predicted cancer) to be 20 and $l$(observed-cancer, predicted-no-cancer) to be 200. This is the situation we talked about much earlier in the class where we penalize the false negative(observed cancer not predicted to be cancer) much more than the false positive(observed non-cancer predicted to be cancer).

## Combining Risks or Losses
The loss we face when making a particular decision $a$ is rarely a single number. The decision might be right or close to right (small loss), or it might be wrong (large loss), each with a certain probabilty. That is: we're about to lock in a decision and then roll a die to determine the actual state of the world and either feel a small loss or a big loss. How should we evaluate the decision $a$ overall, given that the actual loss we'll feel is a bit of a gamble?

There are a as many combinations as you can think of, but two common, well-studided ones. Maximum or "Scardey cat" risk decides the overall loss in an uncertain situation is the maximum possible loss. Average loss calculates the expected loss over all outcomes.

We'll use expected loss.

## Average Risk (a.k.a. Expected Loss)

#### Risk of decision $a$ at X=x
We simply weigh each combinations loss by the predictive probability that that combination can happen, the integral from the risks notes reducing to a sum:

$$ R_{a}(x) = \sum_y l(y,a(x)) p(y|x)$$

That is, we calculate the **average risk** over all choices y, of making choice $a$ for a given data point.

#### Risk of decision $a$ overall
Then, if we want to calculate the overall risk, given all the samples in our set, we calculate:

$$R(a) = \int R_{a}(x) p(x)  dx $$

(Since we usually assume fixed but unknown-distributed covariates, we can replace this integral by a sum over the empirical distribution, i.e. a sum over the data points)

To minimize the overall risk, it is sufficient to minimize the risk at each point or sample since $p(x)$ is always positive.

## Example: Optimal rule for two-class classification
Suppose we have a two-class classifciation problem: our goal is to label each point as "cat or "not cat"; "will default" or "won't default", "cancer or "not cancer". The cost of making a misclassification differs from context to to context. If the model predicts a 61% chance of category 1, should we label that point a 1 or as 0? Moreover, what's the best rule for mapping probability of class=1 into a labeling of either 1 or 0?

The model reports $p(class=1|x)$ and $p(class=0|x)$. The average risk of assmuing class=a is:

$$R_a(x) = l(1, a)p(1|x) + l(0, a)p(0|x).$$

Then for the "decision" $a=1$ we have:

$$R_1(x) = l(1,1)p(1|x) + l(0,1)p(0|x),$$

and for the "decision" $a=0$ we have:

$$R_0(x) = l(1,0)p(1|x) + l(0,0)p(0|x).$$

#### What should we choose?
Well, if we belive in expected risk, we'd choose $1$ for the sample at $x$ if:

$$R_1(x) \lt R_0(x).$$

$$l(1,1)\,p(1|x) + l(0,1)\,p(0|x) < l(1,0)\,p(1|x) + l(0,0)\,p(0|x) $$

$$ p(1|x)(l(1,1) - l(1,0)) \lt p(0|x)(l(0,0) - l(0,1))$$

$$ \frac{p(1|x)}{p(0|x)} > \frac{l(0,0) - l(0,1)}{l(1,1) - l(1,0)}$$

(where the direction of the inequality assumes $l(1,1)<l(1,0)$, as should be common, but reverses otherwise.

This gives us a handy quantity `r` to use as a decision criteria.
$$r=\frac{l(0,1) - l(0,0)}{l(1,0) - l(1,1)} =\frac{c_{FP} - c_{TN}}{c_{FN} - c_{TP}}$$

So, the best policy (i.e. the Bayes Risk) is to label as 1 whenever the ratio $\frac{p(1|x)}{p(0|x)}$ exceeds $r$

If we want to work only in terms of $p(1|x)$, this may also be written as:

$$p(1|x) \gt \frac{r}{1+r}$$.

As a sanity check: if you assume that True positives and True negatives have no cost, and the cost of a false positive is equal to that of a false positive, then $r=1$ and the threshold is the usual intuitive $t=0.5$.

Alternatively, suppose the risks are: $l(0,0)=0$, $l(0,1)=10$, $l(1,1)=20$, $l(1,0)=200$. We'd classify as 1 if $p(1|x) > 10/180 / (1+10/180) =.052$. Intuitively, this is because classifying as 0 when the true class is 1 is so inordinately costly (200) that we'd like to classify as 1 deal with losses like 10 and 20 instead, unless we're really confident that 0 is the correct label.