## 23. Classification

### 23.1 Introduction

The problem of predicting a discrete variable $Y$ from another random variable $X$ is called **classfication**, **supervised learning**, **discrimination** or **pattern recognition**.

In more detail, consider IID data $(X_1, Y_1), \dots, (X_n, Y_n)$ where

$$ X_i = (X_{i1}, \dots, X_{id}) \in \mathcal{X} \subset \mathbb{R}^d $$

is a $d$-dimensional vector and $Y_i$ takes values in some finite set $\mathcal{Y}$.  A **classification rule** is a function $h : \mathcal{X} \rightarrow \mathcal{Y} $.  When we observe a new $X$, we predict $Y$ to be $h(X)$.

It is worth revisiting the vocabulary:

| Statistics     | Computer Science    | Meaning                                      |
|----------------|---------------------|----------------------------------------------|
| classification | supervised learning | predicting a discrete $Y$ from $X$           |
| data           | training sample     | $(X_1, Y_1), \dots, (X_n, Y_n)$              |
| covariates     | features            | the $X_i$'s                                  |
| classifier     | hypothesis          | map $h: \mathcal{X} \rightarrow \mathcal{Y}$ |
| estimation     | learning            | finding a good classifier                    |

In most cases with this chapter, we deal with the case $\mathcal{Y} = \{ 0, 1 \}$.

### 23.2 Error Rates and The Bayes Classifier

The **true error rate** of a classifier is 

$$ L(h) = \mathbb{P}( \{ h(X) \neq Y\} ) $$

and the **empirical error rate** or **training error rate** is

$$ \hat{L}_n(h) = \frac{1}{n} \sum_{i=1}^n I(h(X_i) \neq Y_i) $$

Consider the special case where $\mathcal{Y} = \{0, 1\}$.  Let

$$ r(x) = \frac{\pi f_1(x)}{\pi f_1(x) + (1 - \pi) f_0(x)} $$

where

$$ f_0(x) = f(x | Y = 0)
\quad \text{and} \quad
f_1(x) = f(x | Y = 1)$$

and $\pi = \mathbb{P}(Y = 1)$.

The **Bayes classification rule** $h^*$ is defined to be

$$
h^*(x) = \begin{cases}
1 & \text{if } r(x) > \frac{1}{2} \\
0 & \text{otherwise}
\end{cases}
$$

The set $\mathcal{D}(h) = \{ x : \mathbb{P}(Y = 1 | X = x) = \mathbb{P}(Y = 0 | X = x) \}$ is called the **decision boundary**.

**Warning**: the Bayes rule has nothing to do with Bayesian inference.  We could estimate the Bayes rule using either frequentist or Bayesian methods.

The Bayes rule may be written in several different forms:

$$
h^*(x) = \begin{cases}
1 & \text{if } \mathbb{P}(Y = 1 | X = x) > \mathbb{P}(Y = 0 | X  = x)\\
0 & \text{otherwise}
\end{cases}
$$

and

$$
h^*(x) = \begin{cases}
1 & \text{if } \pi f_1(x) > (1 - \pi) f_0(x) \\
0 & \text{otherwise}
\end{cases}
$$

**Theorem 23.5**.  The Bayes rule is optimal, that is, if $h$ is any classification rule then $L(h^*) \leq L(h)$.

The Bayes rule depends on unknown quantities so we need to use the data to find some approximation to the Bayes rule.  At the risk of oversimplifying, there are three main approaches:

1. **Empirical Risk Maximization**.  Choose a set of classifiers $\mathcal{H}$ and find $\hat{h} \in \mathcal{H}$ that minimizes some estimate of $L(h)$.

2. **Regression**.  Find an estimate $\hat{r}$ of the regression function $r$ and define

$$ 
\hat{h}(x) = \begin{cases}
1 & \text{if } \hat{r} > \frac{1}{2} \\
0 & \text{otherwise}
\end{cases}
$$

3. **Density Estimation**.  Estimate $f_0$ from the $X_i$'s for which $Y_i = 0$, estimate $f_1$ from the $X_i$'s for which $Y_i = 1$, and let $\hat{\pi} = n^{-1} \sum_{i=1}^n Y_i$.  Define

$$ \hat{r}(x) = \hat{\mathbb{P}}(Y = 1 | X = x) = \frac{\hat{\pi} \hat{f}_1(x)}{\hat{\pi} \hat{f}_1(x) + (1 - \hat{\pi}) \hat{f}_0(x)} $$

and

$$ 
\hat{h}(x) = \begin{cases}
1 & \text{if } \hat{r} > \frac{1}{2} \\
0 & \text{otherwise}
\end{cases}
$$

Now to generalize to the case where $Y$ takes more than two values:

**Theorem 23.6**.  Suppose that $Y \in \mathcal{Y} = \{ 1, \dots, K \}$.  The optimal rule is

$$ h(x) = \text{argmax}_h \mathbb{P}(Y = k | X = x) = \text{argmax}_h \pi_k f_k(x) $$

where

$$ \mathbb{P}(Y = k | X = x) = \frac{f_k(x) \pi_k}{\sum_r f_r(x) \pi_r} $$

and $ \pi_r = \mathbb{P}(Y = r)$, $f_r(x) = f(x | Y = r)$.