## 23. Classification

### 23.1 Introduction

The problem of predicting a discrete variable $Y$ from another random variable $X$ is called **classfication**, **supervised learning**, **discrimination** or **pattern recognition**.

In more detail, consider IID data $(X_1, Y_1), \dots, (X_n, Y_n)$ where

$$ X_i = (X_{i1}, \dots, X_{id}) \in \mathcal{X} \subset \mathbb{R}^d $$

is a $d$-dimensional vector and $Y_i$ takes values in some finite set $\mathcal{Y}$.  A **classification rule** is a function $h : \mathcal{X} \rightarrow \mathcal{Y} $.  When we observe a new $X$, we predict $Y$ to be $h(X)$.

It is worth revisiting the vocabulary:

| Statistics     | Computer Science    | Meaning                                      |
|----------------|---------------------|----------------------------------------------|
| classification | supervised learning | predicting a discrete $Y$ from $X$           |
| data           | training sample     | $(X_1, Y_1), \dots, (X_n, Y_n)$              |
| covariates     | features            | the $X_i$'s                                  |
| classifier     | hypothesis          | map $h: \mathcal{X} \rightarrow \mathcal{Y}$ |
| estimation     | learning            | finding a good classifier                    |

In most cases with this chapter, we deal with the case $\mathcal{Y} = \{ 0, 1 \}$.

### 23.2 Error Rates and The Bayes Classifier

The **true error rate** of a classifier is 

$$ L(h) = \mathbb{P}( \{ h(X) \neq Y\} ) $$

and the **empirical error rate** or **training error rate** is

$$ \hat{L}_n(h) = \frac{1}{n} \sum_{i=1}^n I(h(X_i) \neq Y_i) $$

Consider the special case where $\mathcal{Y} = \{0, 1\}$.  Let

$$ r(x) = \frac{\pi f_1(x)}{\pi f_1(x) + (1 - \pi) f_0(x)} $$

where

$$ f_0(x) = f(x | Y = 0)
\quad \text{and} \quad
f_1(x) = f(x | Y = 1)$$

and $\pi = \mathbb{P}(Y = 1)$.

The **Bayes classification rule** $h^*$ is defined to be

$$
h^*(x) = \begin{cases}
1 & \text{if } r(x) > \frac{1}{2} \\
0 & \text{otherwise}
\end{cases}
$$

The set $\mathcal{D}(h) = \{ x : \mathbb{P}(Y = 1 | X = x) = \mathbb{P}(Y = 0 | X = x) \}$ is called the **decision boundary**.

**Warning**: the Bayes rule has nothing to do with Bayesian inference.  We could estimate the Bayes rule using either frequentist or Bayesian methods.

The Bayes rule may be written in several different forms:

$$
h^*(x) = \begin{cases}
1 & \text{if } \mathbb{P}(Y = 1 | X = x) > \mathbb{P}(Y = 0 | X  = x)\\
0 & \text{otherwise}
\end{cases}
$$

and

$$
h^*(x) = \begin{cases}
1 & \text{if } \pi f_1(x) > (1 - \pi) f_0(x) \\
0 & \text{otherwise}
\end{cases}
$$

**Theorem 23.5**.  The Bayes rule is optimal, that is, if $h$ is any classification rule then $L(h^*) \leq L(h)$.

The Bayes rule depends on unknown quantities so we need to use the data to find some approximation to the Bayes rule.  At the risk of oversimplifying, there are three main approaches:

1. **Empirical Risk Maximization**.  Choose a set of classifiers $\mathcal{H}$ and find $\hat{h} \in \mathcal{H}$ that minimizes some estimate of $L(h)$.

2. **Regression**.  Find an estimate $\hat{r}$ of the regression function $r$ and define

$$ 
\hat{h}(x) = \begin{cases}
1 & \text{if } \hat{r} > \frac{1}{2} \\
0 & \text{otherwise}
\end{cases}
$$

3. **Density Estimation**.  Estimate $f_0$ from the $X_i$'s for which $Y_i = 0$, estimate $f_1$ from the $X_i$'s for which $Y_i = 1$, and let $\hat{\pi} = n^{-1} \sum_{i=1}^n Y_i$.  Define

$$ \hat{r}(x) = \hat{\mathbb{P}}(Y = 1 | X = x) = \frac{\hat{\pi} \hat{f}_1(x)}{\hat{\pi} \hat{f}_1(x) + (1 - \hat{\pi}) \hat{f}_0(x)} $$

and

$$ 
\hat{h}(x) = \begin{cases}
1 & \text{if } \hat{r} > \frac{1}{2} \\
0 & \text{otherwise}
\end{cases}
$$

Now to generalize to the case where $Y$ takes more than two values:

**Theorem 23.6**.  Suppose that $Y \in \mathcal{Y} = \{ 1, \dots, K \}$.  The optimal rule is

$$ h(x) = \text{argmax}_h \mathbb{P}(Y = k | X = x) = \text{argmax}_h \pi_k f_k(x) $$

where

$$ \mathbb{P}(Y = k | X = x) = \frac{f_k(x) \pi_k}{\sum_r f_r(x) \pi_r} $$

and $ \pi_r = \mathbb{P}(Y = r)$, $f_r(x) = f(x | Y = r)$.

### 23.3 Gaussian and Linear Classifiers

Perhaps the simplest approach to classification is to use the density estimation strategy and assume a parametric model for the densities.  Suppose that $\mathcal{Y} = \{ 0, 1 \}$ and that $f_0(x) = f(x | Y = 0)$ and $f_1(x) = f(x | Y = 1)$ are both multivariate Gaussians:

$$ f_k(x) = \frac{1}{(2\pi)^{d/2} | \Sigma_k |^{1/2}} \exp \left\{ -\frac{1}{2} (x - \mu_k)^T \Sigma_k^{-1} (x - \mu_k) \right\}, \quad k = 0, 1$$

Thus, $X | Y = 0 \sim N(\mu_0, \Sigma_0)$ and $X | Y = 1 \sim N(\mu_1, \Sigma_1)$.

**Theorem 23.7**.  If $X | Y = 0 \sim N(\mu_0, \Sigma_0)$ and $X | Y = 1 \sim N(\mu_1, \Sigma_1)$, then the Bayes rule is

$$
h^*(x) = \begin{cases}
1 & \text{if } r_1^2 < r_0^2 + 2 \log \left( \frac{\pi_1}{\pi_0} \right) + \log \left( \frac{| \Sigma_0 | }{ | \Sigma_1| }
\right) \\
0 & \text{otherwise} 
\end{cases}
$$

where

$$ r_i^2 = (x - \mu_i)^T \Sigma_i^{-1}(x - \mu_i), \quad i = 1, 2 $$

is the **Manalahobis distance**.  An equivalent way of expressing Bayes' rule is 

$$ h(x) = \text{argmax}_k \delta_k(x) $$

where

$$ \delta_k(x) = -\frac{1}{2} \log | \Sigma_k | - \frac{1}{2} (x - \mu_k)^T \Sigma_k^{-1} (x - \mu_k) + \log \pi_k $$

and $|A|$ denotes the determinant of matrix $A$.

The decision boundary of the above classifier is quadratic so this procedure is often called **quadratic discriminant analysis (QDA)**.  In practice, we use sample estimates of $\pi, \mu_0, \mu_1, \Sigma_0, \Sigma_1$ in place of the true value, namely:

$$
\begin{array}{cc}
\hat{\pi}_0 = \frac{1}{n} \sum_{i=1}^n (1 - Y_i) & \hat{\pi}_1 = \frac{1}{n} \sum_{i=1}^n Y_i \\
\hat{\mu}_0 = \frac{1}{n_0} \sum_{i: Y_i = 0} X_i & \hat{\mu}_1 = \frac{1}{n_0} \sum_{i: Y_i = 1} X_i \\
S_0 = \frac{1}{n_0} \sum_{i: Y_i = 0} (X_i - \hat{\mu}_0) (X_i - \hat{\mu}_0)^T & 
S_1 = \frac{1}{n_1} \sum_{i: Y_i = 1} (X_i - \hat{\mu}_1) (X_i - \hat{\mu}_1)^T
\end{array}
$$

where $n_0 = \sum_i (1 - Y_i)$ and $n_1 = \sum_i Y_i$ are the number of $Y_i$ variables equal to 0 or 1, respectively.

A simplification occurs if we assume $\Sigma_0 = \Sigma_1 = \Sigma$.  In that case, the Bayes rule is

$$ h(x) = \text{argmax}_k \delta_k(x) $$

where now

$$ \delta_k(x) = x^T \Sigma^{-1} \mu_k - \frac{1}{2} \mu_k^T \Sigma^{-1} \mu_k + \log \pi_k $$

The parameters are estimated as before, except the MLE of $\Sigma$ now is

$$ S = \frac{n_0 S_0 + n_1 S_1}{n_0 + n_1} $$

The classification rule is

$$
h^*(x) = \begin{cases}
1 &\text{if } \delta_1(x) > \delta_0(x) \\
0 &\text{otherwise}
\end{cases}
$$

where

$$ \delta_j(x) = x^T S \hat{\mu}_j - \frac{1}{2} \hat{\mu}_j^T S^{-1} \hat{\mu}_j + \log \hat{\pi}_j $$

is called the **discriminant function**.  The decision boundary $ \{ x : \delta_0(x) = \delta_1(x) \}$ is linear so this method is called **linear discrimination analysis (LDA)**.

Now we generalize to the case where $Y$ takes on more than two values.

**Theorem 23.9**.  Suppose that $Y \in \{ 1, \dots, K \}$.  If $f_k(x) = f(x | Y = k)$ is Gaussian, the Bayes rule is

$$ h(x) = \text{argmax}_k \delta_k(x) $$

where

$$ \delta_k(x) = -\frac{1}{2} \log | \Sigma_k | - \frac{1}{2} (x - \mu_k)^T \Sigma_k^{-1} (x - \mu_k) + \log \pi_k $$

If the variances of the Gaussians are equal then

$$ \delta_k(x) = x^T \Sigma_{-1} \mu_k - \frac{1}{2} \mu_k^T \Sigma^{-1} \mu_k + \log \pi_k $$

We estimate $\delta_k(x)$ by inserting estimates of $\mu_k$, $\Sigma_k$, and $\pi_k$.