# Naive Bayes Classifiers

We want to classify vectors of discrete value features, $\mathbf{x}\in\{1,\ldots,K\}^D$, where $K$ is the number of values for each feature, and $D$ is the number of features. If we use a generative approach, we will need to specify the class conditional distribution $p(\mathbf{x}|y=c), c\in\{1, 2, \ldots, K\}$.

The simplest approach is to assume that the features are **conditionally independent** given the class label. This means we can write the class conditional density as a product of one-dimensional densities:

$$
p(\mathbf{x}|y=c, \theta) = \prod_{j=1}^Dp(x_j|y=c,\theta_{jc})
$$

The resulting model is called a Naive Bayes Classifier (NBC). This has $O(CD)$ parameters, for $C$ classes and $D$ features.

## Model fitting
'Training' a Naive Bayes classifier usually means computing the MLE or MAP estimate for the parameters. We can also compute the full posterior $p(\theta|\mathcal{D})$

Let's assume throughout that each vector $\mathbf{x}$ is in the set $\{0, 1\}^D$ for the integer $D$, which is the number of features. In other words, each component $x_j$ for $j=1,\ldots,D$ can take one of two possible values.

### Example: classifying documents into $K$ different categories
For example, $y=1$ might corresponds to a *sports* category, $y=2$ might correspond to a *politics* category, and so on.

The label $y^{(i)}$ represents the category of the $i$-th document in the dataset. Each component $x_j^{(i)}$ for $j=1,\ldots,D$ might represent the presence or absence of a particular word. For example, we might define $x_1^{(i)}$ to be 1 if the $i$-th document contains the word *Giants*, or zero otherwise; $x_2^{(i)}$ to be 1 if the $i$-th document contains the word *Obama* and zero otherwise, and so on.

The Naive Bayes model is derived as follows: we assume random variables $Y$ and $\mathbf{x}_1,\ldots,\mathbf{x}_D$ corresponding to the label $y$ and vector components $x_1,x_2,\ldots,x_D$. Our task is to model the joint probability

$$
p(y=c, \mathbf{x}_1=x_1, \mathbf{x}_2=x_2, \mathbf{x}_D = x_D)
$$

for any label $y$ paired with attribute values $x_1,\ldots,x_D$. A key idea in the Naive Bayes model is the following assumptions

\begin{aligned}
p(y=c, \mathbf{x}_1=x_1, \mathbf{x}_2=x_2, \mathbf{x}_D = x_D) = p(y=c)\prod_{j=1}^D p(\mathbf{x}_j=x_j|y=c)
\end{aligned}

This means we can write

$$
p(\mathbf{x}|y = c) = \prod_{j=1}^Dp(\mathbf{x}_j=x_j|y=c)
$$

Following this equation, the NB model has two types of parameters: $\pi(y)$ for $y \in\{1,\ldots,K\}$, with 

$$
p(y=c) = \pi(y)
$$

and 

$\theta_j(x|y)$ for $j\in\{1,\ldots,D\}$, $x\in\{0, 1\}$, $y\in\{1,\ldots,K\}$, with

$$
p(\mathbf{x}_j = x_j|y=c) = \theta_j(x|y)
$$

We then have

$$
p(y, x_1,\ldots, x_D) = \pi(y)\prod_{j=1}^D \theta_j(x_j|y)
$$

### Definition 1: Naive Bayes Model
A Naive Bayes model consists of an inteter $K$ specifying the number of possible labels, an integer $D$ specifying the number of features, and in addition the following parameters:
* $\pi(y)$ for any $y\in\{1,\ldots,K\}$. The parameter $\pi(y)$ can be interpreted as the probability of seeing the label $y$. We have the constraints $\pi(y)\ge 0$ and $\sum_{y=1}^K\pi(y)=1$.
* A parameter $\theta_j(x|y)$ for any  $j\in\{1,\ldots,D\}, x\in\{0, 1\}, y\in\{1,\ldots,K\}$. 

The value for $\theta_j(x|y)$ can be interpreted as the probability of feature $j$ taking the value $x$, conditioned on the underlying label being $y$. We have the constraints that $\theta_j(x|y)\ge 0$, and for all $y, j$, $\sum_{x\in\{0, 1\}}\theta_j(x|y) = 1$.

Once the parameters have been estimated, given a new test example $\hat{\mathbf{x}} = (x_1,x_2,\ldots,x_D)$, the output of the NB classifier is

$$
\mathrm{argmax}_{y\in\{1,\ldots,K\}}p(y,x_1,\ldots,x_D) = \mathrm{argmax}_{y\in\{1,\ldots,K\}}\left(\pi(y)\prod_{j=1}^D\theta_j(x_j|y)\right)
$$

### MLE
the probability for a single data case is given by

$$
p(\mathbf{x}_i, y_i|\theta) = p(y_i|\pi)\prod_jp(x_{ij}|\theta_j) = \prod_c\pi_c^{\mathbb{I}(y_i=c)}\prod_j\prod_c p(x_{ij}|\boldsymbol\theta_{jc})^{\mathbb{I}(y_i=c)}
$$

Hence the log-likelihood is given by

$$
\log p(\mathcal{D}|\boldsymbol\theta) = \sum_{c=1}^C N_c\log\pi_c + \sum_{j=1}^D\sum_{c=1}^C\sum_{i:y_i=c}\log p(x_{ij}|\boldsymbol\theta_{jc})
$$

We see that this expression decomposes into a series of terms, one concerning $\boldsymbol\pi$, and $DC$ terms containing the $\boldsymbol\theta_{jc}$s. Hence we can optimize all parameters separately.

The MLE for the class prior is given by 

$$
\hat{\pi}_c = \frac{N_c}{N}
$$

where $N_c\triangleq \sum_i\mathbb{I}(y_i=c)$ is the number of examples in class $c$.

The MLE for the likelihood depends on the type of distribution we choose to use for each feature. For simplicity, let us suppose all features are binary, so $x_j|y = c\sim \mathrm{Ber}(\theta_{jc})$. In this case, the MLE becomes

$$
\hat{\theta}_{jc} = \frac{N_{jc}}{N_c}
$$

## Bayesian Naive Bayes
The trouble with maximum likelihood is that it can overfit. A simple solution to overfitting is to be Bayesian. For simplicity, we will use a factored prior:

$$
p(\boldsymbol\theta) = p(\boldsymbol\pi)\prod_{j=1}^D\prod_{c=1}^Cp(\theta_{jc})
$$

We will use a $\mathrm{Dir}(\boldsymbol\alpha)$ prior for $\boldsymbol\pi$ and a $\mathrm{Beta}(\beta_0, \beta_1)$ prior for each $\theta_{jc}$. Often we just take $\boldsymbol\alpha=1$ and $\boldsymbol\beta=1$ corresponding to add-one or Laplace smoothing.

Combining the factored likelihood with the factored prior above, gives the following factored posterior

\begin{aligned}
p(\boldsymbol\theta|\mathcal{D} & = p(\boldsymbol\pi|\mathcal{D})\prod_{j=1}^D\prod_{c=1}^C p(\theta_{jc}|\mathcal{D}) \\
p(\pi|\mathcal{D}) & = \mathrm{Dir}(N_1 + \alpha_1, \ldots, N_c+\alpha_c) \\
p(\theta_{jc}|\mathcal{D}) & = \mathrm{Beta}((N_c - N_{jc}) + \beta_0, N_jc +\beta_1)
\end{aligned}

In order words, to compute the posterior, we just update the prior counts with empirical counts from the likelihood