# Naive Bayes

A simple (naive) classifier that tends to work will with unstructured text data (i.e. spam detection).

### Pros

* Works well when # of features >> # of observations
* Good at online learning (i.e. streaming data)
* Simple to implement

### Cons

* Works poorly with irrelevant features (unlike trees)
* Can be outperformed by more complicated models

## How does it work?

Example with spam detection in email:

Suppose $c$ is the classification of the email, "spam" or "not spam". and $X$ is our document (a set of words $X = \{ x_i \}$ in our vocabulary). 

Remeber Bayes' Theorem:

$$ P(c | X) = \frac{P(X | c) P(c)}{P(X)} $$

If we make the assumption that the probabilities of each word $x_i$ appearing in $X$ is independent (which it probably isn't but...):

$$ P(c | X) \propto P(c) \times P(x_1 | c) \times \cdots \times P(x_n | c) $$

### Training the model

To compute $P(c | X)$ we need to know $P(c)$ and $P(x_i | c)$. 

* $P(c)$ = the probability that any given email is spam. Make some assumption about this based on your data or knowledge (i.e. you know 4 of every 10 emails is spam). 

* $\displaystyle P(x | c) = \frac{\text{# of times x appears in emails of class c}}{\text{# of words in emails of class c}}$

__This doesn't quite work. Why not?__

### Laplace Smoothing

* Do something like $$P(x | c) = \frac{\text{# of times x appears in emails of class c} + \alpha}{(\text{# of words in emails of class c}) + \alpha \times (\text{# of words in corpus})}$$


## Log Likelihood

* Probability values can be very close to 0. 
* Mathematically, no problem
* Computationally, can get numerical under/overflow problems.

Since 

$$ P(c | X) = P(c) \times P(x_1 | c) \times \cdots \times P(x_n | c) $$

we have

$$ \log{P(c | X)} = \log P(c) + \sum_i \log P(x_i | c)$$

The result of our classifier will be

$$argmax_c P(c | X)$$. 

