# Naive Bayes Classification

## Jack Bennetto
#### July 12, 2018
From on Moses Marsh's notebook, with small changes


## Which of these subject lines are from spam emails?

- Need just a little help? Independent senior living might be the answer
- Hard sql from today's interview
- Beautiful Women, Discrete Service
- Ever consider driving with Lyft? Apply here.
- 3/2/18 All Hands Unanswered Questions
- Sizwe Documents


What tipped you off?

Say we have a collection of labeled text documents, each one belonging to a category (for example, news articles & which section of the paper they're in). Now we get a new, unlabeled article. How do use what we've seen so far to predict which category it belongs to?

- Do a bunch of complicated part of speech tagging & account for the sequence of words?
- Make the dumbest "bag of words" assumption possible and hope it works?
  - yep this is what we do

Naive Bayes is a simple-but-effective parametric supervised-learning classifier. It's often used in NLP to predict from a bag of words, but can be used anywhere.


Unlike the other models we've talked about, it's a **generative** (as opposed to **discriminative**) model, meaning that it predicts the $P(y \cap x)$ rather than $P(y | x)$.

## Pros/Cons

Pros
* Works with n << p
* Gives good probabilities
* very fast
* low memory
* handles non-linear boundaries
* automatically handles multiple classes

Cons
* Ignores interactions between features
* Assumes distribution

## What is it really?

Naive Bayes gives the **posterior probability** that a data point is in each class, based on an distribution of the points in each class, *assiming the distributions of the features are independent.*


### Naive Bayes and documents

Notation: a document $\vec{x}$ is a list of words $\{w_1, w_2, \ldots, w_k\}$, and we want the probability that it belongs to class $y$

#### Examples: 
- e-mail: spam or not spam
- news article: is it from the World sections or Sport or Arts, etc
- essay from anonymous author: is the author really a famous person (whose corpus I have access to)?

$$P(y|\vec{x}) = \frac{P(\vec{x}|y)P(y)}{P(\vec{x})}$$

Now here's the wacky part: let's assume the features (words) are totally independent of each other. Then we can write

$$P(\vec{x}|y) = P(w_1|y)\times P(w_2|y)\times\cdots\times P(w_k|y) $$

This is a pretty brazenly naive assumption. It is assuming, for example, that the probability of seeing the word "ball" in an article about sports $P(ball|\text{sports})$ is totally independent of the presence of the word "soccer" in the article.

But if it works, it works.

To classify a document $\vec{x}$, we just need to calculate the posterior probabilities for each class $\{y_j\}$, then see which class has the highest posterior probability.

$$P(y_j|\vec{x}) \propto P(w_1|y_j)\times P(w_2|y_j)\times\cdots\times P(w_k|y_j)\times P(y_j)  $$

Note that we've dropped the denominator $P(\vec{x})$ since it doesn't depend on $y_j$, and we're only interested in the $y_j$ that maximizes $P(y_j|\vec{x})$.

# What's the prior, $P(y_j)$ ?

The prior probability of class $y$ is simply how frequent that class is among our training documents:
$$P(y_j) = \frac{\text{# of documents of class}\, y_j}{\text{total # of documents}} $$

# Calculating the likelihood

### Option 1: Multinomial Naive Bayes

Imagine that a document $\vec{x}$ of class $y_j$ is generated by drawing words from a bag (with replacement) with probabilities $P(w_i | y_j)$

Let our vocabulary (the set of unique words observed across the entire training corpus) be $p$ words long. Let's write $\vec{x}$ as a p-dimensional vector of word counts: $\vec{x} = [x_1, x_2, \ldots, x_p]$

Then we can write our posterior (which, remember, is the prior times the likelihood) as

$$P(y_j|\vec{x}) \propto P(y_j)\times\prod_{i=1}^p P(w_i|y_j)^{x_i} $$

$$\text{log}(P(y_j|\vec{x})) \propto \text{log}(P(y_j)) + \sum_{i=1}^p x_i \text{log}(P(w_i|y_j))$$

To estimate $P(w_i|y_j)$, we could just use 
$$\frac{N_{ji}}{N_j} = \frac{\text{total count of w_i across all documents of class}\, y_j}{\text{total count of all words across all documents of class}\,y_j} $$

But what if our new document contains a word we've never seen before? Then our estimate of $P(w|y)$ for that word would be zero, and then the entire product $P(y|\vec{x})$ would be zero. 

To avoid this, we employ **Laplace Smoothing**
$$ P(w_i | y_j) = \frac{\alpha + N_{ji}}{\alpha p + N_j} $$

Usually, $\alpha = 1$

### Option 2: Bernoulli Naive Bayes


Imagine, now, that a document of class $y_j$ has a probability $P(w_i|y_j)$ of containing word $w_i$ at least once (regardless of word count).

Again let our vocabulary (the set of unique words observed across the entire training corpus) be $p$ words long. Let's write $\vec{x}$ as a p-dimensional vector of word *occurrences*: $x_i = 1$ if the $w_i$ is in that document at all, $x_i = 0$ if not: $\vec{x} = [x_1, x_2, \ldots, x_p]$

Then we can think of our document as a series of Bernoulli trials, one for each word, and our posterior is

$$P(y_j|\vec{x}) \propto P(y_j) \times\prod_{i=1}^p P(w_i|y_j)^{x_i}(1 - P(w_i|y_j))^{1 - x_i} $$

$$\text{log}(P(y_j|\vec{x})) \propto \text{log}(P(y_j)) + \sum_{i=1}^p x_i \text{log}(P(w_i|y_j)) + (1 - x_i)\text{log}(1 - P(w_i|y_j))$$

To estimate $P(w_i|y_j)$, we could just use 
$$\frac{D_{ji}}{D_j} = \frac{\text{total # of documents of class}\,y_j\text{ containing}\, w_i}{\text{total number of documents of class}\,y_j} $$

But again we'd hit the "unseen word" problem. So our smoothing here looks like
### $$ P(w_i | y_j) = \frac{\alpha + D_{ji}}{2\alpha + D_j} $$

Again with $\alpha = 1$ usually.

## Which is better?
- Bernoulli tends to work better with shorter documents...
- ...but sklearn says "It is advisable to evaluate both models, if time permits."

References:

[sklearn User Guide: Naive Bayes](http://scikit-learn.org/stable/modules/naive_bayes.html)

[Spam Filtering with Naive Bayes – Which Naive Bayes?](http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=E870DFA148786F5A27F88CF80FB4D73B?doi=10.1.1.61.5542&rep=rep1&type=pdf)

## Other flavors: Gaussian Naive Bayes
When my features are continuous (but my target is still a class), I can go ahead and assume that data for a given class has normally distributed values for each feature.

Then my likelihood above is just a product of Gaussian probability density functions. Same process!

[See here for bugs](http://www.cs.ucr.edu/~eamonn/CE/Bayesian%20Classification%20withInsect_examples.pdf)