# Naive Bayes Classification

## Which of these subject lines are from spam emails?

- Need just a little help? Independent senior living might be the answer
- Hard sql from today's interview
- Beautiful Women, Discrete Service
- Ever consider driving with Lyft? Apply here.
- 3/2/18 All Hands Unanswered Questions
- Sizwe Documents


What tipped you off?

Say we have a collection of labeled text documents, each one belonging to a category (for example, news articles & which section of the paper they're in). Now we get a new, unlabeled article. How do use what we've seen so far to predict which category it belongs to?

- Do a bunch of complicated part of speech tagging & account for the sequence of words?
- Make the dumbest "bag of words" assumption possible and hope it works?
  - yep this is what we do

Naive Bayes is an extremely simple amazingly effective machine learning technique.

## When do we use it?
1. n << p
2. n small
3. n large
4. streams of input data (online learning)
5. multi-class
6. low memory applications

## What is it really?

It is just [Maximum A Posteriori estimation](https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation), with a fun approximation to make it easy to calculate.

### Notation: a document $\vec{x}$ is a list of words $\{w_1, w_2, \ldots, w_k\}$, and we want the probability that it belongs to class $y$

#### Examples: 
- e-mail: spam or not spam
- news article: is it from the World sections or Sport or Arts, etc
- essay from anonymous author: is the author really a famous person (whose corpus I have access to)?

## $$P(y|\vec{x}) = \frac{P(\vec{x}|y)P(y)}{P(\vec{x})}$$

Now here's the wacky part: let's assume the features (words) are totally independent of each other. Then we can write

## $$P(\vec{x}|y) = P(w_1|y)\times P(w_2|y)\times\cdots\times P(w_k|y) $$

This is a pretty brazenly naive assumption. It is assuming, for example, that the probability of seeing the word "ball" in an article about sports $P(ball|\text{sports})$ is totally independent of the presence of the word "soccer" in the article.

But if it works, it works.

To classify a document $\vec{x}$, we just need to calculate the posterior probabilities for each class $\{y_j\}$, then see which class has the highest posterior probability.

# $$P(y_j|\vec{x}) \propto P(w_1|y_j)\times P(w_2|y_j)\times\cdots\times P(w_k|y_j)\times P(y_j)  $$

Note that we've dropped the denominator $P(\vec{x})$ since it doesn't depend on $y_j$, and we're only interested in the $y_j$ that maximizes $P(y_j|\vec{x})$.

# What's the prior, $P(y_j)$ ?

The prior probability of class $y$ is simply how frequent that class is among our training documents:
$$P(y_j) = \frac{\text{# of documents of class}\, y_j}{\text{total # of documents}} $$

# What about these likelihoods? How do we calculate 'em?

## Option 1: Multinomial Naive Bayes

Imagine that a document $\vec{x}$ of class $y_j$ is generated by drawing words from a bag (with replacement) with probabilities $P(w_i | y_j)$

Let our vocabulary (the set of unique words observed across the entire training corpus) be $p$ words long. Let's write $\vec{x}$ as a p-dimensional vector of word counts: $\vec{x} = [x_1, x_2, \ldots, x_p]$

Then we can write our posterior (which, remember, is the prior times the likelihood) as
### $$P(y_j|\vec{x}) \propto P(y_j)\times\prod_{i=1}^p P(w_i|y_j)^{x_i} $$

$$\text{log}(P(y_j|\vec{x})) \propto \text{log}(P(y_j)) + \sum_{i=1}^p x_i \text{log}(P(w_i|y_j))$$

To estimate $P(w_i|y_j)$, we could just use 
$$\frac{N_{ji}}{N_j} = \frac{\text{total count of w_i across all documents of class}\, y_j}{\text{total count of all words across all documents of class}\,y_j} $$

But what if our new document contains a word we've never seen before? Then our estimate of $P(w|y)$ for that word would be zero, and then the entire product $P(y|\vec{x})$ would be zero. 

To avoid this, we employ *Laplace Smoothing*
### $$ P(w_i | y_j) = \frac{\alpha + N_{ji}}{\alpha p + N_j} $$

Usually, $\alpha = 1$

## Option 2: Bernoulli Naive Bayes


Imagine, now, that a document of class $y_j$ has a probability $P(w_i|y_j)$ of containing word $w_i$ at least once (regardless of word count).

Again let our vocabulary (the set of unique words observed across the entire training corpus) be $p$ words long. Let's write $\vec{x}$ as a p-dimensional vector of word *occurrences*: $x_i = 1$ if the $w_i$ is in that document at all, $x_i = 0$ if not: $\vec{x} = [x_1, x_2, \ldots, x_p]$

Then we can think of our document as a series of Bernoulli trials, one for each word, and our posterior is
### $$P(y_j|\vec{x}) \propto P(y_j) \times\prod_{i=1}^p P(w_i|y_j)^{x_i}(1 - P(w_i|y_j))^{1 - x_i} $$

$$\text{log}(P(y_j|\vec{x})) \propto \text{log}(P(y_j)) + \sum_{i=1}^p x_i \text{log}(P(w_i|y_j)) + (1 - x_i)\text{log}(1 - P(w_i|y_j))$$

To estimate $P(w_i|y_j)$, we could just use 
$$\frac{D_{ji}}{D_j} = \frac{\text{total # of documents of class}\,y_j\text{ containing}\, w_i}{\text{total number of documents of class}\,y_j} $$

But again we'd hit the "unseen word" problem. So our smoothing here looks like
### $$ P(w_i | y_j) = \frac{\alpha + D_{ji}}{2\alpha + D_j} $$

Again with $\alpha = 1$ usually.

## Which is better?
- Bernoulli tends to work better with shorter documents...
- ...but sklearn says "It is advisable to evaluate both models, if time permits."

References:

[sklearn User Guide: Naive Bayes](http://scikit-learn.org/stable/modules/naive_bayes.html)

[Spam Filtering with Naive Bayes – Which Naive Bayes?](http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=E870DFA148786F5A27F88CF80FB4D73B?doi=10.1.1.61.5542&rep=rep1&type=pdf)

## Other flavors: Gaussian Naive Bayes
When my features are continuous (but my target is still a class), I can go ahead and assume that data for a given class has normally distributed values for each feature.

You cannot stop me from making this assumption.

Then my likelihood above is just a product of Gaussian probability density functions. Same process!

[See here for bugs](http://www.cs.ucr.edu/~eamonn/CE/Bayesian%20Classification%20withInsect_examples.pdf)

### Let's code!

In [134]:
from functools import reduce
import numpy as np

In [15]:
X = ['when its time to party we will party hard',
     'theres a party in my mind',
     'i need something to change my mind']
y = ['andrew wk', 
     'talking heads', 
     'talking heads']

In [16]:
x_test = ['a change in the weather']

In [17]:
from collections import Counter, defaultdict

### Finding the vocabulary

There's more than one way!

In [30]:
# Simple, readable, efficient, bulky

vocab = set()
for row in X:
    for word in row.split():
        vocab.add(word)

print( vocab )

{'to', 'time', 'my', 'need', 'i', 'its', 'will', 'hard', 'in', 'theres', 'we', 'a', 'change', 'when', 'something', 'party', 'mind'}


In [31]:
# Simple, clever, memory inefficient

print( set(' '.join(X).split()) )

{'to', 'time', 'my', 'need', 'i', 'its', 'will', 'hard', 'in', 'theres', 'we', 'a', 'change', 'when', 'something', 'party', 'mind'}


In [32]:
# Compact, efficient, unreadable

print( reduce( lambda a,b: a|b, [set(x.split()) for x in X] ) )

{'to', 'time', 'my', 'need', 'i', 'its', 'will', 'hard', 'in', 'theres', 'we', 'a', 'change', 'when', 'something', 'party', 'mind'}


### Get priors

In [37]:
class_counts = Counter(y)

In [38]:
class_counts

Counter({'andrew wk': 1, 'talking heads': 2})

In [39]:
total_docs = sum(class_counts.values())

In [45]:
words_per_class = defaultdict(int)
word_count_per_class = defaultdict(Counter)

for doc, label in zip(X,y):
    doc_words = doc.split()
    
    words_per_class[label] += len(doc_words)
    
    word_count_per_class[label].update(Counter(doc_words))

In [46]:
words_per_class

defaultdict(int, {'andrew wk': 9, 'talking heads': 13})

In [47]:
word_count_per_class

defaultdict(collections.Counter,
            {'andrew wk': Counter({'when': 1,
                      'its': 1,
                      'time': 1,
                      'to': 1,
                      'party': 2,
                      'we': 1,
                      'will': 1,
                      'hard': 1}),
             'talking heads': Counter({'theres': 1,
                      'a': 1,
                      'party': 1,
                      'in': 1,
                      'my': 2,
                      'mind': 2,
                      'i': 1,
                      'need': 1,
                      'something': 1,
                      'to': 1,
                      'change': 1})})

### Calculate posterior

$\vec{x}$ = "a change in the weather"

$y_1$ = andrew wk

$y_2$ = talking heads

$$P(y_1 | \vec{x}) \propto P(y_1) \prod P( w_i | y_1 )^{x_i}$$

Let's start by computing the likelihoods $$P(w_i | y_j) = \frac{\alpha + N_{ji}}{\alpha p + N_j} $$

In [111]:
def P_w_given_class(word, classname, alpha=1):
    
    p = len(vocab)
    
    N_ji = word_count_per_class[classname][word]
    N_j = words_per_class[classname]
    
    return (alpha + N_ji) / (alpha*p + N_j)

In [128]:
x = x_test[0].split()
x

['a', 'change', 'in', 'the', 'weather']

### First we'll do andrew wk

In [130]:
P_w_given_andrewwk = [ P_w_given_class(w_i, "andrew wk") for w_i in x ]
P_w_given_andrewwk

[0.038461538461538464,
 0.038461538461538464,
 0.038461538461538464,
 0.038461538461538464,
 0.038461538461538464]

In [132]:
P_andrewwk = class_counts["andrew wk"]/total_docs
P_andrewwk

0.3333333333333333

In [139]:
P_x_given_andrewwk = np.log(P_andrewwk) + np.log(P_w_given_andrewwk).sum()
P_x_given_andrewwk

-17.38909497877552

### And now the talking heads

In [140]:
P_w_given_th = [ P_w_given_class(w_i, "talking heads") for w_i in x ]
P_w_given_th

[0.06666666666666667,
 0.06666666666666667,
 0.06666666666666667,
 0.03333333333333333,
 0.03333333333333333]

In [141]:
P_th = class_counts["talking heads"]/total_docs
P_th

0.6666666666666666

In [142]:
P_x_given_th = np.log(P_th) + np.log(P_w_given_th).sum()
P_x_given_th

-15.332010474739105

This is an improvement over andrew wk, and so we declare talking heads the most likely origin of the lyric "a change in the weather".