Today we're going to learn about a classifier that has no business being as good as it is, given how simple it is - the naive Bayes classifier, which is a powerful and robust tool for classifying categorical data.

**Naive Bayes**

The naive Bayes algorithm begins with Bayes law, which is a simple deduction from the basic formula for conditional probability:

&nbsp;
<center>
    $\displaystyle P(X|Y) = \frac{P(X \cap Y)}{P(Y)}$
</center>

If we turn this formula aronud, we get:

&nbsp;
<center>
    $\displaystyle P(Y|X) = \frac{P(X \cap Y)}{P(X)}$,
</center>

from which we get:

&nbsp;
<center>
    $\displaystyle P(Y|X) = \frac{P(X|Y)P(Y)}{P(X)}$.
</center>

This last formula is known as Bayes' theroem, and it relates how likely an event $Y$ is (called the prior probability), to how likely it is given additional inforamation $X$ (the posterior probability). Note the term $P(X|Y)$ is called the *likelihood*, and the term $P(X)$ is called the *marginal*.

In their book "Thinking Fast and Slow" Kahneman and Tversky present the following question: suppose you pick a person at random from the population of the United States, and are given the following description of the person "Steve is very shy and withdrawn, invariably helpful but with little interest in people or in the world of reality. A meek and tidy soul, he has a need for order and structure, and a passion for detail.” Is Steve more likely to be a librarian or a farmer?

According to Kahneman, most people think Steve is more likely to be a librarian, focusing on an occupational stereotype while ignoring a much more important fact - there are far more farmers than librarians.

Let's see how Bayes' theorem relates to this question. Suppose in a population of 10,000 there are 200 farmers and 10 librarians, and suppose that 40% of librarians, 5% of farmers, and 3% of the general population meet the above description. In that case, what is the probability that a randomly chosen person who meets this description is a librarian? Well, we know the probabilities of a random person from the population being a librarian, a farmer, and fitting this description are, respectively:

<center>
    $P(L) = .1\%$,
    $P(F) = 2\%$,
    $P(D) = 3\%$.
</center>

So, according to Bayes' theorem, the conditional probability of the randomly selected person being a librarian given that description is:

&nbsp;
<center>
    $\displaystyle P(L|D) = \frac{P(D|L)P(L)}{P(D)} = \frac{.4 \times .001}{.03} = .013$,
</center>

while the probability of the randomly selected person being a farmer given that  description is:

&nbsp;
<center>
    $\displaystyle P(F|D) = \frac{P(D|F)P(F)}{P(D)} = \frac{.05 \times .02}{.03} = .033$.
</center>

So, the randomly selected person is over twice as likely to be a farmer, even though their description applies to a much higher percentage of librarians than farmers.

Alright, so what does this have to do with categorical prediction? Well, suppose we're trying to predict the category of an observation from a set, where the possible categories are $y_{1},y_{2},\ldots,y_{k}$, and we're given as our input $X = (x_{1},x_{2},\ldots,x_{n})$. In that case, Bayes' theorem tells us that the respective probabilities of the different outcomes are:

&nbsp;
<center>
    $\displaystyle P(y_{1}|x_{1},x_{2},\ldots,x_{n}) = \frac{P(x_{1},x_{2},\ldots,x_{n}|y_{1})P(y_{1})}{P(x_{1},x_{2},\ldots,x_{n})} = \frac{P(X|y_{1})P(y_{1})}{P(X)}$,
</center>
&nbsp;
<center>    
    $\displaystyle P(y_{2}|x_{1},x_{2},\ldots,x_{n}) = \frac{P(x_{1},x_{2},\ldots,x_{n}|y_{2})P(y_{2})}{P(x_{1},x_{2},\ldots,x_{n})} = \frac{P(X|y_{2})P(y_{2})}{P(X)}$,
</center>
&nbsp;
<center>
    $\vdots$
</center>    
    &nbsp;
<center>
    $\displaystyle P(y_{k}|x_{1},x_{2},\ldots,x_{n}) = \frac{P(x_{1},x_{2},\ldots,x_{n}|y_{k})P(y_{k})}{P(x_{1},x_{2},\ldots,x_{n})} = \frac{P(X|y_{k})P(y_{k})}{P(X)}$.
</center>

So, in order to pick the most likely outcome, we just need to calculate the above probabilities, and determine which is more likely. How do we do this? Well, we just count their frequencies!

This might seem like an overly simplistic model, but the amazing thing is that, with enough data, this is literally the best possible model there is. I'm not kidding. This can't be beat.

OK, well then why don't we just always use this model? Because we very, very rarely have enough data. That is to say, for a given set of inputs $(x_{1},x_{2},\ldots,x_{n})$ we rarely have much, if any, training data, and if we have no training data for exactly that specific set of inputs, then the model above completely breaks down.

How do we fix this problem? We need to make some assumptions. Now, these assumptions are rarely if ever completely true, but they can be close enough to true in order to be useful. The assumption we make in the naive Bayes model is that the probabilities of observing each of our input variables are conditionally independent:

&nbsp;
<center>
    $P(x_{1},x_{2},\ldots,x_{n}|y_{i}) = P(x_{1}|y_{i})P(x_{2}|y_{i}) \cdots P(x_{n}|y_{i})$.
</center>



What do we mean by conditionally independent? Well, suppose we had two variables $X$ and $Y$, where $X$ represents going to the beach on a given Saturday, and $Y$ represents getting sunburned on that same Saturday. It would make sense to say that the probabilities of these events are not independent - if you go to the beach you're likely to get sunburned. However, it could be that these events *are* independent given knowledge of a third variable $H$ - whether it's hot outside. It could be that you're equally likely to get a sunburn on a hot day whether or not you go to the beach, but you're also more likely to go to the beach on a hot day. Mathematically, we'd write this as:

<center>
    $P(X \cap Y) > P(X)P(Y)$,
</center>

&nbsp;
<center>
but
</center>

&nbsp;
<center>
    $P(X \cap Y | H) = P(X|H)P(Y|H)$.
</center>

This would mean that $X$ and $Y$ are conditionally independent given $H$. This conditional independence is the assumption we make in the naive Bayes model.

One final thing to note is that if the input value $x_{i}$ never shows up with an associated output value $y_{j}$, then we'll have $P(x_{i}|y_{j}) = 0$, and so $P(x_{1},x_{2},\ldots,x_{i},\ldots,x_{n}|y_{j}) = P(x_{1}|y_{j})P(x_{2}|y_{j}) \cdots P(x_{n}|y_{j}) = 0$, regardless of the other conditional probabilities. This can cause problems in our model, so the way we deal with this is to introduce a smoothing parameter.

One commonly used, and one we will use in our spam filter below, is called "Laplace smoothing", and it defines:

&nbsp;
<center>
    $\displaystyle P(x_{i}|y) = \frac{N_{x_{i}|y} + \alpha}{N_{y} + \alpha n}$
</center>

Here, $n$ in the number of features in the data.

OK, let's see how we could use the naive Bayes model to build a spam filter.

To build our spam filter, we'll use a dataset of 5,572 SMS messages. Tiago A. Almeida and José María Gómez Hidalgo put together the dataset, you can download it from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection).

In [14]:
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

sms_spam = pd.read_csv('SMSSpamCollection', sep='\t',
header=None, names=['Label', 'SMS'])

print(sms_spam.shape)
sms_spam.head()

(5572, 2)


Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


We see that about 87% of the messages are ham (non-spam), and the remaining 13% are spam.

In [16]:
sms_spam['Label'].value_counts(normalize=True)

ham     0.865937
spam    0.134063
Name: Label, dtype: float64

We're going to split our dataset into a training set and a testing set, this time without using the train_test_split function, to see what's going on under the hood.

In [18]:
# Randomize the dataset
data_randomized = sms_spam.sample(frac=1, random_state=42)

# Calculate index for split
training_test_index = round(len(data_randomized) * 0.8)

# Split into training and test sets
training_set = data_randomized[:training_test_index].reset_index(drop=True)
test_set = data_randomized[training_test_index:].reset_index(drop=True)

print(training_set.shape)
print(test_set.shape)

(4458, 2)
(1114, 2)


In [19]:
test_set['Label'].value_counts(normalize=True)


ham     0.861759
spam    0.138241
Name: Label, dtype: float64

Now we're going to remove the punctuation from our messages, and convert everything to lowercase.

In [21]:
# After cleaning
training_set['SMS'] = training_set['SMS'].str.replace(
   '\W', ' ') # Removes punctuation
training_set['SMS'] = training_set['SMS'].str.lower()
training_set.head(3)

Unnamed: 0,Label,SMS
0,ham,squeeeeeze this is christmas hug if u lik ...
1,ham,and also i ve sorta blown him off a couple tim...
2,ham,mmm thats better now i got a roast down me i ...


Next, we create the vocabulary, which is the set of words in the training data.

In [23]:
training_set['SMS'] = training_set['SMS'].str.split()

vocabulary = []
for sms in training_set['SMS']:
   for word in sms:
      vocabulary.append(word)

vocabulary = list(set(vocabulary))

In [24]:
len(vocabulary)

7816

Next, we create a dictionary (a key / value pair data object) that maps each word in the vocabulary to a count of the number of times it appears in each message. Note that most will be 0. We then convert it to a dataframe.

In [26]:
word_counts_per_sms = {unique_word: [0] * len(training_set['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(training_set['SMS']):
   for word in sms:
      word_counts_per_sms[word][index] += 1

In [27]:
word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.head()

Unnamed: 0,downstem,pounds,addicted,reach,lautech,situation,conform,heron,malaria,ac,...,dane,had,docks,babies,balloon,ibn,kaypoh,clover,wan,compensation
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


We don't have the label column in this dataset, so we can add it by concatenating the datframe we just built with the dataframe containing our training set.

In [29]:
training_set_clean = pd.concat([training_set, word_counts], axis=1)
training_set_clean.head()

Unnamed: 0,Label,SMS,downstem,pounds,addicted,reach,lautech,situation,conform,heron,...,dane,had,docks,babies,balloon,ibn,kaypoh,clover,wan,compensation
0,ham,"[squeeeeeze, this, is, christmas, hug, if, u, ...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[and, also, i, ve, sorta, blown, him, off, a, ...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[mmm, thats, better, now, i, got, a, roast, do...",0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
3,ham,"[mm, have, some, kanji, dont, eat, anything, h...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[so, there, s, a, ring, that, comes, with, the...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now that we're done cleaning and preparing our data, we can code the spam filter. The naive Bayes algorithm will need, for a given set of words, to calculate:

&nbsp;
<center>
    $\displaystyle P(Spam|w_{1},w_{2},\ldots,w_{n}) \propto P(Spam) \cdot \prod_{i = 1}^{n}P(w_{i}|Spam)$
</center>

and

&nbsp;
<center>
    $\displaystyle P(Ham|w_{1},w_{2},\ldots,w_{n}) \propto P(Ham) \cdot \prod_{i = 1}^{n}P(w_{i}|Ham)$
</center>

To calculate $P(w_{i}|Spam)$ and $P(w_{i}|Ham)$ inside the formulas above, we'll need to use these equations:

&nbsp;
<center>
    $\displaystyle P(w_{i}|Spam) = \frac{N_{w_{i}|Spam} + \alpha}{N_{spam} + \alpha N_{Vocabulary}}$
</center>

and

&nbsp;
<center>
    $\displaystyle P(w_{i}|Ham) = \frac{N_{w_{i}|Ham} + \alpha}{N_{Ham} + \alpha N_{Vocabulary}}$
</center>

Some of the terms in the four equations above will have the same value for every new message. We can calculate the value of these terms once and avoid doing the computations again when a new messages comes in. As a start, let's first calculate:

* $P(Spam)$ and $P(Ham)$
* $N_{Spam}$, $N_{Ham}$, and $N_{Vocabulary}$

It's important to note:

* $N_{Spam}$ is equal to the number of words in all the spam messages — it's not equal to the number of spam messages, and it's not equal to the total number of unique words in spam messages.
* $N_{Ham}$ is equal to the number of words in all the non-spam messages — it's not equal to the number of non-spam messages, and it's not equal to the total number of unique words in non-spam messages.

We'll also use Laplace smoothing and set $\alpha = 1$.

In [35]:
# Isolating spam and ham messages first
spam_messages = training_set_clean[training_set_clean['Label'] == 'spam']
ham_messages = training_set_clean[training_set_clean['Label'] == 'ham']

# P(Spam) and P(Ham)
p_spam = len(spam_messages) / len(training_set_clean)
p_ham = len(ham_messages) / len(training_set_clean)

# N_Spam
n_words_per_spam_message = spam_messages['SMS'].apply(len)
n_spam = n_words_per_spam_message.sum()

# N_Ham
n_words_per_ham_message = ham_messages['SMS'].apply(len)
n_ham = n_words_per_ham_message.sum()

# N_Vocabulary
n_vocabulary = len(vocabulary)

# Laplace smoothing
alpha = 1

Now that we have the constant terms calculated above, we can move on with calculating the parameters $P(w_{i}|Spam)$ and $P(w_{i}|Ham)$.

$P(w_{i}|Spam)$ and $P(w_{i}|Ham)$ will vary depending on the individual words. For instance, $P(secret|Spam)$ will have a certain probability value, while $P(cousin|Spam)$ or $P(lovely|Spam)$ will most likely have other values.

Therefore, each parameter will be a conditional probability value associated with each word in the vocabulary, calculated using the formulas we saw above.

&nbsp;
<center>
    $\displaystyle P(w_{i}|Spam) = \frac{N_{w_{i}|Spam} + \alpha}{N_{spam} + \alpha N_{Vocabulary}}$
</center>

and

&nbsp;
<center>
    $\displaystyle P(w_{i}|Ham) = \frac{N_{w_{i}|Ham} + \alpha}{N_{Ham} + \alpha N_{Vocabulary}}$
</center>

In [37]:
# Initiate parameters
parameters_spam = {unique_word:0 for unique_word in vocabulary}
parameters_ham = {unique_word:0 for unique_word in vocabulary}

# Calculate parameters
for word in vocabulary:
   n_word_given_spam = spam_messages[word].sum() # spam_messages already defined
   p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + alpha*n_vocabulary)
   parameters_spam[word] = p_word_given_spam

   n_word_given_ham = ham_messages[word].sum() # ham_messages already defined
   p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + alpha*n_vocabulary)
   parameters_ham[word] = p_word_given_ham

Now that we have all our parameters calculated, we can start creating the spam filter. The spam filter is understood as a function that:

* Takes in as input a new message $(w_{1}, w_{2}, ..., w_{n})$.
* Calculates $P(Spam|w_{1}, w_{2}, ..., w_{n})$ and $P(Ham|w_{1}, w_{2}, ..., w_{n})$.
* Compares the values of $P(Spam|w_{1}, w_{2}, ..., w_{n})$ and $P(Ham|w_{1}, w_{2}, ..., w_{n})$, and:
    * If $P(Ham|w_{1}, w_{2}, ..., w_{n}) > P(Spam|w_{1}, w_{2}, ..., w_{n})$, then the message is classified as ham.
    * If $P(Ham|w_{1}, w_{2}, ..., w_{n}) < P(Spam|w_{1}, w_{2}, ..., w_{n})$, then the message is classified as spam.
    * If $P(Ham|w_{1}, w_{2}, ..., w_{n}) = P(Spam|w_{1}, w_{2}, ..., w_{n})$, then the algorithm may request human help.

Note that some new messages will contain words that are not part of the vocabulary. We will simply ignore these words when we're calculating the probabilities.

In [39]:
import re

def classify(message):
   '''
   message: a string
   '''

   message = re.sub('\W', ' ', message)
   message = message.lower().split()

   p_spam_given_message = p_spam
   p_ham_given_message = p_ham

   for word in message:
      if word in parameters_spam:
         p_spam_given_message *= parameters_spam[word]

      if word in parameters_ham: 
         p_ham_given_message *= parameters_ham[word]

   print('P(Spam|message):', p_spam_given_message)
   print('P(Ham|message):', p_ham_given_message)

   if p_ham_given_message > p_spam_given_message:
      print('Label: Ham')
   elif p_ham_given_message < p_spam_given_message:
      print('Label: Spam')
   else:
      print('Equal proabilities, have a human classify this!')

Let's try this on a few examples:

In [41]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')


P(Spam|message): 1.5223001843661562e-25
P(Ham|message): 1.2176099861344542e-27
Label: Spam


In [42]:
classify("Sounds good, Tom, then see u there")

P(Spam|message): 4.74693224259436e-25
P(Ham|message): 2.7878169823581606e-21
Label: Ham


Let's modify our classification function so that instead of printing the results it returns it.

In [44]:
def classify_test_set(message):
   '''
   message: a string
   '''

   message = re.sub('\W', ' ', message)
   message = message.lower().split()

   p_spam_given_message = p_spam
   p_ham_given_message = p_ham

   for word in message:
      if word in parameters_spam:
         p_spam_given_message *= parameters_spam[word]

      if word in parameters_ham:
         p_ham_given_message *= parameters_ham[word]

   if p_ham_given_message > p_spam_given_message:
      return 'ham'
   elif p_spam_given_message > p_ham_given_message:
      return 'spam'
   else:
      return 'needs human classification'

Now let's apply this predictive algorithm to the test set.

In [46]:
test_set['predicted'] = test_set['SMS'].apply(classify_test_set)
test_set.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Was playng 9 doors game and gt racing on phone...,ham
1,ham,I dont thnk its a wrong calling between us,ham
2,ham,All e best 4 ur exam later.,ham
3,ham,Hey what how about your project. Started aha da.,ham
4,ham,"Dunno, my dad said he coming home 2 bring us o...",ham


Finally, let's calculate the accuracy of our model on the test set.

In [48]:
correct = 0
total = test_set.shape[0]

for row in test_set.iterrows():
   row = row[1]
   if row['Label'] == row['predicted']:
      correct += 1

print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', correct/total)

Correct: 1092
Incorrect: 22
Accuracy: 0.9802513464991023


Pretty good!