## Building a Spam Filter with Naive Bayes

In this project, we're going to study the practical side of the Naive Bayes algorithm by building a spam filter for SMS messages.

To classify messages as spam or non-spam, we saw in the previous courses that the computer:
   + Learns how humans classify messages.
   + Uses that human knowledge to estimate probabilities for new messages — probabilities for spam and non-spam.
   + Classifies a new message based on these probability values — if the probability for spam is greater, then it classifies the message as spam. Otherwise, it classifies it as non-spam (if the two probability values are equal, then we may need a human to classify the message).

So our first task is to "teach" the computer how to classify messages. To do that, we'll use the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages that are already classified by humans.

The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the [The UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). The data collection process is described in more details on [this page](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/#composition), where you can also find some of the authors' papers.

## Exploring the Dataset

Let's start by reading in the dataset.

In [1]:
import pandas as pd
spam_collection = pd.read_csv('SMSSpamCollection', 
                              sep='\t', #The data are tab separated.
                              header=None, #The dataset doesn't have a header row.
                              names=['Label', 'SMS']) #Name the columns as 'Label' and 'SMS'.

In [2]:
#Explore the dataset.
spam_collection.shape

(5572, 2)

In [3]:
spam_collection.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
spam_collection['Label'].value_counts(normalize = True) * 100

ham     86.593683
spam    13.406317
Name: Label, dtype: float64

The dataset contains 5572 messages. Among all the messages, about 87% are non-spam (labeled as 'ham'), while the remaining 13% of the messages are spam. 

Now that we've become a bit familiar with the dataset, we can move on to building the spam filter.

## Training and Test Set

Before creating the spam filter, it's very helpful to first think of a way of testing how well it works. When creating software (a spam filter is software), a good rule of thumb is that designing the test comes before creating the software. If we write the software first, then it's tempting to come up with a biased test just to make sure the software passes it.

Once our spam filter is done, we'll need to test how good it is with classifying new messages. To test the spam filter, we're first going to split our dataset into two categories:
   + A training set, which we'll use to "train" the computer how to classify messages.
   + A test set, which we'll use to test how good the spam filter is with classifying new messages.

We're going to keep 80% of our dataset for training, and 20% for testing (we want to train the algorithm on as much data as possible, but we also want to have enough test data). The dataset has 5,572 messages, which means that:
   + The training set will have 4,458 messages (about 80% of the dataset).
   + The test set will have 1,114 messages (about 20% of the dataset).

To better understand the purpose of putting a test set aside, let's begin by observing that all 1,114 messages in our test set are already classified by a human. When the spam filter is ready, we're going to treat these messages as new and have the filter classify them. Once we have the results, we'll be able to compare the algorithm classification with that done by a human, and this way we'll see how good the spam filter really is.

For this project, our goal is to create a spam filter that classifies new messages with an accuracy greater than 80% — so we expect that more than 80% of the new messages will be classified correctly as spam or ham (non-spam).

We'll come back to testing toward the end of this project, but for now, let's create a training and a test set. We're going to start by randomizing the entire dataset to ensure that spam and ham messages are spread properly throughout the dataset. 

In [5]:
#Randomize the entire dataset.
spam_collection_randomized = spam_collection.sample(frac = 1, random_state = 1)

#Get the index for training set.
training_in = round(len(spam_collection_randomized)*0.8)
training = spam_collection_randomized[:training_in].reset_index(drop=True)
test = spam_collection_randomized[training_in:].reset_index(drop=True)
print(training.shape)
print(test.shape)

(4458, 2)
(1114, 2)


In [6]:
#Explore the percentage of spam and ham in both of the training and test set.
print(training['Label'].value_counts(normalize = True)*100)
print("-----------------------")
print(test['Label'].value_counts(normalize = True)*100)

ham     86.54105
spam    13.45895
Name: Label, dtype: float64
-----------------------
ham     86.804309
spam    13.195691
Name: Label, dtype: float64


We can conclude that the percentages of spam and ham in both of training and test sets are similar to what we have in the full dataset. These percentages are representative, since in realty most of the messgae people receive are ham.

## Letter Case and Punctuation

Previously, we split our dataset into a training set and a test set. The next big step is to use the training set to teach the algorithm to classify new messages.

Recall from the course We learned before that when a new message comes in, our Naive Bayes algorithm will make the classification based on the two probabilities P(wi|Spam) and P(wi|Ham). There are equations to calculate these probabilities.

However, to calculate all these probabilities, we'll first need to perform a bit of data cleaning to bring the data in a format that will allow us to extract easily all the information we need. Right now, our training and test sets have this format(the messages are fictitious to make the example easier to understand):

|   | Label | SMS                                   |
|---|-------|---------------------------------------|
| 0 | spam  | SECRET PRIZE!CLAIM SECRET PRIZE NOW!! |
| 1 | Ham   | Coming to my secret party?            |
| 2 | spam  | Winner!Claim secret prize now!        |

To make the calculations easier, we want bring the data to this format (the table below is a transformation of the table you see above):

|   | Label | secret | prize | claim | now | coming | to | my | party | winner |
|---|-------|--------|-------|-------|-----|--------|----|----|-------|--------|
| 0 | spam  | 2      | 2     | 1     | 1   | 0      | 0  | 0  | 0     | 0      |
| 1 | ham   | 1      | 0     | 0     | 0   | 1      | 1  | 1  | 1     | 0      |
| 2 | spam  | 1      | 1     | 1     | 1   | 0      | 0  | 0  | 0     | 1      |

About the transformation above, notice that:
   + The SMS column doesn't exist anymore.
   + Instead, the SMS column is replaced by a series of new columns, where each column represents a unique word from the vocabulary.
   + Each row describes a single message. For instance, the first row corresponds to the message "SECRET PRIZE! CLAIM SECRET PRIZE NOW!!", and it has the values spam, 2, 2, 1, 1, 0, 0, 0, 0, 0. These values tell us that:
      + The message is spam.
      + The word "secret" occurs two times inside the message.
      + The word "prize" occurs two times inside the message.
      + The word "claim" occurs one time inside the message.
      + The word "now" occurs one time inside the message.
      + The words "coming", "to", "my", "party", and "winner" occur zero times inside the message.
   + All words in the vocabulary are in lower case, so "SECRET" and "secret" come to be considered to be the same word.
   + Punctuation is not taken into account anymore (for instance, we can't look at the table and conclude that the first message initially had three exclamation marks).
   
   
Let's begin the data cleaning process by removing the punctuation and bringing all the words to lower case.

In [7]:
training.head()

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [8]:
test.head()

Unnamed: 0,Label,SMS
0,ham,Later i guess. I needa do mcat study too.
1,ham,But i haf enuff space got like 4 mb...
2,spam,Had your mobile 10 mths? Update to latest Oran...
3,ham,All sounds good. Fingers . Makes it difficult ...
4,ham,"All done, all handed in. Don't know if mega sh..."


In [9]:
#Remove all the punctuation from the SMS column.
training['SMS'] = training['SMS'].str.replace('\W', ' ')
training['SMS'] = training['SMS'].str.lower()
training.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


In [10]:
test['SMS'] = test['SMS'].str.replace('\W', ' ')
test['SMS'] = test['SMS'].str.lower()
test.head()

Unnamed: 0,Label,SMS
0,ham,later i guess i needa do mcat study too
1,ham,but i haf enuff space got like 4 mb
2,spam,had your mobile 10 mths update to latest oran...
3,ham,all sounds good fingers makes it difficult ...
4,ham,all done all handed in don t know if mega sh...


## Creating the Vocabulary

We just removed the punctuation and changed all letters to lowercase. Our end goal with this data cleaning process is to bring our training set to the table format as shown above.

With the exception of the "Label" column, every other column in the transformed table above represents a unique word in our vocabulary (more specifically, each column shows the frequency of that unique word for any given message). We call the set of unique words a vocabulary.

We'll eventually bring the training set to that format ourselves, but first, let's create a list with all of the unique words that occur in the messages of our training set.

In [11]:
#Transform each message from the SMS column into a list.
training['SMS'] = training['SMS'].str.split()
print(training['SMS'].head())

#Create an empty list to store the words.
vocabulary = []

#Iterate each message in the SMS column using a nested loop.
for row in training['SMS']:
    for word in row:
        vocabulary.append(word)
        
#Transform the vocabulary list into a set to remove the duplicates.
vocabulary_set = set(vocabulary)

#Transform vocabulary_set set back into a list.
vocabulary_list = list(vocabulary_set)

print(len(vocabulary_list))

0                    [yep, by, the, pretty, sculpture]
1    [yes, princess, are, you, going, to, make, me,...
2                      [welp, apparently, he, retired]
3                                             [havent]
4    [i, forgot, 2, ask, ü, all, smth, there, s, a,...
Name: SMS, dtype: object
7783


It seems that we have 7783 words in total.

## The Final Traning Set

Now we are going to use the vocabulary to make the data transformation we need. Eventually, we need to create a new DataFrame. However, we'll first build a dictionary that we'll then convert to the DataFrame we need.

We start by initializing a dictionary named "word_counts_per_sms", where each key is a unique word (a string) from the vocabulary, and each value is a list of the length of training set, where each element in the list is a 0.

Then we loop over training['SMS'] to generate a frequency table for the words from all the messages. 

In [12]:
word_counts_per_sms = {unique_word : [0] * len(training['SMS']) for unique_word in vocabulary_list}

for index, sms in enumerate(training['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

print(word_counts_per_sms['yes'])

[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

Now that we have the dictionary we need, let's do the final transformations to our training set and then move forward with creating the spam filter.

In [13]:
#Transform the dictionary into a DataFrame.
word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.head()

Unnamed: 0,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


In [14]:
#Concatenate the DataFrame we built with training dataframe.
training_table = pd.concat([training, word_counts], axis = 1)
training_table.head()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


## Calculate Constants First

Now that we're done with data cleaning and have a training set to work with, we can begin creating the spam filter. Recall that the Naive Bayes algorithm will need to know the probability values of the two equations below to be able to classify new messages:
[![Capture1.jpg](https://i.postimg.cc/Fz8r4Mxz/Capture1.jpg)](https://postimg.cc/m13f36JW)
Also, to calculate P(w<sub>i</sub>|Spam) and P(w<sub>i</sub>|Ham) inside the formulas above, recall that we need to use these additive smoothing equations:
[![Capture2.jpg](https://i.postimg.cc/MZv9Kn0N/Capture2.jpg)](https://postimg.cc/5jWBsNRS)

Some of the terms in the four equations above will have the same value for every new message. As a start, let's first calculate:
   + P(Spam) and P(Ham)
   + N<sub>Spam</sub>, N<sub>Ham</sub>, N<sub>Vocabulary</sub>
   
Recall from the courses that:
   + N<sub>Spam</sub> is equal to the number of words in all the spam messages - it's not equal to the number of spam messages, and it's not equal to the total number of unique words in spam messages.
   + N<sub>Ham</sub> is equal to the number of words in all the non-spam messages - it's not equal to the number of non-spam messages, and it's not equal to the total number of unique words in non-spam messages.
   
Here we'll use Laplace smoothing and set α=1.

In [15]:
#Calculate P(Spam) and P(Ham).
p_spam = training['Label'].value_counts(normalize = True)['spam']
print(p_spam)

0.13458950201884254


In [16]:
p_ham = training['Label'].value_counts(normalize = True)['ham']
print(p_ham)

0.8654104979811574


In [17]:
#Calculate Nspam, Nham, and Nvocabulary.
n_spam = 0
n_ham = 0
for i in range(len(training)):
    if training['Label'][i] == 'spam':
        n_spam += len(training['SMS'][i])
    else:
        n_ham += len(training['SMS'][i])
print(n_spam)
print(n_ham)

15190
57237


In [18]:
n_vocabulary = len(vocabulary_list)
print(n_vocabulary)

7783


In [19]:
#Initiate a variable named alpha.
alpha = 1

## Calculating Parameters

Previously, we managed to calculate a few terms for our equations:
   + P(Spam) and P(Ham)
   + N<sub>Spam</sub>, N<sub>Ham</sub>, N<sub>Vocabulary</sub>

As we've already mentioned, all these terms will have constant values in our equations for every new message (regardless of the message or each individual word in the message).

However, P(w<sub>i</sub>|Spam) and P(w<sub>i</sub>|Ham) will vary depending on the individual words. For instance, P("secret"|Spam) will have a certain porbability value, while P("cousin"|Spam) or P("lovely"|Spam) will mos likely have other values.

Although both P(w<sub>i</sub>|Spam) and P(w<sub>i</sub>|Ham) vary depending on the word, the probability for each individual word is constant for every new message.

For instance let's say we receive two new messages:
   + "secret code"
   + "secret party 2night"
   
We'll need to calculate P("secret"|Spam) for both these messages, and we can use the training set to get the values we need to find a result for the equation below:
[![Capture3.jpg](https://i.postimg.cc/KzrYwvVc/Capture3.jpg)](https://postimg.cc/8JcDrDNx)

The steps we take to calculate P("secret"|Spam) will be identical for both of our new messages above, or for any other new message that contains the word "secret". The key detail here is that calculating P("secret"|Spam) only depends on the training set, and as long as we don't make changes to the training set, P("secret"|Spam) stays constant. The same reasoning also applies to P("secret"|Ham).

This means that we can use our training set to calculate the probability for each word in our vocabulary. If our vocabulary contained only the words "lost", "navigate", and "sea", then we'd need to calculate six probabilities:

   + P("lost"|Spam) and P("lost"|Ham)
   + P("navigate"|Spam) and P("navigate"|Ham)
   + P("sea"|Spam) and P("sea"|Ham)

We have 7,783 words in our vocabulary, which means we'll need to calculate a total of 15,566 probabilities. For each word, we need to calculate both P(w<sub>i</sub>|Spam) and P(w<sub>i</sub>|Ham).

In more technical language, the probability values that P(w<sub>i</sub>|Spam) and P(w<sub>i</sub>|Ham) will take are called parameters.

The fact that we calculate so many values before even beginning the classification of new messages makes the Naive Bayes algorithm very fast (especially compared to other algorithms). When a new message comes in, most of the needed computations are already done, which enables the algorithm to almost instantly classify the new message.

If we didn't calculate all these values beforehand, then all these calculations would need to be done every time a new message comes in. Imagine the algorithm will be used to classify 1,000,000 new messages. Why repeat all these calculations 1,000,000 times when we could just do them once at the beginning?

Let's now calculate all the parameters using the equations below:
[![Capture2.jpg](https://i.postimg.cc/MZv9Kn0N/Capture2.jpg)](https://postimg.cc/5jWBsNRS)

In [20]:
#Create two dictionaries to store the parameters for P(wi|Spam) and P(wi|Ham).
p_word_given_spam = {unique_word: 0 for unique_word in vocabulary_list}
p_word_given_ham = {unique_word: 0 for unique_word in vocabulary_list}

#Isolate the spam and the ham messages into two dataframes.
training_spam = training_table[training_table['Label']=='spam']
training_ham = training_table[training_table['Label']=='ham']

#Iterate over the vocabulary and calculate the parameters.
for word in vocabulary_list:
    n_word_given_spam = training_spam[word].sum()
    p_word_given_spam[word] = (n_word_given_spam + alpha)/(n_spam + alpha * n_vocabulary)
    n_word_given_ham = training_ham[word].sum()
    p_word_given_ham[word] = (n_word_given_ham + alpha)/(n_ham + alpha * n_vocabulary)

## Classifying A New Message

Now that we've calculated all the constants and parameters we need, we can start creating the spam filter. The spam filter can be understood as a function that;
   + Takes in as input a new message (w<stub>1</stub>, w<stub>2</stub>,..., w<stub>n</stub>)
   + Calculates P(Spam|w<stub>1</stub>, w<stub>2</stub>,..., w<stub>n</stub>) and P(Ham|w<stub>1</stub>, w<stub>2</stub>,..., w<stub>n</stub>)
   + Compares the values of P(Spam|w<stub>1</stub>, w<stub>2</stub>,..., w<stub>n</stub>) and P(Ham|w<stub>1</stub>, w<stub>2</stub>,..., w<stub>n</stub>), and:
      + if P(Spam|w<stub>1</stub>, w<stub>2</stub>,..., w<stub>n</stub>)> P(Ham|w<stub>1</stub>, w<stub>2</stub>,..., w<stub>n</stub>) then the message is classified as spam.
      + if P(Spam|w<stub>1</stub>, w<stub>2</stub>,..., w<stub>n</stub>)< P(Ham|w<stub>1</stub>, w<stub>2</stub>,..., w<stub>n</stub>), then the message is classified as ham.
      + if P(Spam|w<stub>1</stub>, w<stub>2</stub>,..., w<stub>n</stub>)= P(Ham|w<stub>1</stub>, w<stub>2</stub>,..., w<stub>n</stub>), then the algorithm may request human help.
      
Let's start to build the spam filter function. Note that some new messages will contain words that are not part of the vocabulary. Recall from previous course that we simply ignore these words when we're calculating the probabilities.

In [21]:
import re

def classify(message):
    #message should be a string.
    #Remove the punctuation.
    message = re.sub('\W', ' ', message)
    #Bring all letters to lower case.
    message = message.lower()
    #Split the string at the space character and transform it into a Python list.
    message = message.split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    for word in message:
        if word in p_word_given_spam:
            p_spam_given_message = p_spam_given_message * p_word_given_spam[word]
        if word in p_word_given_ham:
            p_ham_given_message = p_ham_given_message * p_word_given_ham[word]
            
    print('P(Spam|message): ', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)
    
    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')
            

Let's test our function by classifying two new example messages.

In [22]:
m1 = 'WINNER!! This is the secret code to unlock the money: C3421.'
m2 = "Sounds good, Tom, then see u there"
classify(m1)
classify(m2)

P(Spam|message):  1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam
P(Spam|message):  2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


According to our experience, the first message m1 is obviously spam, and the second one is ham. The same labels are generated with the classify() function. 

Next, we'll classify all the 1114 messages in our test set.

## Measuring the Spam Filter's Accuracy

Previously, we tried to create a spam filter, and we classified two new messages. We'll now try to determine how well the spam filter does on our test set of 1114 messages.

The algorithm will output a classification label for every message in our test set, which we'll be able to compare with the actual label (given by a human). Note that, in training, our algorithm didn't see these 1,114 messages, so every message in the test set is practically new from the perspective of the algorithm.

First off, we'll change the classify() function that we wrote previously to return the labels instead of printing them. 

In [23]:
def classify_test(message):
    #message should be a string.
    #Remove the punctuation.
    message = re.sub('\W', ' ', message)
    #Bring all letters to lower case.
    message = message.lower()
    #Split the string at the space character and transform it into a Python list.
    message = message.split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    for word in message:
        if word in p_word_given_spam:
            p_spam_given_message = p_spam_given_message * p_word_given_spam[word]
        if word in p_word_given_ham:
            p_ham_given_message = p_ham_given_message * p_word_given_ham[word]
            
    #print('P(Spam|message): ', p_spam_given_message)
    #print('P(Ham|message):', p_ham_given_message)
    
    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return "needs human's help!"

Now that we have a function that returns labels instead of printing them, we can use it to create a new column in our test set.

In [24]:
test['predicted'] = test['SMS'].apply(classify_test)
test.head()

Unnamed: 0,Label,SMS,predicted
0,ham,later i guess i needa do mcat study too,ham
1,ham,but i haf enuff space got like 4 mb,ham
2,spam,had your mobile 10 mths update to latest oran...,spam
3,ham,all sounds good fingers makes it difficult ...,ham
4,ham,all done all handed in don t know if mega sh...,ham


Now we can compare the predicted values with the actual values to measure how good our spam filter is with classifying new messages. To make the measurement, we'll use accuracy as a metric:
[![Capture4.jpg](https://i.postimg.cc/GhHN218h/Capture4.jpg)](https://postimg.cc/D835jHPt)

In [25]:
total = test.shape[0]
correct_test = test[test['Label'] == test['predicted']]
correct = correct_test.shape[0]
accuracy = correct/total
print("Correct: ", correct)
print('Accuracy: ', accuracy)

Correct:  1100
Accuracy:  0.9874326750448833


We classified 1114 messages that hasn't seen in training. Our spam filter classified 1100 messages correctly. The accuracy of our spam filter is 98.74%, which is pretty good. 

## Conclusions and Future Directions

In this project, we managed to build a spam filter for SMS messages using the multinomial Naive Bayes algorithm. The filter has an accuracy of 98.74% on the test set, which is an excellent result. We initially aimed for an accuracy of over 80%, but we managed to do way better than that.

Here are a few next steps can be taken to work one this project in the future:

   + Isolate the 14 messages that were classified incorrectly and try to figure out why the algorithm reached the wrong conclusions.
   + Make the filtering process more complex by making the algorithm sensitive to letter case.