# Spam or Ham? — Creating a SMS Filter with Naive Bayes.

In this project, I will create a filter using the multinominal Naive Bayes algorithm to sort SMS messages as spam or ham(non-spam). The goal is to write a program that can classify new messages with an accuracy greater than 80% which will be done in the following steps: 
- 1. Learns how humans classify messages.
- 2. Uses that human knowledge to estimate probabilities for new messages — probabilities for spam and non-spam.
- 3. Classifies a new message based on these probability values — if the probability for spam is greater, then it classifies the message as spam. Otherwise, it classifies it as non-spam (if the two probability values are equal, then we may need a human to classify the message).
- 4. So the first task is to "teach" the computer how to classify messages. To do that, I'll use the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages that are already classified by humans.

The dataset used was put together by Tiago A. Almeida and José María Gómez Hidalgo, it can be downloaded from the The [UCI Machine Learning Repository.](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection) and more details can be found [here](https://www.dt.fee.unicamp.br/~tiago/smsspamcollection/#composition)

Summary of Results
After creating the spam filter and running it through the SMS dataset, it displayed an accuracy of 98.74% which is way better than what was aimed for and overall an excellent result. 

For more details, Please refer to the full analysis below. 

In [1]:
# import pandas library
import pandas as pd

# read the dataset into python
SMS = pd.read_csv('SMSSpamCollection', sep='\t', header = None, names=['Label', 'SMS']) # account for data being tab seperated and needs headers
SMS.shape #quick check of number of rows and columns

(5572, 2)

Use .head( ) to perform a quick check of the data set 

In [2]:
# quick exploration of data set
SMS.head(3)

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...


There are 5572 rows or 5572 messages that needs to be sorted through. First, I'll find the percentage of the messages that are spam and that are ham("ham" means non-spam). 

In [3]:
# Count the values in the 'Label' column
SMS['Label'].value_counts()

ham     4825
spam     747
Name: Label, dtype: int64

There are 4825 "ham" messages and 747 "spam" messages

In [4]:
# find the percentage of ham and spam messages relative to the total amount 
spam_per = (747 / 5572) * 100
ham_per = (4825 / 5572) * 100

print(spam_per)
print(ham_per)

13.406317300789663
86.59368269921033


From the dataset, there are about 13% spam messages and 87% ham messages. 

# Training and Test Set

Now, to create a spam filter, but before filtering the entire dataset, it's best to test if the software works properly first. So, the spam filter is built, I'll test how accurate it is with classifying new messages. To test the spam filter, I can split the dataset into two categories:
- A training set, which we'll use to "train" the computer how to classify messages.
- A test set, which we'll use to test how good the spam filter is with classifying new messages.

I decided to keep 80% of our dataset for training, and 20% for testing (best to train the algorithm on as much data as possible, but also want to have enough test data). The dataset has 5,572 messages, which means :
- The training set will have 4,458 messages (about 80% of the dataset).
- The test set will have 1,114 messages (about 20% of the dataset).

To better understand the purpose of putting a test set aside, let's begin by observing that all 1,114 messages in our test set are already classified by a human. When the spam filter is ready, we're going to treat these messages as new and have the filter classify them. Once we have the results, we'll be able to compare the algorithm classification with that done by a human, and this way we'll see how good the spam filter really is.

The goal is to create a spam filter that classifies new messages with an accuracy greater than 80%.

In [5]:
# Randomize the data
SMS_rand = SMS.sample(frac = 1, random_state = 1)

# Calculate the index for split 
training_set_index = round(len(SMS_rand) * 0.8)

#Split into Training/Test
training_set = SMS_rand[: training_set_index].reset_index(drop= True)
test_set = SMS_rand[training_set_index:].reset_index(drop = True)

print(training_set.shape)
print(test_set.shape)

(4458, 2)
(1114, 2)


The data set has been split into a training set and test set. I will calculate the percentage of spam and ham in both sets to check if it is similar to the base data set percentages.

In [6]:
# Count values in training set
training_set['Label'].value_counts()

ham     3858
spam     600
Name: Label, dtype: int64

There are 3858 "ham" messages and 600 "spam" messages in the training set

In [7]:
# Calculate percentage of ham and spam messages relative to total in the training set
training_ham_per = (3858 / 4458) * 100 
training_spam_per = (600 / 4458) * 100

In [8]:
# Count values in test set
test_set['Label'].value_counts()

ham     967
spam    147
Name: Label, dtype: int64

In [9]:
# Calculate percentage of ham and spam messages relative to total in the test set
test_ham_per = (967 / 1114) * 100
test_spam_per = (147 / 1114) * 100

Now to take a look at what the percentages are like in the two sets

In [10]:
print('training')
print(training_ham_per)
print(training_spam_per)
print('-------------------')
print('test')
print(test_ham_per)
print(test_spam_per)

training
86.54104979811575
13.458950201884253
-------------------
test
86.80430879712748
13.195691202872531


The percentages for both the training and test sets have percentages that are essentially the same as our initial set of 87% and 13%. 

# Letter Case and Punctuation

Before building the spam filter, some quick data cleaning is needed to remove the punctuation and bring all the words to lower case. 

I will use a regex function and the built-in replace( ) and lower( ) functions to accomplish this.

In [11]:
# Before cleaning
training_set.head(3)

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired


Looks like I need to remove punctuations and make sure all the words ae lower case to ensure everything is standardized. 

In [12]:
# After cleaning 
training_set["SMS"] = training_set["SMS"].str.replace('\W', ' ').str.lower()
training_set.head(3)

  training_set["SMS"] = training_set["SMS"].str.replace('\W', ' ').str.lower()


Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired


# Creating the Vocabulary

With the data cleaned now, The spam filter requires a list with all the unique words it needs to filter for in the training set.

In [13]:
# Split values in 'SMS' column by the whitespace
training_set['SMS'] = training_set['SMS'].str.split()
training_set['SMS'].head()

0                    [yep, by, the, pretty, sculpture]
1    [yes, princess, are, you, going, to, make, me,...
2                      [welp, apparently, he, retired]
3                                             [havent]
4    [i, forgot, 2, ask, ü, all, smth, there, s, a,...
Name: SMS, dtype: object

In [14]:
# Create an empty list 
vocabulary = []
for sms in training_set['SMS']: # Use a for loop to append all the unique words into the vocabulary list
    for word in sms:
        vocabulary.append(word)

vocabulary = list(set(vocabulary))

In [15]:
# Display the length of the list
len(vocabulary)

7783

There looks to be 7783 unique words in all the messages in the training set. 

# The Final Training Set

Now that there's a list of all the unique words from all the messages in the data set, I can create a dictionary to hold the values in that list so I can build a database to display the unique values and how often each pops up. After creating the new database, I can combine the training set and the new database into a updated training set to use for the spam filter. 

In [16]:
# Create a dictionary for the unique values
word_counts_per_sms = {unique_word: [0] * len(training_set['SMS']) for unique_word in vocabulary}

# Use a for loop to count the unique values and how often each appears
for index, sms in enumerate(training_set['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

Now that I have the dictionary made, I can convert it into a database.

In [17]:
# Convert the dictionary into a database
word_count = pd.DataFrame(word_counts_per_sms)
word_count.head() 

Unnamed: 0,offdam,maybe,tellmiss,cage,fr,strange,drunk,langport,barrel,18,...,la3,reference,real1,visiting,london,trackmarque,08715203649,anot,freaky,eighth
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


With the new database, I can concatenate it to the training set to use for the spam filter

In [18]:
# Concatenate the newly created database to the cleaned training set
training_set_clean = pd.concat([training_set, word_count], axis = 1)
training_set_clean.head()

Unnamed: 0,Label,SMS,offdam,maybe,tellmiss,cage,fr,strange,drunk,langport,...,la3,reference,real1,visiting,london,trackmarque,08715203649,anot,freaky,eighth
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# Calculating Constants

I have a clean data set to use , so I can begin to create the spam filter now. I will use the Naive Bayes Algorithm to classify new messages but for it to work properly, the algorithm needs to know the probability values for the following two equations:
- $$ P(Spam | w_1,w_2, ..., w_n) \propto P(Spam) \cdot \prod_{i=1}^{n}P(w_i|Spam) $$$$ P(Ham | w_1,w_2, ..., w_n) \propto P(Ham) \cdot \prod_{i=1}^{n}P(w_i|Ham) $$

I also need to use the following equations to find P(wi|Spam) and P(wi|Ham) inside the above formulas:
- $$ P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}} $$$$ P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}} $$

It is possible that some of the values in the equation above will be the same, so I can just compute them once and avoid performing it again when a new message comes in. I want to calculate the following out to get my constants:
- P(Spam) and P(Ham)
- NSpam, NHam, NVocabulary

I also will need to use Laplace Smoothing and set $\alpha = 1$

In [19]:
# Isolate spam and ham messages first
spam_messages = training_set_clean[training_set_clean['Label'] == 'spam']
ham_messages = training_set_clean[training_set_clean['Label'] == 'ham']

# P(Spam) & P(Ham)
p_spam = len(spam_messages) / len(training_set_clean)
p_ham = len(ham_messages) / len(training_set_clean)

# N_Spam
n_words_per_spam_message = spam_messages['SMS'].apply(len)
n_spam = n_words_per_spam_message.sum()

#N_Ham
n_words_per_ham_message = ham_messages['SMS'].apply(len)
n_ham = n_words_per_ham_message.sum()

# N_Vocabulary
n_vocabulary = len(vocabulary)

# Laplace Smoothing
alpha = 1

# Calculating Parameters

With my constants defined, I can move on to computing my parameters for the spam filter, which in essence is the conditional probability value for each word in the vocabulary. I can calculate my parameters with the following formulas: 
- $$ P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}} $$$$ P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}} $$



In [20]:
# Initiate parameters
parameters_spam = {unique_word:0 for unique_word in vocabulary}
parameters_ham = {unique_word:0 for unique_word in vocabulary}

# Use a for loop to calculate parameters
for word in vocabulary:
    n_word_given_spam = spam_messages[word].sum()   # spam_messages already defined in a cell above
    p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + alpha*n_vocabulary)
    parameters_spam[word] = p_word_given_spam
    
    n_word_given_ham = ham_messages[word].sum()   # ham_messages already defined in a cell above
    p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + alpha*n_vocabulary)
    parameters_ham[word] = p_word_given_ham

# Classifying A New Message
At this point, I have my constants and parameters defined, so I can begin creating the spam filter. The spam filter will essentially work as a function that does the following:
- Takes in as input a new message (w1, w2, ..., wn).
- Calculates P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn).
- Compares the values of P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn), and:
 - If P(Ham|w1, w2, ..., wn) > P(Spam|w1, w2, ..., wn), then the message is classified as ham.
 - If P(Ham|w1, w2, ..., wn) < P(Spam|w1, w2, ..., wn), then the message is classified as spam.
 - If P(Ham|w1, w2, ..., wn) = P(Spam|w1, w2, ..., wn), then the algorithm may request human help.

In [21]:
# Import the regular expressions library 
import re

# Define a function that classifies new messages as Spam/Ham
def classify(message):
    '''
    message: a string
    '''
    
    message = re.sub('\W', ' ', message)
    message = message.lower().split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
            
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
            
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)
    
    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

Above, I defined a function that classifies any new message that is inputted in as spam or ham. Let's test out if it works properly

In [22]:
# Test the function to see if it classifies the messages inputted correctly
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam


I used a message that was already confirmed to be spam to see if the filter would classify it accurately. It seems to be working correctly, let's try another one but use a message that has already been confirmed to be ham.

In [23]:
classify("Sounds good, Tom, then see u there")

P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


It looks like the spam filter is working fine with the two messages we test so I can move on to using the test set. 

# Measuring the Spam Filter's Accuracy

As stated above, the two results look promising, but I want to know how accurate the filter does on our test set, which has 1,114 messages.

I'll test this by writing an updated function that returns classification labels instead of printing them.

In [24]:
# Create a function that returns classification labels instead of printing them
def classify_test_set(message):    
    '''
    message: a string
    '''
    
    message = re.sub('\W', ' ', message)
    message = message.lower().split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
            
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
    
    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

Now that I've written my updated function, I'll apply it to my test set and add the results into a new column. 

In [25]:
test_set['predicted'] = test_set['SMS'].apply(classify_test_set)
test_set.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


I want to see how accurate the spam filter is, so I'll write a function that measures that to see how well it does. 

In [26]:
correct = 0
total = test_set.shape[0]
    
for row in test_set.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1
        
print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', correct/total)

Correct: 1100
Incorrect: 14
Accuracy: 0.9874326750448833


The spam filter has an accuracy of about 98.74% which is great. Out of 1114 messages that it hasn't seen in the training set, it classified 1100 correctly. 

# Isolating the Messages that were Classified Incorrectly
Although the spam filter has great accuracy, I should take a look at the 14 messages were classified incorrectly to see why those messagees were classified wrong. 

I'll begin my isolate the 14 incorrect messages and doing a quick exploration of them.

In [27]:
test_set.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


In [28]:
# Create three empty lists to hold values 
incorrect_messages = []
incorrect_label = []
incorrect_predicted = []
for row in test_set.iterrows(): # Use a for loop to append values to respective lists
    row = row[1]
    if row['Label'] != row['predicted']:
        incorrect_messages.append(row['SMS'])
        incorrect_label.append(row['Label'])
        incorrect_predicted.append(row['predicted'])

Check if the messages pulled look right.

In [29]:
incorrect_messages

['Not heard from U4 a while. Call me now am here all night with just my knickers on. Make me beg for it like U did last time 01223585236 XX Luv Nikiyu4.net',
 "More people are dogging in your area now. Call 09090204448 and join like minded guys. Why not arrange 1 yourself. There's 1 this evening. A£1.50 minAPN LS278BB",
 'Unlimited texts. Limited minutes.',
 '26th OF JULY',
 'Nokia phone is lovly..',
 'A Boy loved a gal. He propsd bt she didnt mind. He gv lv lttrs, Bt her frnds threw thm. Again d boy decided 2 aproach d gal , dt time a truck was speeding towards d gal. Wn it was about 2 hit d girl,d boy ran like hell n saved her. She asked \'hw cn u run so fast?\' D boy replied "Boost is d secret of my energy" n instantly d girl shouted "our energy" n Thy lived happily 2gthr drinking boost evrydy Moral of d story:- I hv free msgs:D;): gud ni8',
 'No calls..messages..missed calls',
 'We have sent JD for Customer Service cum Accounts Executive to ur mail id, For details contact us',
 "Oh

I will create a dataframe that displays the values for the incorrect messages, their label, and their predicted label.

In [30]:
# Convert the lists into dataframes
message_db = pd.DataFrame(incorrect_messages, columns = ['incorrect messages'])
label_db = pd.DataFrame(incorrect_label, columns = ['Label'])
predicted_db = pd.DataFrame(incorrect_predicted, columns = ['predicted'])

# Merge all three dataframes into one 
incorrect_db = message_db.merge(label_db, how = 'outer', left_index = True, right_index = True)
incorrect_db = incorrect_db.merge(predicted_db, how = 'outer', left_index = True, right_index = True)
pd.set_option("display.max_colwidth", None) # Displays full text in columns

In [31]:
incorrect_db

Unnamed: 0,incorrect messages,Label,predicted
0,Not heard from U4 a while. Call me now am here all night with just my knickers on. Make me beg for it like U did last time 01223585236 XX Luv Nikiyu4.net,spam,ham
1,More people are dogging in your area now. Call 09090204448 and join like minded guys. Why not arrange 1 yourself. There's 1 this evening. A£1.50 minAPN LS278BB,spam,ham
2,Unlimited texts. Limited minutes.,ham,spam
3,26th OF JULY,ham,spam
4,Nokia phone is lovly..,ham,spam
5,"A Boy loved a gal. He propsd bt she didnt mind. He gv lv lttrs, Bt her frnds threw thm. Again d boy decided 2 aproach d gal , dt time a truck was speeding towards d gal. Wn it was about 2 hit d girl,d boy ran like hell n saved her. She asked 'hw cn u run so fast?' D boy replied ""Boost is d secret of my energy"" n instantly d girl shouted ""our energy"" n Thy lived happily 2gthr drinking boost evrydy Moral of d story:- I hv free msgs:D;): gud ni8",ham,needs human classification
6,No calls..messages..missed calls,ham,spam
7,"We have sent JD for Customer Service cum Accounts Executive to ur mail id, For details contact us",ham,spam
8,"Oh my god! I've found your number again! I'm so glad, text me back xafter this msgs cst std ntwk chg £1.50",spam,ham
9,"Hi babe its Chloe, how r u? I was smashed on saturday night, it was great! How was your weekend? U been missing me? SP visionsms.com Text stop to stop 150p/text",spam,ham


The spam filter labeled these 14 messages above incorrectly with the reasoning possibly being the following:
- The test set wasn't standardized like the training set was so there were unique values that the filter wasn't able to properly sort. 

# Cleaning the Test Set & Rerunning Spam Filter
After exploring the incorrect messages, I can see that the test set wasn't standardized so I'm going to back and clean it up and rerun the spam filter. 

In [32]:
test_set_up = test_set.drop(columns = ['predicted']).copy()

# Before cleaning
test_set_up.head()

Unnamed: 0,Label,SMS
0,ham,Later i guess. I needa do mcat study too.
1,ham,But i haf enuff space got like 4 mb...
2,spam,Had your mobile 10 mths? Update to latest Orange camera/video phones for FREE. Save £s with Free texts/weekend calls. Text YES for a callback orno to opt out
3,ham,All sounds good. Fingers . Makes it difficult to type
4,ham,"All done, all handed in. Don't know if mega shop in asda counts as celebration but thats what i'm doing!"


In [33]:
# After cleaning 
test_set_up["SMS"] = test_set_up["SMS"].str.replace('\W', ' ').str.lower()
test_set_up.head(3)

  test_set_up["SMS"] = test_set_up["SMS"].str.replace('\W', ' ').str.lower()


Unnamed: 0,Label,SMS
0,ham,later i guess i needa do mcat study too
1,ham,but i haf enuff space got like 4 mb
2,spam,had your mobile 10 mths update to latest orange camera video phones for free save s with free texts weekend calls text yes for a callback orno to opt out


I'm going to create a new 'predicted' column in the updated test set

In [34]:
# Create new 'predicted' column
test_set_up['predicted'] = test_set_up['SMS'].apply(classify_test_set)
test_set_up

Unnamed: 0,Label,SMS,predicted
0,ham,later i guess i needa do mcat study too,ham
1,ham,but i haf enuff space got like 4 mb,ham
2,spam,had your mobile 10 mths update to latest orange camera video phones for free save s with free texts weekend calls text yes for a callback orno to opt out,spam
3,ham,all sounds good fingers makes it difficult to type,ham
4,ham,all done all handed in don t know if mega shop in asda counts as celebration but thats what i m doing,ham
...,...,...,...
1109,ham,we re all getting worried over here derek and taylor have already assumed the worst,ham
1110,ham,oh oh den muz change plan liao go back have to yan jiu again,ham
1111,ham,ceri u rebel sweet dreamz me little buddy c ya 2moro who needs blokes,ham
1112,spam,text meet someone sexy today u can find a date or even flirt its up to u join 4 just 10p reply with name age eg sam 25 18 msg recd thirtyeight pence,spam


Run the updated test set through the spam filter

In [35]:
correct = 0
total = test_set_up.shape[0]
    
for row in test_set_up.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1
        
print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', correct/total)

Correct: 1100
Incorrect: 14
Accuracy: 0.9874326750448833


It looks like even after standardized the test set, the spam filter still only has an accuracy of 98.74% which means that punctuation and casing wasn't the issue. 

# Conclusion
In this project, the goal was to create a SMS filter than would sort messages as spam or ham(non-spam) with an accuracy of more than 80%. After cleaning the data set and computing the constants and parameters needed with multinominal Naive Bayes algorithm, the spam filter was created and displayed an accuracy of 98.74%. 

Taking a closer look at the 14 messages that were incorrect, I noticed that the test_set was not standardized like the training_set was, which could have led to issues with sorting in the spam filter. Although standardized the test set, the accuracy of the spam filter did not change, which means that possibly the spam filter isn't complex enough to sort abbreviations/acronyms in the messages. All in all, even with the issue of incorrect filtering, the spam filter still had an accuracy of 98.74% which is really good.  