# Building a Spam Filter with Naive Bayes

In this guided project, we're going to study the practical side of the algorithm by building a spam filter for SMS messages.

To classify messages as spam or non-spam computer:

1) Learns how humans classify messages.

2) Uses that human knowledge to estimate probabilities for new messages — probabilities for spam and non-spam.

3) Classifies a new message based on these probability values — if the probability for spam is greater, then it classifies the message as spam. Otherwise, it classifies it as non-spam (if the two probability values are equal, then we may need a human to classify the message).

So our first task is to "teach" the computer how to classify messages. To do that, we'll use the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages that are already classified by humans.



# Open and Read File


In [1]:
import pandas as pd


In [2]:
#The data points are tab separated, so we'll need to use the sep='\t' parameter

# The dataset doesn't have a header row, which means we need to use the header=None parameter,
#otherwise the first row will be wrongly used as the header row.

# Use the names=['Label', 'SMS'] parameter to name the columns 
#as Label and SMS.


sms_spam = pd.read_csv('SMSSpamCollection', sep = '\t', header = None,
                      names = ['Label', 'SMS'])

# Explore Dataset

Find how many rows and columns it has.

Find what percentage of the messages is spam and what percentage is ham ("ham" means non-spam).

In [3]:
rows, columns = sms_spam.shape
print('The SMS Spam Collection dataset has {} rows and {} columns'.format
     (rows, columns))

The SMS Spam Collection dataset has 5572 rows and 2 columns


In [4]:
sms_spam.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [5]:
sms_spam['Label'].value_counts(normalize = True)

ham     0.865937
spam    0.134063
Name: Label, dtype: float64

There are 86.5% message are ham (i.e non-spam), while 13.4% messages are spam in SMS Spam Collection dataset

# Training and Test Set

we've become a bit familiar with the dataset, we can move on to building the spam filter.

However, before creating it, it's very helpful to first think of a way of testing how well it works. When creating software (a spam filter is software), a good rule of thumb is that designing the test comes before creating the software. If we write the software first, then it's tempting to come up with a biased test just to make sure the software passes it.

Once our spam filter is done, we'll need to test how good it is with classifying new messages. To test the spam filter, we're first going to split our dataset into two categories:

* A training set, which we'll use to "train" the computer how to classify messages.

* A test set, which we'll use to test how good the spam filter is with classifying new messages.

We're going to keep 80% of our dataset for training, and 20% for testing (we want to train the algorithm on as much data as possible, but we also want to have enough test data). The dataset has 5,572 messages, which means that:

* The training set will have 4,458 messages (about 80% of the dataset).

* The test set will have 1,114 messages (about 20% of the dataset).

To better understand the purpose of putting a test set aside, let's begin by observing that all 1,114 messages in our test set are already classified by a human. When the spam filter is ready, we're going to treat these messages as new and have the filter classify them. Once we have the results, we'll be able to compare the algorithm classification with that done by a human, and this way we'll see how good the spam filter really is.


For this project, our goal is to create a spam filter that classifies new messages with an accuracy greater than 80% — so we expect that more than 80% of the new messages will be classified correctly as spam or ham (non-spam).

let's create a training and a test set. We're going to start by randomizing the entire dataset to ensure that spam and ham messages are spread properly throughout the dataset.


# Randomized the dataset

In [6]:
sms_spam_random = sms_spam.sample(frac = 1, random_state = 1)
sms_spam_random.head()

Unnamed: 0,Label,SMS
1078,ham,"Yep, by the pretty sculpture"
4028,ham,"Yes, princess. Are you going to make me moan?"
958,ham,Welp apparently he retired
4642,ham,Havent.
4674,ham,I forgot 2 ask ü all smth.. There's a card on ...


# Split randomized data into train-test

The training set should account for 80% of the dataset, and the remaining 20% of the data should be the test set.

Reset the index labels for both data sets — the index labels remained unordered after randomization. You can use the DataFrame.reset_index() method.

In [7]:
# calculate index for split

training_test_index = round(len(sms_spam_random) * 0.80)
training_test_index

4458

In [8]:
training_set = sms_spam_random[:training_test_index]
training_set.head()

Unnamed: 0,Label,SMS
1078,ham,"Yep, by the pretty sculpture"
4028,ham,"Yes, princess. Are you going to make me moan?"
958,ham,Welp apparently he retired
4642,ham,Havent.
4674,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [9]:
training_set = sms_spam_random[:training_test_index].reset_index()
training_set.head(3)

Unnamed: 0,index,Label,SMS
0,1078,ham,"Yep, by the pretty sculpture"
1,4028,ham,"Yes, princess. Are you going to make me moan?"
2,958,ham,Welp apparently he retired


In [10]:
training_set = sms_spam_random[:training_test_index].reset_index(drop = True)
training_set.head()

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [11]:
test_set = sms_spam_random[training_test_index:].reset_index(drop = True)
test_set.head(3)

Unnamed: 0,Label,SMS
0,ham,Later i guess. I needa do mcat study too.
1,ham,But i haf enuff space got like 4 mb...
2,spam,Had your mobile 10 mths? Update to latest Oran...


Find the percentage of spam and ham in both the training and the test set. Are the percentages similar to what we have in the full dataset?

In [12]:
training_set['Label'].value_counts(normalize = True)

ham     0.86541
spam    0.13459
Name: Label, dtype: float64

In [13]:
test_set['Label'].value_counts(normalize = True)

ham     0.868043
spam    0.131957
Name: Label, dtype: float64

Percentages of both training and test dataset contains ham(non-spam) and spam messages are almost similar

# Data Cleaning

To calculate all these probabilities, we'll first need to perform a bit of data cleaning to bring the data in a format that will allow us to extract easily all the information we need.




1) Remove all the punctuation from the SMS column. You can use the regex '\W' to detect any character that is not from a-z, A-Z or 0-9.

* For instance, the function re.sub('\W', ' ', 'Secret!! Money, goods.' ) strips the punctuation marks and outputs the string 'Secret Money goods '.


* For simplicity, you can use the Series.str.replace() method.

2) For each message, transform every letter in every word to lower case. You may want to use the Series.str.lower() method.

In [14]:
# Before Cleaning

training_set.head()

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [15]:
training_set['SMS'] = training_set['SMS'].str.replace('\W', ' ')
training_set['SMS'] = training_set['SMS'].str.lower()

In [16]:
training_set.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


In [17]:
training_set.tail()

Unnamed: 0,Label,SMS
4453,ham,sorry i ll call later in meeting any thing re...
4454,ham,babe i fucking love you too you know fuck...
4455,spam,u ve been selected to stay in 1 of 250 top bri...
4456,ham,hello my boytoy geeee i miss you already a...
4457,ham,wherre s my boytoy


# Creating The Vocabulary

With the exception of the "Label" column, every other column in the transformed table above represents a unique word in our vocabulary (more specifically, each column shows the frequency of that unique word for any given message).

We'll eventually bring the training set to that format ourselves, but first, let's create a list with all of the unique words that occur in the messages of our training set.

# Instructions

Create a vocabulary for the messages in the training set. The vocabulary should be a Python list containing all the unique words across all messages, where each word is represented as a string.

* Begin by transforming each message from the SMS column into a list by splitting the string at the space character — use the Series.str.split() method.


* Initiate an empty list named vocabulary.


* Iterate over the the SMS column (each message in this column should be a list of strings by the time you start this loop).

* Using a nested loop, iterate each message in the SMS column (each message should be a list of strings) and append each string (word) to the vocabulary list.


* Transform the vocabulary list into a set using the set() function. This will remove the duplicates from the vocabulary list.


* Transform the vocabulary set back into a list using the list() function.

In [18]:
training_set['SMS'] = training_set['SMS'].str.split()

In [19]:
training_set['SMS'].head(3)

0                    [yep, by, the, pretty, sculpture]
1    [yes, princess, are, you, going, to, make, me,...
2                      [welp, apparently, he, retired]
Name: SMS, dtype: object

In [20]:
vocabulary = []
for sms in training_set['SMS']:
    for word in sms:
        vocabulary.append(word)
vocabulary = list(set(vocabulary))


In [21]:
len(vocabulary)

7783

There are 7783 unique words in our vocabulary

# The Final Training Set

we managed to create the vocabulary for our messages in the training set. Now we're going to use the vocabulary to make the data transformation we need

Eventually, we're going to create a new DataFrame. However, we'll first build a dictionary that we'll then convert to the DataFrame we need.



In [22]:
word_counts_per_sms = {unique_word: [0] * len(training_set['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(training_set['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [23]:
words_counts = pd.DataFrame(word_counts_per_sms)
words_counts.head()

Unnamed: 0,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


Concatenate the DataFrame we just built above with the DataFrame containing the training set (this way, we'll also have the Label and the SMS columns). Use the pd.concat() function.

In [40]:
training_set_clean = pd.concat([training_set, words_counts], axis = 1)

In [41]:
training_set_clean.head()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


# Calculating Constants First

Using the training set only:

1) Calculate P(Spam) and P(Ham). There's more than one way to write the code that can calculate this — feel free to choose any solution you want.

2) Calculate NSpam, NHam, NVocabulary. Feel free to choose any programming solution you like.

3) Initiate a variable named alpha with a value of 1.

In [42]:
# Seprate the spam labels

spam = training_set_clean[training_set_clean['Label'] == 'spam']
spam.head()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
16,spam,"[freemsg, why, haven, t, you, replied, to, my,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
18,spam,"[congrats, 2, mobile, 3g, videophones, r, your...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
56,spam,"[free, message, activate, your, 500, free, tex...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
60,spam,"[call, from, 08702490080, tells, u, 2, call, 0...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
61,spam,"[someone, has, conacted, our, dating, service,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [43]:
spam["Label"].value_counts()

spam    600
Name: Label, dtype: int64

In [44]:
training_set_clean.shape

(4458, 7785)

In [45]:
len(training_set_clean)

4458

# Probability of Spam

In [49]:
p_spam = len(spam)/len(training_set_clean)
p_spam

0.13458950201884254

In [47]:
# separate the ham(non-spam) labels

ham = training_set_clean[training_set_clean['Label'] == 'ham']
ham.head()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


# Probability of ham (non-spam) 

In [48]:
p_ham = len(ham) / training_set.shape[0]
p_ham

0.8654104979811574

# Calculate number of words present in spam messages

In [51]:
n_words_spam_message = spam['SMS'].apply(len)
n_words_spam = n_words_spam_message.sum()
n_words_spam

15190

# Calculate Number of words present in ham messages

In [52]:
n_words_ham_message = ham['SMS'].apply(len)
n_words_ham = n_words_ham_message.sum()
n_words_ham

57237

# Length of vocabulary

In [60]:
n_vocabulary = len(vocabulary)

# Initiate a variable named alpha with a value of 1 (Laplace Smoothing)

In [54]:
alpha = 1

# Calculating Parameters

The steps we take to calculate P("secret"|Spam) will be identical for both of our new messages above, or for any other new message that contains the word "secret". The key detail here is that calculating P("secret"|Spam) only depends on the training set, and as long as we don't make changes to the training set, P("secret"|Spam) stays constant. The same reasoning also applies to P("secret"|Ham).

This means that we can use our training set to calculate the probability for each word in our vocabulary. If our vocabulary contained only the words "lost", "navigate", and "sea", then we'd need to calculate six probabilities:

* P("lost"|Spam) and P("lost"|Ham)

* P("navigate"|Spam) and P("navigate"|Ham)

* P("sea"|Spam) and P("sea"|Ham)

We have 7,783 words in our vocabulary, which means we'll need to calculate a total of 15,566 probabilities. For each word, we need to calculate both P(wi|Spam) and P(wi|Ham).

In more technical language, the probability values that P(wi|Spam) and P(wi|Ham) will take are called parameters.

Initialize two dictionaries, where each key-value pair is a unique word (from our vocabulary) represented as a string, and the value is 0. We'll need one dictionary to store the parameters for P(wi|Spam), and the other for P(wi|Ham).

If the entire vocabulary were ['sea', 'navigate'], we'd need to initialize two dictionaries, one for spam and one for ham, and both should look like this: {'sea': 0, 'navigate': 0}.

In [57]:
parameters_spam = {unique_word : 0 for unique_word in vocabulary}
parameters_ham = {unique_word : 0 for unique_word in vocabulary}

Iterate over the vocabulary, and, for each word, calculate P(wi|Spam) and P(wi|Ham) using the formulas we mentioned above.

* Recall that NSpam, NHam, NVocabulary, and α are already calculated from the last screen.

* Recall from the previous lesson that Nwi|Spam is equal to the number of times the word wi occurs in all the spam messages, while Nwi|Ham is equal to the number of times the word wi occurs in all the ham messages.

* Once you're done with calculating an individual parameter, update the probability value in the two dictionaries you created initially.

In [62]:
for word in vocabulary:
    n_words_given_spam = spam[word].sum() # spam defined above
    p_words_given_spam = (n_words_given_spam + alpha)/ (n_words_spam + alpha * n_vocabulary)
    parameters_spam[word] = p_words_given_spam
    
    # for non-spam messages
    n_words_given_ham = ham[word].sum()
    p_words_given_ham = (n_words_given_ham + alpha)/(n_words_ham + alpha*n_vocabulary)
    parameters_ham[word] = p_words_given_ham

# Classifying a new message

Now that we've calculated all the constants and parameters we need, we can start creating the spam filter. The spam filter can be understood as a function that:

* Takes in as input a new message (w1, w2, ..., wn)


* Calculates P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn)


* Compares the values of P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn), and:


* If P(Ham|w1, w2, ..., wn) > P(Spam|w1, w2, ..., wn), then the message is classified as ham.


* If P(Ham|w1, w2, ..., wn) < P(Spam|w1, w2, ..., wn), then the message is classified as spam.


* If P(Ham|w1, w2, ..., wn) = P(Spam|w1, w2, ..., wn), then the algorithm may request human help.


For the classify() function above, note that:

1) The input variable message is assumed to be a string.

2) We perform a bit of data cleaning on the string message:

3) We remove the punctuation using the re.sub() function.

4) We bring all letters to lower case using the str.lower() method.

5) We split the string at the space character and transform it into a Python list using the str.split() method.

6) There's some placeholder code for calculating p_spam_given_message and p_ham_given_message — we'll write this code in the exercise below.

7) We compare p_spam_given_message with p_ham_given_message and then print a classification label.

In [66]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
        
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]


    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

# Test the function



In [67]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam


In [68]:
classify("Sounds good, Tom, then see u there")

P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


# Measuring the spam filter accuracy

we managed to create a spam filter, and we classified two new messages. We'll now try to determine how well the spam filter does on our test set of 1,114 messages.

The algorithm will output a classification label for every message in our test set, which we'll be able to compare with the actual label (given by a human). Note that, in training, our algorithm didn't see these 1,114 messages, so every message in the test set is practically new from the perspective of the algorithm.

First off, we'll change the classify() function that we wrote previously to return the labels instead of printing them.

In [69]:
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]

        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

# Create a new column named predicted using test set

In [70]:
test_set['Predicted'] = test_set['SMS'].apply(classify_test_set)
test_set.head()

Unnamed: 0,Label,SMS,Predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


# Accuracy

Now we can compare the predicted values with the actual values to measure how good our spam filter is with classifying new messages. To make the measurement, we'll use accuracy as a metric:

Accuracy =
number of correctly classified messages /
total number of classified messages


In [71]:
correct = 0
total = len(test_set['SMS'])
total

1114

In [78]:
for row in test_set.iterrows():
    row = row[1]
    if row['Label'] == row['Predicted']:
        correct += 1

In [85]:
print('Correct = {}'.format(correct))
print('Incorrect = {}'.format(total - correct))
print('Accuracy = {} %'.format(round(correct / total * 100, 2)))

Correct = 1100
Incorrect = 14
Accuracy = 98.74 %



The accuracy is close to 98.74%, which is really good. Our spam filter looked at 1,114 messages that it hasn't seen in training, and classified 1,100 correctly.