# Building a Spam Filter with Naive Bayes

In this project, we're going to study the practical side of the algorithm by building a spam filter for SMS messages.

To classify messages as spam or non-spam, we saw that the computer:

- Learns how humans classify messages.
- Uses that human knowledge to estimate probabilities for new messages — probabilities for spam and non-spam.
- Classifies a new message based on these probability values — if the probability for spam is greater, then it classifies the message as spam. Otherwise, it classifies it as non-spam (if the two probability values are equal, then we may need a human to classify the message).

So our first task is to "teach" the computer how to classify messages. To do that, we'll use the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages that are already classified by humans.

The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the The [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). You can also download the dataset directly [from this link](https://dq-content.s3.amazonaws.com/433/SMSSpamCollection). The data collection process is described in more details on [this page](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/#composition), where you can also find some of the authors' papers.

Let's start by reading in the dataset.

In [1]:
import pandas as pd
import numpy as np

spam_collection = pd.read_csv('SMSSpamCollection', sep='\t')
print(spam_collection.shape)
print('\n')
print(spam_collection.head())
print('\n')
print(spam_collection.tail())

(5571, 2)


    ham  \
0   ham   
1  spam   
2   ham   
3   ham   
4  spam   

  Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...  
0                      Ok lar... Joking wif u oni...                                                               
1  Free entry in 2 a wkly comp to win FA Cup fina...                                                               
2  U dun say so early hor... U c already then say...                                                               
3  Nah I don't think he goes to usf, he lives aro...                                                               
4  FreeMsg Hey there darling it's been 3 week's n...                                                               


       ham  \
5566  spam   
5567   ham   
5568   ham   
5569   ham   
5570   ham   

     Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...  
5566  This is the 2

The file structure is very simple. The first column contains information about how a human rater classified an SMS message, either as 'spam' or as 'ham' (not spam). The second columb contains the text message as strings. We will assign column names to work with them.

In [2]:
spam_collection.columns = ['label', 'message']
print(spam_collection.head())
spam_collection.describe()

  label                                            message
0   ham                      Ok lar... Joking wif u oni...
1  spam  Free entry in 2 a wkly comp to win FA Cup fina...
2   ham  U dun say so early hor... U c already then say...
3   ham  Nah I don't think he goes to usf, he lives aro...
4  spam  FreeMsg Hey there darling it's been 3 week's n...


Unnamed: 0,label,message
count,5571,5571
unique,2,5168
top,ham,"Sorry, I'll call later"
freq,4824,30


In [3]:
print(spam_collection['label'].value_counts(normalize='True'))

ham     0.865913
spam    0.134087
Name: label, dtype: float64


So we have 5571 labelled messages, of which 87% percent are being labelled as 'ham', so genuine non-spam messages. Amon the message content, we see that the most frequent unique message is "Sorry, I'll call later" (probably a preset response message).

## Training and test set

In order to test how well our spam-filter works, we start by setting up a training set and a test set:
- A training set, which we'll use to "train" the computer how to classify messages.
- A test set, which we'll use to test how good the spam filter is with classifying new messages.

We're going to keep 80% of our dataset for training, and 20% for testing (we want to train the algorithm on as much data as possible, but we also want to have enough test data). The dataset has 5,571 messages, which means that:

- The training set will have 4,457 messages (about 80% of the dataset).
- The test set will have 1,114 messages (about 20% of the dataset).

For this project, our goal is to create a spam filter that classifies new messages with an accuracy greater than 80% — so we expect that more than 80% of the new messages will be classified correctly as spam or ham (non-spam).

In [4]:
# Randomize the dataset
randomized = spam_collection.sample(frac=1, random_state=1)
randomized.reset_index(drop=True, inplace=True)

# Split into training set and test set
training_set = randomized.iloc[0:4457, :]
training_set.reset_index(drop=True, inplace=True)
test_set = randomized.iloc[4457:, :]
test_set.reset_index(drop=True, inplace=True)

print(training_set['label'].value_counts(normalize='True'))
print('\n')
print(test_set['label'].value_counts(normalize='True'))

ham     0.865829
spam    0.134171
Name: label, dtype: float64


ham     0.866248
spam    0.133752
Name: label, dtype: float64


Looks like our randomization into the training and test set was successful. Each have the same share of spam as the original set.

# Letter case and punctuation

The next big step is to use the training set to teach the algorithm to classify new messages. But we'll first need to perform a bit of data cleaning to bring the data in a format that will allow us to extract easily all the information we need. Let's begin the data cleaning process by removing the punctuation and bringing all the words to lower case.

In [5]:
import re

# Removing all punctuation from the message column
training_set = training_set.copy()
training_set.loc[:, 'no_punc'] = training_set['message'].apply(lambda x: re.sub(r'\W', ' ', str(x)))

In [6]:
# Setting everything into lower case
training_set['no_punc'] = training_set['no_punc'].str.lower()

print(training_set.head())

  label                                            message  \
0   ham           Dunno he jus say go lido. Same time 930.   
1   ham  Slaaaaave ! Where are you ? Must I summon you ...   
2  spam  Call from 08702490080 - tells u 2 call 0906635...   
3  spam  You are guaranteed the latest Nokia Phone, a 4...   
4   ham                          TaKe CaRE n gET WeLL sOOn   

                                             no_punc  
0           dunno he jus say go lido  same time 930   
1  slaaaaave   where are you   must i summon you ...  
2  call from 08702490080   tells u 2 call 0906635...  
3  you are guaranteed the latest nokia phone  a 4...  
4                          take care n get well soon  


In [7]:
# Remove the original message column
training_set.drop('message', axis=1, inplace=True)

## Creating the vocabulary

Eventually, we want to create a new column for every word in the messages. In order to achieve that, we need to create a set that contains the complete vocabulary in the messages. This is what we will do now.

In [8]:
# Removing leading, trailing and multiple spaces
training_set['cleaned_no_punc'] = training_set['no_punc'].apply(lambda x: " ".join(x.split()))

# Splitting every message up
training_set['msg_content'] = training_set['cleaned_no_punc'].str.split(' ')

In [9]:
# Adding each word to a vocabulary list (training set only)
vocabulary = []
for item in training_set['msg_content']:
    for word in item:
        vocabulary.append(word)
        
# Transforming it to a set (removes duplicates)
vocabulary = set(vocabulary)

# Transforming back to list
vocabulary = list(vocabulary)

## The final training set

Having created the vocabulary, we now need to create a dictionary that counts each word in a message and appends the count as a value. This dictionary can then be turned into the formatted dataframe that we need.

In [10]:
# Initalizing a dictionary
# Each unique word gets a key and a list of 0 elements the length of the words in a messsage
word_counts_per_sms = {unique_word: [0] * len(training_set['msg_content']) for unique_word in vocabulary}

# Going through each word in the 'split' column, counting each word
for index, sms in enumerate(training_set['msg_content']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [11]:
# Transforming this into a DataFrame
word_counts_per_sms = pd.DataFrame(word_counts_per_sms)

# Deleting empty column
word_counts_per_sms = word_counts_per_sms.iloc[:, 1:]

# Concatenating this with the training set
training_set = pd.concat([training_set, word_counts_per_sms], axis=1)

In [12]:
print(training_set.head())

  label                                            no_punc  \
0   ham           dunno he jus say go lido  same time 930    
1   ham  slaaaaave   where are you   must i summon you ...   
2  spam  call from 08702490080   tells u 2 call 0906635...   
3  spam  you are guaranteed the latest nokia phone  a 4...   
4   ham                          take care n get well soon   

                                     cleaned_no_punc  \
0             dunno he jus say go lido same time 930   
1  slaaaaave where are you must i summon you to m...   
2  call from 08702490080 tells u 2 call 090663581...   
3  you are guaranteed the latest nokia phone a 40...   
4                          take care n get well soon   

                                         msg_content  0  00  000  000pes  \
0   [dunno, he, jus, say, go, lido, same, time, 930]  0   0    0       0   
1  [slaaaaave, where, are, you, must, i, summon, ...  0   0    0       0   
2  [call, from, 08702490080, tells, u, 2, call, 0...  0   0   

## Calculating constants first

Now that we're done with data cleaning and have a training set to work with, we can begin creating the spam filter. Recall that the Naive Bayes algorithm will need to know the probability values of the two equations below to be able to classify new messages. We will calculate all the necessary quantities now.

In [13]:
p_spam = training_set['label'].value_counts(normalize='True')[0]
print(p_spam)

0.8658290329818263


In [14]:
p_ham = training_set['label'].value_counts(normalize='True')[1]
print(p_ham)

0.13417096701817366


In [15]:
training_set['msg_n_words'] = training_set.iloc[:, 4:].gt(0).sum(axis=1)

In [16]:
# Total words in spam messages
n_spam = training_set[training_set['label'] == 'spam']['msg_n_words'].sum()

# Total words in non-spam messages
n_ham = training_set[training_set['label'] == 'ham']['msg_n_words'].sum()

In [17]:
# Total vocabulary
n_vocabulary = len(vocabulary)

In [18]:
# Initiate a variable for Laplace smoothing
alpha = 1 

## Calculating parameters

With these constant terms out of the way, we can calculate the probability parameters for each word.

In [19]:
# Initializing two dictionaries to store the parameters
del vocabulary[0] # There was an empty string in the dictionary
parameters_spam = {unique_word: 0 for unique_word in vocabulary}
parameters_ham  = {unique_word: 0 for unique_word in vocabulary}

In [20]:
# Isolate spam and ham meassages in seperate DataFrames
spam = training_set[training_set['label'] == 'spam']
spam = spam.iloc[:, 4:7688] # only retaining the word columns
ham = training_set[training_set['label'] == 'ham']
ham = ham.iloc[:, 4:7688] # only retaining the word columns

In [21]:
# Calculate parameters
for word in vocabulary:
    n_word_given_spam = spam[word].sum()   
    p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + alpha*n_vocabulary)
    parameters_spam[word] = p_word_given_spam
    
    n_word_given_ham = ham[word].sum()  
    p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + alpha*n_vocabulary)
    parameters_ham[word] = p_word_given_ham

## Classifying a new message

Now that we've calculated all the constants and parameters we need, we can start creating the spam filter. It will be a function that takes in a new message, calculates the relevant parameters, compares with the parameters that we already calculates and assess, whether it's more likely to be spam, ham, or if human help is needed.

The function cleasn the incoming message data before calculating the parameters. If a message contains vocabulary that we don't have in our vocabulary list, it will be ignored.

In [22]:
import re

def classify(message):
    '''
    message: a string
    '''
    
    message = re.sub('\W', ' ', message) # remove punctuation
    message = message.lower().split() # put into lower case
    
    p_spam_given_message = p_spam # initiate with some prior value
    p_ham_given_message = p_ham # initiate with some prior value

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word] # update the prior with the parameter calculated before
            
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word] # update the prior with the parameter calculated before
            
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)
    
    if p_ham_given_message > p_spam_given_message: # compare p_spam_given_message with p_ham_given_message
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [25]:
# Test with a spam message
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 6.5927014002685205e-25
P(Ham|message): 1.084032851149266e-27
Label: Spam


In [27]:
# Test with a ham message
classify('Sounds good, Tom, then see u there')

P(Spam|message): 2.6554782486861542e-24
P(Ham|message): 9.505319981703068e-22
Label: Ham


In [28]:
# Test with a another ham message
classify('Argh, I miss you! When are you coming?')

P(Spam|message): 2.5753149323731747e-25
P(Ham|message): 3.623916458242027e-21
Label: Ham


## Measuring the spam filter's accuracy

We'll now try to determine how well the spam filter does on our test set of 1,114 messages.

The algorithm will output a classification label for every message in our test set, which we'll be able to compare with the actual label (given by a human). Note that, in training, our algorithm didn't see these 1,114 messages, so every message in the test set is practically new from the perspective of the algorithm. 

In [31]:
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]

        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

Now that we have a function that returns labels instead of printing them, we can use it to create a new column in our test set.

In [32]:
test_set['predicted'] = test_set['message'].apply(classify_test_set)
test_set.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


Unnamed: 0,label,message,predicted
0,ham,S...from the training manual it show there is ...,ham
1,spam,LIFE has never been this much fun and great un...,ham
2,spam,Am new 2 club & dont fink we met yet Will B gr...,spam
3,ham,Midnight at the earliest,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


In [34]:
# Measuring the accuracy
correct = 0
total = test_set.shape[0]

for row in test_set.iterrows():
    row = row[1]
    if row['label'] == row['predicted']:
        correct += 1
        
print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', correct/total)

Correct: 1065
Incorrect: 49
Accuracy: 0.9560143626570916


The accuracy is close to 95.60%, which is pretty good. Our spam filter looked at 1,114 messages that it hasn't seen in training, and classified 1,065 correctly.

## Next Steps

In this project, we managed to build a spam filter for SMS messages using the multinomial Naive Bayes algorithm. The filter had an accuracy of 95.60% on the test set we used, which is a pretty good result. Our initial goal was an accuracy of over 80%, and we managed to do way better than that.

Next steps include:

- Analyze the 49 messages that were classified incorrectly and try to figure out why the algorithm classified them incorrectly
- Make the filtering process more complex by making the algorithm sensitive to letter case