#  Building a Spam Filter with Naive Bayes Algorithm



In this project, we are going to build a spam filter for SMS messages using the Naive Bayes algorithm, specifically the multinomual Naive Bayes algorithm. The general workflow for to build this spam filter consists of the following:
* Having the computer learns how humans classify messages by making use of an already labeled dataset of spam and ham (non-spam) messages.
* Using the training data, have the algorithm be able to estimate probabilities for incoming new messages which are the testing data
* Classifies a new new message which is part of testing data based on the estimated probabilities from the previous step - if the probability for spam is greater, then message is classified as spam. Otherwise, it is classified as ham.

The dataset to be used in this project can be downloaded from [The UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). The data collection process is described in more details [here](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/#composition). For this project, our goal is to create a spam filter that classifies new messages with an accuracy greater than 80% — so we expect that more than 80% of the new messages will be classified correctly as spam or ham (non-spam).

# Exploring the Dataset

In [1]:
# Importing the libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [2]:
# Reading in the dataset as a pandas DataFrame
sms = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])

In [3]:
# Checking the dimensions and overview of the dataframe
sms.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
Label    5572 non-null object
SMS      5572 non-null object
dtypes: object(2)
memory usage: 87.1+ KB


In [4]:
# First 5 rows
sms.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [5]:
# Checking out what percentage of messages is spam and ham
sms['Label'].value_counts(normalize=True)*100

ham     86.593683
spam    13.406317
Name: Label, dtype: float64

From the above relative frequency distribution table, about 86.6% of the SMS messages in the dataset are ham and 13.4% of the SMS messages are spam. Now that we are more familiar with the dataset, we can move on to building the spam filter.

# Train Test Split

First, we need to split our dataset into a training and test set. The training set would be 80% of our dataset and the test set would be 20% of our dataset.

In [6]:
# Randomize the rows in the dataset to ensure the spam and ham messages are spread across the entire dataset evenly
randomized = sms.sample(frac=1, random_state=1)

# Calculate the number of messages to be included in the training set
training_num = round(randomized.shape[0] * 0.8)

# Splitting into training and test set
train = randomized.iloc[:training_num].reset_index(drop=True)
test = randomized.iloc[training_num:].reset_index(drop=True)

In [7]:
print(train.shape)
print(test.shape)

(4458, 2)
(1114, 2)


After the train-test-split, we need to check if the percentage of spam and ham in the two sets remain largely similar to the original entire dataset.

In [8]:
# Checking percentages of spam and ham in training set
train['Label'].value_counts(normalize=True) * 100

ham     86.54105
spam    13.45895
Name: Label, dtype: float64

In [9]:
# Checking percentages of spam and ham in test set
test['Label'].value_counts(normalize=True) * 100

ham     86.804309
spam    13.195691
Name: Label, dtype: float64

From the above relative frequency distribution tables, we can see that the percentages are similiar to what we have in the full dataset.

# Cleaning the Dataset

## Dealing with Letter Case and Punctuation

In order to calculate all the required probabilities for the Naive Bayes algorithm, data cleaning is required to bring the data in a format that will allows us to extract easily all the information we need like the uniques words in the vocabulary and the counts of the words in each SMS message. The data cleaning would include converting all the words in all the SMS messages into lower case and also, to remove punctuations. Ultimately, the end goal of the data cleaning is to get a data set that looks like this:

![End Result](https://dq-content.s3.amazonaws.com/433/cpgp_dataset_3.png)

That is, the columns would represent each unique word in the dictionary and each row would represent each SMS message. The values in the columns would indicate the counts of the word appearing in the SMS message.

In [10]:
# Before cleaning
train.head()

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [11]:
# Removing punctuations and converting all to lowercase from the SMS column in the training set
train['SMS'] = train['SMS'].str.replace(r'\W', ' ').str.lower()
train.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


## Creating the Vocabulary

In [12]:
# Transform the message into a list of strings, splitting at the space character for all the SMS messages
train['SMS'] = train['SMS'].str.split()

# Initiate an empty list as the vocabulary
vocabulary = []

# Iterate over the new SMS column which should be a list of strings
for msg in train['SMS']:
    for word in msg:
        vocabulary.append(word)
        
# Transform the vocabulary list into a set and then back into a list to remove duplicate words
vocabulary = list(set(vocabulary))

In [13]:
# Checking the number of unique words in the training dataset
len(vocabulary)

7783

Seems like there are 7,783 unique words as part of the vocabulary in the training dataset.

## Transforming Into the Desired Final Training Set

With the vocabulary of uniques words for the training set's messages created, we can then transform the training dataset into the desired form as mentioned earlier. First, we need to create a dictionary with each unique word as the keys and the counts of each unique word in each SMS message in the training set as the values in the form of a list. This dictionary is then passed to create a pandas DataFrame.

In [14]:
# Creating a dictionary first where the keys are the uniques words and the values are lists of length of training set where
# each element in the list is a 0 initially.

word_counts_per_sms = {unique_word:[0] * len(train['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(train['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [15]:
# Transform the dictionary into a pandas DataFrame
final_train = pd.DataFrame(word_counts_per_sms)

In [16]:
# Concatenate the DataFrame built above with the original train dataset
final_train = pd.concat([train, final_train], axis=1)
final_train.head()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


With data cleaned and a proper training set to work with, we can move on to begin creating the spam filter using the Naive Bayes algorithm.

# Calculating the Spam Filter

## Calculating the Constants First

First, there are a few values which remain constant which can be calculated first. All these terms will have constant values in the probability equations for every new message, regardless of the message or each individual word in the message. They include the P(Spam), P(Ham), N_spam, N_ham and N_vocabulary. P(Spam) and P(Ham) represent the probabilities of a new message being a spam and ham message respectively. N_spam, N_ham and N_vocabulary represent the number of words in all the spam messages, in all the ham messages, in the vocabulary respectively. Laplace smoothing with a smoothing parameter denoted by `alpha` of 1 would be used in algorithm.

In [17]:
# First, calculating P(Spam) and P(Ham) and assigning smoothing parameter of 1 to a variable named alpha

p_spam = final_train['Label'].value_counts(normalize=True).loc['spam']
p_ham = final_train['Label'].value_counts(normalize=True).loc['ham']
alpha = 1

In [18]:
# Calculate N_vocabulary
N_vocabulary = len(vocabulary)

# Initialize with 0
N_spam = 0
N_ham = 0

# Filtering out the spam and ham messages from the training set into two separate dataframes
train_spam = final_train[final_train['Label'] == 'spam']
train_ham = final_train[final_train['Label'] == 'ham']

# Looping over spam and ham training dataframe and counting the number of words in all spam and ham messages
for msg in train_spam['SMS']:
    N_spam += len(msg)

for msg in train_ham['SMS']:
    N_ham += len(msg)

In [19]:
# Checking all constant values
print('P(Spam): {}'.format(p_spam))
print('P(Ham): {}'.format(p_ham))
print('N_vocabulary: {}'.format(N_vocabulary))
print('N_spam: {}'.format(N_spam))
print('N_ham: {}'.format(N_ham))

P(Spam): 0.13458950201884254
P(Ham): 0.8654104979811574
N_vocabulary: 7783
N_spam: 15190
N_ham: 57237


## Calculating the Parameters

Subsequently, we would need to calculate the probability values that P(word|Spam) and P(word|Ham) would take every unique word in the vocabulary which are called parameters in more technical language. For the same word, the two probability values would remain the same for every new message that comes in.

In [20]:
# Initializing two dictionaries to store the parameters 
spam_params = {}
ham_params = {}

for word in vocabulary:
    spam_params[word] = 0
    ham_params[word] = 0

In [21]:
# Iterating over the library to calculate the P(word|Spam) and P(word|Ham)
for word in vocabulary:
    N_word_spam = train_spam[word].sum()
    N_word_ham = train_ham[word].sum()
    p_word_given_spam = (N_word_spam + alpha)/(N_spam + alpha * N_vocabulary)
    p_word_given_ham = (N_word_ham + alpha)/(N_ham + alpha * N_vocabulary)
    spam_params[word] = p_word_given_spam
    ham_params[word] = p_word_given_ham

## Building the Spam Filter Function

With all the constants and parameters calculated, we can finally proceed to create a function that adds as a spam filter for new messages. This function would be named as `classify`, and would take in the new message in the form a string. Essentially, what this `classify` function would do is as follows:

* Takes in the new SMS message as input in the form of a string.
* Proceed to calculate the probabilities of it being a spam given the new message and it being a ham given the new message.
* Compare the two probabilities to determine if the message should be classified as ham, spam or requiring human intervention to classify when they are equal.

In [22]:
# Importing the regular expressions module

import re

# Creating the classify function which would is the spam filter

def classify(message):

    message = re.sub('\W', ' ', message) # Removing punctuations
    message = message.lower() # Converting all letters to lower case
    message = message.split() # Convert string into a list of strings of which each element is the word in the new message

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in spam_params:
            p_spam_given_message *= spam_params[word]
        if word in ham_params:
            p_ham_given_message *= ham_params[word]
    
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal probabilities, have a human classify this!')

In [23]:
# Testing out the classify spam filter function on obvious spam and ham examples
classify('WINNER!! This is the secret code to unlock the money: C3421')
classify('Sounds good, Tom, then see u there')

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam
P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


# Evaluation of Spam Filter's Accuracy

Now, we will try to evaluate ow well the spam filter does on our test set which consists of 1,114 messages.
To calculate the spam filter's accuracy, we will need to compare the actual labels and the predicted labels from the spam filter for all the test set's messages. Since in training, the algorithm did not see these 1,114 messages, every message in the test set would be considered as new from the perspective of the algorithm. Accuracy can be defined as the ratio of the numer of correctly classified messages to the total number of classified messages.

In [24]:
# Modifying the classify function from earlier slightly to output the label instead of printing out statements
def classify_test(message):

    message = re.sub('\W', ' ', message) # Removing punctuations
    message = message.lower() # Converting all letters to lower case
    message = message.split() # Convert string into a list of strings of which each element is the word in the new message

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in spam_params:
            p_spam_given_message *= spam_params[word]
        if word in ham_params:
            p_ham_given_message *= ham_params[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'human to classify'

In [25]:
# Creating a new column in the test set to show the predicted label from the spam filter

test['predicted'] = test['SMS'].apply(classify_test)
test.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


In [26]:
# Measuring the accuracy of the spam filter
correct = 0 # Initialize a variable that counts the number of correctly classified messages
total = len(test) # Total number of classified messages

for row in test.iterrows():
    actual = row[1]['Label']
    predicted = row[1]['predicted']
    if actual == predicted:
        correct += 1
        
accuracy = correct/total * 100  
print('Accuracy Score: {:.2f}%'.format(accuracy))

Accuracy Score: 98.74%


The Multinomial Naive Bayes algorithm that was used to build the spam filter for SMS messages has achieved an accuracy score of 98.74%, which is way above our goal of having an accuracy greater than 80%.

# Conclusion

In this project, we have managed to build a very accurate spam filter for SMS messages using the Multinomial Naive Bayes algorithm, achieving an accuracy score of 98.74% which is way greater than our goal of 80%. 