# Building an SMS Spam Filter with Naive Bayes

In this project, we build a spam filter for SMS messages by following a three step process:

1. Learn how humans classify (previous) messages
2. Use the human classification knowledge from step 1 to calculate probabilities for new messages
3. Use the probabilities from step 2 to classify new messages as spam or not spam

We will use the multinomial Naive Bayes algorithm and [this](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection) dataset of 5,572 messages to model classification of messages.

## A brief data orientation

We begin by briefly orienting ourselves with the data.

In [1]:
# Import pandas library and read in the dataset (specify no headers and manually name the fields)
import pandas as pd

data_raw = pd.read_csv("SMSSpamCollection", sep = "\t", header = None, names = ["Label", "SMS"])

In [2]:
# Print the number of rows and columns of the dataset
data_raw.shape

(5572, 2)

In [3]:
# Preview the first five rows of the dataset
data_raw.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
# View proportion of spam messages vs. ham messages
data_raw["Label"].value_counts(normalize = True)

ham     0.865937
spam    0.134063
Name: Label, dtype: float64

We can see from the above cells there are 5,572 rows, one for each SMS message, and there are two columns, one containing the message and the other indicating whether or not it was spam.  We can also see that 87% of the messages are not spam (i.e. ham) and the remaining 13% are spam.

## Splitting the data into training and testing sets

The majority of our dataset (80%) will be used to train the classification model, while the remainder (20%) will be "held out" and used to test the performance it when ready.  Before doing this split, we randomize the messages to randomly sample.

In [5]:
# Randomize the dataset (frac = 1 parameter randomizes the entire dataset)
data_raw.sample(frac = 1, random_state = 1)

# Define the training data vs. testing data cutoff index at 80%/20%
index_cutoff = round(len(data_raw) * 0.8)

# Split the whole dataset into training/testing datasets and reset their indexes
training_data = data_raw[:index_cutoff].reset_index(drop = True)
test_data = data_raw[index_cutoff:].reset_index(drop = True)

In [6]:
# View proportion of spam messages vs. ham messages in the training dataset
training_data["Label"].value_counts(normalize = True)

ham     0.864962
spam    0.135038
Name: Label, dtype: float64

In [7]:
# View proportion of spam messages vs. ham messages in the testing dataset
test_data["Label"].value_counts(normalize = True)

ham     0.869838
spam    0.130162
Name: Label, dtype: float64

Both the training and testing datasets are representative of the total in terms of the percentage of spam and non-spam messages, which is great.

## Data preparation to calculate probabilities

We now prepare our training dataset to more easily calculate probabilities later.  The following steps are taken in the below cells:

1. Remove punctuation, since we're only interested in the *words* of each message and not other characters such as exclamation marks
2. Make everything lowercase; again, we're only interested in words and not case
3. Create a list of all unique words over all messages (i.e. a vocabulary list)
4. Create a new dataset where each column is a word in the vocabulary list
5. Count the number of each word in each message
6. Combine this dataset with the training dataset

In [8]:
# Remove all punctuation from the SMS column and make everything lowercase
training_data["SMS"] = training_data["SMS"].str.replace("\W", " ")
training_data["SMS"] = training_data["SMS"].str.lower()

  training_data["SMS"] = training_data["SMS"].str.replace("\W", " ")


In [9]:
# For each SMS message, put all its words in a list, then store it back to the SMS column
training_data["SMS"] = training_data["SMS"].str.split(" ")

# Make a big list of all the words in all the messages (i.e. its vocabulary)
vocabulary = []

for message in training_data["SMS"]:
    for word in message:
        vocabulary.append(word)

# Using the `set` function removes duplicates
vocabulary = list(set(vocabulary))

In [10]:
# Create a column for each word in the vocabulary, with each cell being the count of each word in each message
word_counts_per_sms = {unique_word: [0] * len(training_data['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(training_data['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

# Convert the huge dictionary to pandas dataframe
word_counts_per_sms = pd.DataFrame(word_counts_per_sms)

# Combine (concatenate) the new dataframe with the training data(frame)
training_data_final = pd.concat([training_data, word_counts_per_sms], axis = 1)

## Calculating various probabilities

We now use our dataframe of messages, their words, and its classification (spam or not) to calculate the following probabilities needed for the SMS spam filter:

* Probability of a message in the data being spam
* Probability of a message in the data being ham
* Probability of each word occurring given the message is spam
* Probability of each word occurring given the message is ham
* Probability of a given user message being spam given the words that makeup the message
* Probability of a given user message being ham given the words that makeup the message

The Naive Bayes algorithm is used to calculate the last two above bullets.  Additionally, Laplace smoothing is used with alpha parameter = 1 in the calculation of the third and fourth above bullets.

In [11]:
# Split out the spam and ham messages in the training set
spam_messages = training_data_final[training_data_final["Label"] == "spam"]
ham_messages = training_data_final[training_data_final["Label"] == "ham"]

# Calculate the probabilities of ham and spam messages
p_spam = len(spam_messages) / len(training_data_final)
p_ham = len(ham_messages) / len(training_data_final)

# Calculate the number of words in all spam messages
words_per_spam = spam_messages["SMS"].apply(len)
n_spam = words_per_spam.sum()

# Calculate the number of words in all ham messages
words_per_ham = ham_messages["SMS"].apply(len)
n_ham = words_per_ham.sum()

# Calculate the number of unique words in all messages
n_vocabulary = len(vocabulary)

# Set the Laplace smoothing parameter
alpha = 1

In [12]:
# Initialize dictionaries for vocabulary/word counts in spam and ham messages
spam_params = {word: 0 for word in vocabulary}
ham_params = {word: 0 for word in vocabulary}

# Calculate the probability of each word given they are in either a spam or ham message
# Note the Laplace smoothing technique
for word in vocabulary:
    p_word_given_spam = (spam_messages[word].sum() + alpha) / (n_spam + alpha * n_vocabulary)
    spam_params[word] = p_word_given_spam
    
    p_word_given_ham = (ham_messages[word].sum() + alpha) / (n_ham + alpha * n_vocabulary)
    ham_params[word] = p_word_given_ham

In [13]:
# Import regular expression library
import re

# Define the function that classifies a given message as spam or ham
def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in spam_params:
            p_spam_given_message *= spam_params[word]
        if word in ham_params:
            p_ham_given_message *= ham_params[word]

    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

## Testing the SMS spam classification

In the following two cells, we test the SMS spam classification function defined above on two messages.  If the probability of it being spam (given the message and its words) is *higher* than the probability of it *not* being spam, then it's labeled as spam.  Otherwise, it's labeled as ham.

In [14]:
# Test the above defined function on a spam message
classify("WINNER!! This is the secret code to unlock the money: C3421.")

P(Spam|message): 4.331304993203445e-26
P(Ham|message): 4.446139382491623e-28
Label: Spam


In [15]:
# Test the above defined function on a ham message
classify("Sounds good, Tom, then see u there")

P(Spam|message): 2.865015300207463e-25
P(Ham|message): 1.1394224653464133e-21
Label: Ham


In [16]:
# Define the function that classifies a given message as spam or ham
# This is similar to the above `classify` function except it returns output instead of simply printing output
def classify_test_data(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in spam_params:
            p_spam_given_message *= spam_params[word]
        if word in ham_params:
            p_ham_given_message *= ham_params[word]

    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        return "ham"
    elif p_ham_given_message < p_spam_given_message:
        return "spam"
    else:
        return "needs human classification"

## Running the SMS spam classification on the test set

We now run the messages in our testing set through the model to obtain its spam or ham predictions.  These predictions, combined with the actual labels, will allow us to calculate the performance of our model.

In [17]:
test_data["predicted"] = test_data["SMS"].apply(classify_test_data)
# test_data.head()

P(Spam|message): 2.518910596094843e-36
P(Ham|message): 5.793740108943922e-28
P(Spam|message): 4.661854530962673e-57
P(Ham|message): 4.776648857158814e-50
P(Spam|message): 1.3994601278490503e-71
P(Ham|message): 7.4753752198374e-83
P(Spam|message): 6.000120267954505e-144
P(Ham|message): 2.829463648775995e-125
P(Spam|message): 2.1274783618872758e-159
P(Ham|message): 9.519950742451555e-138
P(Spam|message): 9.022577473553515e-64
P(Ham|message): 5.436036793743203e-53
P(Spam|message): 5.6963398180106594e-80
P(Ham|message): 2.8430458049254497e-61
P(Spam|message): 1.4992948059812705e-34
P(Ham|message): 2.0840647550214532e-33
P(Spam|message): 7.740366104109491e-40
P(Ham|message): 9.988071413550733e-37
P(Spam|message): 1.1194103042134842e-20
P(Ham|message): 8.098375591718174e-17
P(Spam|message): 5.36028633067445e-114
P(Ham|message): 1.84469987078149e-97
P(Spam|message): 1.118837240999022e-19
P(Ham|message): 6.324159747882893e-16
P(Spam|message): 3.2715037452868636e-132
P(Ham|message): 5.517104464

P(Ham|message): 1.1439079642143612e-56
P(Spam|message): 6.3355338410602525e-112
P(Ham|message): 6.700314632046369e-105
P(Spam|message): 3.3721420441433427e-28
P(Ham|message): 1.8097224529143878e-23
P(Spam|message): 2.697138479290908e-32
P(Ham|message): 1.9985733330314187e-25
P(Spam|message): 3.3388195814406326e-63
P(Ham|message): 5.6982393981273545e-55
P(Spam|message): 8.439942625050705e-21
P(Ham|message): 1.341233833082816e-16
P(Spam|message): 2.2755319647835668e-73
P(Ham|message): 4.553009852912366e-93
P(Spam|message): 1.9591587691483013e-116
P(Ham|message): 1.0378847038093139e-106
P(Spam|message): 9.911798158055295e-31
P(Ham|message): 1.4184511939517094e-24
P(Spam|message): 1.2687685504995427e-95
P(Ham|message): 8.188118395127997e-87
P(Spam|message): 2.7735627101090684e-30
P(Ham|message): 2.697568319738469e-27
P(Spam|message): 2.7516581411868545e-79
P(Ham|message): 7.521081488572513e-103
P(Spam|message): 1.0696121309698282e-26
P(Ham|message): 1.095959395325471e-19
P(Spam|message): 1

In [18]:
# Calculate the accuracy of the above spam/ham classification
correct = 0
total = len(test_data)

for row in test_data.iterrows():
    row = row[1]
    if row["Label"] == row["predicted"]:
        correct += 1
        
print("Correct:", correct)
print("Incorrect:", total - correct)
print("Accuracy:", correct / total)

Correct: 1100
Incorrect: 14
Accuracy: 0.9874326750448833


## Performance of the SMS spam classifier

We can see that 98.7% of messages (1,110 out of 1,114) in our test set were correctly classified.  In other words, if the SMS spam classifier predicted spam (or ham), then the actual label of spam (or ham) matched 98.7% of the time.  This is very good performance and we're done!