# Classifying spam messages

We're going to build a spam filter for SMS messages with the Naive Bayes algorithm using a dataset put together by Tiago A. Almeida and José María Gómez Hidalgo. The data can be downloaded from the [The UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection) or from [this link](https://dq-content.s3.amazonaws.com/433/SMSSpamCollection). The data collection process is described in more details on [this page](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/#composition), where you can also find some of the authors' papers.

Broadly, the algorithm will do the following:

1. Learns how humans classify messages.
2. Uses that human knowledge to estimate probabilities for new messages — probabilities for spam and non-spam.
3. Classifies a new message based on these probability values — if the probability for spam is greater, then it classifies the message as spam. Otherwise, it classifies it as non-spam (if the two probability values are equal, then we may need a human to classify the message).

## Understanding the data

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import re

In [3]:
messages = pd.read_csv('SMSSpamCollection', sep='\t', header=None)
messages.columns = ['Label','SMS']
messages.head(5)

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
# Getting the number of rows and columns
messages.shape

(5572, 2)

In [5]:
# Seeing what % of the messages are spam
messages['Label'].value_counts(normalize=True)*100

ham     86.593683
spam    13.406317
Name: Label, dtype: float64

## Splitting the data for testing

Once our spam filter is done, we'll need to test how good it is with classifying new messages. To test the spam filter, we're first going to split our dataset into two categories:

- A training set, which we'll use to "train" the computer how to classify messages.
- A test set, which we'll use to test how good the spam filter is with classifying new messages.

We're going to keep 80% of our dataset for training, and 20% for testing (we want to train the algorithm on as much data as possible, but we also want to have enough test data). The dataset has 5,572 messages, which means that:

- The training set will have 4,458 messages (about 80% of the dataset).
- The test set will have 1,114 messages (about 20% of the dataset).

When the spam filter is ready, we'll be able to compare the algorithm classification with that done by a human. Our goal is to create a spam filter that classifies new messages with an accuracy greater than 80%.

Let's start by randomizing the entire dataset to ensure that spam and ham messages are spread properly throughout the dataset.

In [6]:
random = messages.sample(frac=1,random_state=1)
random.head(5)

Unnamed: 0,Label,SMS
1078,ham,"Yep, by the pretty sculpture"
4028,ham,"Yes, princess. Are you going to make me moan?"
958,ham,Welp apparently he retired
4642,ham,Havent.
4674,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [7]:
# train, test = train_test_split(messages, test_size=0.2)
# train.reset_index(inplace=True)
# test.reset_index(inplace=True)

In [8]:
# msk = np.random.rand(len(random)) < 0.8
# train = random[msk]
# test = random[~msk]
# train.reset_index(inplace=True,drop=True)
# test.reset_index(inplace=True,drop=True)

In [9]:
# Calculate index for split
training_test_index = round(len(random) * 0.8)

# Training/Test split
train = random[:training_test_index].reset_index(drop=True)
test = random[training_test_index:].reset_index(drop=True)

In [10]:
train['Label'].value_counts(normalize=True)*100

ham     86.54105
spam    13.45895
Name: Label, dtype: float64

In [11]:
test['Label'].value_counts(normalize=True)*100

ham     86.804309
spam    13.195691
Name: Label, dtype: float64

Both sets of data have % of spam messages similar to the full dataset, so we can continue with this split.

## Cleaning the training data

As the spam messages contain a combination of upper and lower case letters and a variety of punctuation, we'll start by cleaning the data.

In [12]:
# Converting everything to lower case
train['SMS'] = train['SMS'].str.lower()
train['SMS'] = train['SMS'].str.replace('\W',' ')
train.head(5)

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


## Re-formatting the training data

We want to reformat our table so that it has a 'Label' column, followed by a column for every unique word in the dataframe that indicates how many times that word appears in each SMS.

In [13]:
# Creating a list of the unique words
train['SMS'] = train['SMS'].str.split()

vocabulary = []
for sms in train['SMS']:
    for word in sms:
        vocabulary.append(word)
        
vocabulary = list(set(vocabulary))
print(len(vocabulary))

7783


In [14]:
# creating a dictionary from the vocabulary list and storing the word counts
word_counts_per_sms = {unique_word: [0] * len(train['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(train['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

# Converting the dictionary to a dataframe
word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.head(5)

Unnamed: 0,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


In [15]:
# Merging vocab data with original dataframe
combined = pd.concat([train,word_counts],axis=1)

In [16]:
combined.head(5)

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


## Creating the spam filter

Now that we're done with data cleaning and have a training set to work with, we can begin creating the spam filter.

we'll need to know the values of the two equations below to classify new messages:

- P(Spam|w1,w2,...,wn)∝P(Spam) * n∏i=1 P(wi|Spam)
- P(Ham|w1,w2,...,wn)∝P(Ham) * n∏i=1 P(wi|Ham)

And to calculate P(wi|Spam) and P(wi|Ham) we'll need to use these equations:

- P(wi|Spam) = (Nwi|Spam + α) / (NSpam + α * NVocab)
- P(wi|Ham) = (Nwi|Ham + α) / (NHam + α * NVocab)

### Calculating required terms

We'll start with calculating:

- P(Spam) and P(Ham)
- NSpam, NHam, NVocabulary

We'll also use Laplace smoothing and set α=1. All these terms will have constant values in our equations for every new message (regardless of the message or each individual word in the message).

In [17]:
# counts = combined['Label'].value_counts()
# p_spam = counts[1] / (counts[0]+counts[1])
# p_ham = counts[0] / (counts[0]+counts[1])
# print(p_spam)
# print(p_ham)

0.13458950201884254
0.8654104979811574


In [18]:
# alpha = 1

In [19]:
# combined['length'] = combined['SMS'].apply(lambda x:len(x))
# spam = combined[combined['Label'] == 'spam']
# ham = combined[combined['Label'] == 'ham']
# n_spam = spam['length'].sum()
# n_ham = ham['length'].sum()
# n_vocabulary = len(vocabulary)
# print(n_spam)
# print(n_ham)
# print(n_vocabulary)

15190
57237
72427


In [26]:
# Isolating spam and ham messages first
spam = combined[combined['Label'] == 'spam']
ham = combined[combined['Label'] == 'ham']

# P(Spam) and P(Ham)
p_spam = len(spam) / len(combined)
p_ham = len(ham) / len(combined)

# N_Spam
n_words_per_spam_message = spam['SMS'].apply(len)
n_spam = n_words_per_spam_message.sum()

# N_Ham
n_words_per_ham_message = ham['SMS'].apply(len)
n_ham = n_words_per_ham_message.sum()

# N_Vocabulary
n_vocabulary = len(vocabulary)

# Laplace smoothing
alpha = 1

print(p_spam)
print(p_ham)
print(n_spam)
print(n_ham)
print(n_vocabulary)

0.13458950201884254
0.8654104979811574
15190
57237
7783


Although both P(wi|Spam) and P(wi|Ham) vary depending on the word, the probability for each individual word is constant for every new message because they only depend on the training set, and as long as we don't make changes to the training set, they will stay constant.

For each word, we need to calculate both P(wi|Spam) and P(wi|Ham), or the parameters.

The fact that we calculate so many values before even beginning the classification of new messages makes the Naive Bayes algorithm very fast (especially compared to other algorithms). When a new message comes in, most of the needed computations are already done, which enables the algorithm to almost instantly classify the new message.

### Calculating parameters

In [27]:
para_spam = dict.fromkeys(vocabulary, 0)
para_ham = dict.fromkeys(vocabulary, 0)

In [28]:
for word in vocabulary:
    n_w_given_spam = spam[word].sum()
    p_w_given_spam = (n_w_given_spam+alpha) / (n_spam+1*n_vocabulary)
    para_spam[word] = p_w_given_spam
    
    n_w_given_ham = ham[word].sum()
    p_w_given_ham = (n_w_given_ham+alpha) / (n_ham+1*n_vocabulary)
    para_ham[word] = p_w_given_ham
    

### Creating spam filter

Now that we've calculated all the constants and parameters we need, we can start creating the spam filter. The spam filter can be understood as a function that:

- Takes in as input a new message (w1, w2, ..., wn)
- Calculates P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn)
- Compares the values of P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn), and:

    - If P(Ham|w1, w2, ..., wn) > P(Spam|w1, w2, ..., wn), then the message is classified as ham.
    - If P(Ham|w1, w2, ..., wn) < P(Spam|w1, w2, ..., wn), then the message is classified as spam.
    - If P(Ham|w1, w2, ..., wn) = P(Spam|w1, w2, ..., wn), then the algorithm may request human help.

Note that some new messages will contain words that are not part of the vocabulary. We simply ignore these words when we're calculating the probabilities.

In [29]:
# Writing a classification function
def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in para_spam:
            p_spam_given_message *= para_spam[word]
            
        if word in para_ham:
            p_ham_given_message *= para_ham[word]
            
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [30]:
# Testing the function on two messages
message_1 = 'WINNER!! This is the secret code to unlock the money: C3421.'
message_2 = "Sounds good, Tom, then see u there"


In [31]:
classify(message_1)
classify(message_2)

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam
P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


### Testing the filter

We'll now try to determine how well the spam filter does on our test set of 1,114 messages.

The algorithm will output a classification label for every message in our test set, which we'll be able to compare with the actual label (given by a human). Note that, in training, our algorithm didn't see these 1,114 messages, so every message in the test set is practically new from the perspective of the algorithm.

In [33]:
# Modifying the classify function above
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in para_spam:
            p_spam_given_message *= para_spam[word]

        if word in para_ham:
            p_ham_given_message *= para_ham[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

Now that we have a function that returns labels instead of printing them, we can use it to create a new column in our test set.

In [34]:
test['predicted'] = test['SMS'].apply(classify_test_set)
test.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


Now we can compare the predicted values with the actual values to measure how good our spam filter is with classifying new messages. To make the measurement, we'll use accuracy as a metric, where accuracy = number of correctly classified message / total number of classified messages.

In [38]:
correct = 0
total = len(test['SMS'])

for index,row in test.iterrows():
    if row['Label'] == row['predicted']:
        correct +=1

accuracy = correct/total*100
print(accuracy)

98.74326750448833


Using the Naive Bayes algorithm, we were able to classify spam SMSes with an accuracy of 98.74%, which is much higher than the 80% we were aiming for.

## Conclusion

In this project, we managed to build a spam filter for SMS messages using the multinomial Naive Bayes algorithm. The filter had an accuracy of 98.74% on the test set, which is an excellent result. We initially aimed for an accuracy of over 80%, but we managed to do way better than that.

If I wanted to continue working on this, I could:

- Isolate the 14 messages that were classified incorrectly and try to figure out why the algorithm reached the wrong conclusions.
- Make the filtering process more complex by making the algorithm sensitive to letter case.