# SpamGuard: Developing a High-Accuracy SMS Spam Filter Using Naive Bayes

In this project our goal is to develop a spam filter for SMS messages using the multinomial Naive Bayes algorithm. To achieve this, we utilized a dataset consisting of 5,572 SMS messages that were already classified by humans.

The dataset, curated by Tiago A. Almeida and José María Gómez Hidalgo, can be obtained from the UCI Machine Learning Repository. Our approach involved dividing the dataset into 80% for training and 20% for testing. We employed Laplace smoothing with an alpha value of 1 and implemented a custom function for message classification based on the Naive Bayes algorithm.

Remarkably, our spam filter demonstrated an exceptional accuracy of 98.74% when applied to the test dataset. This result surpassed our initial objective of achieving an accuracy exceeding 80%. By successfully communicating our findings, we establish the effectiveness of our filter in accurately identifying spam messages.

In [173]:
import pandas as pd

In [174]:
# Reading in the data set
spam = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])
spam

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [175]:
spam['Label'].value_counts(normalize=True)*100

ham     86.593683
spam    13.406317
Name: Label, dtype: float64

The data set contains 5572 entries, 87% of which are labeled as non-spam ('ham'), and the rest as spam.

## Data Cleaning

We will remove all punctuation from SMS column and transform it to lowercase.

In [176]:
spam['SMS'] = spam['SMS'].str.replace('\W', ' ', regex=True).str.lower()

## Splitting the Data Set into Training and Test Sets

Once our spam filter is ready, we'll need to test how good it is at classifying new messages. To test the spam filter, we'll first split our dataset into two categories:

- A training set, which we'll use to "train" the computer to classify messages.
- A test set, which we'll use to test how good the spam filter is at classifying new messages.

We'll keep 80% of our dataset for training and 20% for testing (we want to train the algorithm on as much data as possible, but we also want to have enough test data). The data set has 5,572 messages, which means that
- The training set will have 4,458 messages (about 80% of the dataset).
- The test set will have 1,114 messages (about 20% of the dataset).

For this project, our goal is to create a spam filter that classifies new messages with an accuracy greater than 80% - so we expect that more than 80% of the new messages will be correctly classified as spam or ham (non-spam).

We will start by randomizing the dataset.

In [177]:
spam1 = spam.sample(frac=1, random_state=1).copy()
spam1

Unnamed: 0,Label,SMS
1078,ham,yep by the pretty sculpture
4028,ham,yes princess are you going to make me moan
958,ham,welp apparently he retired
4642,ham,havent
4674,ham,i forgot 2 ask ü all smth there s a card on ...
...,...,...
905,ham,we re all getting worried over here derek and...
5192,ham,oh oh den muz change plan liao go back h...
3980,ham,ceri u rebel sweet dreamz me little buddy c...
235,spam,text meet someone sexy today u can find a d...


In [178]:
# Splitting the randomized data set into training (80%, 4458 entries) and test (20%, 1114 entries) sets
train = spam1.head(4458).copy().reset_index()

test = spam1.tail(1114).copy().reset_index()

We can verify that the percentage of spam and ham in new records is similar to the original dataset.

In [179]:
test['Label'].value_counts(normalize=True)*100

ham     86.804309
spam    13.195691
Name: Label, dtype: float64

In [180]:
train['Label'].value_counts(normalize=True)*100

ham     86.54105
spam    13.45895
Name: Label, dtype: float64

## Creating the Vocabulary

When a new message arrives, our Naive Bayes algorithm will perform the classification based on the results of how often the words in the message have been found in spam and ham messages.

So first we need to build a vocabulary of all the unique words that occur in the messages in our training set.

In [181]:
# Splitting the SMS column into list of words
train['SMS'] = train['SMS'].str.split()

In [182]:
# Extracting all unique words from SMS column into vocabulary
vocabulary = list(set(train['SMS'].explode()))

In [183]:
# Creating a dictionary with words from vocabulary as keys,
# and and each value is a list of the length of training set (all zeros)
word_counts_d = {unique_word: [0] * len(train['SMS']) for unique_word in vocabulary}

# Looping over lists of words from each SMS and its index
for index, sms in enumerate(train['SMS']):
    # Looping over each SMS
    for word in sms:
        # Incrementing word count for that SMS
        word_counts_d[word][index] += 1
        
# Transforming the dictionary to dataframe
word_counts = pd.DataFrame(word_counts_d)

In [184]:
# Concatenating train and word_counts dataframes
train_1 = pd.concat([train, word_counts], axis=1)
train_1

Unnamed: 0,index,Label,SMS,NaN,elaborate,l8r,actual,music,downon,somewhat,...,drpd,neighbor,granite,search,version,wherre,7am,it,85069,attracts
0,1078,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,4028,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,958,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4642,ham,[havent],0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,4674,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4453,1982,ham,"[sorry, i, ll, call, later, in, meeting, any, ...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4454,5180,ham,"[babe, i, fucking, love, you, too, you, know, ...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0
4455,4020,spam,"[u, ve, been, selected, to, stay, in, 1, of, 2...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4456,371,ham,"[hello, my, boytoy, geeee, i, miss, you, alrea...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


We can remove the NaN column which has all zero values.

In [185]:
train_2 = train_1.drop(train_1.columns[3], axis=1)
train_2

Unnamed: 0,index,Label,SMS,elaborate,l8r,actual,music,downon,somewhat,24m,...,drpd,neighbor,granite,search,version,wherre,7am,it,85069,attracts
0,1078,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,4028,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,958,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4642,ham,[havent],0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,4674,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4453,1982,ham,"[sorry, i, ll, call, later, in, meeting, any, ...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4454,5180,ham,"[babe, i, fucking, love, you, too, you, know, ...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0
4455,4020,spam,"[u, ve, been, selected, to, stay, in, 1, of, 2...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4456,371,ham,"[hello, my, boytoy, geeee, i, miss, you, alrea...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Creating the Spam Filter

First, we compute the prior probabilities of spam and ham, the number of words in all spam messages and all ham messages, and the number of words in the vocabulary. We'll also use Laplace smoothing and set alpha = 1.

In [186]:
# Prior probabilites of spam and ham are equal to percentage of spam and ham messages in our data set
p_ham = train_2['Label'].value_counts(normalize=True)[0]
p_spam = 1 - p_ham

# Calculating total number of words in spam and ham messages
n_ham = train_2[train_2['Label'] == 'ham'].iloc[:, 3:].sum().sum()
n_spam = train_2[train_2['Label'] == 'spam'].iloc[:, 3:].sum().sum()

# Number of words in vocabulary is equal to number of columns
# minus the first three columns (index, Label and SMS)
n_voc = train_2.shape[1] - 3

# Setting the smoothing parameter
alpha = 1

Now we calculate the probabilities P(wi|Spam) and P(wi|Ham) for each word in the vocabulary. Although they vary depending on the word, the probability for each individual word is constant for each new message.

In [189]:
# Creating the vocabulary
vocabulary_1 = train_2.columns[3:]

# Creating new dictionaries that will be filled with posterior probabilties for each word
prob_ham = {unique_word: 0 for unique_word in vocabulary_1}
prob_spam = prob_ham.copy()

# Dividing the dataframe into spam and ham messages
df_spam = train_2[train_2['Label'] == 'spam']
df_ham = train_2[train_2['Label'] == 'ham']

In [190]:
for word in vocabulary_1:
    # Calculating how many times the word is encountered in ham and spam messages
    n_word_ham = df_ham[word].sum()
    n_word_spam = df_spam[word].sum()
    
    # Calculating posterior probability for the word given its ham or spam
    p_word_ham = (n_word_ham + alpha) / (n_ham + alpha * n_voc)
    p_word_spam = (n_word_spam + alpha) / (n_spam + alpha * n_voc)
    
    # Updating the probabilities in the dictionaries
    prob_ham[word] = p_word_ham
    prob_spam[word] = p_word_spam

Now we can create a function that classifies new messages.

In [191]:
import re

def classify(message):

    # Cleaning the message and transforming it into a list
    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()
    
    # Creating posterior probability variables
    p_spam_message = p_spam
    p_ham_message = p_ham
    
    for word in message:
        if word in vocabulary_1:
            p_spam_message = p_spam_message * prob_spam[word]
            p_ham_message = p_ham_message * prob_ham[word]
            # If the word is not in our vocabulary, we will ignore it

    if p_ham_message > p_spam_message:
        return 'ham'
    elif p_ham_message < p_spam_message:
        return 'spam'
    else:
        return 'needs human classification'

We can check how this function works.

In [192]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

'spam'

In [193]:
classify('Sounds good, Tom, then see u there')

'ham'

## Measuring the Spam Filter's Accuracy

We'll now try to determine how well the spam filter performs on our test set of 1,114 messages and compare the predicted values with the actual values.

In [194]:
# Applying the function and creating the corresponding column
test['predicted'] = test['SMS'].apply(classify)
test

Unnamed: 0,index,Label,SMS,predicted
0,2131,ham,later i guess i needa do mcat study too,ham
1,3418,ham,but i haf enuff space got like 4 mb,ham
2,3424,spam,had your mobile 10 mths update to latest oran...,spam
3,1538,ham,all sounds good fingers makes it difficult ...,ham
4,5393,ham,all done all handed in don t know if mega sh...,ham
...,...,...,...,...
1109,905,ham,we re all getting worried over here derek and...,ham
1110,5192,ham,oh oh den muz change plan liao go back h...,ham
1111,3980,ham,ceri u rebel sweet dreamz me little buddy c...,ham
1112,235,spam,text meet someone sexy today u can find a d...,spam


In [195]:
# Calculating the accuracy
accuracy = (test['Label'] == test['predicted']).sum() / len(test) * 100
accuracy

98.74326750448833

The filter had an accuracy of 98.74% on the test set, which is an excellent result. We originally aimed for an accuracy of over 80%, but we managed to do much better than that.

# Conclusion

In conclusion, our project aimed to develop a spam filter for SMS messages using the multinomial Naive Bayes algorithm. We successfully achieved this goal by leveraging a dataset of 5,572 SMS messages, classified by humans. 

Our approach involved training the model on 80% of the dataset while reserving the remaining 20% for testing. With the implementation of Laplace smoothing and an alpha value of 1, we created a function that effectively classified messages as spam or ham based on the Naive Bayes algorithm.

The most significant finding of our project was the outstanding accuracy of our spam filter. When evaluated on the test dataset, it demonstrated an impressive accuracy of 98.74%. This surpassed our initial objective of 80% accuracy and underscored the filter's ability to reliably identify spam messages.

In summary, our project successfully developed a high-performing SMS spam filter using the multinomial Naive Bayes algorithm. The remarkable accuracy achieved validates the effectiveness of our approach and establishes the practicality of employing this filter in real-world scenarios.