# Building a Spam Filter with Naive Bayes

In this project I will create a spam filter for SMS messages using the multinomial Naive Bayes algorithm.

To classify messages as spam or non-spam a computer:
    1. Learns how humans classify messages.
    2. Uses that human knowledge to estimate probabilities for new messages — probabilities for spam and non-spam.
    3. Classifies a new message based on these probability values — if the probability for spam is greater, then it classifies the message as spam. Otherwise, it classifies it as non-spam (if the two probability values are equal, then we may need a human to classify the message).

So my first task is to "teach" the computer how to classify messages. To do that, I'll use the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages that are already classified by humans.

## Exploring the Dataset

In [1]:
import pandas as pd
from pandas.core.common import SettingWithCopyWarning
import warnings
warnings.simplefilter(action="ignore", category=SettingWithCopyWarning)

sms = pd.read_csv('SMSSpamCollection',sep='\t',header=None,names=['Label', 'SMS'])
sms.shape

(5572, 2)

In [2]:
sms.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
ham_pct = (sms['Label']=='ham').sum()/sms['Label'].count()
print('The percentage of messages that are non-spam are {:.2f}%'.format(ham_pct*100))

The percentage of messages that are non-spam are 86.59%


In [4]:
spam_pct = (sms['Label']=='spam').sum()/sms['Label'].count()
print('The percentage of messages that are spam are {:.2f}%'.format(spam_pct*100))

The percentage of messages that are spam are 13.41%


## Training and Test Set

On the previous screen, I read in the dataset and saw that about 87% of the messages are ham ("ham" means non-spam), and the remaining 13% are spam. Now that I've become a bit familiar with the dataset, I can move on to building the spam filter.

However, before creating it, I think it is best to test it to see how well it works. To test the spam filter, I am first going to split the dataset into two categories:

- A training set, which I'll use to "train" the computer how to classify messages.
- A test set, which I'll use to test how good the spam filter is with classifying new messages.

I am going to keep 80% of the dataset for training, and 20% for testing. The dataset has 5,572 messages, which means that:

- The training set will have 4,458 messages (about 80% of the dataset).
- The test set will have 1,114 messages (about 20% of the dataset).

In [5]:
from sklearn.model_selection import train_test_split
sms_random = sms.sample(frac=1, random_state=1)
train, test = train_test_split(sms_random, test_size=0.2)

In [6]:
train.reset_index(inplace=True)
test.reset_index(inplace=True)
train.shape

(4457, 3)

In [7]:
test.shape

(1115, 3)

In [8]:
ham_pct = (train['Label']=='ham').sum()/train['Label'].count()
print('The percentage of messages that are non-spam in the training set are {:.2f}%'.format(ham_pct*100))

spam_pct = (train['Label']=='spam').sum()/train['Label'].count()
print('The percentage of messages that are spam in the training set are {:.2f}%'.format(spam_pct*100))

The percentage of messages that are non-spam in the training set are 86.31%
The percentage of messages that are spam in the training set are 13.69%


In [9]:
ham_pct = (test['Label']=='ham').sum()/test['Label'].count()
print('The percentage of messages that are non-spam in the training set are {:.2f}%'.format(ham_pct*100))

spam_pct = (test['Label']=='spam').sum()/test['Label'].count()
print('The percentage of messages that are spam in the training set are {:.2f}%'.format(spam_pct*100))

The percentage of messages that are non-spam in the training set are 87.71%
The percentage of messages that are spam in the training set are 12.29%


## Preparing the Data

### Letter Case and Punctuation

In this section we will teach the algorithim to classify new messages. In order to calculate the probabilities needed for the algorithim, I'll first need to perform a bit of data cleaning to bring the data in a format that will allow me to extract easily all the information I need.

In [10]:
train['SMS'] = train['SMS'].str.replace('\W', ' ').str.lower()

### Creating the Vocabulary

Now that I have removed the punctuation and changed all letters to lowercase, I will bring the training set into a format that can be used by the algorithim.

In [11]:
train['SMS'] = train['SMS'].str.split()

vocabulary = []

for row in train['SMS']:
    for word in row:
        vocabulary.append(word)

vocabulary = set(vocabulary)
vocabulary = list(vocabulary)

###  The Final Training Set

I have managed to create the vocabulary for our messages in the training set. Now I am going to use the vocabulary to make the data transformation. However, I'll first build a dictionary that I'll then convert to the DataFrame I need.

In [12]:
word_counts_per_sms = {unique_word: [0] * len(train['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(train['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [13]:
word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.head()

Unnamed: 0,0,00,000,008704050406,0121,01223585236,01223585334,02,0207,02072069400,...,zindgi,zoe,zogtorius,zoom,zouk,zyada,é,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [14]:
train_counts = pd.concat([train,word_counts],axis=1)
train_counts.head()

Unnamed: 0,index,Label,SMS,0,00,000,008704050406,0121,01223585236,01223585334,...,zindgi,zoe,zogtorius,zoom,zouk,zyada,é,ü,〨ud,鈥
0,3224,ham,"[well, that, must, be, a, pain, to, catch]",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2813,ham,"[say, this, slowly, god, i, love, you, amp, i,...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,4839,ham,"[all, boys, made, fun, of, me, today, ok, i, h...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,2891,ham,"[shuhui, has, bought, ron, s, present, it, s, ...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,4170,ham,"[haven, t, heard, anything, and, he, s, not, a...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Building the Algorithim

###  Calculating Constants First
Now that I am done with data cleaning and have a training set to work with, I can begin creating the spam filter.

In [15]:
spam = train_counts['Label']=='spam'
ham = train_counts['Label']=='ham'

p_spam = spam.sum()/len(train_counts) 
p_ham = ham.sum()/len(train_counts) 

n_spam = (train_counts[spam]['SMS'].str.len()).sum()
n_ham = (train_counts[ham]['SMS'].str.len()).sum()
n_vocabulary = len(vocabulary)

alpha = 1

### Calculating Parameters

In [16]:
p_w_spam = {}
p_w_ham = {}

for word in vocabulary:
    p_w_ham[word] = 0
    p_w_spam[word] = 0
    
df_ham = pd.DataFrame(train_counts[ham])
df_spam = pd.DataFrame(train_counts[spam])

In [17]:
for word in vocabulary:
    n_w_ham = df_ham[word].sum()
    n_w_spam = df_spam[word].sum()
    
    p_w_ham[word] = (n_w_ham + alpha)/(n_ham + (alpha * n_vocabulary))
    p_w_spam[word] = (n_w_spam + alpha)/(n_spam + (alpha * n_vocabulary))

## Classifying A New Message

### Creating the Spam Filter

Now that I've calculated all the constants and parameters I need, I can start creating the spam filter. The spam filter can be understood as a function that:

- Takes in as input a new message (w1, w2, ..., wn)
- Calculates P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn)
- Compares the values of P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn), and:
    - If P(Ham|w1, w2, ..., wn) > P(Spam|w1, w2, ..., wn), then the message is classified as ham.
    - If P(Ham|w1, w2, ..., wn) < P(Spam|w1, w2, ..., wn), then the message is classified as spam.
    - If P(Ham|w1, w2, ..., wn) = P(Spam|w1, w2, ..., wn), then the algorithm may request human help.


In [18]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in vocabulary:
            p_spam_given_message *= p_w_spam[word]
            p_ham_given_message *= p_w_ham[word]

    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [19]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 2.299815708641833e-25
P(Ham|message): 2.893605406481076e-27
Label: Spam


In [20]:
classify("Sounds good, Tom, then see u there")

P(Spam|message): 2.3338574380410145e-25
P(Ham|message): 3.97687797635986e-21
Label: Ham


## Measuring the Spam Filter's Accuracy

In previous section I managed to create a spam filter, and I classified two new messages. I'll now try to determine how well the spam filter does on the test set of 1,114 messages.

The algorithm will output a classification label for every message in the test set, which I'll be able to compare with the actual label (given by a human). Note that, in training, the algorithm didn't see these 1,114 messages, so every message in the test set is practically new from the perspective of the algorithm.

In [30]:
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in vocabulary:
            p_spam_given_message *= p_w_spam[word]
            p_ham_given_message *= p_w_ham[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [32]:
predicted = test['SMS'].apply(classify_test_set)
test['predicted'] = predicted

In [33]:
test.head()

Unnamed: 0,index,Label,SMS,predicted
0,3160,ham,Are you up for the challenge? I know i am :),ham
1,3599,ham,"Aight, we'll head out in a few",ham
2,2212,ham,Just gettin a bit arty with my collages at the...,ham
3,620,ham,Let there be snow. Let there be snow. This kin...,ham
4,2956,ham,Id have to check but there's only like 1 bowls...,ham


In [34]:
correct = 0
total = len(test['SMS'])

for index, row in test.iterrows():
    if row['Label'] == row['predicted']:
        correct += 1
prct = correct/total       
print('The accuracy of the algorithim is {:.2f}%'.format(prct*100))

The accuracy of the algorithim is 98.83%


# Conclusion

In this project, I managed to build a spam filter for SMS messages using the multinomial Naive Bayes algorithm. The filter had an accuracy of 98.83% on the test set I used, which is a pretty good result. My initial goal was an accuracy of over 80%, and I managed to do way better than that.