# Building a spam filter with Naive Bayes

In this project, we're going to build a spam filter for SMS messages using the multinomial Naive Bayes algorithm. Our goal is to write a program that classifies new messages with an accuracy greater than 80% — so we expect that more than 80% of the new messages will be classified correctly as spam or ham (non-spam).

To train the algorithm, we'll use a dataset of 5,572 SMS messages that are already classified by humans. The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the [The UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). 

In [1]:
import pandas as pd

sms_spam = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['label', 'SMS_column'])
print(sms_spam.shape)
sms_spam.head()

(5572, 2)


Unnamed: 0,label,SMS_column
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [2]:
sms_spam['label'].value_counts(normalize=True)

ham     0.865937
spam    0.134063
Name: label, dtype: float64

Above, we see that about 87% of the messages are ham, and the remaining 13% are spam. This sample looks representative, since in practice most messages that people receive are ham.

# Training and test set

Before creating the spam filter, it's very helpful to first think of a way of testing how well it works. When creating software (a spam filter is software), a good rule of thumb is that designing the test comes before creating the software. If we write the software first, then it's tempting to come up with a biased test just to make sure the software passes it.

Once our spam filter is done, we'll need to test how good it is with classifying new messages. To test the spam filter, we're first going to split our dataset into two categories:

- A training set, which we'll use to "train" the computer how to classify messages.
- A test set, which we'll use to test how good the spam filter is with classifying new messages.

To better understand the purpose of putting a test set aside, let's begin by observing that all 1,114 messages in our test set are already classified by a human. When the spam filter is ready, we're going to treat these messages as new and have the filter classify them. Once we have the results, we'll be able to compare the algorithm classification with that done by a human, and this way we'll see how good the spam filter really is.

For this project, our goal is to create a spam filter that classifies new messages with an accuracy greater than 80% — so we expect that more than 80% of the new messages will be classified correctly as spam or ham (non-spam).

In [3]:
import numpy as np
from numpy import random

# data_randomised = sms_spam.iloc[random.permutation(sms_spam.index)].reset_index(drop=True)
data_randomised = sms_spam.sample(frac=1, random_state=1).reset_index(drop=True)
data_randomised.head()

Unnamed: 0,label,SMS_column
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [4]:
train_df = data_randomised[:round(data_randomised.shape[0]*0.8)].copy()
test_df = data_randomised[round(data_randomised.shape[0]*0.8):].copy().reset_index(drop=True)
print(train_df.shape)
print(test_df.shape)
print(train_df.shape[0] + test_df.shape[0])

(4458, 2)
(1114, 2)
5572


We'll now analyze the percentage of spam and ham messages in the training and test sets. We expect the percentages to be close to what we have in the full dataset, where about 87% of the messages are ham, and the remaining 13% are spam.

In [5]:
train_df['label'].value_counts(normalize=True)

ham     0.86541
spam    0.13459
Name: label, dtype: float64

In [6]:
test_df['label'].value_counts(normalize=True)

ham     0.868043
spam    0.131957
Name: label, dtype: float64

The results look good! We'll now move on to cleaning the dataset.

# Data cleaning

To calculate all the probabilities required by the algorithm, we'll first need to perform a bit of data cleaning to bring the data in a format that will allow us to extract easily all the information we need.

## Letter case and punctuation

We'll begin with removing all the punctuation and bringing every letter to lower case.

In [7]:
import re

train_df['SMS_column'] = train_df['SMS_column'].str.lower().str.strip()
train_df['SMS_column'] = train_df['SMS_column'].apply(lambda x: re.sub('\W', ' ', x))

train_df.head()

Unnamed: 0,label,SMS_column
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


In [8]:
test_df['SMS_column'] = test_df['SMS_column'].str.lower().str.strip()
test_df['SMS_column'] = test_df['SMS_column'].str.replace('\W', ' ')

test_df.head()

Unnamed: 0,label,SMS_column
0,ham,later i guess i needa do mcat study too
1,ham,but i haf enuff space got like 4 mb
2,spam,had your mobile 10 mths update to latest oran...
3,ham,all sounds good fingers makes it difficult ...
4,ham,all done all handed in don t know if mega sh...


## Creating the vocabulary

Let's now move to creating the vocabulary, which in this context means a list with all the unique words in our training set.

In [9]:
train_df['SMS_column'] = train_df['SMS_column'].str.split()
test_df['SMS_column'] = test_df['SMS_column'].str.split()

In [10]:
vocabulary_train = []
vocabulary_test = []

for sms in train_df['SMS_column'].iteritems():

    for word_train in sms[1]:        
        vocabulary_train.append(word_train)
        
for sms in test_df['SMS_column'].iteritems():
    for word_test in sms[1]:
        vocabulary_test.append(word_test)

vocabulary_train = set(vocabulary_train)
vocabulary_train = list(vocabulary_train)
vocabulary_test = set(vocabulary_test)
vocabulary_test = list(vocabulary_test)
print(len(vocabulary_train))
print(len(vocabulary_test))

7783
3605


In [11]:
word_counts_per_sms_train = {}
word_counts_per_sms_test = {}
for key in vocabulary_train:
    word_counts_per_sms_train[key] = [0]*len(train_df)
    
for key in vocabulary_test:
    word_counts_per_sms_test[key] = [0]*len(test_df)

#word_counts_per_sms = {unique_word: [0] * len(training_set['SMS']) for unique_word in vocabulary}
#for index, sms in enumerate(training_set['SMS']):
#     for word in sms:
#         word_counts_per_sms[word][index] += 1

In [12]:

for index, sms in train_df['SMS_column'].iteritems():

    for word_train in sms:
        if word_train in word_counts_per_sms_train:
            word_counts_per_sms_train[word_train][index] += 1
        else:
            word_counts_per_sms_train[word_train][index] = 1
        
for index, sms in enumerate(test_df['SMS_column']):
    for word_test in sms:
        if word_test in word_counts_per_sms_test:
            word_counts_per_sms_test[word_test][index] += 1
        else:
            word_counts_per_sms_test[word_test][index] = 1


In [13]:
word_counts_train = pd.DataFrame(word_counts_per_sms_train)
word_counts_test = pd.DataFrame(word_counts_per_sms_test)

In [14]:
pd.options.display.max_columns = 200
word_counts_train.head()

Unnamed: 0,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,02073162414,02085076972,021,03,04,0430,05,050703,0578,06,07008009200,07046744435,07090201529,07090298926,07099833605,07123456789,0721072,07734396839,07742676969,07753741225,07781482378,07786200117,077xxx,078,07801543489,07808726822,07815296484,07821230901,078498,07880867867,0789xxxxxxx,07946746291,0796xxxxxx,07973788240,07xxxxxxxxx,08,0800,08000407165,08000776320,08000839402,08000930705,08000938767,08001950382,08002888812,08002986030,08002986906,08002988890,08006344447,0808,08081263000,08081560665,0825,083,0844,08448350055,08448714184,0845,08450542832,08452810073,08452810075over18,0870,08700435505150p,08700469649,08700621170150p,08701237397,08701417012,08701417012150p,0870141701216,087016248,087018728737,0870241182716,08702490080,08702840625,08704050406,08704439680ts,08706091795,0870737910216yrs,08707500020,08707509020,0870753331018,08708034412,08708800282,08709222922,08709501522,0871,087104711148,08712101358,08712103738,0871212025016,08712300220,...,xin,xmas,xuhui,xx,xxsp,xxuk,xxx,xxxx,xxxxx,xxxxxxx,xxxxxxxx,xxxxxxxxxxxxxx,xy,y,ya,yah,yahoo,yalrigu,yalru,yan,yar,yarasu,yards,yavnt,yaxx,yaxxx,yay,yck,yeah,year,years,yeh,yelling,yelow,yeovil,yep,yer,yes,yest,yesterday,yet,yetty,yetunde,yhl,yi,yifeng,yijue,ym,ymca,yo,yoga,yogasana,yor,yorge,you,youdoing,youi,young,younger,youphone,your,youre,yourinclusive,yourjob,yours,yourself,youuuuu,youwanna,yoville,yowifes,yoyyooo,yr,yrs,ystrday,ything,yummy,yun,yunny,yuo,yuou,yup,yupz,z,zac,zaher,zealand,zebra,zed,zeros,zhong,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0


In [15]:
train = pd.concat([train_df, word_counts_train], axis=1)
train.head()

Unnamed: 0,label,SMS_column,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,02073162414,02085076972,021,03,04,0430,05,050703,0578,06,07008009200,07046744435,07090201529,07090298926,07099833605,07123456789,0721072,07734396839,07742676969,07753741225,07781482378,07786200117,077xxx,078,07801543489,07808726822,07815296484,07821230901,078498,07880867867,0789xxxxxxx,07946746291,0796xxxxxx,07973788240,07xxxxxxxxx,08,0800,08000407165,08000776320,08000839402,08000930705,08000938767,08001950382,08002888812,08002986030,08002986906,08002988890,08006344447,0808,08081263000,08081560665,0825,083,0844,08448350055,08448714184,0845,08450542832,08452810073,08452810075over18,0870,08700435505150p,08700469649,08700621170150p,08701237397,08701417012,08701417012150p,0870141701216,087016248,087018728737,0870241182716,08702490080,08702840625,08704050406,08704439680ts,08706091795,0870737910216yrs,08707500020,08707509020,0870753331018,08708034412,08708800282,08709222922,08709501522,0871,087104711148,08712101358,08712103738,...,xin,xmas,xuhui,xx,xxsp,xxuk,xxx,xxxx,xxxxx,xxxxxxx,xxxxxxxx,xxxxxxxxxxxxxx,xy,y,ya,yah,yahoo,yalrigu,yalru,yan,yar,yarasu,yards,yavnt,yaxx,yaxxx,yay,yck,yeah,year,years,yeh,yelling,yelow,yeovil,yep,yer,yes,yest,yesterday,yet,yetty,yetunde,yhl,yi,yifeng,yijue,ym,ymca,yo,yoga,yogasana,yor,yorge,you,youdoing,youi,young,younger,youphone,your,youre,yourinclusive,yourjob,yours,yourself,youuuuu,youwanna,yoville,yowifes,yoyyooo,yr,yrs,ystrday,ything,yummy,yun,yunny,yuo,yuou,yup,yupz,z,zac,zaher,zealand,zebra,zed,zeros,zhong,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0


In [16]:
test = pd.concat([test_df, word_counts_test], axis=1)
test.head()

Unnamed: 0,label,SMS_column,0,00,000,008704050406,0121,01223585236,0125698789,02,021,03,04,06,07,07732584351,0776xxxxxxx,07786200117,07808,07808247860,07821230901,0800,08000839402,08000930705,08000938767,08001950382,08002888812,08002986030,08002986906,0808,08081560665,0844,0845,08452810071,08452810073,08452810075over18,08701213186,08701417012,08701752560,0870241182716,08704439680,08707808226,08712101358,08712300220,08712317606,08712404000,08712405020,0871277810710p,0871277810910p,08714712379,08715203028,08715203694,08715705022,08718727868,08718727870,08718738002,08719899217,08719899229,08719899230,09050000555,09050001808,09050002311,09058094507,09058094565,09058094594,09058098002,09058099801,09061221066,09061702893,09061743386,09061743806,09061743810,09061743811,09061749602,09063458130,09064012160,09064019788,09065069120,09065174042,09065394514,09065989180,09066350750,09066358361,09066612661,09090204448,09099726553,0a,1,10,100,1000,1000s,100percent,1030,10am,10p,10th,11,11mths,11pm,...,wks,wld,wn,wnevr,wnt,woke,woman,women,won,wondar,wonder,wonderful,wondering,wont,word,words,work,working,works,world,worried,worry,worse,worst,worth,wot,would,wouldn,wounds,wow,wrc,wrenching,write,wrk,wrnog,wrong,wt,wtf,wud,wudn,wuldnt,www,wylie,x,x29,xafter,xam,xchat,xmas,xoxo,xt,xx,xxx,xxxmobilemovieclub,xxxx,xxxxx,xxxxxx,xy,y,y87,ya,yam,yan,yar,yay,yck,yeah,year,years,yeesh,yeh,yellow,yen,yep,yes,yest,yesterday,yet,yetunde,yijue,ym,yo,yoga,yor,you,your,yours,yourself,yr,yrs,yummmm,yun,yunny,yuo,yup,zed,zoe,zoom,èn,ü
0,ham,"[later, i, guess, i, needa, do, mcat, study, too]",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,ham,"[but, i, haf, enuff, space, got, like, 4, mb]",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,spam,"[had, your, mobile, 10, mths, update, to, late...",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,ham,"[all, sounds, good, fingers, makes, it, diffic...",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,ham,"[all, done, all, handed, in, don, t, know, if,...",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


# Calculating constants first
We're now done with cleaning the training set, and we can begin creating the spam filter. The Naive Bayes algorithm will need to answer these two probability questions to be able to classify new messages:

$$
P(Spam | w_1,w_2, ..., w_n) \propto P(Spam) \cdot \prod_{i=1}^{n}P(w_i|Spam)
$$$$
P(Ham | w_1,w_2, ..., w_n) \propto P(Ham) \cdot \prod_{i=1}^{n}P(w_i|Ham)
$$
Also, to calculate P(wi|Spam) and P(wi|Ham) inside the formulas above, we'll need to use these equations:

$$
P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}}
$$$$
P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}}
$$
Some of the terms in the four equations above will have the same value for every new message. We can calculate the value of these terms once and avoid doing the computations again when a new messages comes in. Below, we'll use our training set to calculate:

- $P(Spam)$ and $P(Ham)$
- $N_{Spam}$, $N_{Ham}$, $N_{Vocabulary}$<br>
  Note that 
  - $N_{Spam}$ is equal to the number of words in all the spam messages — it's not equal to the number of spam messages, and it's not equal to the total number of unique words in spam messages.
  - $N_{Ham}$ is equal to the number of words in all the non-spam messages — it's not equal to the number of non-spam messages, and it's not equal to the total number of unique words in non-spam messages.
  - $N_{Vocabulary}$ represents the number of unique words in all the messages — both spam and non-spam

We'll also use Laplace smoothing and set $\alpha = 1$.

In [17]:
train_spam_messages = train[train['label']=='spam']
train_ham_messages = train[train['label']=='ham']

# P(Spam) and P(Ham)
p_spam_train = len(train_spam_messages)/len(train)
p_ham_train = len(train_ham_messages)/len(train)

# N_Spam
n_words_per_spam_message = train_spam_messages['SMS_column'].apply(len)
n_spam_train = n_words_per_spam_message.sum()

# N_Ham
n_words_per_ham_message = train_ham_messages['SMS_column'].apply(len)
n_ham_train = n_words_per_ham_message.sum()

# N_Vocabulary
n_vocabulary_train = len(vocabulary_train)

# Laplace smoothing
alpha = 1


# Calculating parameters

Now that we have the constant terms calculated above, we can move on with calculating the parameters $P(w_i|Spam)$ and $P(w_i|Ham)$. Each parameter will thus be a conditional probability value associated with each word in the vocabulary.

The parameters are calculated using the formulas:

$$
P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}}
$$$$
P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}}
$$

Note that $N_{w_i|Spam}$ is equal to the number of times the word wi occurs in all the spam messages, while $N_{w_i|Ham}$ is equal to the number of times the word wi occurs in all the ham messages.

In [18]:
spam_parameters = {unique_word: 0 for unique_word in vocabulary_train}
ham_parameters = {unique_word: 0 for unique_word in vocabulary_train}

In [19]:
for word in train.columns[2:]:
    n_word_given_spam = train_spam_messages[word].sum()   # spam_messages already defined in a cell above
    p_word_given_spam = (n_word_given_spam + alpha )/ (n_spam_train + alpha * n_vocabulary_train)
    spam_parameters[word] = p_word_given_spam
    
    n_word_given_ham = train_ham_messages[word].sum()   # ham_messages already defined in a cell above
    p_word_given_ham = (n_word_given_ham + alpha) / (n_ham_train + alpha *n_vocabulary_train)
    ham_parameters[word] = p_word_given_ham

# Classifiying a new message

Now that we have all our parameters calculated, we can start creating the spam filter. The spam filter can be understood as a function that:

- Takes in as input a new message (w1, w2, ..., wn).
- Calculates P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn).
- Compares the values of P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn), and:
  - If P(Ham|w1, w2, ..., wn) > P(Spam|w1, w2, ..., wn), then the message is classified as ham.
  - If P(Ham|w1, w2, ..., wn) < P(Spam|w1, w2, ..., wn), then the message is classified as spam.
  - If P(Ham|w1, w2, ..., wn) = P(Spam|w1, w2, ..., wn), then the algorithm may request human help.

In [27]:
def classify(message):
    
    p_spam_given_message = p_spam_train
    p_ham_given_message = p_ham_train
    
    for word in message:
        if word in spam_parameters:
            p_spam_given_message *= spam_parameters[word]
        if word in ham_parameters:
            p_ham_given_message *= ham_parameters[word]
            
#     print('P(Spam|message) = ', p_spam_given_message)
#     print('P(Ham|message) = ', p_ham_given_message)
    
    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'Equal proabilities, have a human classify this!'
    


In [30]:
test['predicted'] = test['SMS_column'].apply(classify)
test[['label','SMS_column','predicted']].head()

Unnamed: 0,label,SMS_column,predicted
0,ham,"[later, i, guess, i, needa, do, mcat, study, too]",ham
1,ham,"[but, i, haf, enuff, space, got, like, 4, mb]",ham
2,spam,"[had, your, mobile, 10, mths, update, to, late...",spam
3,ham,"[all, sounds, good, fingers, makes, it, diffic...",ham
4,ham,"[all, done, all, handed, in, don, t, know, if,...",ham


Now, we'll write a function to measure the accuracy of our spam filter to find out how well our spam filter does.

In [31]:
correct = 0

for index, row in test.iterrows():
    if row['label'] == row['predicted']:
        correct += 1
        
print('Correct: ', correct)
print('Incorrect: ', len(test)-correct)
print('Accuracy: ', correct/len(test))

Correct:  1100
Incorrect:  14
Accuracy:  0.9874326750448833


The accuracy is close to 98.74%, which is really good. Our spam filter looked at 1,114 messages that it hasn't seen in training, and classified 1,100 correctly.

# Next Steps

In this project, we managed to build a spam filter for SMS messages using the multinomial Naive Bayes algorithm. The filter had an accuracy of 98.74% on the test set we used, which is a pretty good result. Our initial goal was an accuracy of over 80%, and we managed to do way better than that.

Next steps include:

- Analyze the 14 messages that were classified incorrectly and try to figure out why the algorithm classified them incorrectly
- Make the filtering process more complex by making the algorithm sensitive to letter case