In this project, we're going to build a spam filter for SMS messages using the multinomial Naive Bayes algorithm.

Our goal is to write a program that classifies new messages with an accuracy greater than 80% — so we expect that more than 80% of the new messages will be classified correctly as :-
* spam

or 
* ham (non-spam/legitimate).

To train the algorithm, we'll use a dataset of 5,572 SMS messages that are already classified by humans.
* The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the The UCI Machine Learning Repository. The data collection process is described in more details on this page, where you can also find some of the authors' papers.

In [1]:
import numpy as np
import pandas as pd
import re

#  Exploring the Dataset

In [2]:
classified_SMS_messages = pd.read_csv('SMSSpamCollection',sep='\t',header=None,names=['Label', 'SMS'])

In [3]:
classified_SMS_messages.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Label   5572 non-null   object
 1   SMS     5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [4]:
classified_SMS_messages

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


## Message Label Distribution in the dataset

In [5]:
dataset_msgs_label_distribution = classified_SMS_messages['Label'].value_counts(normalize=True)*100
dataset_msgs_label_distribution

ham     86.593683
spam    13.406317
Name: Label, dtype: float64

About 87% of the messages in the dataset are classified as non-spam SMSs. Roughly 13% of the messages analyzed were unwanted or unsolicited SMSs.
* This sample looks representative, since in practice most messages that people receive are ham.

# Training and Test Set

We're going to start by randomizing the entire dataset to ensure that spam and ham messages are spread properly throughout the dataset.

## Randomizing the dataset

In [6]:
randomized_classified_SMS_messages = classified_SMS_messages.sample(frac=1,random_state=1)
number_of_SMS_messages = randomized_classified_SMS_messages.shape[0]
randomized_classified_SMS_messages

Unnamed: 0,Label,SMS
1078,ham,"Yep, by the pretty sculpture"
4028,ham,"Yes, princess. Are you going to make me moan?"
958,ham,Welp apparently he retired
4642,ham,Havent.
4674,ham,I forgot 2 ask ü all smth.. There's a card on ...
...,...,...
905,ham,"We're all getting worried over here, derek and..."
5192,ham,Oh oh... Den muz change plan liao... Go back h...
3980,ham,CERI U REBEL! SWEET DREAMZ ME LITTLE BUDDY!! C...
235,spam,Text & meet someone sexy today. U can find a d...


We're going to keep 80% of our dataset for training, and 20% for testing (we want to train the algorithm on as much data as possible, but we also want to have enough test data). The dataset has 5,572 messages, which means that:
* The training set will have 4,458 messages (about 80% of the dataset).
* The test set will have 1,114 messages (about 20% of the dataset).

## Splitting the randomized dataset into a training and a test set

In [7]:
training_test_division_index = round(0.8*number_of_SMS_messages)

training_set = randomized_classified_SMS_messages.iloc[:training_test_division_index].reset_index(drop=True)
test_set = randomized_classified_SMS_messages.iloc[training_test_division_index:].reset_index(drop=True)

print(training_set.shape)
print(test_set.shape)

(4458, 2)
(1114, 2)


In [8]:
training_set_msgs_label_distribution = training_set['Label'].value_counts(normalize=True)*100
training_set_msgs_label_distribution

ham     86.54105
spam    13.45895
Name: Label, dtype: float64

In [9]:
test_set_msgs_label_distribution = test_set['Label'].value_counts(normalize=True)*100
test_set_msgs_label_distribution

ham     86.804309
spam    13.195691
Name: Label, dtype: float64

In both the training and test sets, the messages classified as 'ham' and 'spam' account for 87% and 13% respectively of the corresponding total number of entries in the 2 sets of data. This is similiar to what we have observed in the entire dataset, so both the training and test sets are representative.

# Letter Case and Punctuation

To calculate all the probabilities required by the algorithm, we'll first need to perform a bit of data cleaning to bring the data in a format that will allow us to extract easily all the information we need.

Essentially, we want to bring data to this format:

## Cleaning the training set

We'll first remove all punctuations and bring every letter in every word of all messages to lowercase

In [10]:
# the reason why non-word characters are replaced by a whitespace, is because of the presence of apostrophe's in messages
# essentially an apostrophe implies 2 words, hence replacing it by a space signals the presence of 2 words, albeit that the
# second word is in short form.
training_set_unpunctuated = training_set['SMS'].str.replace(r'\W',' ')

# the lower case is to ensure all of the same words mean non-distinct
training_set_unpunctuated_lowercase = training_set_unpunctuated.str.lower()

training_set['cleaned_SMS'] = training_set_unpunctuated_lowercase
training_set

Unnamed: 0,Label,SMS,cleaned_SMS
0,ham,"Yep, by the pretty sculpture",yep by the pretty sculpture
1,ham,"Yes, princess. Are you going to make me moan?",yes princess are you going to make me moan
2,ham,Welp apparently he retired,welp apparently he retired
3,ham,Havent.,havent
4,ham,I forgot 2 ask ü all smth.. There's a card on ...,i forgot 2 ask ü all smth there s a card on ...
...,...,...,...
4453,ham,"Sorry, I'll call later in meeting any thing re...",sorry i ll call later in meeting any thing re...
4454,ham,Babe! I fucking love you too !! You know? Fuck...,babe i fucking love you too you know fuck...
4455,spam,U've been selected to stay in 1 of 250 top Bri...,u ve been selected to stay in 1 of 250 top bri...
4456,ham,Hello my boytoy ... Geeee I miss you already a...,hello my boytoy geeee i miss you already a...


# Creating the Vocabulary

In [11]:
# will contain the set of unique words in the messages belonging to the training set
SMS_vocabulary = list()

In [12]:
training_set['SMS_words'] = training_set['cleaned_SMS'].str.split()
training_set

Unnamed: 0,Label,SMS,cleaned_SMS,SMS_words
0,ham,"Yep, by the pretty sculpture",yep by the pretty sculpture,"[yep, by, the, pretty, sculpture]"
1,ham,"Yes, princess. Are you going to make me moan?",yes princess are you going to make me moan,"[yes, princess, are, you, going, to, make, me,..."
2,ham,Welp apparently he retired,welp apparently he retired,"[welp, apparently, he, retired]"
3,ham,Havent.,havent,[havent]
4,ham,I forgot 2 ask ü all smth.. There's a card on ...,i forgot 2 ask ü all smth there s a card on ...,"[i, forgot, 2, ask, ü, all, smth, there, s, a,..."
...,...,...,...,...
4453,ham,"Sorry, I'll call later in meeting any thing re...",sorry i ll call later in meeting any thing re...,"[sorry, i, ll, call, later, in, meeting, any, ..."
4454,ham,Babe! I fucking love you too !! You know? Fuck...,babe i fucking love you too you know fuck...,"[babe, i, fucking, love, you, too, you, know, ..."
4455,spam,U've been selected to stay in 1 of 250 top Bri...,u ve been selected to stay in 1 of 250 top bri...,"[u, ve, been, selected, to, stay, in, 1, of, 2..."
4456,ham,Hello my boytoy ... Geeee I miss you already a...,hello my boytoy geeee i miss you already a...,"[hello, my, boytoy, geeee, i, miss, you, alrea..."


In [13]:
for sms in training_set['SMS_words']:
    SMS_vocabulary.extend(sms)

SMS_vocabulary

['yep',
 'by',
 'the',
 'pretty',
 'sculpture',
 'yes',
 'princess',
 'are',
 'you',
 'going',
 'to',
 'make',
 'me',
 'moan',
 'welp',
 'apparently',
 'he',
 'retired',
 'havent',
 'i',
 'forgot',
 '2',
 'ask',
 'ü',
 'all',
 'smth',
 'there',
 's',
 'a',
 'card',
 'on',
 'da',
 'present',
 'lei',
 'how',
 'ü',
 'all',
 'want',
 '2',
 'write',
 'smth',
 'or',
 'sign',
 'on',
 'it',
 'ok',
 'i',
 'thk',
 'i',
 'got',
 'it',
 'then',
 'u',
 'wan',
 'me',
 '2',
 'come',
 'now',
 'or',
 'wat',
 'i',
 'want',
 'kfc',
 'its',
 'tuesday',
 'only',
 'buy',
 '2',
 'meals',
 'only',
 '2',
 'no',
 'gravy',
 'only',
 '2',
 'mark',
 '2',
 'no',
 'dear',
 'i',
 'was',
 'sleeping',
 'p',
 'ok',
 'pa',
 'nothing',
 'problem',
 'ill',
 'be',
 'there',
 'on',
 'lt',
 'gt',
 'ok',
 'my',
 'uncles',
 'in',
 'atlanta',
 'wish',
 'you',
 'guys',
 'a',
 'great',
 'semester',
 'my',
 'phone',
 'ok',
 'which',
 'your',
 'another',
 'number',
 'the',
 'greatest',
 'test',
 'of',
 'courage',
 'on',
 'earth',
 '

In [14]:
# vocabulary may contain duplicates, hence convert to set and back to list    
SMS_vocabulary = list(set(SMS_vocabulary))
SMS_vocabulary

['acknowledgement',
 'managed',
 'married',
 'fuckin',
 'bringing',
 'dice',
 'drugdealer',
 'control',
 'japanese',
 'xxxxx',
 'make',
 'cash',
 'imprtant',
 '09041940223',
 'da',
 'gender',
 'taught',
 'hunks',
 'anniversary',
 'sporadically',
 'rr',
 '08709222922',
 'tiz',
 'jamz',
 '09065174042',
 '3lp',
 'pile',
 'england',
 '08712402972',
 'or2optout',
 'onam',
 'hug',
 'settling',
 'tke',
 'stamped',
 'footprints',
 'verify',
 'matthew',
 'date',
 '04',
 'goodnoon',
 'oble',
 'convincing',
 'text',
 'mornings',
 'stash',
 'nowhere',
 'jjc',
 'mahfuuz',
 'christmas',
 '09058094455',
 'convince',
 'pence',
 'astne',
 'specialisation',
 'later',
 '400mins',
 'roles',
 'loves',
 'subject',
 'sleep',
 'able',
 'fro',
 '2bold',
 'funs',
 'pic',
 'rental',
 'blocked',
 'wa',
 'same',
 'hav2hear',
 '07xxxxxxxxx',
 '07880867867',
 'sathy',
 'hair',
 'well',
 'vu',
 'ru',
 'which',
 'dosomething',
 'jungle',
 'plyr',
 'noun',
 'outgoing',
 'finishd',
 'laxinorficated',
 'semiobscure',
 '0

In [15]:
# may contain null values(if an original message content was just a punctuation)
SMS_vocabulary = [word for word in SMS_vocabulary if word != '']
SMS_vocabulary

['acknowledgement',
 'managed',
 'married',
 'fuckin',
 'bringing',
 'dice',
 'drugdealer',
 'control',
 'japanese',
 'xxxxx',
 'make',
 'cash',
 'imprtant',
 '09041940223',
 'da',
 'gender',
 'taught',
 'hunks',
 'anniversary',
 'sporadically',
 'rr',
 '08709222922',
 'tiz',
 'jamz',
 '09065174042',
 '3lp',
 'pile',
 'england',
 '08712402972',
 'or2optout',
 'onam',
 'hug',
 'settling',
 'tke',
 'stamped',
 'footprints',
 'verify',
 'matthew',
 'date',
 '04',
 'goodnoon',
 'oble',
 'convincing',
 'text',
 'mornings',
 'stash',
 'nowhere',
 'jjc',
 'mahfuuz',
 'christmas',
 '09058094455',
 'convince',
 'pence',
 'astne',
 'specialisation',
 'later',
 '400mins',
 'roles',
 'loves',
 'subject',
 'sleep',
 'able',
 'fro',
 '2bold',
 'funs',
 'pic',
 'rental',
 'blocked',
 'wa',
 'same',
 'hav2hear',
 '07xxxxxxxxx',
 '07880867867',
 'sathy',
 'hair',
 'well',
 'vu',
 'ru',
 'which',
 'dosomething',
 'jungle',
 'plyr',
 'noun',
 'outgoing',
 'finishd',
 'laxinorficated',
 'semiobscure',
 '0

In [16]:
vocabulary_length = len(SMS_vocabulary)

There are 7783 unique words in our vocabulary

# The Final Training Set

We're now going to use the vocabulary we just created to make the data transformation we want.

In [17]:
training_set_msgs_word_distribution = {unique_word: [0] * len(training_set['SMS']) for unique_word in SMS_vocabulary}

for index, sms in enumerate(training_set['SMS_words']):
    for word in sms:
        training_set_msgs_word_distribution[word][index] += 1

training_set_msgs_word_distribution = pd.DataFrame(training_set_msgs_word_distribution)
training_set_msgs_word_distribution

Unnamed: 0,acknowledgement,managed,married,fuckin,bringing,dice,drugdealer,control,japanese,xxxxx,...,breeze,republic,wan2,smidgin,dizzamn,courageous,actor,maid,box326,vodka
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4453,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4454,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4455,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4456,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Concatenating the messages' word distribution with the training set

In [18]:
training_set = pd.concat([training_set,training_set_msgs_word_distribution], axis=1)
training_set

Unnamed: 0,Label,SMS,cleaned_SMS,SMS_words,acknowledgement,managed,married,fuckin,bringing,dice,...,breeze,republic,wan2,smidgin,dizzamn,courageous,actor,maid,box326,vodka
0,ham,"Yep, by the pretty sculpture",yep by the pretty sculpture,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"Yes, princess. Are you going to make me moan?",yes princess are you going to make me moan,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,Welp apparently he retired,welp apparently he retired,"[welp, apparently, he, retired]",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,Havent.,havent,[havent],0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,I forgot 2 ask ü all smth.. There's a card on ...,i forgot 2 ask ü all smth there s a card on ...,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4453,ham,"Sorry, I'll call later in meeting any thing re...",sorry i ll call later in meeting any thing re...,"[sorry, i, ll, call, later, in, meeting, any, ...",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4454,ham,Babe! I fucking love you too !! You know? Fuck...,babe i fucking love you too you know fuck...,"[babe, i, fucking, love, you, too, you, know, ...",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4455,spam,U've been selected to stay in 1 of 250 top Bri...,u ve been selected to stay in 1 of 250 top bri...,"[u, ve, been, selected, to, stay, in, 1, of, 2...",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4456,ham,Hello my boytoy ... Geeee I miss you already a...,hello my boytoy geeee i miss you already a...,"[hello, my, boytoy, geeee, i, miss, you, alrea...",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# Calculating Constants First

The multinomial Naive Bayes algorithm will need to know the probability values of the two equations below to be able to classify new messages:

\begin{equation}
P(Spam | w_1,w_2, ..., w_n) \propto P(Spam) \cdot \prod_{i=1}^{n}P(w_i|Spam)
\end{equation}
\begin{equation}
P(Ham | w_1,w_2, ..., w_n) \propto P(Ham) \cdot \prod_{i=1}^{n}P(w_i|Ham)
\end{equation}

Also, to calculate P(wi|Spam) and P(wi|Ham) inside the formulas above, we need to use these equations:

\begin{equation}
P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}}
\end{equation}
\begin{equation}
P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}}
\end{equation}

Some of the terms in the four equations above will have the same value for every new message. Below we'll use our training set calculate:
* P(Spam) and P(Ham)
* NSpam, NHam, NVocabulary

We'll also use Laplace smoothing and set 
α
=
1


## Calculating P(Spam) and P(Ham)

In [19]:
p_spam = training_set_msgs_label_distribution['spam']/100
p_spam

0.13458950201884254

In [20]:
p_ham = training_set_msgs_label_distribution['ham']/100
p_ham

0.8654104979811575

## Calculating NSpam and NHam

### NSpam and vocabulary word count in spam messages

In [21]:
spam_sms_messages = training_set.loc[training_set['Label'] == 'spam']
spam_sms_messages

Unnamed: 0,Label,SMS,cleaned_SMS,SMS_words,acknowledgement,managed,married,fuckin,bringing,dice,...,breeze,republic,wan2,smidgin,dizzamn,courageous,actor,maid,box326,vodka
16,spam,FreeMsg Why haven't you replied to my text? I'...,freemsg why haven t you replied to my text i ...,"[freemsg, why, haven, t, you, replied, to, my,...",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
18,spam,Congrats! 2 mobile 3G Videophones R yours. cal...,congrats 2 mobile 3g videophones r yours cal...,"[congrats, 2, mobile, 3g, videophones, r, your...",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
56,spam,FREE MESSAGE Activate your 500 FREE Text Messa...,free message activate your 500 free text messa...,"[free, message, activate, your, 500, free, tex...",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
60,spam,Call from 08702490080 - tells u 2 call 0906635...,call from 08702490080 tells u 2 call 0906635...,"[call, from, 08702490080, tells, u, 2, call, 0...",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
61,spam,Someone has conacted our dating service and en...,someone has conacted our dating service and en...,"[someone, has, conacted, our, dating, service,...",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4437,spam,Congratulations YOU'VE Won. You're a Winner in...,congratulations you ve won you re a winner in...,"[congratulations, you, ve, won, you, re, a, wi...",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4439,spam,Win the newest “Harry Potter and the Order of ...,win the newest harry potter and the order of ...,"[win, the, newest, harry, potter, and, the, or...",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4443,spam,Someone U know has asked our dating service 2 ...,someone u know has asked our dating service 2 ...,"[someone, u, know, has, asked, our, dating, se...",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4449,spam,YOUR CHANCE TO BE ON A REALITY FANTASY SHOW ca...,your chance to be on a reality fantasy show ca...,"[your, chance, to, be, on, a, reality, fantasy...",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [22]:
spam_sms_messages_word_count = spam_sms_messages.iloc[:,4:].sum()
spam_sms_messages_word_count

acknowledgement    0
managed            0
married            2
fuckin             0
bringing           1
                  ..
courageous         0
actor              0
maid               0
box326             1
vodka              0
Length: 7783, dtype: int64

In [23]:
n_spam = spam_sms_messages_word_count.sum() 
n_spam

15190

### NHam and vocabulary word count in ham messages 

In [24]:
ham_sms_messages = training_set.loc[training_set['Label'] == 'ham']
ham_sms_messages

Unnamed: 0,Label,SMS,cleaned_SMS,SMS_words,acknowledgement,managed,married,fuckin,bringing,dice,...,breeze,republic,wan2,smidgin,dizzamn,courageous,actor,maid,box326,vodka
0,ham,"Yep, by the pretty sculpture",yep by the pretty sculpture,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"Yes, princess. Are you going to make me moan?",yes princess are you going to make me moan,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,Welp apparently he retired,welp apparently he retired,"[welp, apparently, he, retired]",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,Havent.,havent,[havent],0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,I forgot 2 ask ü all smth.. There's a card on ...,i forgot 2 ask ü all smth there s a card on ...,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4452,ham,"How about clothes, jewelry, and trips?",how about clothes jewelry and trips,"[how, about, clothes, jewelry, and, trips]",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4453,ham,"Sorry, I'll call later in meeting any thing re...",sorry i ll call later in meeting any thing re...,"[sorry, i, ll, call, later, in, meeting, any, ...",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4454,ham,Babe! I fucking love you too !! You know? Fuck...,babe i fucking love you too you know fuck...,"[babe, i, fucking, love, you, too, you, know, ...",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4456,ham,Hello my boytoy ... Geeee I miss you already a...,hello my boytoy geeee i miss you already a...,"[hello, my, boytoy, geeee, i, miss, you, alrea...",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [25]:
ham_sms_messages_word_count = ham_sms_messages.iloc[:,4:].sum()
ham_sms_messages_word_count

acknowledgement    1
managed            2
married            2
fuckin             3
bringing           1
                  ..
courageous         1
actor              2
maid               7
box326             0
vodka              3
Length: 7783, dtype: int64

In [26]:
n_ham = ham_sms_messages_word_count.sum()
n_ham

57237

## Calculating NVocabulary 

In [27]:
n_vocabulary = vocabulary_length

## Setting smoothing parameter as 1 (i.e. Laplace Smoothing)

In [28]:
alpha = 1

# Calculating Parameters

The probability values that P(wi|Spam) and P(wi|Ham) will take are called parameters.

Let's now calculate all the parameters using the equations below:
\begin{equation}
P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}}
\end{equation}
\begin{equation}
P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}}
\end{equation}

In [29]:
p_vocabulary_word_given_spam = (spam_sms_messages_word_count + alpha)/(n_spam + alpha*n_vocabulary)
p_vocabulary_word_given_spam

acknowledgement    0.000044
managed            0.000044
married            0.000131
fuckin             0.000044
bringing           0.000087
                     ...   
courageous         0.000044
actor              0.000044
maid               0.000044
box326             0.000087
vodka              0.000044
Length: 7783, dtype: float64

In [30]:
p_vocabulary_word_given_ham = (ham_sms_messages_word_count + alpha)/(n_ham + alpha*n_vocabulary)
p_vocabulary_word_given_ham

acknowledgement    0.000031
managed            0.000046
married            0.000046
fuckin             0.000062
bringing           0.000031
                     ...   
courageous         0.000031
actor              0.000046
maid               0.000123
box326             0.000015
vodka              0.000062
Length: 7783, dtype: float64

# Classifying A New Message

Now that we've calculated all the constants and parameters we need, we can start creating the spam filter. The spam filter can be understood as a function that:

1. Takes in as input a new message (w1, w2, ..., wn)
2. Calculates P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn)
3. Compares the values of P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn), and:
   * If P(Ham|w1, w2, ..., wn) > P(Spam|w1, w2, ..., wn), then the message is classified as ham.
   * If P(Ham|w1, w2, ..., wn) < P(Spam|w1, w2, ..., wn), then the message is classified as spam.
   * If P(Ham|w1, w2, ..., wn) = P(Spam|w1, w2, ..., wn), then the algorithm may request human help.

##  Multinomial Naive Bayes Algorithm

In [31]:
def classify_SMS_message(msg):
    # perform a bit of data cleaning on the string message:
    msg = re.sub('\W',' ',msg)
    msg = msg.lower()
    msg = msg.split()
    
    # initializing conditional probabilities needed for classification
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in msg:
        if word in SMS_vocabulary:
            p_spam_given_message *= p_vocabulary_word_given_spam[word]
            p_ham_given_message *= p_vocabulary_word_given_ham[word]
            
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'Equal proabilities, have a human classify this!'

## Example SMS messages 

Testing out a few example SMS messages:
* SPAM : 'WINNER!! This is the secret code to unlock the money: C3421.'
* HAM : "Sounds good, Tom, then see u there"

In [32]:
test_result_1 = classify_SMS_message('WINNER!! This is the secret code to unlock the money: C3421.')
test_result_1

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.936804902858988e-27


'spam'

In [33]:
test_result_2 = classify_SMS_message("Sounds good, Tom, then see u there")
test_result_2

P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.6875304350092384e-21


'ham'

# Measuring the Spam Filter's Accuracy

We'll now try to determine how well the spam filter does on our test set of 1,114 messages.

In [34]:
test_set['predicted'] = test_set['SMS'].apply(classify_SMS_message)

P(Spam|message): 3.4831070937898343e-26
P(Ham|message): 4.2532451305346546e-19
P(Spam|message): 3.113880725982859e-34
P(Ham|message): 9.669410959057878e-29
P(Spam|message): 7.548549643070596e-83
P(Ham|message): 4.338466063216561e-98
P(Spam|message): 3.608707853113582e-34
P(Ham|message): 1.4814957224618124e-28
P(Spam|message): 2.764395015074001e-68
P(Ham|message): 6.58114288387539e-58
P(Spam|message): 3.003832099003537e-110
P(Ham|message): 1.3968662892114072e-88
P(Spam|message): 6.630543201285272e-08
P(Ham|message): 1.5368220027591872e-05
P(Spam|message): 1.6750161514573394e-44
P(Ham|message): 9.822271357766385e-39
P(Spam|message): 1.2938388793330702e-42
P(Ham|message): 5.700659615758871e-36
P(Spam|message): 1.0298350092198955e-15
P(Ham|message): 6.458118857256102e-15
P(Spam|message): 6.697358800236469e-16
P(Ham|message): 1.8371872356584518e-12
P(Spam|message): 1.0054992015050658e-41
P(Ham|message): 3.978760177910221e-34
P(Spam|message): 6.888979648453438e-99
P(Ham|message): 5.555195832

P(Spam|message): 2.029375034207579e-71
P(Ham|message): 9.42784410739429e-66
P(Spam|message): 6.834770511600425e-89
P(Ham|message): 3.121957332396153e-71
P(Spam|message): 4.1738643764370007e-100
P(Ham|message): 7.497054700081859e-121
P(Spam|message): 4.890930283148589e-34
P(Ham|message): 2.735911383153011e-26
P(Spam|message): 3.5123486046036553e-40
P(Ham|message): 7.1242338198863175e-37
P(Spam|message): 7.459140052378628e-105
P(Ham|message): 4.899258167761655e-118
P(Spam|message): 4.525823691211584e-64
P(Ham|message): 3.403836621809092e-73
P(Spam|message): 1.5415315323380577e-66
P(Ham|message): 5.249927565222289e-53
P(Spam|message): 1.0494177898696922e-91
P(Ham|message): 1.086706896486464e-78
P(Spam|message): 7.649705249099936e-236
P(Ham|message): 4.32879703101004e-213
P(Spam|message): 8.645248399736358e-12
P(Ham|message): 1.2475289229872795e-10
P(Spam|message): 2.1139305424494797e-25
P(Ham|message): 3.070642624978947e-22
P(Spam|message): 1.2689164039978864e-88
P(Ham|message): 4.2632042

P(Spam|message): 3.549872475569415e-46
P(Ham|message): 1.2659066540088178e-40
P(Spam|message): 3.9190642716238706e-51
P(Ham|message): 8.949970440730532e-43
P(Spam|message): 8.646091743870475e-35
P(Ham|message): 8.579055673184154e-29
P(Spam|message): 3.028337107420339e-50
P(Ham|message): 1.2201500561642768e-41
P(Spam|message): 6.691971980361724e-34
P(Ham|message): 9.328043976042522e-27
P(Spam|message): 2.6857600469118784e-47
P(Ham|message): 1.6377084715403865e-38
P(Spam|message): 1.3810300843549413e-78
P(Ham|message): 1.2223438313106625e-67
P(Spam|message): 4.748127421401796e-72
P(Ham|message): 6.037225236233053e-55
P(Spam|message): 1.4987082079626756e-100
P(Ham|message): 8.340791291627015e-81
P(Spam|message): 2.465717868626954e-34
P(Ham|message): 8.736914669509731e-30
P(Spam|message): 1.754809725949264e-69
P(Ham|message): 1.508717635915867e-91
P(Spam|message): 7.439200959997619e-35
P(Ham|message): 5.34861799650493e-29
P(Spam|message): 2.1723088244394963e-103
P(Ham|message): 1.093740658

P(Spam|message): 1.46712198598629e-48
P(Ham|message): 3.428568338071325e-36
P(Spam|message): 2.321081477570653e-40
P(Ham|message): 1.421729793437636e-35
P(Spam|message): 6.565580982207206e-17
P(Ham|message): 7.040214298848454e-15
P(Spam|message): 7.509151032636466e-20
P(Ham|message): 3.9391675243556815e-17
P(Spam|message): 1.4597898536600804e-63
P(Ham|message): 4.2437377157847846e-57
P(Spam|message): 5.709711064732726e-73
P(Ham|message): 3.0411315260020924e-90
P(Spam|message): 5.182773961610748e-39
P(Ham|message): 1.7966980453570727e-30
P(Spam|message): 3.3611225539736137e-77
P(Ham|message): 1.659274481190746e-82
P(Spam|message): 2.0335894727715704e-42
P(Ham|message): 3.522256621282568e-32
P(Spam|message): 5.427465875190859e-91
P(Ham|message): 9.594759336842543e-76
P(Spam|message): 4.031368172817855e-54
P(Ham|message): 7.650970845091878e-44
P(Spam|message): 2.2145005613570683e-33
P(Ham|message): 9.491169558575947e-29
P(Spam|message): 3.036174270883676e-46
P(Ham|message): 6.346578265498

P(Spam|message): 1.9026541830114975e-25
P(Ham|message): 1.3623975306429293e-22
P(Spam|message): 4.4863633064934454e-76
P(Ham|message): 3.1459485892264444e-83
P(Spam|message): 3.330268911641394e-14
P(Ham|message): 1.7851072859560394e-11
P(Spam|message): 2.5127175873903056e-16
P(Ham|message): 5.0223873604669043e-14
P(Spam|message): 2.4957262961495277e-36
P(Ham|message): 1.1430995692784705e-28
P(Spam|message): 2.3461364318775114e-16
P(Ham|message): 3.764012673655267e-13
P(Spam|message): 1.3159931619833866e-31
P(Ham|message): 8.048361811381367e-25
P(Spam|message): 1.0026297751922221e-57
P(Ham|message): 1.7195982305603996e-48
P(Spam|message): 8.439823653874138e-44
P(Ham|message): 3.493373478309324e-37
P(Spam|message): 1.0973947342037465e-28
P(Ham|message): 8.671872747139644e-23
P(Spam|message): 2.5745660683325026e-20
P(Ham|message): 8.047855886334032e-16
P(Spam|message): 1.2494488203298296e-14
P(Ham|message): 6.423537908968673e-13
P(Spam|message): 9.975041793021258e-64
P(Ham|message): 1.098

P(Spam|message): 2.630899417439622e-46
P(Ham|message): 1.9552751145116695e-37
P(Spam|message): 4.5377865134431775e-20
P(Ham|message): 7.880975640357725e-17
P(Spam|message): 3.551221574314766e-49
P(Ham|message): 1.2724771621295183e-40
P(Spam|message): 6.633896853492777e-65
P(Ham|message): 6.218473528320932e-55
P(Spam|message): 2.3373809601796716e-68
P(Ham|message): 2.1819223772165576e-80
P(Spam|message): 5.189637399455582e-47
P(Ham|message): 8.303344163101736e-36
P(Spam|message): 5.390098058877563e-17
P(Ham|message): 6.896050873949672e-14
P(Spam|message): 3.974444126618123e-30
P(Ham|message): 3.351155983064514e-22
P(Spam|message): 1.3422903576932154e-45
P(Ham|message): 2.112793651890855e-40
P(Spam|message): 5.777672299412519e-92
P(Ham|message): 3.510858144731767e-77
P(Spam|message): 1.459185992084471e-26
P(Ham|message): 1.0977734820433144e-20
P(Spam|message): 4.715330518688033e-24
P(Ham|message): 1.0455484560393924e-21
P(Spam|message): 1.2797864826798807e-21
P(Ham|message): 4.6364823605

P(Spam|message): 6.252823865330755e-69
P(Ham|message): 7.335746506692425e-57
P(Spam|message): 4.758685906259454e-26
P(Ham|message): 2.1496813555724636e-22
P(Spam|message): 9.30401336820152e-14
P(Ham|message): 5.516426280414872e-12
P(Spam|message): 6.517104878127362e-27
P(Ham|message): 6.1669135256375374e-24
P(Spam|message): 4.505902471635209e-74
P(Ham|message): 4.509483212307904e-59
P(Spam|message): 1.4215643232204564e-18
P(Ham|message): 5.2976686596317165e-15
P(Spam|message): 1.1251480211302493e-38
P(Ham|message): 2.646768140599638e-32
P(Spam|message): 2.1314283644097046e-91
P(Ham|message): 4.239026499012099e-78
P(Spam|message): 1.9633065169017314e-79
P(Ham|message): 1.4186091486406881e-97
P(Spam|message): 1.8509952125266358e-21
P(Ham|message): 3.8062655197345636e-16
P(Spam|message): 3.330268911641394e-14
P(Ham|message): 1.983452539951155e-13
P(Spam|message): 6.855999014356502e-23
P(Ham|message): 1.6666643145723603e-17
P(Spam|message): 5.2450080347248724e-14
P(Ham|message): 3.36401639

P(Spam|message): 4.7537510404047116e-101
P(Ham|message): 5.793287371923453e-87
P(Spam|message): 3.2658428829817595e-37
P(Ham|message): 9.034882221266607e-31
P(Spam|message): 4.2257522683924934e-36
P(Ham|message): 5.517562054570272e-27
P(Spam|message): 1.683377064798179e-74
P(Ham|message): 1.0128284419364875e-59
P(Spam|message): 1.663969281825757e-164
P(Ham|message): 9.155550991361432e-132
P(Spam|message): 2.6356999373778396e-30
P(Ham|message): 6.629431982192934e-22
P(Spam|message): 7.095157078581061e-105
P(Ham|message): 1.337793065838013e-89
P(Spam|message): 5.0864689244777824e-111
P(Ham|message): 5.1678724465123095e-84
P(Spam|message): 5.226958071737148e-18
P(Ham|message): 3.061824374868645e-15
P(Spam|message): 9.653587827947158e-87
P(Ham|message): 1.3052190506872353e-70
P(Spam|message): 4.307355733759788e-57
P(Ham|message): 1.564128254216581e-51
P(Spam|message): 8.706924492493146e-28
P(Ham|message): 1.8585220993975448e-23
P(Spam|message): 3.733120799082129e-18
P(Ham|message): 9.82912

P(Spam|message): 1.1538965289369053e-44
P(Ham|message): 3.2826956285039382e-34
P(Spam|message): 9.254976062633179e-22
P(Ham|message): 6.642623026869151e-17
P(Spam|message): 1.5085278945614924e-33
P(Ham|message): 4.256469681218687e-24
P(Spam|message): 2.487737565635799e-18
P(Ham|message): 5.540444106905167e-17
P(Spam|message): 2.9070555434125306e-39
P(Ham|message): 2.170964286505283e-30
P(Spam|message): 1.5869476039919944e-68
P(Ham|message): 3.834617464234002e-56
P(Spam|message): 2.077813760866381e-96
P(Ham|message): 9.47166215977017e-85
P(Spam|message): 7.014027174305931e-66
P(Ham|message): 2.1434534469477853e-56
P(Spam|message): 1.4073348023302667e-30
P(Ham|message): 1.805787202635306e-23
P(Spam|message): 1.4696455337765797e-33
P(Ham|message): 3.300587895069895e-29
P(Spam|message): 2.3485910727455253e-61
P(Ham|message): 4.567585265016552e-51
P(Spam|message): 1.7612433636178684e-58
P(Ham|message): 6.437080668859555e-59
P(Spam|message): 4.286958320944504e-31
P(Ham|message): 1.8806534416

P(Spam|message): 9.62817856524892e-57
P(Ham|message): 5.421106906111739e-47
P(Spam|message): 6.272006450257959e-12
P(Ham|message): 4.3179761794736645e-10
P(Spam|message): 7.113521284349922e-54
P(Ham|message): 2.683497135440841e-43
P(Spam|message): 7.772114853378818e-35
P(Ham|message): 5.854615932695583e-29
P(Spam|message): 3.8447086720095626e-69
P(Ham|message): 4.274704688073775e-59
P(Spam|message): 1.0017655262318273e-75
P(Ham|message): 1.3051851587461705e-59
P(Spam|message): 6.250595262285196e-62
P(Ham|message): 5.227700160407619e-53
P(Spam|message): 5.2583742204923065e-55
P(Ham|message): 7.622734849489379e-47
P(Spam|message): 2.8000509396766564e-17
P(Ham|message): 5.211962113253055e-14
P(Spam|message): 3.118431873017779e-90
P(Ham|message): 1.3845042384319287e-79
P(Spam|message): 3.782448814659465e-105
P(Ham|message): 8.985839698274243e-96
P(Spam|message): 5.125943878276225e-14
P(Ham|message): 5.259870873593938e-12
P(Spam|message): 3.6596998863925e-61
P(Ham|message): 4.57639111240514

In [35]:
test_set

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham
...,...,...,...
1109,ham,"We're all getting worried over here, derek and...",ham
1110,ham,Oh oh... Den muz change plan liao... Go back h...,ham
1111,ham,CERI U REBEL! SWEET DREAMZ ME LITTLE BUDDY!! C...,ham
1112,spam,Text & meet someone sexy today. U can find a d...,spam


Now we can compare the predicted values with the actual values to measure how good our spam filter is with classifying new messages. To make the measurement, we'll use accuracy as a metric:

\begin{equation}
\text{Accuracy} = \frac{\text{number of correctly classified messages}}{\text{total number of classified messages}}
\end{equation}

In [36]:
num_correctly_classified_messages = (test_set['Label'] == test_set['predicted']).sum()
num_correctly_classified_messages

1100

In [37]:
total_num_classified_messages = test_set.shape[0]
total_num_classified_messages

1114

In [38]:
accuracy = num_correctly_classified_messages/total_num_classified_messages
accuracy_pct = accuracy*100
print("{}%".format(accuracy_pct))

98.74326750448833%


The accuracy is close to 98.74%, which is really good. Our spam filter looked at 1,114 messages that it hasn't seen in training, and classified 1,100 correctly.

In this project, we managed to build a spam filter for SMS messages using the multinomial Naive Bayes algorithm. The filter had an accuracy of 98.74% on the test set we used, which is a pretty good result. Our initial goal was an accuracy of over 80%, and we managed to do way better than that.