# Multinomial Naive Bayes Spam Filter

Today we are making an sms spam filter using the multinomial Naive Bayes algorithm. 

## Setup 

In [2]:
import pandas as pd 
import numpy as np 

In [3]:
df = pd.read_csv('SMSSpamCollection', 
                 sep='\t',
                 header=None, 
                 names=['Label','SMS']
                )

In [4]:
print(df.shape)

(5572, 2)


In [5]:
df.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [6]:
#finding out what pctg is ham 

ham = df[df['Label']=='ham'].sum()

p_ham = len(ham) / len(df)

## Training

Before we write our filter, we must split the data into training and test data so that we can verify that it works in the way it's supposed to. 

Our training set will contain ~80% of the data. 

Our test set will contain the remaining ~20%. 

An important fact to notice is that 20% of the data we have has been classified by a human, we are going to use this to compare against our filter when we are done writing it. 

Our goal with this filter is for it to correctly classify more than 80% of the spam messages as spam.

In [7]:
#lets randomize the entire set to start

#I think we can skip the below step if we use the sample method like we 
#did farther down

    #df = df.sample(frac=1, random_state=1)
    #these two params 
        #randomize the entire dataset 
        #ensure our results are reproducible
    
#now we must split the dataset into training and test 

train = df.sample(frac=.8, random_state=1).reset_index(drop=True)

#notice how we didnt use sample here also or else we could have populated 
#the testing set with some of the same values as the training set
test = df.drop(train.index).reset_index(drop=True)

print(train.shape)
print(test.shape)

(4458, 2)
(1114, 2)


In [8]:
test['Label'].value_counts(normalize=True)

ham     0.869838
spam    0.130162
Name: Label, dtype: float64

In [9]:
train['Label'].value_counts(normalize=True)

ham     0.86541
spam    0.13459
Name: Label, dtype: float64

The above mixtures of ham to spam are similar to our main dataset!

## Cleaning

Let's bring the data into a useable format!

In [10]:
train.head()

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [11]:
#we want to replace the SMS column with a series of columns that capture
#each word from our vocabulary

#lets first strip every col of punctuation
    #using the vectorized string method .str.replace() along with the 
    #regex \W to get rid of punctuation
    
train['SMS'] = train['SMS'].str.replace('\W', ' ')

#using vectorized string method .str.lower() to convert to lower case
train['SMS'] = train['SMS'].str.lower()

train.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


In [12]:
#now we can create our vocab

#first we transform each msg from SMS into a list
    #can do this by split string at space char 
    #note: the method already splits on whitespace so we dont have to pass it anything

train['SMS'] = train['SMS'].str.split()

vocabulary = []

for msg in train['SMS']:
    
    for word in msg: 
        
        vocabulary.append(word)
    
#now transform the vocab list into a set using set()
    #this removes the duplicates 
    
vocabulary = set(vocabulary)

#now transform vocab list back into list using list()

vocabulary = list(vocabulary)

In [13]:
#lets check the length of our vocab 

print(len(vocabulary))

7783


Wow! looks like we have almost 800 unique words in our vocabulary!

In [14]:
#now we are going to use our vocab to create those enumerated columns we wanted

#lets start by creating a dictionary

    #in this dictionary the keys are the words from our vocab (remember they're all unique because we turned the list into a set)
    
    #the values are lists the length of the training set with each element equal to zero
        #the values are lists because we are using a list comprehension... hence the 'for...' in our dictionary
        
word_counts = {vocab_word: [0] * len(train['SMS']) for vocab_word in vocabulary}

#now we can loop thru our training data... 
    #at the end we increment the value corresponding to the word in our dictionary by 1 
for index, sms in enumerate(train['SMS']):
    for word in sms:
        word_counts[word][index] += 1 

In [15]:
#turning the above dictionary into a df 

word_counts = pd.DataFrame(word_counts)
word_counts.head()

Unnamed: 0,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


In [16]:
#now we must combine the new dataframe from above with our training data so we can use it

train_final = pd.concat([train, word_counts], axis=1)
train_final.head()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


## Creating The Spam Filter

Now that our training data is prepared we can train our spam filter!

Lets first review our main equations...


$P(Spam|w_{1},w_{2},...,w_{n})\alpha P(Spam)\cdot \prod_{i=1}^{n}P(w_{i}|Spam)$

$P(Ham|w_{1},w_{2},...,w_{n}) \alpha P(Ham)\cdot \prod_{i=1}^{n}P(w_{i}|Ham)$

And the equations for the parameters...

$P(w_{i}|Spam)=\frac{N_{w_{i}|Spam}+α}{N_{Spam}+α⋅N_{Vocabulary}}$

$P(w_{i}|Ham)=\frac{N_{w_{i}|Ham}+α}{N_{Ham}+α⋅N_{Vocabulary}}$

**Constants**

In [17]:
#lets calulate some of the variables in our equations for the spam filter
    #to see these two equations (complete with additive smoothing) check the 
    #Naive Bayes notes in dataquest folder

    
#we need to find out how many msgs are spam using a loop 
    #we can then take this away from the total len of msgs to find numb of non spam msgs

spam_msg = 0 

for msg in train_final['Label']: 
    
    if msg == 'spam': 
        spam_msg += 1 

print(spam_msg)



600


In [18]:
#this is the probability that a new message is spam
p_spam = spam_msg / len(train_final)

#this is the probability that a new message is NOT spam
p_ham = (len(train_final) - spam_msg) / len(train_final)

print("this is the probability a msg is spam:", round(p_spam, 2))
print("this is the probability a msg is not spam:", round(p_ham, 2))

this is the probability a msg is spam: 0.13
this is the probability a msg is not spam: 0.87


In [19]:
#now we want the n_spam value 
    #this value is the total words there are in all spam messages
    
    #we can first isolate the spam messages
    
spam_msgs = train_final[train_final['Label'] == 'spam']

    #now we can apply the len function to the sms column to count the elements in each list
    #this will result in a list that we have to sum 
    
numb_words_per_spam_msg = spam_msgs['SMS'].apply(len)
n_spam = numb_words_per_spam_msg.sum()

In [20]:
#now we can do the same things but with non-spam msgs

ham_msgs = train_final[train_final['Label'] == 'ham']


numb_words_per_non_spam_msg = ham_msgs['SMS'].apply(len)
n_ham = numb_words_per_non_spam_msg.sum()

In [21]:
#to find the number of words in our vocab all we have to do is add 
#n_spam and n_ham 

n_vocab = len(vocabulary)

In [22]:
#we also must define our alpha parameter for additive smoothing 

alpha = 1 

In [23]:
#lastly lets print all these constants off 

print('\n', n_spam, '\n', n_ham, '\n', n_vocab)


 15190 
 57237 
 7783


Quick sanity check... these values seem reasonable! We can now move on to calculating our parameters

**Parameters**

Now that our constants are calculated we can move on to calculating our parameters. 

Recall...

$P(w_{i}|Spam)=\frac{N_{w_{i}|Spam}+α}{N_{Spam}+α⋅N_{Vocabulary}}$

$P(w_{i}|Ham)=\frac{N_{w_{i}|Ham}+α}{N_{Ham}+α⋅N_{Vocabulary}}$

In [24]:
#it's time to calculate our params 
    #these are the probabilities that P(wi|Spam) and P(wi|Ham) are going to take

#initializing two dicts to store our params

p_wi_spam = {} 
p_wi_ham = {}

    #we are going to fill these dicts with the entire vocab and each value set to 0 
    
for word in vocabulary: 
    
    p_wi_spam[word] = 0 
    p_wi_ham[word] = 0

#note that spam and ham messages have already been split into two dfs from when we calculated constants

#iterating thru the VOCAB!!! (not the training data), calculating P(wi|Spam) and P(wi|Ham), and assigning to relevant dicts
for word in vocabulary: 
    
    #recall N_wi|spam is the number of times the word wi occurs in all spam messages
    #the same logic applies for N_wi|ham
    
    n_wi_spam = spam_msgs[word].sum()
    p_wi_spam[word] = (n_wi_spam + alpha) / (n_spam + alpha * n_vocab)
    
    n_wi_ham = ham_msgs[word].sum()
    p_wi_ham[word] = (n_wi_ham + alpha) / (n_ham + alpha * n_vocab)
    
#we can grab a few parameter values from the p_wi_spam dict
x=list(p_wi_spam.values())

print(x[0:5])

#seems to work...

[0.00021764680276846734, 4.3529360553693465e-05, 4.3529360553693465e-05, 4.3529360553693465e-05, 4.3529360553693465e-05]


## Testing

We've just calculated all the constants and parameters we will need. Now we'd like to put this filter to the test by classifying a new message. 

From the lesson: 

The spam filter can be understood as a function that:

- Takes in as input a new message (w1, w2, ..., wn)
- Calculates P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn)
- Compares the values of P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn), and:
- If P(Ham|w1, w2, ..., wn) > P(Spam|w1, w2, ..., wn), then the message is classified as ham.
- If P(Ham|w1, w2, ..., wn) < P(Spam|w1, w2, ..., wn), then the message is classified as spam.
- If P(Ham|w1, w2, ..., wn) = P(Spam|w1, w2, ..., wn), then the algorithm may request human help.




In [26]:
import re

def classify(message):
    
    message = re.sub('\W', ' ', message) #remember \W matches a word character so here we are removing punctuation
    message = message.lower() 
    message = message.split() #splitting at space char and transforming the string into a list

    
    #these are just the main equations we listed above 
    
        #initate with a value -- we are going to use our constants
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
        #now iterate through message and include the appropriate values into
        #the above equations
        
    for word in message: 
        
        if word in p_wi_spam:
            #if in the vocab we need to include it in our equation above
            p_spam_given_message *= p_wi_spam[word] #the *= multiplies with left operand and assigns val to left operand
            
        if word in p_wi_ham:
            #same logic as above 
            p_ham_given_message *= p_wi_ham[word]
            
        else:
            pass 
       
    #this code below compares the results and classifies our result
    
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [27]:
#now lets test it with a few phrases...

classify('WINNER!! This is the secret code to unlock the money: C3421.')

classify("Sounds good, Tom, then see u there")


#weeee it works!

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam
P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


## Measuring Accuracy

We would now like to see how accurate our spam filter is. This is going to involve us using the testing data we split off at the beginning of the project. 

In [28]:
# first we are going to change our spam filter to return the labels rather than print them... 


def classify_test(message):
    
    message = re.sub('\W', ' ', message) #remember \W matches a word character so here we are removing punctuation
    message = message.lower() 
    message = message.split() #splitting at space char and transforming the string into a list

    
    #these are just the main equations we listed above 
    
        #initate with a value -- we are going to use our constants
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
        #now iterate through message and include the appropriate values into
        #the above equations
        
    for word in message: 
        
        if word in p_wi_spam:
            #if in the vocab we need to include it in our equation above
            p_spam_given_message *= p_wi_spam[word] #the *= multiplies with left operand and assigns val to left operand
            
        if word in p_wi_ham:
            #same logic as above 
            p_ham_given_message *= p_wi_ham[word]
            
        else:
            pass 
    
    #notice we are changing this portion of the code to return strings so that we can 
    #evaluate the accuracy of our classifier against the labeled testing data
    
    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'ask human'

In [30]:
#now lets add a column to our test data so that we can store the newly classified values

test['predictions'] = test['SMS'].apply(classify_test) #applying our test funciton that returns values
test.head()

Unnamed: 0,Label,SMS,predictions
0,ham,Aight should I just plan to come up later toni...,ham
1,ham,Die... I accidentally deleted e msg i suppose ...,ham
2,spam,Welcome to UK-mobile-date this msg is FREE giv...,spam
3,ham,This is wishing you a great day. Moji told me ...,ham
4,ham,Thanks again for your reply today. When is ur ...,ham


This looks promising... lets measure the accuracy using the following formula: 

Accuracy = # correctly classified messages / # incorrectly classified messages

In [48]:
correct = 0 
total = len(test)

#we are iterating over each row in our df using .iterrows()
#for each row we are checking if the label and predictions values match 
    #if they do we are increasing our value of correct
    
for row in test.iterrows():
    #selecting the row in our iterable 'row' --> this is because iterrows returns index series pairs 
    row = row[1]
    
    if row['Label'] == row['predictions']:
        correct += 1 
        
print('Accuracy:', correct / total)
print('Wrong:', total - correct)
print('Total', total)

Accuracy: 0.9910233393177738
Wrong: 10
Total 1114


Wow! What an impressive result!