<font color=red>TITLE:</font> Bayesian Filter for SMS Spam

<font color=red>DESCRIPTION</font>

This Notebook uses a df of email designated as spam or ham  to learn filtering for future spam messages. The filter was developed as follows:

- import df, split into training and test df's
- split SMS messages in training df
- Create columns for each unique word in the SMS messages
- Count the no. of times a particular word is in a message in the training df
- Calculate values needed for the Naive Bayes Algorithm
- Create a function to remove non words from a SMS message and determine if it's ham or spam
- Run function on test df and measure accuracy of ham/spam evaluation




In [1]:
#import pandas, read file
import pandas as pd
df = pd.read_csv(r"C:\Users\drrdm\Data Quest Guided Projects\11th Guided Project - SMS Spam filter\SMSSpamCollection", 
                 header = None, sep = '\t', names = ['Label', 'SMS'])

In [2]:
#how many rows and columns
df.shape

(5572, 2)

In [3]:
#explore dataset
df.head(3)

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...


In [4]:
#relative frequency of spam and ham
df['Label'].value_counts(normalize= True)

ham     0.865937
spam    0.134063
Name: Label, dtype: float64

In [5]:
#split data into training and test sets
df = df.sample(frac = 1, random_state = 1) #randomize full dataset

dftest = df.iloc[0:1113].copy().reset_index(drop = True)
dftrain = df.iloc[1114:5572].copy().reset_index(drop = True)

print(dftest['Label'].value_counts(normalize = True))
print(dftrain['Label'].value_counts(normalize = True))

ham     0.867925
spam    0.132075
Name: Label, dtype: float64
ham     0.86541
spam    0.13459
Name: Label, dtype: float64


<font color=red>NOTE:</font>Distribution of ham and spam in testing and training datasets similiar 
to distribution in `SMSSpamCollection`

In [7]:
#remove non word characters from SMS column cells in train dataset

dftrain['SMS'] = dftrain['SMS'].str.replace(pat ='\W', repl = ' ', 
                                             regex=True).str.lower()



In [8]:
#split message row into a set of words, then convert into list

train_SMSsplit = dftrain['SMS'].str.split()

#list of all words in SMS
vocabulary = []
for row in train_SMSsplit:
    for word in row:
        vocabulary.append(word)

#remove duplicates, convert back to list
s = set(vocabulary)
vocabulary = list(s)

In [10]:
#check for duplicates in vocabulary
import collections
counter=collections.Counter(vocabulary)
print(counter.most_common(20))

[('eg', 1), ('nookii', 1), ('09061209465', 1), ('pussy', 1), ('db', 1), ('gravel', 1), ('yo', 1), ('mns', 1), ('avoid', 1), ('maintain', 1), ('goldviking', 1), ('exeter', 1), ('jiu', 1), ('ignoring', 1), ('appreciate', 1), ('sticky', 1), ('careers', 1), ('mr', 1), ('mix', 1), ('anything', 1)]


In [20]:
#use code provided by Dataquest to create dictionary of unique words

word_counts_per_sms = {unique_word: [0] * len(train_SMSsplit) 
                       for unique_word in vocabulary}

#the above dict comprehension created a dict (words_counts_per_sms) of the unique words for keys and 
#the associated values are a list of zeros the len of the train df

#train_SMSsplit is a Series containing rows of lists which contain the words in the SMS column of the df

for index, sms in enumerate(train_SMSsplit): #Loop through rows in the Series, index as counter,sms as Series value 
    for word in sms:                         #Loop through the words of the list in the Series value (like for i in xxx)
        word_counts_per_sms[word][index] += 1 #to index(dict key)[word], list position[index] add 1
        
#this inner loop will loop through the word list in each row of train_SMSsplit 
#and if it finds a keyword in word_counts_per_sms it will add 1 to the list[index] value, which started at 0
#The outer loop then goes to the next row and the process is repeated.

In [None]:
#NOTE to self:

#what does word_counts_per_sms look like?
#firstpair = {k: word_counts_per_sms[k] 
#               for k in list(word_counts_per_sms)[:1]}

#Another way to look at word_counts_per_sms first 2 entries
# import itertools
# dict(itertools.islice(word_counts_per_sms.items(), 2))


In [33]:
# Turn the word_counts_per_SMS into a dataframe
train_dict = pd.DataFrame(data = word_counts_per_sms)

In [24]:
#concat dftrain (label, SMS) with train_dict (individual words in SMS)

dftrain_final = pd.concat([dftrain,train_dict], axis = 1, sort = False)

In [34]:
# now I have a train df with ham/spam label for the message, the original SMSmessage, and counts for how often words that showed
#up in all the SMS messages were in a particular message
dftrain_final.head(1)

Unnamed: 0,Label,SMS,eg,nookii,09061209465,pussy,db,gravel,yo,mns,...,loyal,mmmm,barolla,gravity,winds,annoyin,detroit,fromwrk,god,killing
0,ham,yeah do don t stand to close tho you ll catc...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [12]:
# check randomly chosen words (recently,pin,soon,to) to see if the column count is > 0 (validate this was done right)

print('recently: ', dftrain_final['recently'].sum())
print('pin: ', dftrain_final['pin'].sum())
print('soon: ',dftrain_final['soon'].sum())
print('to: ', dftrain_final['to'].sum())

recently:  4
pin:  5
soon:  46
to:  1788


<br><br><br>
Calculate the values needed for the Bayes filtering algorithm

In [36]:
pspam = .1345 #from value counts earlier
pham = .8654 #from value counts earlier
alpha = 1

#of words in the spam and ham messages

spam_df = dftrain_final[dftrain_final['Label'] == 'spam'] #boolean index for spam labeled rows
spam_no_words = spam_df['SMS'].str.split().apply(len).sum() #split SMS message by words, calculate # of words, sum column 

ham_df = dftrain_final[dftrain_final['Label'] == 'ham'] #boolean index for ham labeled rows
ham_no_words = ham_df['SMS'].str.split().apply(len).sum() #split SMS message by words, calculate # of words, sum column

# of vocabulary words

no_vocabulary = len(vocabulary)

In [29]:
#create dicts of ham and spam words 
ham_words = {}
spam_words = {}
for word in vocabulary: #vocabulary is a list of unique words from the SMS column of the df
    ham_words[word] = 0 #making a ham dict with all the unique words as keys and value 0
    spam_words[word] = 0 #ditto 

In [37]:
#count word occurences in ham and spam dicts, determine word probability by Bayes Theorem

for word in vocabulary:
    n_ham_word = ham_df[word].sum() # # of times a word is in the ham df
 
 # (n_ham_word + 1) / (total #of words in ham df + # of words in the vocabulary list )
    p_ham_word = (n_ham_word + 1)/(ham_no_words + no_vocabulary) 

    ham_words[word] = p_ham_word # update the ham words dict with the probability of the word being in a ham SMS 
    
# ditto for above with ham
    n_spam_word = spam_df[word].sum()
    p_spam_word = (n_spam_word + 1)/(spam_no_words + no_vocabulary)
    spam_words[word] = p_spam_word

In [31]:
#check one of the dictionaries to verify
#import itertools
#dict(itertools.islice(spam_words.items(), 2))

{'eg': 0.0003908285565398645, 'nookii': 8.685079034219212e-05}

In [38]:
#some of this code provided by Dataquest

import re  #import regex

def classify(message):

    message = re.sub('\W', ' ', message) #for message, remove non word characters 
    message = message.lower() 
    message = message.split()

    p_spam_given_message = .1345
    p_ham_given_message = .8654
    for word in message:
        if word in spam_words:
            p_spam_given_message *= spam_words[word] #multiply p of getting spam message by p word is in a spam message
        if word in ham_words:
            p_ham_given_message *= ham_words[word] #ditto

    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [39]:
#Test the filter on a ham message
classify('Sounds good Tom. Then see u there')

P(Spam|message): 4.771573236579387e-25
P(Ham|message): 3.4555424517038425e-21
Label: Ham


In [40]:
#Test the filter on a spam message
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 1.2776455587679249e-25
P(Ham|message): 2.5841115002039434e-27
Label: Spam



Run the filter on the test dataset

In [41]:
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = .1345
    p_ham_given_message = .8654
    for word in message:
        if word in spam_words:
            p_spam_given_message *= spam_words[word]
        if word in ham_words:
            p_ham_given_message *= ham_words[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [42]:
#create a new column showing results of filtering test set

dftest['predicted'] = dftest['SMS'].apply(classify_test_set)
dftest.head()

Unnamed: 0,Label,SMS,predicted
0,ham,"Yep, by the pretty sculpture",ham
1,ham,"Yes, princess. Are you going to make me moan?",ham
2,ham,Welp apparently he retired,ham
3,ham,Havent.,ham
4,ham,I forgot 2 ask ü all smth.. There's a card on ...,ham


In [45]:
correct = 0
total = dftest.shape[0] #of rows in dftest

for row in dftest.iterrows():  #iterows returns tuple of index and the row data as a Series
    row = row[1]               #row is the Series (index of Label, SMS, predicted with their values)
    if row['Label'] == row['predicted']:
        correct += 1
        
print('correct:', correct)
print('total:', total)
print('accuracy:{0:5.1f}%'.format(correct/total*100))

correct: 1100
total: 1113
accuracy: 98.8%


I'm pleasantly surprised at the accuracy of the filter.