# Dataquest Guided Project: Building a Spam Filter with Naive Bayes

The goal of this project is to build a spam filter with the Naive Bayes algorithm. We will use a dataset of 5,572 SMS messages that were classified as spam or not spam by humans. The dataset is available at the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). 

We will begin by importing and exploring the dataset.

In [1]:
import pandas as pd
#read in dataset as pandas dataframe
#specify tab separated values
#there is no header so we add our own names
df = pd.read_csv('SMSSpamCollection', sep='\t', 
              header=None, names=['Label', 'SMS'])
df.head() #'ham' means non-spam

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [2]:
[rows, columns] = df.shape
print('''The dataset has {} rows and {} columns.'''.format(rows,columns))


The dataset has 5572 rows and 2 columns.


In [3]:
#percentage spam and ham
freq_table = df['Label'].value_counts(normalize=True) * 100
#pretty print the table
freq_table = pd.DataFrame(freq_table).rename(columns={'Label':'Percent'})
freq_table


Unnamed: 0,Percent
ham,86.593683
spam,13.406317


About 87% of the messages are non-spam (aka "ham") and about 13% are spam.

Now we need to split the data into a training set and a test set. We will train our algorithm on 80% of the data, and set aside 20% for testing later. To ensure we split the data in an unbiased way, we must first randomize the dataset.

In [4]:
df_random = df.sample(frac=1, random_state=1) #randomize the whole dataset
train_test_index = round(len(df_random)*0.8) #split point of dataset
train = df_random[:train_test_index].reset_index(drop=True)
test = df_random[train_test_index:].reset_index(drop=True)


Now we will check the percentage of spam and ham in both the training and the test sets, to ensure they are similar.

In [5]:
train_freq_table = train['Label'].value_counts(normalize=True) * 100

train_freq_table = pd.DataFrame(train_freq_table).rename(columns={'Label':'Percent Train'})
train_freq_table

Unnamed: 0,Percent Train
ham,86.54105
spam,13.45895


In [6]:
test_freq_table = test['Label'].value_counts(normalize=True) * 100

test_freq_table = pd.DataFrame(test_freq_table).rename(columns={'Label':'Percent Train'})
test_freq_table

Unnamed: 0,Percent Train
ham,86.804309
spam,13.195691


In order to prepare the data for analysis, we will have to do some data cleaning. Eventually, we will want a dataframe that has one row per message, a column of labels, and one column for each word observed in the dataset. The dataframe will count how many times a given word occurs in a message. This set of unique words is called a **vocabulary**.

To accomplish this, we first need to remove all punctuation and bring all words to lower case. Below, we use teh regex code `\W` to detect any character that is not from a-z, A-Z, or 0-9.

In [7]:
train['SMS'] = train['SMS'].str.replace('\W', ' ').str.lower()
train.head()


Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


Now we need to create a vocabulary for the messages in the training set. This will be a Python list of all unique words across all messages.

In [8]:
#transform each SMS into a list
train['SMS'] = train['SMS'].str.split() 

#append each SMS to the vocabulary list
vocabulary = []
for SMS in train['SMS']:
    for word in SMS:
        vocabulary.append(word)

#transform vocabulary into a set to remove duplicates, then back into a list
vocabulary = set(vocabulary)
vocabulary = list(vocabulary)
vocabulary[:10]

['murder',
 'leh',
 '8800',
 'listen',
 'resort',
 'woken',
 'something',
 'released',
 '09058094597',
 'persolvo']

Now we will use the vocabulary list to create a data dictionary. We will loop over the training set to count the words in each message, and add this count to the dictionary. Then, we will turn the dictionary into a dataframe. We will then concatenate this new dataframe with the original training set dataframe, so that we keep the Label and SMS columns.

In [9]:
word_counts_per_sms = {unique_word: [0] * len(train['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(train['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1
        
word_counts_per_sms = pd.DataFrame(word_counts_per_sms)
train_wc = pd.concat([train, word_counts_per_sms], axis=1)
train_wc.head()


Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


Now, we need to calculate the constant variables in the training dataset: the probability that a given message is spam P(Spam) or ham P(Ham), the number of words in all spam messages N<sub>Spam</sub>, the number of words in all ham messages N<sub>Ham</sub>, the number of words in the vocabulary N<sub>Vocabulary</sub>, and the $\alpha$ used in the Laplace smoothing.

In [10]:
# Isolating spam and ham messages first
spam_messages = train_wc[train_wc['Label'] == 'spam']
ham_messages = train_wc[train_wc['Label'] == 'ham']

# P(Spam) and P(Ham)
p_spam = len(spam_messages) / len(train_wc)
p_ham = len(ham_messages) / len(train_wc)

# N_Spam
n_words_per_spam_message = spam_messages['SMS'].apply(len)
n_spam = n_words_per_spam_message.sum()

# N_Ham
n_words_per_ham_message = ham_messages['SMS'].apply(len)
n_ham = n_words_per_ham_message.sum()

# N_Vocabulary
n_vocabulary = len(vocabulary)

# Laplace smoothing
alpha = 1

Now, we need to calculate the probability of each word in the vocabulary occuring in the message. This probability changes depending on whether the message is spam or ham. In other words, for each word *w* we must calculate the probability of that word occuring in a spam message *P(w|Spam)* and the probability of the word occuring in a ham message *P(w|Ham)*. These probabilities differ depending on the word.


In [14]:
#initialize word count dictionaries for spam and ham
p_words_spam = {unique_word: [0] for unique_word in vocabulary}
p_words_ham = {unique_word: [0] for unique_word in vocabulary}

#calculate p_words_given_spam and p_words_given_ham for each word
for word in vocabulary:
    n_word_given_spam = spam_messages[word].sum()
    p_word_given_spam = (n_word_given_spam + alpha)/(n_spam + (alpha*n_vocabulary))
    p_words_spam[word] = p_word_given_spam
    n_word_given_ham = ham_messages[word].sum()
    p_word_given_ham = (n_word_given_ham + alpha)/(n_ham + (alpha*n_vocabulary))
    p_words_ham[word] = p_word_given_ham


Finally, we will now write a function that calculate the probability a given message is spam or ham, using the probabilities we calculated above.

In [15]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()
    
    #initialize values
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    for word in message:
        if (word in p_words_spam):
            p_spam_given_message *= p_words_spam[word]
        if (word in p_words_ham):
            p_ham_given_message *= p_words_ham[word]
            
    
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [16]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam


In [17]:
classify("Sounds good, Tom, then see u there")

P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


Now we will test the accuracy of our algorithm. First we will change the print statements in our function to return statements. Then we will apply the function to each row of our test dataset. We will compare the predicted labels to the actual labels to calculate the accuracy.

In [20]:
#change print to return
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()
    
    #initialize values
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    for word in message:
        if (word in p_words_spam):
            p_spam_given_message *= p_words_spam[word]
        if (word in p_words_ham):
            p_ham_given_message *= p_words_ham[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [21]:
#apply to test set
test['predicted'] = test['SMS'].apply(classify_test_set)
test.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


In [22]:
#calculate accuracy
correct = 0
total = test.shape[0]

for row in test.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1
        
print("Number correct: ", correct)
print("Accuracy: ", correct/total)

Number correct:  1100
Accuracy:  0.9874326750448833


With this relatively simple algorithm, we were able to build a spam filter with 98% accuracy.