# Building a Spam Filter with Naive Bayes  

In this guided project, I'm going to use the practical side of the Naive Bayes algorithm by building a spam filter for SMS messages. 

I'll be using the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages that are already classified by humans. The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the The UCI Machine Learning Repository.

In [1]:
import pandas as pd
# import the data, tab delinited, no header row
sms = pd.read_csv('SMSSpamCollection',sep='\t',header=None,names=['Label', 'SMS'])
# number of rows and columns
print(sms.shape)

(5572, 2)


In [2]:
# display percentage of each Label
sms['Label'].value_counts(normalize=True)

ham     0.865937
spam    0.134063
Name: Label, dtype: float64

Our dataset has 5,572 and only 2 columns - Label (Spam or not Spam) and SMS (the message text)  
In the dataset 87% of messages are not spam and 13% are spam.

### Training and Test Set

In order to properly test the end product I am going to split the dataset into a train set and test set. Train set will be used to produce the spam filter and the test set will be used to test it. UI'm going to keep 80% of our dataset for training, and 20% for testing.

In [3]:
# Randomize the dataset
data_randomized = sms.sample(frac=1, random_state=1)

# Calculate index for split
training_test_index = round(len(data_randomized) * 0.8)

# Training/Test split
sms_train = data_randomized[:training_test_index].reset_index(drop=True)
sms_test = data_randomized[training_test_index:].reset_index(drop=True)

print(sms_train.shape)
print(sms_test.shape)

(4458, 2)
(1114, 2)


In [4]:
# display percentage of each Label in test set
print('Test set:')
sms_test['Label'].value_counts(normalize=True)

Test set:


ham     0.868043
spam    0.131957
Name: Label, dtype: float64

In [5]:
# display percentage of each Label in train set
print('Train set:')
sms_train['Label'].value_counts(normalize=True)

Train set:


ham     0.86541
spam    0.13459
Name: Label, dtype: float64

Both the test and train sets have the same 87/13 split between non-spam and spam.

### Letter Case and Punctuation

In [6]:
# remove all punction from SMS column and normalize case
sms_train['SMS'] = sms_train['SMS'].str.replace('\W',' ')
sms_train['SMS'] = sms_train['SMS'].str.lower()
sms_train.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


### Creating the Vocabulary

The goal is to create a column for each unique word in the SMS column (known as the vocabulary). First I will create a list for the vocabulary.

In [7]:
# split the SMS column on space
sms_train['SMS'] = sms_train['SMS'].str.split()
sms_train.head()

Unnamed: 0,Label,SMS
0,ham,"[yep, by, the, pretty, sculpture]"
1,ham,"[yes, princess, are, you, going, to, make, me,..."
2,ham,"[welp, apparently, he, retired]"
3,ham,[havent]
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,..."


In [8]:
# interate over the transformed SMS column to create vocabulary list
vocabulary = []
for row in sms_train['SMS']:
    for word in row:
        vocabulary.append(word)
vocabulary = list(set(vocabulary))

### The Final Training Set

I will now use the vocabulary to transform the SMS column. This will be achieved by creating a dictionary.

In [9]:
# initialize a dictionary the length of the SMS column
word_counts_per_sms = {unique_word: [0] * len(sms_train['SMS']) for unique_word in vocabulary}
# iterate through the SMS column incrementing the dictionary index each time
for index, sms in enumerate(sms_train['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [10]:
# convert above dictionary into a dataframe
word_counts = pd.DataFrame(word_counts_per_sms)
# concat the word_counts to the original train set
sms_train_clean = pd.concat([sms_train,word_counts],axis=1)
sms_train_clean.head()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


### Calculating Constants

In [11]:
# Isolating spam and ham messages first
spam_messages = sms_train_clean[sms_train_clean['Label'] == 'spam']
ham_messages = sms_train_clean[sms_train_clean['Label'] == 'ham']

# P(Spam) and P(Ham)
p_spam = len(spam_messages) / len(sms_train_clean)
p_ham = len(ham_messages) / len(sms_train_clean)

# N_Spam
n_words_per_spam_message = spam_messages['SMS'].apply(len)
n_spam = n_words_per_spam_message.sum()

# N_Ham
n_words_per_ham_message = ham_messages['SMS'].apply(len)
n_ham = n_words_per_ham_message.sum()

# N_Vocabulary
n_vocabulary = len(vocabulary)

# Laplace smoothing
alpha = 1

### Calculating Parameters

Next I will calculate the probabilty of a word give spam P(w|'spam') for each word in the vocabulary.

In [12]:
# Initiate parameters
parameters_spam = {unique_word:0 for unique_word in vocabulary}
parameters_ham = {unique_word:0 for unique_word in vocabulary}

# Calculate parameters
for word in vocabulary:
    n_word_given_spam = spam_messages[word].sum()   # spam_messages already defined in a cell above
    p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + alpha*n_vocabulary)
    parameters_spam[word] = p_word_given_spam
    
    n_word_given_ham = ham_messages[word].sum()   # ham_messages already defined in a cell above
    p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + alpha*n_vocabulary)
    parameters_ham[word] = p_word_given_ham

### Classifying a New Message

Next I will build the function that can calculate the probailty of spam for a new message.

In [13]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
            
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [14]:
# first test 
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam


In [15]:
# second test 
classify("Sounds good, Tom, then see u there")

P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


### Measuring the Spam Filter's Accuracy

In [16]:
# function to test a set
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]

        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [17]:
# run test on test set
sms_test['predicted'] = sms_test['SMS'].apply(classify_test_set)
sms_test.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


In [None]:
correct = 0
total = test_set.shape[0]
    
for row in test_set.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1
        
print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', correct/total)