# Building a Spam Filter with Naive Bayes


In this guided project, we're going to study the practical side of the algorithm by building a spam filter for SMS messages. To classify messages as spam or non-spam, we saw in the previous lesson that the computer:

1. Learns how humans classify messages.
2. Uses that human knowledge to estimate probabilities for new messages — probabilities for spam and non-spam.
3. Classifies a new message based on these probability values — if the probability for spam is greater, then it classifies the message as spam. Otherwise, it classifies it as non-spam (if the two probability values are equal, then we may need a human to classify the message).

So our first task is to "teach" the computer how to classify messages. To do that, we'll use the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages that are already classified by humans.

The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the [The UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). The data collection process is described in more details on [this page](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/#composition),  where you can also find some of the authors' papers.

In [1]:
import pandas as pd
import re

In [2]:
df = pd.read_csv('SMSSpamCollection', sep='\t', 
                 header=None, names=['Label', 'SMS'])

In [3]:
df.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
print('The data set has {} rows and {} columns'.format(df.shape[0], df.shape[1]))

The data set has 5572 rows and 2 columns


In [5]:
ham_spam = df.Label.value_counts()
ham = (ham_spam.ham/df.shape[0]*100).round(2)
spam = (ham_spam.spam/df.shape[0]*100).round(2)

In [6]:
print('The data set has {}% spam messages and {}% non-spam messages'.format(spam, ham))

The data set has 13.41% spam messages and 86.59% non-spam messages


Once our spam filter is done, we'll need to test how good it is with classifying new messages. To test the spam filter, we're first going to split our dataset into two categories:

- A training set, which we'll use to "train" the computer how to classify messages.
- A test set, which we'll use to test how good the spam filter is with classifying new messages.

The training set will have about 80% of the dataset and the testset about 20%.

For this project, our goal is to create a spam filter that classifies new messages with an accuracy greater than 80% — so we expect that more than 80% of the new messages will be classified correctly as spam or ham (non-spam).

In [7]:
random_df = df.sample(frac=1, random_state=1) #randomizing dataset

# divide datset between training (80%) and testing (20%) sets
training = random_df[:round(random_df.shape[0]*0.8)].reset_index(drop=True)
testing = random_df[:round(random_df.shape[0]*0.2)].reset_index(drop=True)

## Creating our vocabulary

The next big step is to use the training set to teach the algorithm to classify new messages. We will be using the Naive Bayes algorithm; in order to use it though, we first need to create a vocabulary of all the words in the dataset.

In [8]:
# drop all characters that are not a a-z, A-Z or 0-9 word and make everything lowercase
training.SMS = training.SMS.str.replace(r'(\W)',' ').str.lower()

In [9]:
# create a list of all words in the dataset 
vocabulary = [] #empty list
words_list = training.SMS.str.split().tolist() # list of lists of words in each row
training.SMS = training.SMS.str.split()

# iterate every list of words in each row and fill vocabulary list
for l in words_list:
    for word in l:
        if word not in vocabulary:
            vocabulary.append(word)


Now that we have our vocabulary we can derive the training set.

In [10]:
word_counts_per_sms = {unique_word: [0] * len(training['SMS']) 
                       for unique_word in vocabulary}

for index, sms in enumerate(training['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [11]:
word_counts_df = pd.DataFrame(word_counts_per_sms)
word_counts_df.head()

Unnamed: 0,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


In [12]:
# final dataframe made up of original and df of all words
training_final = pd.concat([training, word_counts_df], axis=1)

In [13]:
training_final.head(5)

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


## xxx

We'll also use Laplace smoothing and set $\alpha = 1$.



In [14]:
# function to flatten list of lists into single list
def flatten(t):
    return [item for sublist in t for item in sublist]

In [15]:
alpha = 1 #smoothing factor

# dfs for spam and non-spam messages
spam = training_final[training_final.Label == 'spam']
ham = training_final[training_final.Label == 'ham']

# probability of any message being spam or non-spam
p_spam = len(spam)/len(training_final)
p_ham = len(ham)/len(training_final)

# number of words total per spam and ham messages
n_spam = len(flatten(spam.SMS.tolist())) # Nspam
n_ham = len(flatten(ham.SMS.tolist())) # Nham
n_vocabulary = len(vocabulary) # Nvocabulary

## Calculating Parameters

In [18]:
param_spam = {unique_word:0 for unique_word in vocabulary}
param_ham = {unique_word:0 for unique_word in vocabulary}

for word in vocabulary:
    n_word_given_spam = spam[word].sum()
    n_word_given_ham = ham[word].sum()

    p_word_given_spam = (n_word_given_spam+alpha)/(n_spam+alpha*n_vocabulary)
    p_word_given_ham = (n_word_given_ham+alpha)/(n_ham+alpha*n_vocabulary)

    param_spam[word] = p_word_given_spam
    param_ham[word] = p_word_given_ham
    

## Classifying A New Message

In [21]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in param_spam:
            p_spam_given_message *= param_spam[word]
            
        if word in param_ham:
            p_ham_given_message *= param_ham[word]
            
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [22]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')


P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam


## Measuring the Spam Filter's Accuracy

In [26]:
def classify_test_set(message):    
    '''
    message: a string
    '''
    
    message = re.sub('\W', ' ', message)
    message = message.lower().split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in param_spam:
            p_spam_given_message *= param_spam[word]
            
        if word in param_ham:
            p_ham_given_message *= param_ham[word]
    
    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [27]:
testing['predicted'] = testing['SMS'].apply(classify_test_set)
testing.head()

Unnamed: 0,Label,SMS,predicted
0,ham,"Yep, by the pretty sculpture",ham
1,ham,"Yes, princess. Are you going to make me moan?",ham
2,ham,Welp apparently he retired,ham
3,ham,Havent.,ham
4,ham,I forgot 2 ask ü all smth.. There's a card on ...,ham


In [29]:
correct = 0
total = testing.shape[0]
    
for row in testing.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1
        
print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', correct/total)

Correct: 1106
Incorrect: 8
Accuracy: 0.992818671454219
