# Building a Spam Filter with Naive Bayes

### Purpose of this project

In this project, we use a Naive Bayes algorithm to classify messages as spam or not using different criteria. This project exemplifies the use of Bayesian statistics, speicifcally Naive Bayes in this case.

Ultimately, we will calculate the probability that a message is or is not spam. If the probability for spam is greater, then the classifier assigns the message as spam. If the probability for non-spam is greater, the classifier assigns the message as non-spam.

The dataset comes from the UCI Machine Learning Repository and can be found [here](https://dq-content.s3.amazonaws.com/433/SMSSpamCollection).

### Opening the data

In [1]:
# import packages
import pandas as pd

# read dataset
data = pd.read_csv('SMSSpamCollection',sep='\t',header=None,names=['Label','SMS'])


In [2]:
# Checkout the data
data.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
data.info

<bound method DataFrame.info of      Label                                                SMS
0      ham  Go until jurong point, crazy.. Available only ...
1      ham                      Ok lar... Joking wif u oni...
2     spam  Free entry in 2 a wkly comp to win FA Cup fina...
3      ham  U dun say so early hor... U c already then say...
4      ham  Nah I don't think he goes to usf, he lives aro...
5     spam  FreeMsg Hey there darling it's been 3 week's n...
6      ham  Even my brother is not like to speak with me. ...
7      ham  As per your request 'Melle Melle (Oru Minnamin...
8     spam  WINNER!! As a valued network customer you have...
9     spam  Had your mobile 11 months or more? U R entitle...
10     ham  I'm gonna be home soon and i don't want to tal...
11    spam  SIX chances to win CASH! From 100 to 20,000 po...
12    spam  URGENT! You have won a 1 week FREE membership ...
13     ham  I've been searching for the right words to tha...
14     ham                I HAVE A DAT

Here, ham means non-spam.

In [4]:
data['Label'].value_counts()

ham     4825
spam     747
Name: Label, dtype: int64

In [5]:
data['Label'].value_counts(normalize=True)

ham     0.865937
spam    0.134063
Name: Label, dtype: float64

### Splitting the data into a training set and testing set

Our training set will have 4,458 messages, about 80% of the data.

Our test set will have 1,114 messages, about 20% of the data.

In [6]:
# Randomize the data 
random = data.sample(frac=1,random_state=1)

# Split the data
# Estimate an index of 80% of the data
index_80 = round(len(random)*.8)

train_set = random[:index_80:].reset_index(drop=True)
test_set = random[index_80:].reset_index(drop=True)

print(train_set.shape)
print(test_set.shape)

(4458, 2)
(1114, 2)


## Data Cleaning

Since we want to analyze the word contents of each text to determine if it's spam or not, let's separate out each word from each text and remove the punctuation.

In [7]:
# Remove punctuation
train_set['SMS'] = train_set['SMS'].str.replace('\W', ' ')

# Convert to all lowercase
train_set['SMS'] = train_set['SMS'].str.lower()

In [8]:
train_set.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


Now we need to create a unique word list called a *vocabulary*, in which each word in the SMS column gets its own column so that we can count the frequency of each word.

In [9]:
train_set['SMS'] = train_set['SMS'].str.split()

vocabulary = []
for line in train_set['SMS']:
    for word in line:
        vocabulary.append(word)

# Remove duplicates        
vocabulary = list(set(vocabulary))     

In [10]:
len(vocabulary)

7783

We have 7,783 unique words in our SMS training set.

Now let's create a dictionary to count how often each word occurs in our training set.

In [11]:
word_counts_per_sms = {unique_word: [0] * 
                       len(train_set['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(train_set['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [12]:
# Transform this large dictionary into a dataframe
word_count_df = pd.DataFrame(word_counts_per_sms)
word_count_df.head()

Unnamed: 0,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


In [13]:
# Concatenate with training set
train_set_clean = pd.concat([train_set,word_count_df],axis=1)
train_set_clean.head()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


Next we need to calculate P(spam) and P(ham).

We also need to calculate N<sub>Spam</sub>, N<sub>Ham</sub>, and N<sub>Vocabulary</sub>, where

* N<sub>Spam</sub> is the number of all words in spam messages,
* N<sub>Ham</sub> is the number of all words in non-spam messages, and
* N<sub>Vocabulary</sub> is the number of all unique words across all messages

In [14]:
print(train_set['Label'].value_counts())

ham     3858
spam     600
Name: Label, dtype: int64


In [15]:
# Separate spam and non-spam
spam = train_set_clean[train_set_clean['Label'] == 'spam']
ham = train_set_clean[train_set_clean['Label'] == 'ham']

# Calculate P(spam) and P(ham)
p_spam = len(spam) / len(train_set_clean)
p_ham = len(ham) / len(train_set_clean)

# Calculate N's
n_words_per_spam = spam['SMS'].apply(len)
n_spam = n_words_per_spam.sum()

n_words_per_ham = ham['SMS'].apply(len)
n_ham = n_words_per_ham.sum()

n_vocab = len(vocabulary)

Also we will set the Laplace smoothing parameter, alpha, to 1. 

In [16]:
alpha = 1

## Calculating Parameters

Next we need to calculate our parameters. One of our parameters will be: for each word, how likely is that word going to occur in a spam message versus a non-spam message? 
We can calculate this now.

In [21]:
# Initialize  dictionaries for each outcome
parameters_spam = {unique_word: 0 for unique_word in vocabulary}
parameters_ham = {unique_word: 0 for unique_word in vocabulary}

# Calculate parameters
for word in vocabulary:
    # Spam
    n_word_given_spam = spam[word].sum()
    p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + alpha*n_vocab)
    parameters_spam[word] = p_word_given_spam
    
    # Ham
    n_word_given_ham = ham[word].sum()
    p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + alpha*n_vocab)
    parameters_ham[word] = p_word_given_ham

## Classifying a new message

Now that we've calculated the constants and parameters we need, we can start testing the spam filter.

This function will:

* Take a new message as input
* Calculate P(spam|words) and P(ham|words)
* Compare the values of P(spam|words) and P(ham|words)

In [22]:
# Cleanup and sort input messages
import re

def classify(message):
    # Clean up message text
    message = re.sub('\W',' ', message)
    message = message.lower()
    message = message.split()
    
    # Probability of spam or non-spam given message
    # Initialize values
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
       
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)
    
    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal probabilities, have a human classify this!')

Let's test it out! We'll use this algorithm to classify two new messages

1. 'WINNER!! This is the secret code to unlock the money: C3421.'
2. "Sounds good, Tom, then see u there"

In [23]:
classify('WINNER!! This is the secret code to unlock the money: C3421')

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam


In [24]:
classify("Sounds good, Tom, then see u there")

P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


It works! The first message definitely seems like spam and was classified as such. The second message looks like non-spam, and the classifier identified it correctly.

## Measuring the Spam Filter's Accuracy

Now we have to see how well our spam filter performs on unseen test data.

In [25]:
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]

        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [28]:
# Add a column in the test set with the predicted values from the classifier
test_set['predicted'] = test_set['SMS'].apply(classify_test_set)
test_set.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


In [38]:
correct = 0
total = test_set.shape[0]

for row in test_set.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1
        
print('Correct:',correct)
print('Incorrect:',total-correct)
print('Accuracy:',correct/total)

Correct: 1100
Incorrect: 14
Accuracy: 0.9874326750448833


The accuracy of this spam filter is 98.74%, which is very good! Only 14 out of 1114 messages were classified incorrectly. 

## Conclusion
In this project, we built a spam filter for SMS messages using a multinomial Naive Bayes algorithm. The filter had an impressive accuracy, correctly classifying 98.74% on the test set. 