# Building a spam filter with Naive Bayes algorithm.

In this project, we're going to build a spam filter for SMS messages using the multinomial Naive Bayes algorithm. Our goal is to write a program that classifies new messages with an accuracy greater than 80% — so we expect that more than 80% of the new messages will be classified correctly as spam or ham (non-spam).

To train the algorithm, we'll use a dataset of 5,572 SMS messages that are already classified by humans. The dataset was put together by *Tiago A. Almeida and José María Gómez Hidalgo*, and it can be downloaded from the The [UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/228/sms+spam+collection).

This project is based on and extends a similar project shared by [DataQuest](https://github.com/dataquestio/solutions/blob/master/Mission433Solutions.ipynb).

We will use **pandas** to manipluate the data and **re** for working with text.

In [30]:
import pandas as pd
import re

## Exploring the Dataset

First we read in the dataset and do basic checks - size, head, tail.

In [31]:
sms_data = pd.read_csv('data/SMSSpamCollection', sep='\t', header=None, names = ['Label', 'SMS'])
print(sms_data.shape)
print( sms_data.head() )
print( sms_data.tail() )

(5572, 2)
  Label                                                SMS
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...
     Label                                                SMS
5567  spam  This is the 2nd time we have tried 2 contact u...
5568   ham               Will ü b going to esplanade fr home?
5569   ham  Pity, * was in mood for that. So...any other s...
5570   ham  The guy did some bitching but I acted like i'd...
5571   ham                         Rofl. Its true to its name


In the head/tail we see that most messages are labelled as ham, some are spam. This is normal. Let's check the exact percentages.

In [32]:
sms_data['Label'].value_counts(normalize=True)

Label
ham     0.865937
spam    0.134063
Name: proportion, dtype: float64

We see above that 86.6% messages are ham, and 13.4% are spam. This sample looks representative based on common experience.

# Working with the data
## Training/Test split
Now we split the data into training and test sets (the usual 80/20 split). 80% for the training and 20% for testing

In [33]:
# First randomise the dataset
data_randomised = sms_data.sample(frac = 1, random_state= 1)

# Calculate index for split at 80% of data
split_index = round(len(data_randomised) * 0.8 )

# Training/ test split at the split_index
training_data = data_randomised[ : split_index ].reset_index(drop=True)
test_data = data_randomised[ split_index: ].reset_index(drop=True)

print("Shape of training data is ", training_data.shape, " and the shape of test data is ", test_data.shape)

# Now do a sanity check to see that both sets of data has the ham/spam ratio of the original data.
print ("Training set: \n", training_data['Label'].value_counts(normalize=True) )
print ("Test set: \n", test_data['Label'].value_counts(normalize=True) )

Shape of training data is  (4458, 2)  and the shape of test data is  (1114, 2)
Training set: 
 Label
ham     0.86541
spam    0.13459
Name: proportion, dtype: float64
Test set: 
 Label
ham     0.868043
spam    0.131957
Name: proportion, dtype: float64


The ratios look similar to the original dataset, which his good. We now move on to clean the dataset.

## Data cleaning
In order to work with the algorithm for calculating probabilities, we need to convert all the text messages to see the number of keywords. The below image captures the essence of what we need to do.

![](data_cleaning.png)

### Remove punctuation and convert everything to lower case

In [34]:
# Remove all the punctuation and bring all letters to lower case
print("Data before cleaning: \n", training_data.head())

training_data['SMS'] = training_data['SMS'].str.replace(r'\W', ' ', regex=True) # replace any non-word characters with a space
training_data['SMS'] = training_data['SMS'].str.lower() # convert everything to lower case

print("Data after cleaning: \n", training_data.head())

# print(training_data['SMS'].str.replace(r'\W', ' ', regex=True).head())

Data before cleaning: 
   Label                                                SMS
0   ham                       Yep, by the pretty sculpture
1   ham      Yes, princess. Are you going to make me moan?
2   ham                         Welp apparently he retired
3   ham                                            Havent.
4   ham  I forgot 2 ask ü all smth.. There's a card on ...
Data after cleaning: 
   Label                                                SMS
0   ham                       yep  by the pretty sculpture
1   ham      yes  princess  are you going to make me moan 
2   ham                         welp apparently he retired
3   ham                                            havent 
4   ham  i forgot 2 ask ü all smth   there s a card on ...


### Create a vocabulary of all the unique words

In [35]:
training_data['SMS'] = training_data['SMS'].str.split() # this converts SMS column into a list of all the words

vocabulary = []

for sms in training_data['SMS']:
    for word in sms:
        vocabulary.append(word)

# Currently vocabulary contains all the words in the dataset. Remove duplicates:
vocabulary = set(vocabulary)
print("The total number of words in our vocabulary is", len(vocabulary))


The total number of words in our vocabulary is  7783


In [38]:
# First initialise a dictionary with zeros. Each word corresponds to a list of numbers of how many times it appears in each sms.
word_count_per_sms = {unique_word: [0]* len(training_data['SMS']) for unique_word in vocabulary }

# This loop assigns the current frequency of each word.
for index, sms in enumerate(training_data['SMS']):
    for word in sms:
        word_count_per_sms[word][index] +=1 

# Now convert this dictionary to a dataframe
word_counts = pd.DataFrame(word_count_per_sms)
word_counts.head()


Unnamed: 0,babygoodbye,checkup,hi,86688,izzit,install,anythingtomorrow,year,great,days,...,activities,dartboard,when,gd,understanding,tool,thangam,100p,swell,09061790121
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [40]:
# Now merge the above with the modified training set dataframe to get the final dataset to work with
training_data_clean = pd.concat([training_data, word_counts], axis= 1)
training_data_clean.head()

Unnamed: 0,Label,SMS,babygoodbye,checkup,hi,86688,izzit,install,anythingtomorrow,year,...,activities,dartboard,when,gd,understanding,tool,thangam,100p,swell,09061790121
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
