# SMS spam filter

In this project we'll build a spam filter for SMS messages, capable of distinguishing messages between spam and non spam. 

We'll do this using as reference a set of messages previously classified as spam or non spam by humans.

## Choosing an algorithm for a classification problem

Being this a classical binary classification problem, the **Naive Bayes algorithm** should do the job well, so further on this work, we'll use it.

Let's load the libraries we need to do the job.

In [1]:
import pandas as pd
import numpy as np
import re

## The SMS Spam Collection

Now, let's load the spam collection file, convert it to a Pandas Dataframe and check it briefly.

In [2]:
colec = pd.read_csv('SMSSpamCollection', sep='\t', header=None)

In [3]:
colec

Unnamed: 0,0,1
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


These columns could have more intuitive names, since they are named "0" and "1". Let's change them to "Label" and "SMS'.

In [4]:
colec.columns = ['Label', 'SMS']

Checking below if everything is ok with the Dataframe and its length.

In [5]:
colec.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [6]:
colec.shape

(5572, 2)

We have 5.572 rows and 2 columns on this collection. Let's check how the entries are distributed between 'ham' and 'spam', which means "non spam" and "spam" and see these percentuals.

In [7]:
ham_perc = ((len(colec[colec['Label'] == 'ham']))/len(colec))*100
spam_perc = ((len(colec[colec['Label'] == 'spam']))/len(colec))*100

In [8]:
ham_perc

86.59368269921033

In [9]:
spam_perc

13.406317300789663

This dataset is 86,59% non spam and 13,5% spam SMS.

## Creating a training set and a test set

Using this collection, we'll now create two separeted sets, one for training our filter, and other to test it. Since all these messages are already pre classified as spam or no spam, this is useful for us to check the accuracy of our solution.

The training set will have 80% of the messages, and the test set will have 20% of the total.

In [10]:
# Randomize the dataset
data_randomized = colec.sample(frac=1, random_state=1)

# Calculate index for split
training_test_index = round(len(data_randomized) * 0.8)

# Training/Test split
training_set = data_randomized[:training_test_index].reset_index(drop=True)
test_set = data_randomized[training_test_index:].reset_index(drop=True)

print(training_set.shape)
print(test_set.shape)

(4458, 2)
(1114, 2)


We'll just verify if the division was made correctly.

In [11]:
len(training_set)

4458

In [12]:
len(test_set)

1114

Everything looks right. We'll now check if the ratio spam:ham from the original dataset was kept. If nothing gone wrong, we should see the spam percentage below.

In [13]:
(len(training_set[training_set['Label'] == 'spam'])/len(training_set))*100

13.458950201884253

The percentages match. We'll now do the data cleansing on the training_set.

## Data cleansing on the training set

To standardize this this training set, we'll need to do some cleaning to remove punctuactions and to make it all lower case.

In [14]:
training_set['SMS'] = training_set['SMS'].str.replace('\W', ' ')
training_set['SMS'] = training_set['SMS'].str.lower()

In [15]:
training_set.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


This dataset is clean and ready for further work.

## Creating a vocabulary

To make a vocabulary and check the probability for every word on our dataset, we'll need to register and count every word we have. To do so, we'll loop through every row of the dataset, and through every word on every row.

In [16]:
training_set['SMS'] = training_set['SMS'].str.split()

vocabulary = []

for sms in training_set['SMS']:
    for word in sms:
        vocabulary.append(word)

In [17]:
training_set['SMS'].head()

0                    [yep, by, the, pretty, sculpture]
1    [yes, princess, are, you, going, to, make, me,...
2                      [welp, apparently, he, retired]
3                                             [havent]
4    [i, forgot, 2, ask, ü, all, smth, there, s, a,...
Name: SMS, dtype: object

Now, we'll use a trick to remove duplicates from our vocabulary. We'll transform the list into a set, and into a list again.

In [18]:
vocabulary = list(set(vocabulary))

In [19]:
len(vocabulary)

7783

Our vocabulary has 7.783 unique words.

## Creating a table for counting words frequency

Now we have a vocabulary, we need to check the frequency for every word. This will be done by creating a table where each row is represents an SMS, and each column represents a word. Each cell will have a number, representing how many occurences of that word the row has.

In [20]:
word_counts_per_sms = {unique_word: [0] * len(training_set['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(training_set['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

Let's now transform this dictionary into a dataframe.

In [21]:
word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.head()

Unnamed: 0,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


Our dataframe has 4458 rows (one for each sms) and 7783 columns. We'll now concatenate this word count with the training dataset to get the index and the complete SMS of each row.

In [22]:
training_set_clean = pd.concat([training_set, word_counts], axis=1)

In [23]:
training_set_clean.shape

(4458, 7785)

In [24]:
training_set_clean = training_set_clean[training_set_clean['Label'].notnull()]

In [25]:
training_set_clean.shape

(4458, 7785)

Our new dataframe has 5341 rows (sms) and 7785 columns (words), precisely 2 more than the word counts dataframe.

## Let's now calculate our constant values for calculation probabilities, which are:
* p_spam (probability of finding an spam sms)
* p_ham (probability of finding a non-spam sms)
* n_spam (number of words in all spam sms)
* n_ham (number of words in all non spam sms)
* n_vocabulary (number of total words)
* alpha (the Laplace Smoothing factor)

In [26]:
spam_messages = training_set_clean[training_set_clean['Label'] == 'spam']
ham_messages = training_set_clean[training_set_clean['Label'] == 'ham']

In [27]:
p_spam = len(spam_messages)/len(training_set_clean)
p_spam

0.13458950201884254

In [28]:
len(training_set_clean)

4458

In [29]:
p_ham = len(ham_messages)/len(training_set_clean)
p_ham

0.8654104979811574

In [30]:
n_spam = spam_messages['SMS'].apply(len).sum()
n_spam

15190

In [31]:
n_ham = ham_messages['SMS'].apply(len).sum()
n_ham

57237

In [32]:
n_vocabulary = len(vocabulary)

n_vocabulary

7783

In [33]:
alpha = 1

In [34]:
word_counts_per_sms = {unique_word: [0] * len(training_set['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(training_set['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

### Extracting word probabilities, creating parameters

Now we have our constants, it's time to calculate the probability for every word on our dictionary, spam and non spam. We'll do this by using a dictionary, where each key is a word and each number is a value.

We'll start with the probabilities for words being spam.

In [35]:
parameters_spam = {unique_word:0 for unique_word in vocabulary}

In [36]:
for word in vocabulary:
    n_word_given_spam = spam_messages[word].sum()
    p_word_given_spam = (n_word_given_spam + alpha)/(n_spam + alpha * n_vocabulary)
    parameters_spam[word] = p_word_given_spam

And now, probability for words not being spam.

In [37]:
parameters_ham = {unique_word:0 for unique_word in vocabulary}

In [38]:
for word in vocabulary:
    n_word_given_ham = ham_messages[word].sum()
    p_word_given_ham = (n_word_given_ham + alpha)/(n_ham + alpha * n_vocabulary)
    parameters_ham[word] = p_word_given_ham

### Creating the filter itself

In [51]:
def classify(message):
    '''
    message: a string
    '''
    
    message = re.sub('\W', ' ', message)
    message = message.lower().split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
            
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
   
    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    if p_spam_given_message > p_ham_given_message:
        return 'spam'
    if p_spam_given_message == p_ham_given_message:
        return 'needs human classification'

### Testing the filter

We'll now test the filter with two sample messages:
* WINNER!! This is the secret code to unlock the money: C3421.
* Sounds good, Tom, then see u there

Let's see if the results look good enough. The filter will return "spam" for a spam message, and "ham" for a non-spam message.

In [40]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

'spam'

In [52]:
classify('Sounds good, Tom, then see u there')

'ham'

Looks good. Let's check the accuracy of the filter, using the test set, which has pre-classified values to compare.

In [42]:
test_set['predicted'] = test_set['SMS'].apply(classify)

In [43]:
test_set.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


In [55]:
correct = 0
total = len(test_set)

Below, we'll check the number of rows which the pre-classification is equal to the generated by the filter

In [57]:
corr = len(test_set[test_set['Label'] == test_set['predicted']])
corr

1100

And now, the rows with wrong classifications according to the original set.

In [58]:
wrong = len(test_set[test_set['Label'] != test_set['predicted']])
wrong

14

Below, we have the percentage of correct answers given by our filter.

In [59]:
(corr/total)*100

98.74326750448833

And now, lets check all the wrong predictions.

In [60]:
test_set[test_set['Label'] != test_set['predicted']]

Unnamed: 0,Label,SMS,predicted
114,spam,Not heard from U4 a while. Call me now am here...,ham
135,spam,More people are dogging in your area now. Call...,ham
152,ham,Unlimited texts. Limited minutes.,spam
159,ham,26th OF JULY,spam
284,ham,Nokia phone is lovly..,spam
293,ham,A Boy loved a gal. He propsd bt she didnt mind...,needs human classification
302,ham,No calls..messages..missed calls,spam
319,ham,We have sent JD for Customer Service cum Accou...,spam
504,spam,Oh my god! I've found your number again! I'm s...,ham
546,spam,"Hi babe its Chloe, how r u? I was smashed on s...",ham


With this results and performance, our spam filter is fully functional and good to go.