# Spam Filter

In this project, we're going to build a **spam filter for SMS messages** using the multinomial Naive Bayes algorithm. Our goal is to write a program that classifies new messages with an **accuracy greater than 80%** — so we expect that more than 80% of the new messages will be classified correctly as spam or ham (non-spam).

To train the algorithm, we'll use a dataset of 5,572 SMS messages that are already classified by humans. The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the <a href="https://archive.ics.uci.edu/ml/datasets/sms+spam+collection">The UCI Machine Learning Repository</a>. The data collection process is described in more details on <a href="https://www.dt.fee.unicamp.br/~tiago/smsspamcollection/#composition">this page</a>, where you can also find some of the papers authored by Tiago A. Almeida and José María Gómez Hidalgo.

# Requiremets

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np

# Exploring the dataset

In [2]:
df = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])

In [3]:
df

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Label   5572 non-null   object
 1   SMS     5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [5]:
df.describe()

Unnamed: 0,Label,SMS
count,5572,5572
unique,2,5169
top,ham,"Sorry, I'll call later"
freq,4825,30


In [6]:
df["Label"].value_counts(normalize=True) * 100

ham     86.593683
spam    13.406317
Name: Label, dtype: float64

From the previous list, we can see that about **87%** of the messages are ham, and the remaining **13%** are spam. This sample looks representative, since in practice most messages that people receive are ham.

# Training and test sets

In [7]:
data_randomized = df.sample(frac=1, random_state=1)
train_test_portion = round(len(data_randomized) * 0.8)

training_set = data_randomized[:train_test_portion].reset_index(drop=True)
test_set = data_randomized[train_test_portion:].reset_index(drop=True)

We'll now analyze the percentage of spam and ham messages in the training and test sets. We expect the percentages to be close to what we have in the full dataset, where about 87% of the messages are ham, and the remaining 13% are spam.

Training set shape and percentages:

In [8]:
training_set.shape

(4458, 2)

In [9]:
training_set["Label"].value_counts(normalize=True) * 100

ham     86.54105
spam    13.45895
Name: Label, dtype: float64

Test set shape and percentages:

In [10]:
test_set.shape

(1114, 2)

In [11]:
test_set["Label"].value_counts(normalize=True) * 100

ham     86.804309
spam    13.195691
Name: Label, dtype: float64

The results look good! We'll now move on to cleaning the dataset.

# Data cleaning

To calculate all the probabilities required by the algorithm, we'll first need to perform a bit of data cleaning to bring the data in a format that will allow us to extract easily all the information we need.

In [12]:
training_set

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...
...,...,...
4453,ham,"Sorry, I'll call later in meeting any thing re..."
4454,ham,Babe! I fucking love you too !! You know? Fuck...
4455,spam,U've been selected to stay in 1 of 250 top Bri...
4456,ham,Hello my boytoy ... Geeee I miss you already a...


### Letter case & punctuation

We start by removing all the punctuation and bringing converting every letter to lower case.

In [13]:
training_set['SMS'] = training_set['SMS'].str.replace('\W', ' ')
training_set['SMS'] = training_set['SMS'].str.lower()
training_set.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


### Creating Vocabulary

In [14]:
training_set['SMS'] = training_set['SMS'].str.split()

vocabulary = []
for sms in training_set['SMS']:
    for word in sms:
        vocabulary.append(word)
vocabulary = list(set(vocabulary))            

In [15]:
len(vocabulary)

7783

It looks like there are 7,783 unique words in all the messages of the training set.

### Counting words per sms

In [16]:
word_counts_per_sms = {unique_word: [0] * len(training_set['SMS']) for unique_word in vocabulary}

In [17]:
for index, sms in enumerate(training_set['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

### Final training set

In [18]:
training_set = pd.concat([training_set, pd.DataFrame(word_counts_per_sms)], axis=1)
training_set.head()

Unnamed: 0,Label,SMS,caroline,erode,poor,bakrid,f4q,spell,cried,gay,...,selling,vouch4me,splat,replied,gastroenteritis,rentl,wahala,sophas,percent,kiefer
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Calculating constants

The Naive Bayes algorithm will need to answer these two probability questions to be able to classify new messages:
<img src="eq1.png"/>
Also, to calculate P(wi|Spam) and P(wi|Ham) inside the formulas above, we'll need to use these equations:
<img src="eq2.png"/>
Some of the terms in the four equations above will have the same value for every new message. We can calculate the value of these terms once and avoid doing the computations again when a new messages comes in. Below, we'll use our training set to calculate:
<ul>
    <li><b>P(Spam)</b></li>
    <li><b>P(Ham)</b></li>
    <li><b>NSpam</b></li>
    <li><b>NHam</b></li>
    <li><b>NVocabulary</b></li>
    <li>$\alpha = 1$</li>
<ul>

In [19]:
alpha = 1

First let's compute **P(Spam)** and **P(Ham)**:

In [20]:
p_spam = len(training_set[training_set['Label'] == 'spam']) / len(training_set)
p_ham = len(training_set[training_set['Label'] == 'ham']) / len(training_set)

**NSpam** is equal to the number of words in all the spam messages — it's not equal to the number of spam messages, and it's not equal to the total number of unique words in spam messages.

In [21]:
n_spam = training_set[training_set['Label'] == 'spam']['SMS'].apply(len).sum()
n_spam

15190

**NHam** is equal to the number of words in all the non-spam messages — it's not equal to the number of non-spam messages, and it's not equal to the total number of unique words in non-spam messages.

In [22]:
n_ham = training_set[training_set['Label'] == 'ham']['SMS'].apply(len).sum()
n_ham

57237

**NVocabulary** is equal to the number unique words previous stored in the dictionary.

In [23]:
n_vocabulary = len(vocabulary)
n_vocabulary

7783

### Calculating Parameters

Now that we have the constant terms calculated above, we can move on with calculating the parameters $P(w_i|Spam)$ and $P(w_i|Ham)$. Each parameter will thus be a conditional probability value associated with each word in the vocabulary.

The parameters are calculated using the formulas:
<img src="eq2.png"/>


In [24]:
parameters_spam = {unique_word:0 for unique_word in vocabulary}
parameters_ham = {unique_word:0 for unique_word in vocabulary}

In [25]:
spam_messages = training_set[training_set['Label'] == 'spam']
ham_messages = training_set[training_set['Label'] == 'ham']
for word in vocabulary:
    n_word_given_spam = spam_messages[word].sum()
    p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + (alpha * n_vocabulary)) 
    parameters_spam[word] = p_word_given_spam
    
    n_word_given_ham = ham_messages[word].sum()
    p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + (alpha * n_vocabulary)) 
    parameters_ham[word] = p_word_given_ham

In [26]:
parameters_spam

{'caroline': 4.3529360553693465e-05,
 'erode': 4.3529360553693465e-05,
 'poor': 4.3529360553693465e-05,
 'bakrid': 4.3529360553693465e-05,
 'f4q': 8.705872110738693e-05,
 'spell': 4.3529360553693465e-05,
 'cried': 4.3529360553693465e-05,
 'gay': 0.00030470552387585427,
 'courageous': 4.3529360553693465e-05,
 'spook': 0.0002611761633221608,
 'arrange': 8.705872110738693e-05,
 'sheets': 4.3529360553693465e-05,
 'iwana': 4.3529360553693465e-05,
 'darren': 4.3529360553693465e-05,
 'aunties': 4.3529360553693465e-05,
 'tue': 4.3529360553693465e-05,
 'trying': 0.0005223523266443216,
 '08715705022': 0.00017411744221477386,
 'theirs': 8.705872110738693e-05,
 'wedlunch': 4.3529360553693465e-05,
 'information': 0.00030470552387585427,
 'swollen': 4.3529360553693465e-05,
 '09064018838': 8.705872110738693e-05,
 'cat': 4.3529360553693465e-05,
 'dec': 4.3529360553693465e-05,
 'cold': 4.3529360553693465e-05,
 'karo': 4.3529360553693465e-05,
 'ki': 4.3529360553693465e-05,
 '08719181259': 8.705872110738

In [27]:
parameters_ham

{'caroline': 4.6139649338665025e-05,
 'erode': 3.075976622577668e-05,
 'poor': 0.00012303906490310673,
 'bakrid': 3.075976622577668e-05,
 'f4q': 1.537988311288834e-05,
 'spell': 3.075976622577668e-05,
 'cried': 3.075976622577668e-05,
 'gay': 4.6139649338665025e-05,
 'courageous': 3.075976622577668e-05,
 'spook': 1.537988311288834e-05,
 'arrange': 4.6139649338665025e-05,
 'sheets': 6.151953245155337e-05,
 'iwana': 3.075976622577668e-05,
 'darren': 0.00016917871424177176,
 'aunties': 3.075976622577668e-05,
 'tue': 3.075976622577668e-05,
 'trying': 0.00039987696093509687,
 '08715705022': 1.537988311288834e-05,
 'theirs': 1.537988311288834e-05,
 'wedlunch': 3.075976622577668e-05,
 'information': 4.6139649338665025e-05,
 'swollen': 3.075976622577668e-05,
 '09064018838': 1.537988311288834e-05,
 'cat': 7.689941556444171e-05,
 'dec': 3.075976622577668e-05,
 'cold': 0.00012303906490310673,
 'karo': 3.075976622577668e-05,
 'ki': 3.075976622577668e-05,
 '08719181259': 1.537988311288834e-05,
 'eve

### Messages classifier

In [28]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower().split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
        
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
    
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)        

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [29]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam


In [30]:
classify("Sounds good, Tom, then see u there")

P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


### Measuring the Spam Filter's Accuracy

The two results above look promising, but let's see how well the filter does on our test set, which has 1,114 messages.
We'll start by modifing the previous function to return classification labels instead of printing them.

In [31]:
import re

def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower().split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
        
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]    

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [32]:
test_set['predicted'] = test_set['SMS'].apply(classify_test_set)
test_set

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham
...,...,...,...
1109,ham,"We're all getting worried over here, derek and...",ham
1110,ham,Oh oh... Den muz change plan liao... Go back h...,ham
1111,ham,CERI U REBEL! SWEET DREAMZ ME LITTLE BUDDY!! C...,ham
1112,spam,Text & meet someone sexy today. U can find a d...,spam


Now we can compare the predicted values with the actual values to measure how good our spam filter is with classifying new messages. To make the measurement, we'll use **accuracy** as a metric:
<img src="eq3.png"/>

In [33]:
correct = test_set[test_set['predicted'] == test_set['Label']].shape[0]
total = test_set.shape[0]

In [34]:
print('Correct: ', correct)
print('Incorrect: ', total - correct)
print('Accuracy: ', round((correct / total) * 100, 2), "%")

Correct:  1100
Incorrect:  14
Accuracy:  98.74 %



The accuracy is close to 98.74%, which is really good. Our spam filter looked at 1,114 messages that it hasn't seen in training, and classified 1,100 correctly.

### Exploring Wrong Classified Messages

### Case Sensitive

### Scikit Learn Naive Bayes

In [67]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

In [37]:
training_set

Unnamed: 0,Label,SMS,caroline,erode,poor,bakrid,f4q,spell,cried,gay,...,selling,vouch4me,splat,replied,gastroenteritis,rentl,wahala,sophas,percent,kiefer
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4453,ham,"[sorry, i, ll, call, later, in, meeting, any, ...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4454,ham,"[babe, i, fucking, love, you, too, you, know, ...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4455,spam,"[u, ve, been, selected, to, stay, in, 1, of, 2...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4456,ham,"[hello, my, boytoy, geeee, i, miss, you, alrea...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [36]:
clf = MultinomialNB()

In [46]:
clf.fit(training_set[training_set.columns.tolist()[2:]], training_set[training_set.columns.tolist()[0]])

MultinomialNB()

In [57]:
test_set

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham
...,...,...,...
1109,ham,"We're all getting worried over here, derek and...",ham
1110,ham,Oh oh... Den muz change plan liao... Go back h...,ham
1111,ham,CERI U REBEL! SWEET DREAMZ ME LITTLE BUDDY!! C...,ham
1112,spam,Text & meet someone sexy today. U can find a d...,spam


In [58]:
word_counts_per_sms = {unique_word: [0] * len(test_set['SMS']) for unique_word in vocabulary}

In [59]:
for index, sms in enumerate(test_set['SMS']):
    for word in sms:
        if word in vocabulary:
            word_counts_per_sms[word][index] += 1

In [60]:
test_set_arranged = pd.concat([test_set, pd.DataFrame(word_counts_per_sms)], axis=1)
test_set_arranged.head()

Unnamed: 0,Label,SMS,predicted,caroline,erode,poor,bakrid,f4q,spell,cried,...,selling,vouch4me,splat,replied,gastroenteritis,rentl,wahala,sophas,percent,kiefer
0,ham,Later i guess. I needa do mcat study too.,ham,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,But i haf enuff space got like 4 mb...,ham,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,All sounds good. Fingers . Makes it difficult ...,ham,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"All done, all handed in. Don't know if mega sh...",ham,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [65]:
predictions = clf.predict(test_set_arranged[test_set_arranged.columns.tolist()[3:]])

In [69]:
accuracy_score(test_set_arranged[test_set_arranged.columns.tolist()[0]], predictions)

0.8671454219030521