## Building a Spam Filter with Naive Bayes

#### The goal for this script is to create an SMS filtering system for classifying incoming SMS as spam.

In [27]:
import pandas as pd

### Collect data, split training/test data and clean.

In [2]:
sms_collection_data = pd.read_csv('data\SMSSpamCollection', sep='\t', header=None, names=['Label','SMS'])

In [3]:
sms_collection_data.shape

(5572, 2)

In [4]:
sms_collection_data.isnull().sum()

Label    0
SMS      0
dtype: int64

In [5]:
sms_collection_data['Label'].value_counts() / len(sms_collection_data) * 100

Label
ham     86.593683
spam    13.406317
Name: count, dtype: float64

The dataset consists of 5572 SMS messages with 2 columns, one with a label as to whether the message is spam or ham (not spam), and another with the SMS message body.

Roughly 86.6% of SMS are ham, 13.4% are spam.

There are no NaN values across the dataset.

Splitting the dataset up into 80% for training the model and 20% for testing.

In [6]:
sms_collection_data_randomized = sms_collection_data.sample(frac=1, random_state=1)

training_test_index = round(len(sms_collection_data_randomized) * 0.8)

training_data = sms_collection_data_randomized[:training_test_index].reset_index(drop=True)
test_data = sms_collection_data_randomized[training_test_index:].reset_index(drop=True)

print("Training data has a shape of: ", training_data.shape)
print(training_data['Label'].value_counts() / len(training_data) * 100)

print("Test data has a shape of: ", test_data.shape)
print(test_data['Label'].value_counts() / len(test_data) * 100)

Training data has a shape of:  (4458, 2)
Label
ham     86.54105
spam    13.45895
Name: count, dtype: float64
Test data has a shape of:  (1114, 2)
Label
ham     86.804309
spam    13.195691
Name: count, dtype: float64


Training and test samples have spam and ham percentages very close to our population.

Cleaning the training dataset to remove punctuation and any upper case.

In [7]:
training_data['SMS'] = training_data['SMS'].replace(to_replace=r'\W', value=' ', regex=True)

training_data['SMS'] = training_data['SMS'].str.lower()

Each unique word in the training data SMS messages must be turned into a header and counted.

In [8]:
words_in_sms = []

for message in training_data['SMS']:
    list_of_words_in_sms = message.split()

    for word in list_of_words_in_sms:
        words_in_sms.append(word)

unique_words_in_sms = list(set(words_in_sms))

In [9]:
count_of_unique_words = {unique_word: [0] * len(training_data['SMS']) for unique_word in unique_words_in_sms}

for index, message in enumerate(training_data['SMS']):
    for word in message.split():
        count_of_unique_words[word][index] += 1


In [10]:
word_counts = pd.DataFrame(count_of_unique_words)
word_counts.head()

Unnamed: 0,brighten,frosty,suply,bridgwater,gave,pax,lighters,enters,se,82242,...,tarot,bbd,pooja,1450,prepared,wont,checked,infront,kb,reach
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [11]:
training_data_clean = pd.concat([training_data, word_counts], axis=1)
training_data_clean.head()

Unnamed: 0,Label,SMS,brighten,frosty,suply,bridgwater,gave,pax,lighters,enters,...,tarot,bbd,pooja,1450,prepared,wont,checked,infront,kb,reach
0,ham,yep by the pretty sculpture,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,yes princess are you going to make me moan,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,welp apparently he retired,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,havent,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,i forgot 2 ask ü all smth there s a card on ...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


***
### Calculate Probabilities

These are the formulae for Naive Bayes algorithm that will be used to calculate the probabilities of an SMS being spam from the words it contains.

$$ P
(
S
p
a
m
|
w
1
,
w
2
,
.
.
.
,
w
n
)
∝
P
(
S
p
a
m
)
⋅
n
∏
i
=
1
 
P
(
w
i
|
S
p
a
m
)
$$


$$
P
(
H
a
m
|
w
1
,
w
2
,
.
.
.
,
w
n
)
∝
P
(
H
a
m
)
⋅
n
∏
i
=
1
 
P
(
w
i
|
H
a
m
) $$


To calculate the respective weight of each word as spam and ham, these formulae need to be used:

$$P
(
w
i
|
S
p
a
m
)
=
N
w
i
|
S
p
a
m
+
α
N
S
p
a
m
+
α
⋅
N
V
o
c
a
b
u
l
a
r
y
$$


$$
P
(
w
i
|
H
a
m
)
=
N
w
i
|
H
a
m
+
α
N
H
a
m
+
α
⋅
N
V
o
c
a
b
u
l
a
r
y
$$

Some of the terms are constant for every new message. These will be calculated now.

α = 1 will be used for Laplace smoothing as there are a high number of unique words and a fairly small dataset so the probabilties of limited or no occurances is high.

In [12]:
alpha = 1

In [13]:
only_spam = training_data[training_data['Label']=='spam']
only_ham = training_data[training_data['Label']=='ham']

In [14]:
p_spam = only_spam['Label'].count()/training_data['Label'].count()

p_spam

0.13458950201884254

In [15]:
p_ham = only_ham['Label'].count()/training_data['Label'].count()

p_ham

0.8654104979811574

In [16]:
n_spam = 0

for message in only_spam['SMS']:
    n_spam += (len(message.split()))

n_spam

15190

In [17]:
n_ham = 0

for message in only_ham['SMS']:
    n_ham += (len(message.split()))

n_ham

57237

In [18]:
n_volcab = len(count_of_unique_words)

n_volcab

7783

In [19]:
p_w_spam_parameters = {unique_word: 0 for unique_word in unique_words_in_sms}

p_w_ham_parameters = {unique_word: 0 for unique_word in unique_words_in_sms}

## Creating counts of each unique word in SMS that are marked as spam
for message in only_spam['SMS']:
    list_of_words = message.split()

    for word in list_of_words:
        p_w_spam_parameters[word] +=1

## Creating the probabilities of each unique word in SMS appearing in spam.

for key, value in p_w_spam_parameters.items():
    p_w_spam_parameters[key] = (value + alpha)/(n_spam + alpha * n_volcab)

## Creating counts of each unique word in SMS that are marked as ham
for message in only_ham['SMS']:
    list_of_words = message.split()

    for word in list_of_words:
        p_w_ham_parameters[word] +=1    

## Creating the probabilities of each unique word in SMS appearing in ham.

for key, value in p_w_ham_parameters.items():
    p_w_ham_parameters[key] = (value + alpha)/(n_ham + alpha * n_volcab)


***
### Classifying a New SMS

All parameters are calculated so a spam filter can now be made. The spam filter will constitute a function that:

- Takes in as input a new message (w1, w2, ..., wn)
- Calculates P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn)
- Compares the values of P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn), and:

    -If P(Ham|w1, w2, ..., wn) > P(Spam|w1, w2, ..., wn), then the message is classified as ham

    -If P(Ham|w1, w2, ..., wn) < P(Spam|w1, w2, ..., wn), then the message is classified as spam
    
    -If P(Ham|w1, w2, ..., wn) = P(Spam|w1, w2, ..., wn), then the algorithm may request human help

In [20]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()
    ## message words have been split into list
    spam_word_probabilities = 1
    for word in message:
        if word in p_w_spam_parameters:
            spam_word_probabilities *= p_w_spam_parameters[word]
    
    p_spam_given_message = p_spam* spam_word_probabilities

    ham_word_probabilities = 1
    for word in message:
        if word in p_w_ham_parameters:
            ham_word_probabilities *= p_w_ham_parameters[word]

    p_ham_given_message = p_ham* ham_word_probabilities

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    
    else:
        print('Equal probabilities, flag')
        return 'equal'

In [21]:
classify('Click the link now to claim your free prize')

'spam'

In [22]:
classify('Hi mate just nipping down for a couple. Want to join?')

'ham'

***
### Evaluating the Performance of the Filter

The test dataset of 1114 SMS will now be used against the filter to calculate its performance.

In [23]:
test_data.shape

(1114, 2)

In [24]:
test_data['spam_or_ham_filter'] = test_data['SMS'].apply(classify)

Equal probabilities, flag


In [25]:
count_correct_labels = 0

total = test_data.shape[0]
    
for row in test_data.iterrows():
    row = row[1]
    if row['Label'] == row['spam_or_ham_filter']:
        count_correct_labels += 1

percentage_correct = round(count_correct_labels/total*100,2)

print(f"The simple filter worked with a certain level of accuracy of {percentage_correct}% against the test data provided.\nThis equates to {count_correct_labels} correct classifications out of a total of {total} messages.\n\nFurther investigations could include looking at the messages that were mis-classified.\nIncreasing the volume of training data could also be considered.")

The simple filter worked with a certain level of accuracy of 98.74% against the test data provided.
This equates to 1100 correct classifications out of a total of 1114 messages.

Further investigations could include looking at the messages that were mis-classified.
Increasing the volume of training data could also be considered.


***
### Evaluating increasing training data by 10%

As a quick validation excercise I wonder what increasing our training data volume from 80% of our total data to 90% would do to the accuracy of the filter (if any).

_n.b. increasing training data will reduce our testing data and this must be considered if comparing accuracy._

In [26]:
sms_collection_data_randomized = sms_collection_data.sample(frac=1, random_state=1)

training_test_index = round(len(sms_collection_data_randomized) * 0.9)

training_data = sms_collection_data_randomized[:training_test_index].reset_index(drop=True)
test_data = sms_collection_data_randomized[training_test_index:].reset_index(drop=True)

print("Training data has a shape of: ", training_data.shape)
print(training_data['Label'].value_counts() / len(training_data) * 100)

print("Test data has a shape of: ", test_data.shape)
print(test_data['Label'].value_counts() / len(test_data) * 100)

training_data['SMS'] = training_data['SMS'].replace(to_replace=r'\W', value=' ', regex=True)

training_data['SMS'] = training_data['SMS'].str.lower()

words_in_sms = []

for message in training_data['SMS']:
    list_of_words_in_sms = message.split()

    for word in list_of_words_in_sms:
        words_in_sms.append(word)

unique_words_in_sms = list(set(words_in_sms))

count_of_unique_words = {unique_word: [0] * len(training_data['SMS']) for unique_word in unique_words_in_sms}

for index, message in enumerate(training_data['SMS']):
    for word in message.split():
        count_of_unique_words[word][index] += 1
word_counts = pd.DataFrame(count_of_unique_words)
training_data_clean = pd.concat([training_data, word_counts], axis=1)
training_data_clean.head()

alpha = 1

only_spam = training_data[training_data['Label']=='spam']
only_ham = training_data[training_data['Label']=='ham']

only_spam = training_data[training_data['Label']=='spam']
only_ham = training_data[training_data['Label']=='ham']

only_spam = training_data[training_data['Label']=='spam']
only_ham = training_data[training_data['Label']=='ham']

p_spam = only_spam['Label'].count()/training_data['Label'].count()

p_ham = only_ham['Label'].count()/training_data['Label'].count()

n_spam = 0

for message in only_spam['SMS']:
    n_spam += (len(message.split()))

n_ham = 0

for message in only_ham['SMS']:
    n_ham += (len(message.split()))

n_volcab = len(count_of_unique_words)

p_w_spam_parameters = {unique_word: 0 for unique_word in unique_words_in_sms}

p_w_ham_parameters = {unique_word: 0 for unique_word in unique_words_in_sms}

## Creating counts of each unique word in SMS that are marked as spam
for message in only_spam['SMS']:
    list_of_words = message.split()

    for word in list_of_words:
        p_w_spam_parameters[word] +=1

## Creating the probabilities of each unique word in SMS appearing in spam.

for key, value in p_w_spam_parameters.items():
    p_w_spam_parameters[key] = (value + alpha)/(n_spam + alpha * n_volcab)

## Creating counts of each unique word in SMS that are marked as ham
for message in only_ham['SMS']:
    list_of_words = message.split()

    for word in list_of_words:
        p_w_ham_parameters[word] +=1    

## Creating the probabilities of each unique word in SMS appearing in ham.

for key, value in p_w_ham_parameters.items():
    p_w_ham_parameters[key] = (value + alpha)/(n_ham + alpha * n_volcab)

test_data['spam_or_ham_filter'] = test_data['SMS'].apply(classify)

count_correct_labels = 0

total = test_data.shape[0]
    
for row in test_data.iterrows():
    row = row[1]
    if row['Label'] == row['spam_or_ham_filter']:
        count_correct_labels += 1

percentage_correct = round(count_correct_labels/total*100,2)

print(f"The simple filter worked with a certain level of accuracy of {percentage_correct}% against the test data provided.\nThis equates to {count_correct_labels} correct classifications out of a total of {total} messages.")

Training data has a shape of:  (5015, 2)
Label
ham     86.66002
spam    13.33998
Name: count, dtype: float64
Test data has a shape of:  (557, 2)
Label
ham     85.996409
spam    14.003591
Name: count, dtype: float64
The simple filter worked with a certain level of accuracy of 99.28% against the test data provided.
This equates to 553 correct classifications out of a total of 557 messages.


As expected, increasing training data resulted in a higher accuracy filter. 