# Building a Spam Filter with Naive Bayes

1. The **SMS Spam Collection** Dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the The [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection#). 


2. The SMS Spam Collection v.1 is a public set of SMS labeled messages that have been collected for mobile phone spam research. It has one collection composed by 5,574 English, real and non-enconded messages, tagged according being legitimate (ham) or spam.

In [1]:
import pandas as pd
import numpy as np
import re

In [2]:
# let read the dataset and study the data
file_loc = '/Users/sni/Documents/Python/Dataquest-Online-Courses-2022/Datasets/SMSSpamCollection.csv'
df = pd.read_csv(file_loc,sep='\t',header=None)
names=['Label','SMS']
df.columns=names
df.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
# let get the basic information about the dataset
print(df.shape)
print('\n')
print(df['Label'].value_counts())
print('\n')
print(df['Label'].value_counts(normalize=True))

(5572, 2)


ham     4825
spam     747
Name: Label, dtype: int64


ham     0.865937
spam    0.134063
Name: Label, dtype: float64


Before we move on and create a Spam Filter, it's very helpful to first think of a way of testing how well it works. A good rule of thumb is that designing the test comes before creating the software. 

Once our Spam Filter is done, we need to test how good it is with classifying new message. To test the spam filter, we are going to split our dataset into two categories:

- A **Training Set**, which we'll use to train the computer how to classify messages.

- A **Test Set**, which we'll use to test how good the spam filter is with classifying new messages.

We are going to keep 80% of our dataset for training, and 20% for testing.

- The training set will have 4,458 messages
- The testing set will have 1,114 messages

### For this project, our goal is to create a spam filter that classifies new messages with an accuracy greater than 80% — so we expect that more than 80% of the new messages will be classified correctly as spam or ham (non-spam).

In [90]:
'''let's start taking samples from df, and a create test data set 
called test, and a training data set called training'''

df_randomized = df.sample(frac=1,random_state=1)

training_rows = round(len(df_randomized)*0.8)

training = df_randomized[:training_rows].copy()
test = df_randomized[training_rows:].copy()

training.reset_index(drop=True,inplace=True)
test.reset_index(drop=True,inplace=True)

print(f'training dataset shape {training.shape}')
print(f'test dataset shape {test.shape}')

training dataset shape (4458, 2)
test dataset shape (1114, 2)


In [5]:
#Check the percentage split between spam and non-spam for each dataset
training['Label'].value_counts(normalize=True)

ham     0.86541
spam    0.13459
Name: Label, dtype: float64

In [6]:
test['Label'].value_counts(normalize=True)

ham     0.868043
spam    0.131957
Name: Label, dtype: float64

In [7]:
training.head(2)

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"


In [8]:
'''the dataset contain messages in different format, some have capital
letters (which need to be convert to lower letters), 
some have punctuation (which need to be removed), let's firstly cleanup
the dataset'''

# let's firstly transform every letter in every word to lower case
training['SMS'] = training['SMS'].str.lower()

# write a function to remove all punctuation using re.sub() methond from 
# regular expression package, and then apply to 'SMS' series.
def sub_W(x):
    return re.sub('\W',' ',x)

training['SMS'] = training['SMS'].apply(sub_W)
training

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...
...,...,...
4453,ham,sorry i ll call later in meeting any thing re...
4454,ham,babe i fucking love you too you know fuck...
4455,spam,u ve been selected to stay in 1 of 250 top bri...
4456,ham,hello my boytoy geeee i miss you already a...


To make the calculations easier, we want bring the data to this format (the table below is a transformation of the table you see above):

![probability-pic-32](https://raw.githubusercontent.com/tongNJ/Dataquest-Online-Courses-2022/main/Pictures/probability-pic-32.png)

In [9]:
'''Create a vocabulary for the messages in the training set. 
The vocabulary should be a Python list containing all the unique 
words across all messages, where each word is represented as a string.'''

dirty_list = training['SMS'].str.split().sum()
clean_set = set(dirty_list)
vocabulary = list(clean_set)
vocabulary[:20]
print(f'there are total {len(vocabulary)} words in training vocabulary')

there are total 7783 words in training vocabulary


In [10]:
'''now let's create a dictionary that we will then convert to the 
Dateframe we need'''

word_counts_per_sms = {}

training['SMS'] = training['SMS'].str.split()
training.head()
# for i in dirty_list:
#     if word_counts_per_sms[i]

Unnamed: 0,Label,SMS
0,ham,"[yep, by, the, pretty, sculpture]"
1,ham,"[yes, princess, are, you, going, to, make, me,..."
2,ham,"[welp, apparently, he, retired]"
3,ham,[havent]
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,..."


In [13]:
# convert vocabulary list to vocabulary dictionary
vocabulary_dict = {unique_word:[0]*len(training['SMS']) for unique_word in vocabulary}

for index,sms in enumerate(training['SMS']):
    for word in sms:
        vocabulary_dict[word][index] +=1

In [24]:
training_set=pd.DataFrame(vocabulary_dict)
print(training_set.shape)
training_set.head()


(4458, 7783)


Unnamed: 0,parents,08714712412,functions,arestaurant,discussed,cali,nottingham,usher,face,tog,...,paining,erupt,txt,show,exposed,possessive,requests,widelive,incredible,nimya
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [20]:
training_set_final = pd.concat([training,training_set],axis=1)

In [25]:
print(training_set_final.shape)
training_set_final.head(5)

(4458, 7785)


Unnamed: 0,Label,SMS,parents,08714712412,functions,arestaurant,discussed,cali,nottingham,usher,...,paining,erupt,txt,show,exposed,possessive,requests,widelive,incredible,nimya
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [23]:
#checking data
training_set_final.loc[training_set_final['yep']>0,['Label','SMS','yep']]

Unnamed: 0,Label,SMS,yep
0,ham,"[yep, by, the, pretty, sculpture]",1
875,ham,"[yep, get, with, the, program, you, re, slacking]",1
2203,ham,"[er, yep, sure, props]",1
2594,ham,"[lol, yep, did, that, yesterday, already, got,...",1
2764,ham,"[yep, i, do, like, the, pink, furniture, tho]",1
3096,ham,"[yep, then, is, fine, 7, 30, or, 8, 30, for, i...",1
3385,ham,"[oh, ok, i, didnt, know, what, you, meant, yep...",1
4145,ham,"[er, enjoyin, indians, at, the, mo, yep, sall,...",1
4221,ham,"[nimbomsons, yep, phone, knows, that, one, obv...",1


Now that we're done with data cleaning and have a training set to work with, we can begin creating the spam filter. Recall that the Naive Bayes algorithm will need to know the probability values of the two equations below to be able to classify new messages:

\begin{equation}
P(Spam | w_1,w_2, ..., w_n) \propto P(Spam) \cdot \prod_{i=1}^{n}P(w_i|Spam) \\
P(Ham | w_1,w_2, ..., w_n) \propto P(Ham) \cdot \prod_{i=1}^{n}P(w_i|Ham)
\end{equation}


Also, to calculate P(wi|Spam) and P(wi|Ham) inside the formulas above, recall that we need to use these equations:

\begin{equation}
P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}} \\
P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}}
\end{equation}


Some of the terms in the four equations above will have the same value for every new message. As a start, let's first calculate:

- P(Spam) and P(Ham)


- $N_Spam$, $N_Ham$, $N_Vocabulary$

In [33]:
p_spam = training_set_final['Label'].value_counts(normalize=True)['spam']
p_ham = 1- p_spam
print(f'P(Spam) = {p_spam}')
print('\n')
print(f'P(Ham) = {p_ham}')
print('\n')

P(Spam) = 0.13458950201884254


P(Ham) = 0.8654104979811574




In [40]:
# now calculate N_spam, N_ham, and N_vocabulary
 
n_spam = len(training.loc[training['Label']=='spam','SMS'].sum())
n_ham = len(training.loc[training['Label']=='ham','SMS'].sum())
n_vocabulary = len(vocabulary)
alpha = 1
print(f'N_spam = {n_spam}')
print(f'N_ham = {n_ham}')
print(f'N_vocabulary = {n_vocabulary}')

N_spam = 15190
N_ham = 57237
N_vocabulary = 7783


In [68]:
# Now we will move on to calculate P(wi|Spam) and P(wi|Ham)
'''
Initialize two dictionaries, where each key-value pair is a 
unique word (from our vocabulary) represented as a string, 
and the value is 0. We'll need one dictionary to store the parameters 
for P(wi|Spam), and the other for P(wi|Ham).
'''
p_word_given_spam = {each_word: 0 for each_word in vocabulary}
p_word_given_ham = p_spam_dict.copy()

'''
Isolate the spam and the ham messages in the training set into two 
different DataFrames. The Label column will help you isolate the 
messages.
'''
spam_df = training_set_final[training_set_final['Label']=='spam'].copy()
ham_df = training_set_final[training_set_final['Label']=='ham'].copy()



In [63]:
spam_df.head()

Unnamed: 0,Label,SMS,parents,08714712412,functions,arestaurant,discussed,cali,nottingham,usher,...,paining,erupt,txt,show,exposed,possessive,requests,widelive,incredible,nimya
16,spam,"[freemsg, why, haven, t, you, replied, to, my,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
18,spam,"[congrats, 2, mobile, 3g, videophones, r, your...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
56,spam,"[free, message, activate, your, 500, free, tex...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
60,spam,"[call, from, 08702490080, tells, u, 2, call, 0...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
61,spam,"[someone, has, conacted, our, dating, service,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [77]:
# calcualte the P(wi|Spam) parameter
n_spam_denominator = n_spam + n_vocabulary * alpha
spam_df_sum = spam_df.sum()[2:].copy()
for w in p_word_given_spam:
    p_word_given_spam[w] = (spam_df_sum[w]+alpha)/n_spam_denominator


In [78]:
# calculate the P(wi|ham) parameter
n_ham_denominator = n_ham + n_vocabulary * alpha
ham_df_sum = ham_df.sum()[2:].copy()
for w in p_word_given_ham:
    p_word_given_ham[w] = (ham_df_sum[w]+alpha)/n_ham_denominator

## Classifying A New Message

Now that we have all our parameters calculated, we can start creating the spam filter. The spam filter can be understood as a function that:

- Takes in as input a new message (w1, w2, ..., wn).

- Calculates P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn).

- Compares the values of P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn), and:
  - If P(Ham|w1, w2, ..., wn) > P(Spam|w1, w2, ..., wn), then the message is classified as ham
  - If P(Ham|w1, w2, ..., wn) < P(Spam|w1, w2, ..., wn), then the message is classified as spam
  - If P(Ham|w1, w2, ..., wn) = P(Spam|w1, w2, ..., wn), then the algorithm may request human help.

In [92]:
# let's build a spam filter
def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    '''    
    This is where we calculate:
    '''
    # assign initial probability
    p_spam_given_message = p_spam 
    p_ham_given_message = p_ham
    
    for word in message:
        if word not in vocabulary:
            pass
        else:
            p_spam_given_message *= p_word_given_spam[word]
            p_ham_given_message *= p_word_given_ham[word]

#     print('P(Spam|message):', p_spam_given_message)
#     print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
#         print('Label: Ham')
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
#         print('Label: Spam')
        return 'spam'
    else:
#         print('Equal proabilities, have a human classify this!')
        return 'needs human classification'

In [80]:
message = 'WINNER!! This is the secret code to unlock the money: C3421.'
classify(message)

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam


In [81]:
message = 'Sounds good, Tom, then see u there'
classify(message)

P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


In [102]:
# test['predicted'] = test['SMS'].apply(classify)
# test.head()
test_copy = test.copy()
test_copy['predicted'] = test_copy['SMS'].apply(classify)
test_copy.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


In [110]:
for index,row in test_copy.head().iterrows():
#     print(row['Label'])
    print(row['Label'])

ham
ham
spam
ham
ham


In [111]:
total = len(test_copy)
correct=0

for index,row in test_copy.iterrows():
    if row['Label'] == row['predicted']:
        correct +=1

print('Correct: ', correct)
print('Incorrect: ', total-correct)
print('Accuracy: ', correct/total)

Correct:  1100
Incorrect:  14
Accuracy:  0.9874326750448833
