# Building An SMS Spam Filter Using Multinomial Naive Bayes.

## Introduction:

In this project, we are going to be using the `Multinomial Naive Bayes` algorithm to classify SMS as either spam messages or non-spam messages. The `Multinomial Naive Bayes` algorithm is based on `Bayes' Theorem`; this theorem describes the probability of an event based on prior knowledge of conditions that might be related to the event.

Our goal in this project is to train the computer how to classify messages using the `Multinomial Naive Bayes` algorithm as well as a dataset of 5,572 SMS messages that have already been classified by humans. 
The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo and it can be downloaded from the <a href='https://archive.ics.uci.edu/ml/datasets/sms+spam+collection'>The UCI Machine Learning Repository<a/>. 

Our algorithm should at least have an accuracy score 80% for us to be confident in it.
    

## Initial Data Exploration

In [1]:
import pandas as pd
import re

In [2]:
sms = pd.read_csv('SMSSpamCollection',
                 sep ='\t',
                 header = None,
                 names = ['Label', 'SMS']
                 )
sms.head() # displays the first 5 rows

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
sms.shape # displays the number of rows and columns in the dataset

(5572, 2)

In [4]:
sms['Label'].value_counts(normalize=True) * 100 # ( normalize = True) displays the frequencies as percentages

ham     86.593683
spam    13.406317
Name: Label, dtype: float64

## Splitting Our Dataset Into Training and Test Datasets:

Nearly 87% of the messages in our dataset are ham(non-spam) messages and about 13% of them are spam messages. Now that we are farmiliar with our dataset we are going to split it into a training set and a test set. 
We are going to keep 80% of our dataset for training and 20% for testing; this is because we want to train the algorithm on as much data as possible and still have enough data for testing the algorithm. The dataset has 5,572 messages, which means that:

* The training set will have 4,458 messages (about 80% of the dataset).
* The test set will have 1,114 messages (about 20% of the dataset).

In [5]:
sample = sms.sample(frac=1, random_state=1) # randomizes the dataset
training_set = sample.iloc[0:4458].reset_index(drop=True) # (drop=True) drops old index after resetting index 
test_set = sample.iloc[4458:].reset_index(drop=True)

In [6]:
print(training_set.shape)

print(test_set.shape)

(4458, 2)
(1114, 2)


In [7]:
#getting the percentages of spam and non-spam messages in our training and tests set
print('training set:', '\n',
      training_set['Label'].value_counts(normalize=True) * 100
     )
print('\n')
print('test set:', '\n',
      test_set['Label'].value_counts(normalize=True) * 100
     )

training set: 
 ham     86.54105
spam    13.45895
Name: Label, dtype: float64


test set: 
 ham     86.804309
spam    13.195691
Name: Label, dtype: float64


The perecentages for spam and non-spam messages in both our training and test dataset are roughly equal to the percentages of spam and non-spam messages in our original sms dataset.

## Data Cleaning.
To train the algorithm we are going to clean our dataset t make it easier for to calculate the probability of each individual word in the dataset. We are going to transform the dataset. The result we will get is going to look like this.

<img src = 'https://dq-content.s3.amazonaws.com/433/cpgp_dataset_3.png'/>

In [8]:
#before transformation
training_set.head()

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [9]:
#removing punctuations
training_set['SMS'] = training_set['SMS'].str.replace('\W', ' ', regex=True)

#turning every text to lower case
training_set['SMS'] = training_set['SMS'].str.lower()

In [10]:
training_set.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


In [11]:
#extracting every individual word in the sms column
training_set['SMS'] = training_set['SMS'].str.split()
vocabulary = []
for value in training_set['SMS']:
    for i in value:
        vocabulary.append(i)
vocabulary = set(vocabulary) # using the set function gets rid of duplicate values
vocabulary = list(vocabulary) 

len(vocabulary)

7783

## Final Transformation Of The Training Dataset.

There are 7783 unique words in our vocabulary. We are going to create a dictionary containing each unique word as a key and the frequency of the word as its key. Then we are going to transform the dictionary to a pandas dataframe and then combine it with our `training_set` dataframe.

In [12]:
#creates a dictionary with every unique word in the vocabulary list
word_counts_per_sms = {unique_word: [0] * len(training_set['SMS']) for unique_word in vocabulary}

for index, message in enumerate(training_set['SMS']):
    for word in message:
        word_counts_per_sms[word][index] += 1

In [13]:
word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.head()

Unnamed: 0,character,gas,prayers,09061213237,invaders,nor,representative,through,telling,88066,...,bcz,payed,sound,brats,wipro,approaching,turkeys,canteen,8552,ihave
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [14]:
clean_training_set = pd.concat([training_set, word_counts], axis=1) # joins the dataset together
clean_training_set.head()

Unnamed: 0,Label,SMS,character,gas,prayers,09061213237,invaders,nor,representative,through,...,bcz,payed,sound,brats,wipro,approaching,turkeys,canteen,8552,ihave
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Probability Of Spam & Non-Spam Messages.
Here, we are going to do the following:

* split our training dataset into datasets for ham and spam messages.
* calculate the probability of the constants. To do this, we are going:
  1. Calculate the probability of both ham and spam messages.
  2. Create a variable alpha with a value of 1.
  3. Calculate the number of words in both ham and spam messages.
  4. Calculate the number of words in the vocabulary dictionary.
  
* Calculate the parameters. This includes the probability of every word given that a message is either spam or ham. 

### Calculating the constants:

In [15]:
spam_messages = clean_training_set[clean_training_set['Label']=='spam'].shape[0] # returns the number of rows with spam messages

ham_messages = clean_training_set[clean_training_set['Label']=='ham'].shape[0] # returns the number of rows with ham messages

total_messages = clean_training_set.shape[0]

probability_spam = spam_messages / total_messages
probability_ham = ham_messages / total_messages

print('probability of spam messages:', '\n', probability_spam)
print('\n')
print('probability of non-spam messages:', '\n', probability_ham)

probability of spam messages: 
 0.13458950201884254


probability of non-spam messages: 
 0.8654104979811574


In [16]:
spam_words = clean_training_set[clean_training_set['Label']=='spam']['SMS'].apply(len) # len function counts the individual words in the sms column
n_spam = spam_words.sum() # sums them to get the total amount of spam words in the SMS column

ham_words = clean_training_set[clean_training_set['Label']=='ham']['SMS'].apply(len)
n_ham = ham_words.sum()

n_vocabulary = len(vocabulary)
alpha = 1 #laplace smoothing

### Calculating the parameters:

In [17]:
spam_dict = {unique_word:0 for unique_word in vocabulary} # initializes  a dictionary with every word in the vocabulary list as a key.
ham_dict = {unique_word:0 for unique_word in vocabulary}

#creating new DataFrames for both Spam and Ham messages
spam_df = clean_training_set[clean_training_set['Label']=='spam'].copy() #using copy() method avoids SettingWithCopy Warning.
ham_df = clean_training_set[clean_training_set['Label']=='ham'].copy()

#calculating the probability for each word in spam messages
for word in vocabulary:
    n_word_spam_messages = spam_df[word].sum() # the number of times the word occurs in the spam DataFrame
    p_word_spam_messages = (n_word_spam_messages + alpha) / (n_spam + alpha * n_vocabulary)
    spam_dict[word] =  p_word_spam_messages # updates the dictionary values with the probability of each unique word
    

#calculating the probability for each word in ham messages
for word in vocabulary:
    n_word_ham_messages = ham_df[word].sum() # the number of times the word occurs in the spam DataFrame
    p_word_ham_messages = (n_word_ham_messages + alpha) / (n_ham + alpha * n_vocabulary)
    ham_dict[word] =  p_word_ham_messages # updates the dictionary values with the probability of each unique word
        


## Classifying Messages:

* First we are going to write a mock `classify()`function that does the classification and confirm that it can classify messages as spam or non-spam.
* We are going to update the `classify()` function and then use it on our test_set DataFrame.

In [18]:
def classify(message):
    ''' Takes in a string value
    and cleans it'''

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()


    p_spam_given_message = probability_spam
    p_ham_given_message = probability_ham
    
    for word in message:
# multiplies the already known probability for spam messages by the probability of a word being in a spam message
        if word in spam_dict:
            p_spam_given_message *= spam_dict[word] 
            
# multiplies the already known probability for sham messages by the probability of a word being in a ham message
            
        if word in ham_dict:
            p_ham_given_message *= ham_dict[word]
            
   

    if p_ham_given_message > p_spam_given_message:
        return 'This is not a spam message.'
    elif p_ham_given_message < p_spam_given_message:
        return 'This is a spam message.'
    else:
        return 'Unsure, have a human classify this!'

In [19]:
# Testing the fucntion
test_message1 = 'WINNER!! This is the secret code to unlock the money: C3421.'
test_message2 = "Sounds good, Tom, then see u there"

classify(test_message1)
print('\n')
classify(test_message2)





'This is not a spam message.'

In [20]:
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()


    p_spam_given_message = probability_spam
    p_ham_given_message = probability_ham
    
    for word in message:
# multiplies the already known probability for spam messages by the probability of a word being in a spam message
        if word in spam_dict:
            p_spam_given_message *= spam_dict[word] 
            
# multiplies the already known probability for sham messages by the probability of a word being in a ham message
            
        if word in ham_dict:
            p_ham_given_message *= ham_dict[word]
            

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [21]:
#applying the classify function to the training_set DataFrame
test_set['predicted'] = test_set['SMS'].apply(classify_test_set)
test_set.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


## Calculating The Accuracy Of The Algorithm.
To calculate the accuracy of the algorithm, we are going to do the following:

* initialize a variable `correct` with a value of 0 and also a variable `total` with the total number of messages in the test set.
* Loop through our `test_set` DataFrame and increment the value of correct by 1 if the `Label` column is the same as the `predicted` column.
* Finally we divide `correct` by `total` to get our accuracy score.

In [22]:
# function labels each row as correct or incorrect if we predicted correctly
def classification(row):
    if row['Label'] != row['predicted']:
        return 'incorrect'
    else:
        return 'correct'

In [23]:
# creates a new colun that shows the classification of each column
test_set['classification'] = test_set.apply(classification, axis=1)

test_set.head()

Unnamed: 0,Label,SMS,predicted,classification
0,ham,Later i guess. I needa do mcat study too.,ham,correct
1,ham,But i haf enuff space got like 4 mb...,ham,correct
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam,correct
3,ham,All sounds good. Fingers . Makes it difficult ...,ham,correct
4,ham,"All done, all handed in. Don't know if mega sh...",ham,correct


In [24]:
# calculating the number of incorrect columns
incorrect = test_set[test_set['classification'] == 'incorrect']
incorrect

Unnamed: 0,Label,SMS,predicted,classification
114,spam,Not heard from U4 a while. Call me now am here...,ham,incorrect
135,spam,More people are dogging in your area now. Call...,ham,incorrect
152,ham,Unlimited texts. Limited minutes.,spam,incorrect
159,ham,26th OF JULY,spam,incorrect
284,ham,Nokia phone is lovly..,spam,incorrect
293,ham,A Boy loved a gal. He propsd bt she didnt mind...,needs human classification,incorrect
302,ham,No calls..messages..missed calls,spam,incorrect
319,ham,We have sent JD for Customer Service cum Accou...,spam,incorrect
504,spam,Oh my god! I've found your number again! I'm s...,ham,incorrect
546,spam,"Hi babe its Chloe, how r u? I was smashed on s...",ham,incorrect


In [25]:
# getting the number of incorrectly classified messages
incorrect.shape[0] - 1 # subtract one to account for the message that needs human classification

13

In [26]:
# calculating the accuracy of the algorithm
correct = 0
total = test_set.shape[0]

for row in test_set.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1

accuracy = correct / total * 100
accuracy

98.74326750448833

From the accuracy score, our algorithm is doing pretty good. We managed to classify 98.7% of messages correctly.

## Conclusion:
We wanted to create an algorithm that classifies messages as either spam or non-spam messages using the multinomial Naive Bayes algorithm. So far we have been able to do that and we had an accuracy score of 98.7%. This means that we are confident that in the real world our algorithm will do so well.
Although our algorithm is doing quite good, it failed to classify 13 messsages correctly. There are a few things we can do to improve on it. 
1. We could make the algorithm case sensitive to see if it will increase the accuracy.
2. We could calculate the probability for certain phrases appearin in spam messages instead od just individual words.
3. Collect more data on sms classification and retrain the model.

In [27]:
import gradio as gr

In [28]:
interface = gr.Interface(fn=classify,
                         inputs = gr.Textbox(lines=10,
                                             placeholder = 'enter your message here....'),
                         outputs='text',
                         description = 'If your message is wrongly classified, please click the flag button.'
                        )

In [30]:
interface.launch()

Rerunning server... use `close()` to stop if you need to change `launch()` parameters.
----
Running on local URL:  http://127.0.0.1:7860/

To create a public link, set `share=True` in `launch()`.


(<gradio.routes.App at 0x16201218940>, 'http://127.0.0.1:7860/', None)