# SMS Spam Filter Using Probability theory

This project is about building an SMS Spam filter algorithm. The algorithm will be developed based on Naive Bayes theorem. What it essentially does is that it uses a dataset containing sms messages that are flagged as spam or non-spam and learns from that. Based on the knowledge built on the data, the algorithm will classify new messages whether they are spam or ham.

The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the [The UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). You can also download the dataset directly [from this link](https://dq-content.s3.amazonaws.com/433/SMSSpamCollection). The data collection process is described in more details on [this page](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/#composition), where you can also find some of the authors' papers.

In [None]:
#loading the necessary modules and loading the data
import pandas as pd
import numpy as np

# reading in the data with correct encoding, utf-8 throws and error.
messages = pd.read_csv('../input/sms-spam-collection-dataset/spam.csv',encoding='ISO-8859-1')

# removing two unnecessary cols
messages = messages[['v1', 'v2']]
messages = messages.rename({'v1': 'Label',
                           'v2': 'SMS'}, axis=1)
messages.head()

Now we have made cols easily distinguishable by renaming them to `label` and `sms`

Our data set contains 2 columns and 5572 rows, and it has no NA values, which means no need for data cleaning.

**Let's find out the proportion of spam vs ham messages**

In [None]:
round(messages['Label'].value_counts(normalize=True)*100)

We can see that **87%** of messages are non-spam, and **13%** is flagged as spam. These are prior probabilities that will be useful in our calculations.

## Seperating the dataset into train and test

It is important to think of testing the algorithm before we go about creating it. A good idea is to split the data into *train and test* so that:
- Train dataset will be used to train the model
- test will be used to test how good our model actually is.

For Training we will use **80% ~4,458 messages** of the data, and the rest **20% ~1,114 messages** will be for testing.

### Our final goal. 

Our goal is to achieve 80%+ accuracy with our model. 

We will split the data into two parts, but before, we will randomize the dataset so that our spam and ham messages are spread across the dataset. 

In [None]:
#randomizing the dataset before we split. 
messages = messages.sample(frac=1, random_state=1)

#importing display module for displaying several tables/graphs
import IPython.display

#seperating train 80%, and test 20%
train = messages.iloc[:4458, :]
test = messages.iloc[4458:, :]

display(round(test['Label'].value_counts(normalize=True)*100))
display(round(train['Label'].value_counts(normalize=True)*100))

We successfully split the datasets, and the resulting datasets have **exact proportion** of spam/ham messages. 

## Training the Algorithm on *train* dataset

### Data cleaning.

Before we start building our algorithm we need to clean our data for easier calculation. 
Here are the things important for our model:
- A vocabulary: consists all the unique words in our entire train data
- number of unique words in each message

In order to get the # of unique words in each message, we will create columns that represent each unique word, and each row will show how many times each words was repeated in each message.

First, we need to remove the punctutation from all messages, as we don't need them for analysis.

In [None]:
'''removing punctuation from messages. 
re.sub('\W', ' ', 'Message') This is our regex for replacing any upper/lowercase 
character that is not a letter, digits.'''
pd.options.mode.chained_assignment = None


train['SMS'] = train['SMS'].str.replace(r'\W', ' ')
train['SMS'] = train['SMS'].str.lower()

We have removed the punctutation and made all letters lowercase. 

Now it is time to transform our dataset with cols each representing unique word in our vocabulary. 

In [None]:
#creating a vocabulary containing all unique words from all messages.
vocabulary = []

for sms in train['SMS'].str.split():
    for word in sms:
        vocabulary.append(word)

vocabulary = list(set(vocabulary))
len(vocabulary)

A simple operation of converting messages to a list of words, and transforming a list into set to remove duplicates and back to list - gave us a list of unique words.

### Transforming the train dataset into word count of each unique word for each sms/row

The below code will help create a frequency dictionary of each row in our train dataset.

In [None]:
#Function to create a dictionary of words with word count for each SMS.
word_counts_per_sms = {unique_word: [0] * (len(train['SMS'])) for unique_word in vocabulary}

for index, sms in enumerate(train['SMS'].str.split()):
    for word in sms:
        word_counts_per_sms[word][index] += 1
        
# transforming dictionary to a dataframe.
final_training = pd.DataFrame(word_counts_per_sms)

We have created a frequency of each word for every message in our train dataset, In order to better see the data, we will combine this new dataframe with our existing train dataset. The resulting dataset will be called **final_training_set**

In [None]:
#concatenating the train and final_training datasets.
dfs = [train, final_training]
final_training_set = pd.concat(dfs, axis=1)

### Starting the algorithm

Now that we have our training dataset ready, we can start building our model based on Multinomial Naive Bayes Theorem. 

The formula for Multinomial Naive Bayes Algorithm is the following. 

![P(Spam|Words)](https://i.ibb.co/jhVKFb2/p-spam-given-words.png)
![P(Ham|Words)](https://i.ibb.co/tssQ9Pm/p-ham-given-words.png)

We will be comparing the result of the first formula to the second, and 
- **if P(Spam|Words) > P(Ham|Words)**, the message will be marked **spam**
- **if P(Spam|Words) < P(Ham|Words)**, the message will be marked **ham**
- **if P(Spam|Words) < P(Ham|Words)**, we will need **human judgement** for these messages, as the algorithm could not classify

In [None]:
# finding probabilities of spam and ham messages
p_ham_spam = round(final_training_set['Label'].value_counts(normalize=True),2)
p_ham = p_ham_spam['ham']
p_spam = p_ham_spam['spam']


#seperating ham and spam messages
ham_messages = final_training_set.loc[final_training_set['Label']=='ham', 'SMS']
spam_messages = final_training_set.loc[final_training_set['Label']=='spam', 'SMS']


#counting number of words for ham and spam messages seperately
n_ham = 0
n_spam = 0


for message in ham_messages:
    message = str(message)
    message = message.split()
    n_ham += len(message)

for message in spam_messages:
    message = str(message)
    message = message.split()
    n_spam += len(message)
    
# Lapllace smoothing with a = 1
a = 1


# number of words in vocabulary
n_vocabulary = len(vocabulary)

Above, we calculated constants that will be used in our model. Now we will calculate two important probabilities.

- **P of a word given spam message**
- **P of a word given ham message**

Here is the formula for calculating these values for each word.

![p_words_given_ham_spam](https://i.ibb.co/yNgZ4QC/p-words-given-ham-spam.png)

- **N(w|spam)** - total number that the word appears in spam messages
- **N(w|ham)** - total number that the word appears in ham messages
- **N(spam)** - total number of words in spam messages
- **N(ham)** - total number of words in ham messages
- **a** - is alpha for Laplace smoothing

[Laplace/additive smoothing](https://en.wikipedia.org/wiki/Additive_smoothing) is a technique used to avoid having 0 probabilities, as some words might have appearances in either of the categories. 

In [None]:
# we will create two dictionaries to store probabilities for each type of message
p_word_given_spam = {unique_word:0 for unique_word in vocabulary} 
p_word_given_ham = {unique_word:0 for unique_word in vocabulary} 


# we will create a counter of each type of word using collections.Counter object
import collections


spam_words = []
ham_words = []
for message in ham_messages.str.split():
    for word in message:
        ham_words.append(word)
for message in spam_messages.str.split():
    for word in message:
        spam_words.append(word)
     
    
# Now that we have a list of all spam/ham words, we will create a dictionary of their count. 
ham_word_count = collections.Counter(ham_words)
spam_word_count = collections.Counter(spam_words)

for word in vocabulary:
    n_ham_word = ham_word_count[word]
    n_spam_word = spam_word_count[word]
    
    p_word_given_spam[word] = (n_spam_word + a) / (n_spam + a * n_vocabulary)
    p_word_given_ham[word] = (n_ham_word + a) / (n_ham + a * n_vocabulary)

## Classifying a new Message

We are ready to write our final algorithm and function that classifies a new message as spam or ham.

The function will take a string as an input and formats it as necessary. After having a message in a desired fashion, the function will calculate the probabilities for both spam and ham, after that classify the message depending on the probability results. 

In [None]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    
    # calculation of probabilities for spam and ham
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    
    #iterate
    for word in message:
        if word in p_word_given_spam:
            p_spam_given_message *= p_word_given_spam[word]
        if word in p_word_given_ham:
            p_ham_given_message *= p_word_given_ham[word]
    

    #classification based on comparison results
    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'needs human classification!'

We have created the classify function, now we will apply this on two custom messages, Note that these messages are deliberately made easy as spam or ham only for testing.

In [None]:
message_1 = 'WINNER!! This is the secret code to unlock the money: C3421.'
message_2 = "Sounds good, Tom, then see u there"
classify(message_1), classify(message_2)

## Applying our function on test data.

So now we have the algorithm ready to go with our test dataset. Test dataset has never been applied to our algorithm, so it is a new message with respect to algorithm perspective. 

We will create a new col in test dataset called **'predicted'** that will store results of our function output. After that we will be able to compare our predictions with our actual human classified labels.

In [None]:
# applying our function to our test dataset
test['predicted'] = test['SMS'].apply(classify)
test.head()

### Checking accuracy of our function

In [None]:
# we will check the accuracy of our model. 
correct = 0
total = len(test.SMS)

for row in test.iterrows():
    label = row[1][0]
    predicted = row[1][2]
    if label == predicted:
        correct += 1
accuracy = round((correct / total) * 100, 2)
accuracy

### Final Result

Our model is predicting the message with 98.74% accuracy, which is much higher than I expected. This was a fun project to work with, thank you. 