# Building a SMS Spam Filter with Naive Bayes

##Introduction
This work is an exercise of the machine learning course of Data Quest. The aim is to build a good SMS Spam classifier with Naive Bayes algorithm and the database available in [this url](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/#composition).

** Para os que falam português e leem esse trabalho, não fiz um classificador de Spam para mensagens escritas em lingua portugues pois o curso **DataQuest** é norte-americano e explora conteúdo na lingua inglesa. Apesar disso, a lógica de código usada nesse projeto é capaz de criar um filtro de Spam em qualquer idioma, desde que empossado de uma boa base de dados.

You can learn more about this method [here](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) or [here in portuguese](https://www.organicadigital.com/blog/algoritmo-de-classificacao-naive-bayes/).

## Conclusions
In this work, we build a SPAM SMS classifier with 98.74% accuracy and 95% F1_score using the Multinomial Naive Bayes Algorithm. The method is simple and recquire simple and easy to learn concepts.


## Importing Libraries

In [1]:
# Essential libraries
import pandas as pd
import numpy as np
import regex as re
#Graphical Libraries
import plotly.express as px
import plotly.graph_objects as pyo
import plotly.figure_factory as ff
import matplotlib.pyplot as plt
import seaborn as sns


pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.



In [0]:
#Importing the data set
df = pd.read_table('/content/drive/My Drive/Colab Notebooks/SMSSpamCollection.txt',header = None,names = ['spam_status','SMS_msg'])

In [3]:
df.head(10)

Unnamed: 0,spam_status,SMS_msg
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


In [4]:
df['spam_status'].value_counts(normalize=True)*100

ham     86.593683
spam    13.406317
Name: spam_status, dtype: float64

## Splitting Train and Test Sets

After designing a Spam classifier, it is essential to assess its performance in a completly new data. We can do it by splitting the dataframe in two parts (not equal). 

The first part is the **training set**. We use it to calculate the parameters of the Naive Bayes algorithm (the prior and conditional probabilities). The second part is called the **test set** in which we can input the messages in our desgined classifier and evaluate how well it does in classifying unsee messages. 

This is important in practice because a classifier that performs well in the design phase but do not during test is a bad system! 

Think about it as a car that was designed in controlled conditions and performs really well in this scenario. But when it goes to the streets, it crashes in rapdly! Would like such a car?

In [0]:
# Randomly sorting the DataFrame and splitting data
df_sorted = df.sample(frac=1,random_state = 1) #sorting the dataframe
split_idx = round(len(df)*0.8) #Index to split the data frame 80% train
 
train_df = df_sorted.iloc[0:split_idx] #Training dataframe
test_df = df_sorted.iloc[split_idx:]   #Test dataframe

In [6]:
#Checking the proportion of Spam and Non Spam messages
train_df['spam_status'].value_counts(normalize=True)

ham     0.86541
spam    0.13459
Name: spam_status, dtype: float64

The table shows that the proportion of spam to non spam messages is about the same if compared to the entire data set.

## Data Cleaning in message column

In [0]:
# #testing scikitlearn
# from sklearn.feature_extraction.text import CountVectorizer
# x = train_df.message
# y= train_df.spam_status
# vectorize = CountVectorizer()
# a = vectorize.fit_transform(x)
# len(vectorize.get_feature_names())

In order to build the classifier, a bit of data cleaning is necessary! Here we want a classifier that input words and output the probabilities of being a spam or not.

If we take a look at the table, we can see that some messages have ponctuation symbols which are characters we will not use as features build the classifier. So, the first step is to strip those caracters from the message column.

In [0]:
train_df = train_df.copy()
test_df = test_df.copy()
train_df['SMS_msg'] = train_df['SMS_msg'].str.replace('\W',' ').str.lower()
#test_df['SMS_msg'] = test_df['SMS_msg'].str.replace('\W',' ').str.lower()

## Building the vocabulary

The vocabulary will be useful to calculate the total number of unique words in our data set. In order to this, we must first split the words by the ' ' string in the message column.

In [0]:
#splitting the words in the sentences by the ' ' string
vocabulary = []
separate_words = train_df['SMS_msg'].str.split()

In [0]:
# Building a nested loop to append every word from every column in the 'message' column
for i in separate_words:
    for j in i:
        vocabulary.append(j)

In [11]:
# Using the set function to eliminate duplicate entries - A set is defined as a collection of non repeated objects
unique_vocabulary = set(vocabulary)
print('the size of the list with all words is',len(vocabulary))
print('\n')
print('the size of the set with all unique words is',len(unique_vocabulary))

the size of the list with all words is 72427


the size of the set with all unique words is 7783


In [12]:
unique_vocabulary = list(unique_vocabulary)
unique_vocabulary = [x for x in unique_vocabulary if x != " "]
len(unique_vocabulary)

7783

## Creating a dataframe to be used as a supervised learning problem

In [0]:
#Initialize a dictionary with the same values equal to a zero vector with the same size as the number of rows in the train data set
word_counts_per_sms = {unique_word: [0] * len(train_df['SMS_msg']) for unique_word in unique_vocabulary}

In [0]:
## VERY IMPORTANT TO BREAK A STRING INTO THEIR SEPARATED WORDS
train_df['SMS_msg'] = train_df['SMS_msg'].str.split()

In [0]:
for index,sms in enumerate(train_df['SMS_msg']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [0]:
# Transforming the features into a DataFrame
feature_df = pd.DataFrame(word_counts_per_sms)

In [0]:
train_df.reset_index(inplace=True)

In [0]:
#Concatenating with the original training data set
new_train = pd.concat([train_df,feature_df],join='outer',axis=1)

In [19]:
new_train.head(5)

Unnamed: 0,index,spam_status,SMS_msg,bless,program,duchess,aretaking,lacs,evil,08715705022,timi,nevering,mesages,fassyole,bawling,sachin,sharing,ability,60,kiosk,____,contribute,voda,find,tonexs,yavnt,blowing,replied,zac,brownies,smoking,ay,mys,bao,fellow,amongst,2nd,role,valentines,hol,...,against,msg150p,cheating,2wks,inshah,bad,08712103738,revealed,09066364349,smsco,mojibiola,locks,perfume,meg,eatin,rumour,fats,fridge,wit,gonnamissu,pass,tallahassee,wewa,nd,opinion,your,sitting,fathima,success,neva,darling,grace,69669,school,subtoitles,banter,laughing,espe,situations,apologize
0,1078,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,4028,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,958,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,4642,ham,[havent],0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,4674,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


## Estimating the prior Probabilities

In [0]:
p_ham = new_train['spam_status'].value_counts(normalize=True)['ham'] # Probability of getting a spam email
p_spam = new_train['spam_status'].value_counts(normalize=True)['spam'] #Probability of not getting a spam email

# Calculating the number of words in spam emails
spam_df = new_train[new_train['spam_status'] == 'spam'] #Filtering rows that contains spam emails
n_spam = spam_df['SMS_msg'].apply(len).sum()

#calculating the number of words in non spam emails
non_spam_df = new_train[new_train['spam_status'] == 'ham'] #Filtering rows that contains spam emails
n_non_spam = non_spam_df['SMS_msg'].apply(len).sum()
n_vocab = len(unique_vocabulary)
alpha=1

## Conditional Probabilities Estimation

In [0]:
pes_spam = {word:0 for word in unique_vocabulary}
pes_ham = {word:0 for word in unique_vocabulary}

In [0]:
for word in unique_vocabulary:
    prob_word_spam = (spam_df[word].sum() + alpha)/(n_spam + n_vocab*alpha)
    prob_word_non_spam = (non_spam_df[word].sum() + alpha)/(n_non_spam + n_vocab*alpha)
    pes_spam[word]=prob_word_spam
    pes_ham[word] = prob_word_non_spam

## Deploying the classifier

In [0]:
import re

def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

        
    #This is where we calculate:

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in unique_vocabulary:
           p_spam_given_message *= pes_spam[word]
           p_ham_given_message *=pes_ham[word]


    #print('P(Spam|message):', p_spam_given_message)
    #print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        return('ham')
    elif p_ham_given_message < p_spam_given_message:
        return('spam')
    else:
        return 'needs human classification'

## Assessing the algorithm with the test set

In [24]:
test_df['predicted'] = test_df['SMS_msg'].apply(classify_test_set)
test_df.head()

Unnamed: 0,spam_status,SMS_msg,predicted
2131,ham,Later i guess. I needa do mcat study too.,ham
3418,ham,But i haf enuff space got like 4 mb...,ham
3424,spam,Had your mobile 10 mths? Update to latest Oran...,spam
1538,ham,All sounds good. Fingers . Makes it difficult ...,ham
5393,ham,"All done, all handed in. Don't know if mega sh...",ham


In [25]:
boolean_acc = test_df['predicted'] == test_df['spam_status']
counts = boolean_acc.value_counts()
accuracy = counts[True]/counts.sum()
accuracy

0.9874326750448833

In [0]:
boolean_recall =  (test_df.loc[test_df['spam_status']=='spam','predicted']) == (test_df.loc[test_df['spam_status']=='spam','spam_status'])

In [28]:
boolean_recall.value_counts()

True     139
False      8
dtype: int64

In [30]:
recall = boolean_recall.value_counts()[True]/boolean_recall.value_counts().sum()
recall

0.9455782312925171

In [33]:
boolean_precision = (test_df.loc[test_df['predicted']=='spam','predicted']) == (test_df.loc[test_df['predicted']=='spam','spam_status'])
precision =  boolean_precision.value_counts()[True]/boolean_precision.value_counts().sum()
precision

0.9652777777777778

In [0]:
F1_score = 2*precision*recall/(precision+recall)

In [37]:
F1_score

0.9553264604810997