# A Naive Bayes approach to filter SMS messages and discard Spam

Spam text messages and/or emails are not just used to target individuals to market new products but on ocassion they are also armed with malicious intent to gather sensitive information about the individual in unethical ways.

Combating spam can be a fairly tedious process if done manually, but the process can be automated to classify new and incoming messages as legitimate (ham) or spam based on probabilistic approach.

This project discusses the approach to building a Multinomial Naive Bayes spam filter. The training dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). You can also download the dataset directly from this [link](https://dq-content.s3.amazonaws.com/433/SMSSpamCollection). The data collection process is described in details [on this page](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/#composition).

In [1]:
import pandas as pd
import numpy as np

sms = pd.read_csv("SMSSpamCollection",sep="\t",header=None,names=["Label","SMS"])

In [2]:
sms.head(5)

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
sms.tail(5)

Unnamed: 0,Label,SMS
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...
5571,ham,Rofl. Its true to its name


In [4]:
sms["Label"].value_counts(normalize=True)*100

ham     86.593683
spam    13.406317
Name: Label, dtype: float64

The training dataset has 5572 rows (sms messages), of which ~86.6% are designated as "ham" implying that it is a genuine message where as the remaining 13.4% are classified as "spam".

## Spliting the dataset into training and testing sets

To assess the performance of our approach we split the dataset into training and testing sets. We employ an 80/20 training to testing split i.e. 80% of the dataset is assigned for training our algorithm while 20% is reserved for testing.

In [5]:
sms_random = sms.sample(frac=1,random_state=1)

In [6]:
train_index = round(len(sms_random)*0.8)

In [7]:
sms_train = sms_random[:train_index].reset_index(drop=True)
sms_test = sms_random[train_index:].reset_index(drop=True)

In [8]:
sms_train["Label"].value_counts(normalize=True)*100

ham     86.54105
spam    13.45895
Name: Label, dtype: float64

In [9]:
sms_test["Label"].value_counts(normalize=True)*100

ham     86.804309
spam    13.195691
Name: Label, dtype: float64

The distribution between "ham" and "spam" messages in both training and testing datasets is very similar to the distribution in the original dataset.

## Removing punctuation and other irrelevant characters

In [10]:
print(sms_train.head(10))

sms_train["SMS"] = sms_train["SMS"].str.replace("\W", " ")
sms_train["SMS"] = sms_train["SMS"].str.lower()

sms_train.head(10)

  Label                                                SMS
0   ham                       Yep, by the pretty sculpture
1   ham      Yes, princess. Are you going to make me moan?
2   ham                         Welp apparently he retired
3   ham                                            Havent.
4   ham  I forgot 2 ask ü all smth.. There's a card on ...
5   ham  Ok i thk i got it. Then u wan me 2 come now or...
6   ham  I want kfc its Tuesday. Only buy 2 meals ONLY ...
7   ham                         No dear i was sleeping :-P
8   ham                          Ok pa. Nothing problem:-)
9   ham                    Ill be there on  &lt;#&gt;  ok.


Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...
5,ham,ok i thk i got it then u wan me 2 come now or...
6,ham,i want kfc its tuesday only buy 2 meals only ...
7,ham,no dear i was sleeping p
8,ham,ok pa nothing problem
9,ham,ill be there on lt gt ok


## Creating a vocabulary for the dataset

To implement the Naive Bayes approach, we need a comprehensive vocabulary of the words used in the sms messages that compose our dataset. 

In [11]:
vocabulary = []
sms_train["SMS"] = sms_train["SMS"].str.split()

for each in sms_train['SMS']:
    for every in each:
        vocabulary.append(every)
        
vocabulary = list(set(vocabulary))

In [12]:
n_vocab = len(vocabulary)
n_vocab

7783

In [13]:
word_counts = {unique_word: [0] * len(sms_train['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(sms_train['SMS']):
    for word in sms:
        word_counts[word][index] += 1

In [14]:
word_countss = pd.DataFrame(word_counts)

sms_train = pd.concat([sms_train,word_countss],axis=1)
sms_train.shape

(4458, 7785)

## Determining the probability constants

To implement the Naive Bayes approach to our spam filtering problem, we need some prior probabilistic information before we determine the posterior probability of new and incoming messages being classified.

In [15]:
def probability_constants(data):
    spam = data[data["Label"]=="spam"]
    ham = data[data["Label"]=="ham"]
    n = data.shape[0]
    
    p_ham = (ham.shape[0]/n)
    p_spam = (spam.shape[0]/n)
    
    n_ham = ham["SMS"].apply(len).sum()
    n_spam = spam["SMS"].apply(len).sum()
    
            
    return([p_ham,n_ham,p_spam,n_spam])

p_ham = probability_constants(sms_train)[0]
n_ham = probability_constants(sms_train)[1]
p_spam = probability_constants(sms_train)[2]
n_spam = probability_constants(sms_train)[3]

print("The prior probability of a message being 'ham' is {0:.4f}".format(p_ham))
print("The prior probability of a message being 'spam' is {0:.4f}".format(p_spam))


The prior probability of a message being 'ham' is 0.8654
The prior probability of a message being 'spam' is 0.1346


In [16]:
alpha = 1

spam = sms_train[sms_train["Label"]=="spam"]
ham = sms_train[sms_train["Label"]=="ham"]

ham_word_probability = {unique:0 for unique in vocabulary}
spam_word_probability = {unique:0 for unique in vocabulary}

for each in vocabulary:
    n_word_spam = spam[each].sum()
    p_word_spam = (n_word_spam + alpha)/(n_spam + (alpha*n_vocab))
    spam_word_probability[each] = p_word_spam
    
    n_word_ham = ham[each].sum()
    p_word_ham = (n_word_ham + alpha)/(n_ham + (alpha*n_vocab))
    ham_word_probability[each] = p_word_ham

In [17]:
import re

def spam_filter(new_text):

    new_text = re.sub('\W', ' ', new_text)
    new_text = new_text.lower().split()
    
    p_spam_new_text = p_spam
    p_ham_new_text = p_ham
    
    for each in new_text:
        if each in spam_word_probability:
            p_spam_new_text *= spam_word_probability[each]
        
        if each in ham_word_probability:
            p_ham_new_text *= ham_word_probability[each]
    
    if p_ham_new_text > p_spam_new_text:
        return ("ham")
    elif p_ham_new_text < p_spam_new_text:
        return ('spam')
    else:
        print('It is difficult to classify this message as legitimate or spam.')

In [18]:
spam_filter('WINNER!! This is the secret code to unlock the money: C3421.')

'spam'

In [19]:
spam_filter("Sounds good, Tom, then see u there")

'ham'

We have built a classifier function which takes as input any new incoming message and is appropriately designate labels whether the message is safe or spam.

We can now test the performance of this function against the testing dataset we assigned earlier and judge how efficient our algorithm is.

In [21]:
sms_test["Predicted_Label"] = sms_test["SMS"].apply(spam_filter)
sms_test.head()

It is difficult to classify this message as legitimate or spam.


Unnamed: 0,Label,SMS,Predicted_Label
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


## Measuring the performance

We can assess the performance of the filter by measuring the accuracy of the classification, how many messages were correctly classified and how many were incorrectly classififed.

In [29]:
def accuracy(data):
    correctly_done = 0
    n = data.shape[0]
    
    for index,row in data.iterrows():
        if (row['Label'] == row["Predicted_Label"]):
            correctly_done += 1
    
    incorrect = n - correctly_done
    accuracy = (correctly_done/n)*100
    return([accuracy,correctly_done,incorrect,n])


In [30]:
filter_performance = accuracy(sms_test)[0]
positives = accuracy(sms_test)[1]
negatives = accuracy(sms_test)[2]
total = accuracy(sms_test)[3]
print("""The Naive Bayes filter designed to classify new incoming messages as 'Spam' or 'Ham' is accurate upto {0:.3f}% of the time""".format(filter_performance))

The Naive Bayes filter designed to classify new incoming messages as 'Spam' or 'Ham' is accurate upto 98.743% of the time


In [31]:
print("Correctly {0} classified messages from {1}".format(positives,total))

Correctly 1100 classified messages from 1114


In [32]:
print("Incorrectly {0} classified messages from {1}".format(negatives,total))

Incorrectly 14 classified messages from 1114


# Conclusions

The Naive Bayes filter performs very well with an accuracy of ~98.7%. The performance can further be enhanced by looking at the 14 misclassified messages and deducing what might have caused their erroneous classification.

The overall approach can be improved to deal with mixed case characters, punctuation and potentially text based emoji's. 