# Creating a Spam Filter using Naive Bayes Algorithm
The goal of this project is to practice using the multinominal naive Bayes algorithm by building a spam filter for SMS messages.

We are going to teach the computer how humans classify messages for spam or non-spam and then implement this knowledge to make the classification independently by estimating the probabilities of new messages are being spam or non spam.

## Teaching
Our first task is to "teach" the computer how to classify messages. To do this, we'll use the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages that are already classified by humans (the "ham" means "non spam").

The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the [The UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection).

Let's explore the given dataset

In [1]:
import pandas as pd
sms_db = pd.read_csv('SMSSpamCollection',
                     sep = '\t',
                     header = None,
                     names = ['Label', 'SMS']                   
                    )
spam_percentage = sms_db[sms_db['Label'] == 'spam'].shape[0] / sms_db.shape[0]
    
print('Total messages {}'.format(sms_db.shape[0]))
print('Spam percentage is {}'.format(spam_percentage * 100))

sms_db.head()

Total messages 5572
Spam percentage is 13.406317300789663


Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## Establishing test and training sets
We see there are 5572 messages with 13.4% of spam. All the classifications were made by humans.

Before creating a spam filter we should think of a way of testing it. Also we should determine which result will be treated as a success. 

Let's randomly split our dataset into two categories:
* A **training set** which we'll use to "train" the computer how to classify messages. It will take **80%** of our initial dataset
* A **test set** of the remaining **20%** for the performance testing of our spam filter after completion

In [2]:
# Shuffle the dataset
randomized_db = sms_db.sample(frac = 1,
                              random_state = 1
                             )
num_rows = randomized_db.shape[0]

# Getting 80% for the training set
training_set = randomized_db.iloc[:int(round(num_rows*0.8,0))].copy()
training_set.reset_index(drop = True, inplace = True)

# Getting remaining 20% for the test set
test_set = randomized_db.iloc[int(round(num_rows*0.8,0)):].copy()
test_set.reset_index(drop = True, inplace = True)

# Calculating spam percentages in both samples
spam_percentage_training = (training_set[training_set['Label'] == 'spam'].shape[0]
                            / training_set.shape[0])

spam_percentage_test = (test_set[test_set['Label'] == 'spam'].shape[0]
                              / test_set.shape[0])

print('Spam percentage in the training set is {}'.format(spam_percentage_training * 100))
print('Spam percentage in the test set is {}'.format(spam_percentage_test * 100))

Spam percentage in the training set is 13.458950201884253
Spam percentage in the test set is 13.195691202872531


We can see that the spam percentages of both new datasets are very similar to the initial one which means the splitting was made randomly enough.

## Data cleaning. No punctuation, lowercase
The principle of the Naive Bayes algorithm is based on calculating the probabilities of every word mentioned in a message for being spam or non spam.

To calculate this we need to consider each word regardless of case and punctuation.

So we are going to clean our data by:
1. Deleting all the punctuation
2. Making all letters lowercase

In [3]:
# Before cleaning
training_set.head()

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [4]:
# Cleaning
training_set['SMS'] = training_set['SMS'].str.replace(r'\W', ' ')
training_set['SMS'] = training_set['SMS'].str.replace(r'\s+', ' ')
training_set['SMS'] = training_set['SMS'].str.lower()

  training_set['SMS'] = training_set['SMS'].str.replace(r'\W', ' ')
  training_set['SMS'] = training_set['SMS'].str.replace(r'\s+', ' ')


In [5]:
# After cleaning
training_set.head(5)

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on da...


## Creating a vocabulary
Now we need to create a vocabulary which will contain all the words from our training set. Firstly, we will split the 'SMS' column and transform every SMS into the list of words.

In [6]:
training_set['SMS'] = training_set['SMS'].str.split()
training_set['SMS'].head()

0                    [yep, by, the, pretty, sculpture]
1    [yes, princess, are, you, going, to, make, me,...
2                      [welp, apparently, he, retired]
3                                             [havent]
4    [i, forgot, 2, ask, ü, all, smth, there, s, a,...
Name: SMS, dtype: object

Then we'll iterate over every SMS and take every word into a new list called "vocabulary". In other words we'll get a list of every word of every sms.

The next step is to get rid from the duplicates. We can do this by transform our list into a set and back into a list.

In [7]:
vocabulary = []
for sms in training_set['SMS']:
    for word in sms:
        vocabulary.append(word)
        
vocabulary = list(set(vocabulary))    

print(len(vocabulary)) 

7783


Well done. Now we have a vocabulary which contains every unique word from all SMS in the training set

## Creating a dictionary
The next step is to calculate the number of mentions of each word in each message. We are going to manage it by:
1. Creating a dictionary with each unique word as a key
2. Filling this dictionary with the lists of '0' values
3. Iterating over the training dataset and counting the number of mentions of every word in every SMS
4. Transforming the dictionary into the dataset which will have all the unique words as columns and all SMS messages as rows. The values will show how many time a specific word was mentioned in a specific message

So from the data like this:

![image](https://dq-content.s3.amazonaws.com/433/cpgp_dataset_1.png)

we will get something like this:

![image](https://dq-content.s3.amazonaws.com/433/cpgp_dataset_2.png)

In [8]:
word_counts_per_sms = {}
for unique_word in vocabulary:
    word_counts_per_sms[unique_word] = [0] * len(training_set['SMS'])

for index, sms in enumerate(training_set['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1\
        
word_counts = pd.DataFrame(word_counts_per_sms)

Let's print 3 rows with only relevant columns to check if reach what we wanted

In [9]:
example = word_counts.sample(3)
example.loc[:, example.loc[:, :].sum() != 0]

Unnamed: 0,disappeared,which,you,yeah,nice,worse,babe,and,hey,saw,...,is,second,rather,i,a,came,for,salmon,happened,online
3612,0,0,1,0,1,0,0,0,0,0,...,0,0,1,1,2,0,0,1,0,0
2553,0,1,0,1,0,1,0,0,0,0,...,1,0,0,1,0,0,1,0,0,0
3879,1,0,2,0,0,0,1,1,1,1,...,0,1,0,1,1,1,1,0,1,1


Exactly what we wanted to get! But we haven't the SMS itself to make sure the calculation is right. Let's concatenate this data frame with the inintial training set. But firstly we need to be sure there are the same number of rows in each dataset

In [10]:
print(word_counts.shape)
print(training_set.shape)

(4458, 7783)
(4458, 2)


Looks good, it's time for concatenating now

In [11]:
word_counts = pd.concat([training_set, word_counts], axis=1)

# We'll take first 3 row for make sure evetything's fine
head = word_counts.head(3)

#Let's print only those columns which has at least 1 value
head.loc[:, head.loc[:, :].sum() != 0]

Unnamed: 0,Label,SMS,you,retired,moan,me,to,he,yes,make,by,sculpture,yep,are,princess,welp,apparently,the,pretty,going
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,1,1,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",1,0,1,1,1,0,1,1,0,0,0,1,1,0,0,0,0,1
2,ham,"[welp, apparently, he, retired]",0,1,0,0,0,1,0,0,0,0,0,0,0,1,1,0,0,0


## Calculating constants
Now we have all the additional calculations and may start building a spam filter.
The main advantage of the Naive Bayes algorithm is that the most computations are made before the classification itself. So the classification process is done almost instantly.

At this stage we may determine parts of the algorithm which could be calculated only once and stay constant for every new SMS as we don't change the the training set.

For this pupose we are going to split our training set into two parts:
1. Spam
2. Non spam

and then calculate some constants

In [12]:
spam_data = word_counts[word_counts['Label'] == 'spam'].copy()
ham_data = word_counts[word_counts['Label'] == 'ham'].copy()

p_spam = spam_data.shape[0] / word_counts.shape[0]
p_ham = 1 - p_spam

# n_spam - a number of words in all spam messages
n_spam = 0
for sms in spam_data['SMS']:
    n_spam += len(sms)

# n_spam - a number of words in all ham messages
n_ham = 0
for sms in ham_data['SMS']:
    n_ham += len(sms)

# n_vocabulary - a number of unique words in all messages both spam and ham
n_vocabulary = len(vocabulary)

# alpha = 1, as we'll use Laplace smoothing
alpha = 1

## Creating dictionaries with probabilities of every word is being spam or ham
We can also treat as a constant the probability of every word in our dictionary is being spam or non spam because it stays the same whatever new SMS message we'll be considering.

Let's calculate the probability of every word is being spam or non spam. We'll create two dictionaries - one of probabilities of spam and another - of non spam. The keys will be unique words from our dictionary, the values will be '0'. 

In [13]:
p_word_given_spam = dict.fromkeys(vocabulary, 0)
p_word_given_ham = dict.fromkeys(vocabulary, 0)

Then we'll update every value using the fomula from the Naive Bayes algorithm:

P (word | spam) = (N (word | spam) + alpha) / (N (spam) + alpha * N (vocabulary))

In [14]:
for word in p_word_given_spam:
    n_word_spam = spam_data[word].sum()
    p_word_given_spam[word] = ((n_word_spam + alpha)
                               / (n_spam + alpha * n_vocabulary)
                              )
    
    n_word_ham = ham_data[word].sum()
    p_word_given_ham[word] = ((n_word_ham + alpha)
                               / (n_ham + alpha * n_vocabulary)
                             )

## Creating a spam filter
And finally we have everything ready to create a spam filter itself. It will be a function that takes in a message and classify it as spam or non spam using the knowledge from the training set.

In [15]:
import re

def classify(message):
    
    #some data cleaning with the new SMS 
    #(similar to we did creating the training set)
    
    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()
    
    #Some math magic
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in p_word_given_spam:
            p_spam_given_message *= p_word_given_spam[word]
            p_ham_given_message *= p_word_given_ham[word]
        else:
            continue

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'needs human classification'

## Testing
And that's it! Now as the function is ready we'll test it on our test set. We'll create a new column in the test set which will show the result of our algorithm's classification.

In [16]:
test_set['predicted'] = test_set['SMS'].apply(classify)

In [17]:
test_set.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


And the final step is to measure the efficiency. We'll compare the algorithm's classification results with the humans' ones

In [18]:
test_set['correct'] = test_set['Label'] == test_set['predicted']
test_set.head()
test_set['correct'].value_counts(normalize = True) * 100

True     98.743268
False     1.256732
Name: correct, dtype: float64

# Conclusion
Wow! it's far above our expectation level of 80%.
In this project we could assure how effective mathematics could be in practice.

But one important thing we should bear in mind: the multinominal Naive Bayes algorithm don't sensitive to the combinations of words and determine the probability of each word of being spam or non spam independently from another words in a message.

Nevertheless the efficiency of the algorithm shown on our test set is 98.7 % and the speed of calculation is pretty fast