### INTRODUCTION
This tutorial will talk about detection of spam using Multinomial Naive Bayes Algorithm. 
Naive bayes is a simple technique for the construction of classfiers. The fundamental rule that Naive Bayes algorithm uses
is 'Bayes Theorem'. 
\begin{equation}
p(Y \mid X) = \frac{p(X \mid Y)p(Y)}{p(X)}
\end{equation}
where X, Y are the events and p() is the probability.

Now, we will break the oroginal Bayes Theorem in the context of the text classification problem we are trying to solve "Spam Detection".
Our hypothesis $H$ is something like "given text is spam" and teh evidence $E$ is the text of the email.
We are trying to find the probabiltiy that our email is spam given the text in email.
\begin{equation}
p(H \mid E) = \frac{p(E \mid H)p(H)}{p(E)}
\end{equation}


### Naive Bayes Assumption
We are assuming that each word is independent of all other words. Hence, we can re-write the above equation as:
\begin{equation}
p(Spam \mid w_1,...,w_n) = \frac{p(w_1 \mid Spam).p(w_2 \mid Spam)..p(w_n \mid Spam).p(Spam)}{p(w_1,..,w_n)}
\end{equation}
Now, we can interpret each term $p(w_1 \mid Spam)$ is the probability of finding a word $w_1$ in the email.

This is the Naive Bayes formulation which return a probabilty that an email message is spam given the words in that email.

### Loading the Dataset
Data set used for this tutorial is the Enron email dataset, which will be used as the training data.
Enron Dataset can be downloaded from the google drive location- "https://drive.google.com/file/d/1KMqathKI_U-mDahA0m2C7XkHBippdGQA/view?usp=sharing" (use Google Chrome if unable to download from IE)

Before starting, download the data. Put folders 'spam' and 'ham' in a top-level folder 'enronAll'. And put this enronAll folder in the same directory as your source code.

In [1]:
import os
import re
import string
import math
import pandas as pd
import random

PATH = 'enronAll'
target_names = ['ham', 'spam']

log_class_priors = {}
word_counts = {}
vocab = set()
word_counts['spam'] = {}
word_counts['ham'] = {}

def load_data(PATH):
    
    text = []
    spam_flag = []
    
    spams = os.listdir(os.path.join(PATH, 'spam'))
    for spam in spams:
        with open(os.path.join(PATH, 'spam', spam), encoding="latin-1") as file:
            text.append(file.read())
            spam_flag.append(1)

    not_spams = os.listdir(os.path.join(PATH, 'ham'))
    for not_spam in not_spams:
        with open(os.path.join(PATH, 'ham', not_spam), encoding="latin-1") as file:
            text.append(file.read())
            spam_flag.append(0)
            
    text_df = pd.Series((t for t in text))
    spam_flag_df = pd.Series((d for d in spam_flag))
    
    return text_df, spam_flag_df

In [2]:
text_df, spam_flag_df = load_data(PATH)
print(text_df.head())

0    Subject: fw : this is the solution i mentioned...
1    Subject: adv : space saving computer to replac...
2    Subject: advs\ngreetings ,\ni am benedicta lin...
3    Subject: fw : account over due wfxu ppmfztdtet...
4    Subject: whats new in summer ? bawled\ncarolyn...
dtype: object


This will produce two data frames:
1. text_df: here element represents the text of an email
2. spam_flag_df: a simple binary data frame where 1 meaning 'spam' and 0 meaning 'ham' (not spam)

### Cleaning the dataset
Here we will clean the dataset by removing the special characters and symbols from the text and then tokenizing the string text into words. After text_df is cleaned and tokenized, a new dataframe is created by combining the text dataframe and spam flag data frame.

In [3]:
def clean_data(text_df):

    result = []
    
    for ind, val in text_df.iteritems():
        translator = str.maketrans("", "", string.punctuation)
        
        text = val.translate(translator)
        text = text.lower()
        text = re.split("\W+", text)
        
        result.append(text) 
        
    clean_text_df = pd.Series((t for t in result))
        
    return clean_text_df

In [4]:
result = clean_data(text_df)
combined_df = pd.concat([result, spam_flag_df], axis=1)
print(combined_df.head())

                                                   0  1
0  [subject, fw, this, is, the, solution, i, ment...  1
1  [subject, adv, space, saving, computer, to, re...  1
2  [subject, advs, greetings, i, am, benedicta, l...  1
3  [subject, fw, account, over, due, wfxu, ppmfzt...  1
4  [subject, whats, new, in, summer, bawled, caro...  1


This method will be used to count upto how many times each word is appearing in the given lsit of words.

In [5]:
def get_word_counts(words):
    
    word_counts = {}
    for word in words:
        word_counts[word] = word_counts.get(word, 0.0) + 1.0
    return word_counts

Before moving ahead, lets first try to understand the algorithm. For training purposes, we actually require three things; the log class priors, i.e. it represents the probability whether or not a given message is spam or ham(not spam); a vocabulary of words and frequency of words i.e. the number of words appearing separately in spam and ham messages.

1. Computing the log class probabilities. First counting that how many messages are there in ham and spam separately and then dividing it by the total number messages in spam/ham and then taking the log of that value.
2. For each word, if the word is not present,update it individually to the respective vocabularies of spam/ham. If alreasy present then update the number of counts for that word. Also, add the word to the global vocabulary.

Example: Lets assume that we have a spam message. First, we will count upto how many each word is appearing in the spam message and add that count to the vocabulary of the spam. 

We are also tracking the frequency of each word, when it appears in either spam or ham message. Suppose the word "science" is appearing in both spam and ham messages. So, on the basis of the frequency count we can assume that the likelihood of this word appearing in spam is more than in ham.


In [6]:
def fit_data(combined_df):
    
    n = len(combined_df)
    
    X = []
    Y = []
    spam_sum = 0
    ham_sum = 0
    
    for ind, val in combined_df.iterrows():
        X.append(val[0])
        Y.append(val[1])
    
    for label in Y:
        if label == 1:
            spam_sum += 1
        else:
            ham_sum += 1
            
            
    global log_class_priors
    log_class_priors['spam'] = math.log(spam_sum / n)
    log_class_priors['ham'] = math.log(ham_sum / n)
  
    global word_counts
    word_counts['spam'] = {}
    word_counts['ham'] = {}
       
    global vocab
    
    for x, y in zip(X, Y):
        c = 'spam' if y == 1 else 'ham'
        
        
        counts = get_word_counts(x)
        for word, count in counts.items():
            if word not in vocab:
                vocab.add(word)
            if c not in word_counts:
                continue
            if word not in word_counts[c]:
                word_counts[c][word] = 0.0
                
            word_counts[c][word] += count
            
    return X, Y

In [7]:
X, Y = fit_data(combined_df)

Now that we have extracted the required data from the training data, we can move ahead with the Naive Bayes classification.

1. Given a document, we need to iterate each of the words and compute $\log p(w_i|\text{Spam})$ and sum them all up, and we also compute $\log p(w_i|\text{Ham})$ and sum them all up.
2. Then we add the log class priors and check to see which score is bigger for that document. Whichever is larger, that is the predicted label!
3. To compute $\log p(w_i|\text{Spam})$, the numerator is how many times we’ve seen $w_i$ in a “spam” message divided by the total count of all words in every “spam” message.

#### Laplace Smoothing
One thing that we need to take care of is that, if we encounter a word that is present in the spam vocabulary and not is ham vocabulary and vice-versa, Then $p(w_i|\text{Ham})$ will be 0, hence the log of 0, will return undefined. To overcome this, we use Laplace Smoothing, i.e. we simply add 1 to the numerator, and add the size of the vocubulary to the denominator to balance it.

In [8]:
def predict_data(X):
    result = []
    
    for x in X:
        counts = get_word_counts(x)
        spam_score = 0
        ham_score = 0
        
        for word, _ in counts.items():
            if word not in vocab: continue
            # add Laplace smoothing
            log_w_given_spam = math.log( (word_counts['spam'].get(word, 0.0) + 1) / (sum(word_counts['spam'].values()) + len(vocab)) )
            log_w_given_ham = math.log( (word_counts['ham'].get(word, 0.0) + 1) / (sum(word_counts['ham'].values()) + len(vocab)) )

            spam_score += log_w_given_spam
            ham_score += log_w_given_ham

        spam_score += log_class_priors['spam']
        ham_score += log_class_priors['ham']

        if spam_score > ham_score:
            result.append(1)
        else:
            result.append(0)
    return result

In [9]:
pred = predict_data(random.sample(X, 100))
true = random.sample(Y, 100)

accuracy = sum(1 for i in range(len(pred)) if pred[i] == true[i]) / float(len(pred))

print("Accuracy of correctly identifying a mail as spam or not = {0:.0f}%".format(accuracy * 100))

Accuracy of correctly identifying a mail as spam or not = 54%


### Naive Bayes using sklearn library

In this part we will use sklearn package, and use the prdefined methods to calculate the accuracy of the model, when compared to the previous model.
    
a. Extraction of features:- we will convert the database into numerical feature vectors, by using the 'bag of words' model. We will segment each file into words and keep a count each word appears in the document. Using CountVectorizer.fit_transform() method, we are learning the vocabulary

In [10]:
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
import numpy as np


new_combined_df = pd.concat([text_df, spam_flag_df], axis=1)
train_df, test_df = train_test_split(new_combined_df, test_size=0.2)

In [11]:
def fit_transform(train_df):
    
    count_vect = CountVectorizer()
    counts = count_vect.fit_transform(train_df[0])
    
    return counts

In [12]:
train_counts = fit_transform(train_df)

b. Reducing the common words:- When counting the number of words in each document, more weightage will be given to longer documents. To overcome this we use Term_Frequency, i.e. count of word/total words in each document. Further we use TFIDF to represent the documents via a (weighted) bag of words model.

In [13]:
def tfidf(train_counts):
    
    tfidf_transformer = TfidfTransformer()
    tfidf = tfidf_transformer.fit_transform(train_counts)
    
    return tfidf

In [14]:
train_tfidf = tfidf(train_counts)
print(train_tfidf.shape)

(26961, 142761)


c. Building a Naive Bayes Classfier:
1. Building a Multinomial Naive Bayes classifier and train it on the training data.
2. Building a pipeline- The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters
3. Performance of NB classifier - Test the performance of the Multinomial Naive Bayes classifier against the test data.

In [15]:
def NBclassifier(train_df, train_tfidf):
    
    clf = MultinomialNB().fit(train_tfidf, train_df[1])

    text_clf = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()), ('clf', MultinomialNB())])
    
    text_clf = text_clf.fit(train_df[0], train_df[1])
    
    predicted = text_clf.predict(test_df[0])

    return(np.mean(predicted == test_df[1]))
        

In [16]:
accuracy = NBclassifier(train_df, train_tfidf)
print("Accuracy of correctly identifying a mail as spam or not = {0:.0f}%".format(accuracy * 100))

Accuracy of correctly identifying a mail as spam or not = 99%


### References

1. Pandas: https://pandas.pydata.org/
2. String: https://docs.python.org/2/library/string.html
3. Random: https://docs.python.org/2/library/random.html
4. Sklearn: http://scikit-learn.org/stable/documentation.html
5. http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
6. https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a
7. http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
8. https://pythonmachinelearning.pro/text-classification-tutorial-with-naive-bayes/
8. http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html

