# Building a Spam Filter with Multinomial Naive Bayes

This Notebook is the continuation of the Guided Project from [Dataquest](dataquest.io)'s course on Conditional Probability. The goal is to create a spam filter using multinomial [Naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html).

In [None]:
import numpy as np
import pandas as pd
import re
from sklearn.metrics import recall_score, precision_score

In [None]:
sms_df = pd.read_csv('../input/sms-spam-collection-dataset/spam.csv', header=1, encoding='latin-1', names=['Label', 'SMS', 'Unknown1', 'Unknown2', 'Unknown3'])
sms_df.head()

In [None]:
sms_df.drop(['Unknown1','Unknown2','Unknown3'], axis=1, inplace=True)

In [None]:
sms_df.info()

In [None]:
sms_df.groupby('Label').describe()

There are no missing values in the 5571 entries. However, only 13% of the text messages are classified as spam.

In [None]:
list(sms_df[sms_df['Label'] == 'spam']['SMS'])

In [None]:
data_randomized = sms_df.sample(frac=1, random_state=1)
split_index = round(len(data_randomized) * 0.8)
sms_train = data_randomized[:split_index].reset_index(drop=True)
sms_test = data_randomized[split_index:].reset_index(drop=True)

The next step is to replace the 'SMS' column with a set of columns, one for each word in the test dataset's vocabulary. For each row, the column represents the number of times the word appeared in the given SMS. 
To simplify the vocabulary, all text messages will be stripped of punctuation and all letters are transformed to lowercase.

In [None]:
def clean_and_split_message(message):
    message = message.lower()
    message = re.sub(r'\W', ' ', message)
    return message.split()

sms_train['SMS'] = sms_train['SMS'].apply(clean_and_split_message)

In [None]:
vocabulary = {word for sms_words in list(sms_train['SMS']) for word in sms_words}
word_counts_per_sms = {unique_word: [0] * len(sms_train['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(sms_train['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1
        
sms_train = pd.concat([sms_train, pd.DataFrame(word_counts_per_sms)], axis=1)

In [None]:
sms_train.head()

Now that the data is in a suitable format, the next objective is to calculate the propability of a new message (decomposed into its words $w_1$, $w_2$, etc) being a spam message or a ham message using the following equations respectively:
\begin{equation}
P(Spam  |  w_1,w_2, ..., w_n) \propto P(Spam) \cdot \prod_{i=1}^{n}P(w_i|Spam)
\end{equation}
\begin{equation}
P(Ham  |  w_1,w_2, ..., w_n) \propto P(Ham) \cdot \prod_{i=1}^{n}P(w_i|Ham)
\end{equation}

The highest probability determines to which class our message belongs to.

Since we want to apply Multinomial Naive Bayes, each parameter $ P(w_i|Spam) $ and $ P(w_i|Ham) $ is calculated using the following formulas: 

\begin{equation}
P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}}
\end{equation}
\begin{equation}
P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}}
\end{equation}  
where $ N_{Vocabulary} $ is the size of our vocabulary, $ N_{Ham} $ is the total number of words in all ham messages, $ N_{Ham} $ is the total number of words in all ham messages, $ N_{Spam} $ is the total number of words in all spam messages, and $ \alpha $ is our smoothing parameter. We will use $ \alpha = 1 $ (Laplace smoothing).

Let us start by calculating the constants:

In [None]:
vocab_cols = sms_train.columns[2:]
N_ham = sms_train[sms_train['Label'] == 'ham'][vocabulary].sum(axis=1).sum()
N_spam = sms_train[sms_train['Label'] == 'spam'][vocabulary].sum(axis=1).sum()
alpha = 1
P_ham = sms_train[sms_train['Label'] == 'ham'].shape[0]/sms_train.shape[0]
P_spam = sms_train[sms_train['Label'] == 'spam'].shape[0]/sms_train.shape[0]
N_vocab = len(vocabulary)

As for the parameters, we will initialize two dictionaries, one for spam and one for ham messages, where each key is a word of our vocabulary, and its value is the associated probability as given by the formulas above. 

In [None]:
P_wi_given_ham = { wi:0 for wi in vocabulary}
P_wi_given_spam = { wi:0 for wi in vocabulary}
sms_train_ham = sms_train[sms_train['Label'] == 'ham']
sms_train_spam = sms_train[sms_train['Label'] == 'spam']

for wi in vocabulary:
    N_wi_given_ham = sms_train_ham[wi].sum()
    N_wi_given_spam = sms_train_spam[wi].sum()
    P_wi_given_ham[wi] = (N_wi_given_ham + alpha)/(N_ham + alpha*N_vocab)
    P_wi_given_spam[wi] = (N_wi_given_spam + alpha)/(N_spam + alpha*N_vocab)

Note that the most computationally expensive calculations only need to be done once, and one the training set alone, which means that there is little to calculate once a new message comes in. 
In fact, all we need to do now is clean the new message and multiply some of the probabilites we calculated just above.

In [None]:
def classify(message):
    message = clean_and_split_message(message)
    
    P_ham_given_message = P_ham
    P_spam_given_message = P_spam
    
    for word in message:
        if word in P_wi_given_ham:
            P_ham_given_message *= P_wi_given_ham[word]
        if word in P_wi_given_spam:
            P_spam_given_message *= P_wi_given_spam[word]

    if P_ham_given_message > P_spam_given_message:
        return 'ham'
    elif P_spam_given_message > P_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [None]:
def accuracy(y_true, predicted):
    return len(y_true[y_true == predicted])/len(y_true)

In [None]:
predictions = sms_test['SMS'].apply(classify)
print(accuracy(sms_test['Label'], predictions))
print(recall_score(sms_test['Label'], predictions, pos_label='spam'))
print(precision_score(sms_test['Label'], predictions, pos_label='spam'))

In [None]:
predictions.value_counts()

That is a more than an acceptable score. Also note that there were no instances where the classified needed human help.

Let us look at the messages that were wrongly classified: 

In [None]:
sms_test[sms_test['Label'] != predictions].values

Two things stand out, namely that the message 'We have sent JD for Customer Service cum Accounts Executive to ur mail id, For details contact us' appears twice, and that the last message is listed as 'spam' although it does not look like a typical spam message. I personally would have associated it to the ham messages. It is not impossible that there are other "misclassified" messages in the training set that could impact the final result.