# Building a Spam Filter with Naive Bayes

In this project, we're going to create a spam filter that classifies new messages with an accuracy greater than 80% — so we expect that more than 80% of the new messages will be classified correctly as spam or ham(non-spam).
To do that, we'll use the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages that are already classified by humans.The dataset can be downloaded from the [The UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection).

In [1]:
import numpy as np
import pandas as pd

In [2]:
data = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['label','SMS'])

In [3]:
data.head()

Unnamed: 0,label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   label   5572 non-null   object
 1   SMS     5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [5]:
data['label'].value_counts()

ham     4825
spam     747
Name: label, dtype: int64

Então podemos ver que, entre as 5572 mensagens, temos 747 que são consideradas spams, ou seja, 13.4% das mensagens.

## Creating a Training and Test Set

We're first going to split our dataset into two categories:

* A training set, which we'll use to "train" the computer how to classify messages.
* A test set, which we'll use to test how good the spam filter is with classifying new messages.

But first we're going to start by randomizing the entire dataset to ensure that spam and ham messages are spread properly throughout the dataset.

In [6]:
# Randomizing

data_rand = data.sample(frac=1, random_state=1)

In [7]:
# Creating a traning and testing set
training_data = data_rand.sample(frac=0.8, random_state=2)
testing_data = data_rand.drop(training_data.index)

In [8]:
# Reset index
training_data = training_data.reset_index(drop=True)
testing_data = testing_data.reset_index(drop=True)

## Data Cleaning 

To make easy to work with both datasets we're going to removing the punctuation and bringing all the words to lower case.

In [29]:
training_data['SMS'] = training_data['SMS'].str.lower().str.replace(r'\W',' ', regex=True)
testing_data['SMS'] = testing_data['SMS'].str.lower().str.replace(r'\W',' ', regex=True)

The next step is to create a dataset to identify how many times a word appears in a sentence. For this, we need to know how many unique words appear so that we can create our vocabulary. This way, we will create a list with all the words in our dataset.

In [10]:
vocabulary=[]
for row in training_data['SMS'].str.split():
    for i in row:
        if i not in vocabulary:
            vocabulary.append(i)

len(vocabulary)

7751

In [11]:
dic={}

for word in vocabulary:
    total=[]
    for row in training_data['SMS'].str.split():
        count=0
        for i in row:
            if i == word:
                count+=1
        total.append(count)
    dic[word] = total

In [12]:
# Transforming in a dataset
training_data2 = pd.DataFrame(dic)

In [13]:
training_data2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4458 entries, 0 to 4457
Columns: 7751 entries, i to guoyang
dtypes: int64(7751)
memory usage: 263.6 MB


In [14]:
training = pd.concat([training_data,training_data2], axis=1)

## Naive Bayes: Calculating Constants First

Before starting the calculation, we can define the values of some constants that appear in the naive Bayes equation. Are they:

* $P(Spam)$ - probability of being spam.
* $P(Ham)$ - probability of being ham.
* $N_{Spam}$ - is equal to the number of words in all the spam messages.
* $N_{Ham}$ - is equal to the number of words in all the non-spam messages.
* $N_{Vocabulary}$ - is equal to the number of unique words in data.

$P(Spam)$

In [15]:
P_spam = len(training[training['label']=='spam'])/len(training)
print(P_spam)

0.13301929116195604


$P(Ham)$

In [16]:
P_ham = len(training[training['label']=='ham'])/len(training)
print(P_ham)

0.866980708838044


$N_{Spam}$

In [17]:
N_spam = 0
for row in training[training['label']=='spam']['SMS'].str.split():
    N_spam += len(row)
    
print(N_spam)

15148


$N_{Ham}$

In [18]:
N_ham = 0
for row in training[training['label']=='ham']['SMS'].str.split():
    N_ham += len(row)
    
print(N_ham)

57122


$N_{Vocabulary}$

In [19]:
N_voc = len(vocabulary)
print(N_voc)

7751


In [20]:
alpha = 1

## Naive Bayes: Calculating $P(w_i|Spam)$ and $P(w_i|Ham)$

Following the equation:

## $P(w_i|Spam)$ = $\frac{N_{w_i|Spam} + \alpha }{N_{Spam}+ \alpha.N_{Vocabulary}}$

Furthermore, similarly, for $P(w_i|Ham)$, we can calculate the individual probabilities for each word given, whether the message is spam or not.

In [21]:
Spam = training[training['label']=='spam']
Spam = Spam.reset_index(drop=True)

Ham = training[training['label']=='ham']
Ham = Ham.reset_index(drop=True)

In [22]:
word_given_spam = {}
word_given_ham = {}

for word in vocabulary:
    N_w_spam = Spam[word].sum()
    N_w_ham = Ham[word].sum()
    P_word_given_spam = (N_w_spam+alpha)/(N_spam+(alpha*N_voc))
    P_word_given_ham = (N_w_ham+alpha)/(N_ham+(alpha*N_voc))
    word_given_spam[word] = P_word_given_spam
    word_given_ham[word] = P_word_given_ham

In [23]:
word_given_spam['secret']

0.0003930302633302764

## Naive Bayes: Classifying A New Message

Following the Naive Bayes equation, we will create a function to calculate the probability of a message being Spam or Ham based on previous results.

In [24]:
def classify(message):
    message = message.lower().split()
    
    P_spam_message = P_spam
    P_ham_message = P_ham

    for w in message:
        if w in vocabulary:
            P_spam_message *= word_given_spam[w]
            P_ham_message *= word_given_ham[w]

    if P_spam_message>P_ham_message:
        return 'spam'
    elif P_ham_message>P_spam_message:
        return 'ham'
    else:
        return 'Equal'

In [25]:
testing_data['predict'] = testing_data['SMS'].apply(classify)

In [26]:
testing_data.head()

Unnamed: 0,label,SMS,predict
0,ham,yep by the pretty sculpture,ham
1,ham,yes princess are you going to make me moan,ham
2,ham,my uncles in atlanta wish you guys a great se...,ham
3,ham,ok which your another number,ham
4,spam,freemsg why haven t you replied to my text i ...,spam


In [27]:
# Checking the accuracy
acc = testing_data['label'] == testing_data['predict'] 

In [28]:
acc.value_counts()

True     1102
False      12
dtype: int64

In this way, creating a spam filter with an accuracy of 98.9% was possible.