# Spam Detection using Naive Bays

## Introduction 

Naive Bayes classifiers are a popular statistical technique of e-mail filtering. They typically use bag of words features to identify spam e-mail, an approach commonly used in text classification.

Naive Bayes classifiers work by correlating the use of tokens (typically words, or sometimes other things), with spam and non-spam e-mails and then using Bayes' theorem to calculate a probability that an email is or is not spam.

Naive Bayes spam filtering is a baseline technique for dealing with spam that can tailor itself to the email needs of individual users and give low false positive spam detection rates that are generally acceptable to users. It is one of the oldest ways of doing spam filtering, with roots in the 1990s. (Wikipedia)

In [3]:
# Import the libraries
import pandas as pd
import re
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

## Loading the Dataset
We will be using the Naive Bayes algorithm to create a model that can classify [dataset](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection) SMS messages as spam or not spam, based on the training we give to the model. 

In [4]:
# Loading the dataset
dataset = pd.read_table("./smsspamcollection/SMSSpamCollection", 
                        sep = "\t", 
                        header = None, 
                        names = ["label", "sms_message"])
dataset.head(10)

Unnamed: 0,label,sms_message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


In [5]:
print("dataset shape: ", dataset.shape)

dataset shape:  (5572, 2)


## Data Preprocessing

Now that we loaded our dataset, Let's convert our labels into a binary number. 0 for ham and 1 for spam.

In [6]:
# Converting the labels to binary numbers
dataset["label"] = dataset["label"].map({"ham": 0, "spam": 1})
dataset.head(10)

Unnamed: 0,label,sms_message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."
5,1,FreeMsg Hey there darling it's been 3 week's n...
6,0,Even my brother is not like to speak with me. ...
7,0,As per your request 'Melle Melle (Oru Minnamin...
8,1,WINNER!! As a valued network customer you have...
9,1,Had your mobile 11 months or more? U R entitle...


## Splitting the Dataset

In order to find a model with the highest accuracy possible, We need further testing for our model. For doing so we split our dataset into training and testing set. One for training and another one for testing the model.

In [7]:
# Splitting the dataset into training and test set
X_train, X_test, y_train, y_test = train_test_split(dataset["sms_message"],
                                                    dataset["label"],
                                                    test_size =0.25,
                                                    random_state = 1)

print("Training set size: \n", X_train.shape[0])
print("Test set size: \n", X_test.shape[0])

Training set size: 
 4179
Test set size: 
 1393


## Feature Extraction - Bag of Words
Now that we have split the data, our next objective is to get the Bag of words and convert our data into the desired matrix format. To do this we will be using CountVectorizer(). The steps are as follows:
* First we fir the `CountVectorizer()` into the training data or `X_train`.
* Then we transform our testing data or `X_test`

In [8]:
# Initializing the count vectorizer
count_vector = CountVectorizer(lowercase = True, 
                               token_pattern = "(?u)\\b\\w\\w+\\b",
                               stop_words = "english")

# Fit and transform the training data
training_data = count_vector.fit_transform(X_train)
#print("Vocabulary in Training Set: \n", count_vector.get_feature_names(), "\n\n")

# Transform the test data
test_data = count_vector.transform(X_test)

In [101]:
# Checking the training data in DataFrame
pd.DataFrame(data = training_data.toarray(), 
             columns = count_vector.get_feature_names()).head(10)

Unnamed: 0,00,000,008704050406,0121,01223585236,01223585334,0125698789,02,0207,02072069400,...,zed,zeros,zhong,zindgi,zoe,zoom,zouk,zyada,èn,〨ud
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Naive Bayes

Sklearn has several Naive Bayes implementations that we can use and so we do not have to do the math from scratch. We will be using sklearns `sklearn.naive_bayes` method.

We will be using the multinomial Naive Bayes implementation. This particular classifier is suitable for classification with discrete features.

In [129]:
# Applying Naive Bayes
naive_bayes = MultinomialNB()

naive_bayes.fit(training_data, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

## Prediction

Let's make a prediction on our testing dataset. Later on we will evaluate how well did our model do on this dataset.

In [130]:
# Predicting the test set
predictions = naive_bayes.predict(test_data)
print("predictions: ", predictions)

predictions:  [0 0 0 ... 0 1 0]


## Evaluating the Model

Now we want to evaluate how well our model is doing. There are various mechanisms for doing so:

1. **Accuracy**: `[Correct Predictions/Total Number of Predictions]`


2. **Precision**: `[True Positives/(True Positives + False Positives)]`


3. **Recall(sensitivity)**: `[True Positives/(True Positives + False Negatives)]`


4. **F1 score**: weighted average of the precision and recall scores. This score can range from 0 to 1, with 1 being the best possible F1 score.

In [131]:
# Evaluation
print("Accuracy: {:.2f}%".format(accuracy_score(y_test, predictions)*100))
print("Precision Score: {:.2f}%".format(precision_score(y_test, predictions)*100))
print("Recall Score: {:.2f}%".format(recall_score(y_test, predictions)*100))
print("F1 Score: {:.2f}%".format(f1_score(y_test, predictions)*100))

Accuracy: 98.78%
Precision Score: 96.15%
Recall Score: 94.59%
F1 Score: 95.37%
