# Spam Classifier with Naive Bayes

Our Mission
Spam detection is one of the major applications of Machine Learning in the interwebs today. Pretty much all of the major email service providers have spam detection systems built in and automatically classify such mail as 'Junk Mail'.

In this mission we will be using the Naive Bayes algorithm to create a model that can classify dataset SMS messages as spam or not spam, based on the training we give to the model. It is important to have some level of intuition as to what a spammy text message might look like. Usually they have words like 'free', 'win', 'winner', 'cash', 'prize' and the like in them as these texts are designed to catch your eye and in some sense tempt you to open them. Also, spam messages tend to have words written in all capitals and also tend to use a lot of exclamation marks. To the recipient, it is usually pretty straightforward to identify a spam text and our objective here is to train a model to do that for us!

Being able to identify spam messages is a binary classification problem as messages are classified as either 'Spam' or 'Not Spam' and nothing else. Also, this is a supervised learning problem, as we will be feeding a labelled dataset into the model, that it can learn from, to make future predictions.

Overview¶
This project has been broken down in to the following steps:

Step 0: Introduction to the Naive Bayes Theorem
Step 1.1: Understanding our dataset
Step 1.2: Data Preprocessing
Step 2.1: Bag of Words(BoW)
Step 2.2: Implementing BoW from scratch
Step 2.3: Implementing Bag of Words in scikit-learn
Step 3.1: Training and testing sets
Step 3.2: Applying Bag of Words processing to our dataset.
Step 4.1: Bayes Theorem implementation from scratch
Step 4.2: Naive Bayes implementation from scratch
Step 5: Naive Bayes implementation using scikit-learn
Step 6: Evaluating our model
Step 7: Conclusion

## Understanding Dataset

In [44]:
import pandas as pd
from zipfile import ZipFile

In [47]:
file_name = "smsspamcollection.zip"

In [49]:
with ZipFile(file_name, 'r') as zip:
    zip.printdir()
    zip.extractall() 

File Name                                             Modified             Size
SMSSpamCollection                              2011-03-15 22:36:02       477907
readme                                         2011-04-18 14:53:56         5868


In [50]:
import os
print(os.listdir())
file_name_comp = "SMSSpamCollection"

['.git', 'data.csv', 'decision_trees.ipynb', 'Naivebayes_spamClassifier.ipynb', 'Naive_bayes.ipynb', 'readme', 'SMSSpamCollection', 'smsspamcollection.zip', 'titanic_data.csv', 'titanic_survival_exploration.ipynb']


In [51]:
 df = pd.read_table(file_name_comp,sep='\t',header=None,names=['label','sms_message'])
 df.head()

Unnamed: 0,label,sms_message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## Data Preprocessing

In [52]:
df["label"] = df.label.map({'ham':0, 'spam':1})
df["label"].shape

(5572,)

## Bag of Words

In [53]:
# Example 
# Convert all strings to their lower case form.
documents = ['Hello,!!! how are you!',
             'Win money, win from home.',
             'Call me now.',
             'Hello, Call hello you tomorrow?']
lower_case_documents = []
for i in documents:
    lower_case_documents.append(i.lower())
print(lower_case_documents)

['hello,!!! how are you!', 'win money, win from home.', 'call me now.', 'hello, call hello you tomorrow?']


In [54]:
# Removing all punctuations
sans_punctuation_documents = []
import string

for i in lower_case_documents:
    # sans_punctuation_documents.append(i.translate(None, string.punctuation))
    sans_punctuation_documents.append(i.translate(str.maketrans('', '', string.punctuation)))
print(sans_punctuation_documents)

['hello how are you', 'win money win from home', 'call me now', 'hello call hello you tomorrow']


In [55]:
# Tokenization 
preprocessed_documents = []
for i in sans_punctuation_documents:
    preprocessed_documents.append(i.split(" "))
print(preprocessed_documents)

[['hello', 'how', 'are', 'you'], ['win', 'money', 'win', 'from', 'home'], ['call', 'me', 'now'], ['hello', 'call', 'hello', 'you', 'tomorrow']]


In [56]:
# Count Frequency of each word in a document
Frequency_list = []
import pprint
from collections import Counter

for i in preprocessed_documents:
    Frequency_list.append(Counter(i))
pprint.pprint(Frequency_list)

[Counter({'hello': 1, 'how': 1, 'are': 1, 'you': 1}),
 Counter({'win': 2, 'money': 1, 'from': 1, 'home': 1}),
 Counter({'call': 1, 'me': 1, 'now': 1}),
 Counter({'hello': 2, 'call': 1, 'you': 1, 'tomorrow': 1})]


In [57]:
# Implementing BoW in scikit-learn using count vectorizer
documents = ['Hello, how are you!',
                'Win money, win from home.',
                'Call me now.',
                'Hello, Call hello you tomorrow?']
from sklearn.feature_extraction.text import CountVectorizer

count_vector = CountVectorizer(lowercase=True)

In [58]:
count_vector.fit(documents)

CountVectorizer()

In [59]:
count_vector.get_feature_names()

['are',
 'call',
 'from',
 'hello',
 'home',
 'how',
 'me',
 'money',
 'now',
 'tomorrow',
 'win',
 'you']

In [60]:
doc_array = count_vector.transform(documents).toarray()
doc_array

array([[1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1],
       [0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 2, 0],
       [0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0],
       [0, 1, 0, 2, 0, 0, 0, 0, 0, 1, 0, 1]], dtype=int64)

In [61]:
frequency_matrix = pd.DataFrame(doc_array,columns = count_vector.get_feature_names())
frequency_matrix

Unnamed: 0,are,call,from,hello,home,how,me,money,now,tomorrow,win,you
0,1,0,0,1,0,1,0,0,0,0,0,1
1,0,0,1,0,1,0,0,1,0,0,2,0
2,0,1,0,0,0,0,1,0,1,0,0,0
3,0,1,0,2,0,0,0,0,0,1,0,1


# Training and Testing 

In [62]:
from sklearn.model_selection import train_test_split

In [63]:
X_train, X_test, y_train, y_test = train_test_split(df['sms_message'], 
                                                    df['label'], 
                                                    random_state=1)
print('Number of rows in the total set: {}'.format(df.shape[0]))
print('Number of rows in the training set: {}'.format(X_train.shape[0]))
print('Number of rows in the test set: {}'.format(X_test.shape[0]))

Number of rows in the total set: 5572
Number of rows in the training set: 4179
Number of rows in the test set: 1393


In [64]:
# Instantiate the count vector method
count_vector = CountVectorizer()

In [65]:
# Fit the trainig data and then return the matrix
training_data = count_vector.fit_transform(X_train)
training_data

<4179x7456 sparse matrix of type '<class 'numpy.int64'>'
	with 55209 stored elements in Compressed Sparse Row format>

In [66]:
testing_data = count_vector.transform(X_test)
testing_data

<1393x7456 sparse matrix of type '<class 'numpy.int64'>'
	with 17604 stored elements in Compressed Sparse Row format>

In [67]:
# P(D)
p_diabetes = 0.01

# P(~D)
p_no_diabetes = 0.99

# Sensitivity or P(Pos|D)
p_pos_diabetes = 0.9

# Specificity or P(Neg|~D)
p_neg_no_diabetes = 0.9

# P(Pos)
p_pos = (p_diabetes  *  p_pos_diabetes ) + (p_no_diabetes * (1 - p_neg_no_diabetes))
print('The probability of getting a positive test result P(Pos) is: {}'.format(p_pos))

The probability of getting a positive test result P(Pos) is: 0.10799999999999998


In [68]:
p_diabetes_pos = (p_diabetes *  p_pos_diabetes )/ p_pos
print('Probability of an individual having diabetes, given that that individual got a positive test result is:\
',format(p_diabetes_pos))

Probability of an individual having diabetes, given that that individual got a positive test result is: 0.08333333333333336


In [77]:
p_pos_no_diabetes = 1 - (0.9)

p_no_diabetes_pos = (p_no_diabetes * p_pos_diabetes) / p_pos

print ('Probability of an individual not having diabetes, given that that individual got a positive test result is:'\
,p_no_diabetes_pos)

Probability of an individual not having diabetes, given that that individual got a positive test result is: 8.250000000000002


### Naive Bayes implementation from scratch for multiple variables

In [78]:
# P(J)
p_j = 0.5

# P(F/J)
p_j_f = 0.1

# P(I/J)
p_j_i = 0.1

p_j_text = p_j * p_j_i * p_j_f
print(p_j_text)

0.005000000000000001


In [79]:
# P(G)
p_g = 0.5

# P(F/G)
p_g_f = 0.7

# P(I/G)
p_g_i = 0.2

p_g_text = p_g * p_g_f * p_g_i
print(p_g_text)

0.06999999999999999


In [80]:
p_f_i = p_g_text + p_j_text
print('Probability of words freedom and immigration being said are: ', format(p_f_i))

Probability of words freedom and immigration being said are:  0.075


In [82]:
p_j_fi = p_j_text / p_f_i
print('The probability of Jill Stein saying the words Freedom and Immigration: ', format(p_j_fi))

The probability of Jill Stein saying the words Freedom and Immigration:  0.06666666666666668


In [84]:
p_g_fi = p_g_text / p_f_i
print('The probability of Gary Johnson saying the words Freedom and Immigration: ', format(p_g_fi))

The probability of Gary Johnson saying the words Freedom and Immigration:  0.9333333333333332


### Naive Bayes Implementation using Scikit-learn

In [87]:
from sklearn.naive_bayes import MultinomialNB
naive_bayes = MultinomialNB()
naive_bayes.fit(training_data, y_train)

MultinomialNB()

In [90]:
predictions = naive_bayes.predict(testing_data)
predictions

array([0, 0, 0, ..., 0, 1, 0], dtype=int64)

## Evaluating our model

In [91]:
## Accuracy : ratio of the number of correct predictions to the total number of predictions 
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [92]:
print('Accuracy score: ', format(accuracy_score(y_test, predictions)))

Accuracy score:  0.9885139985642498


In [94]:
## Precision : ratio of true positives(words classified as spam, and which are actually spam) to all positives(all words classified as spam, irrespective of whether that was the correct classification) [True Positives/(True Positives + False Positives)]
print('Precision score: ', format(precision_score(y_test, predictions)))

Precision score:  0.9720670391061452


In [95]:
## a ratio of true positives(words classified as spam, and which are actually spam) to all the words that were actually spam, in other words it is the ratio of [True Positives/(True Positives + False Negatives)]
print('Recall score: ', format(recall_score(y_test, predictions)))

Recall score:  0.9405405405405406


In [97]:
## It merges the recall and precison and takes an avergae raning from 0 to 1
print('F1 score: ', format(f1_score(y_test, predictions)))

F1 score:  0.9560439560439562
