<link rel='stylesheet' href='../assets/css/main.css'/>

[<< back to main index](../README.md)

# Naive Bayes Spam Filtering

### Overview

We all hate spam, so developing a classifier to classify email as spam or not spam is useful.  

### Builds on
None

### Run time
approx. 20-30 minutes

### Notes

PySpark has a class called NaiveBayes that can be used to do Naive Bayes classification.

In [2]:
%matplotlib inline

import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


## Step 1: Let's load the dataframe

We will load the dataframe into pandas.  Since the outcome label is "ham" or "spam", that will be our labe.

In [5]:
t1 = time.perf_counter()

dataset = pd.read_csv("/data/spam/SMSSpamCollection.txt", sep='\t')

t2 = time.perf_counter() 

print("read {:,} records in {:,.2f} ms".format(len(dataset), (t2-t1)*1000))

dataset

read 5,572 records in 14.78 ms


Unnamed: 0,isspam,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


In [6]:
## Count spam/ham
dataset.groupby("isspam").size()

isspam
ham     4825
spam     747
dtype: int64

In [7]:
from sklearn.model_selection import train_test_split

train_x, test_x, train_y, test_y = train_test_split(dataset.text, 
                                                   dataset.isspam,
                                                   test_size=0.25)

## Step 2: Vectorize using tf/idf

Let's use tf/idf for vecorization at first.  TF/IDF will take and count the instances of each term, and then divide by the total frequecy of that term in the entire dataset.  

This leads to very highly dimensional data, because every word in the document will lead to a dimension in the data.

In [9]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB


pipeline = Pipeline([('vec', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB())])

pipeline.fit(train_x, train_y)

Pipeline(memory=None,
     steps=[('vec', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_...inear_tf=False, use_idf=True)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

## Step 6: Run test data

Let's call .transform on our model to do make predictions on our test data. The output should be contained in the "prediction" column, while the correct label will be there in the "label" column. 

We will be able to evaluate our results by comparing the results.

In [11]:
# select example rows to display.
## TODO : transform on test data (hint : test)
predictions = pipeline.predict(test_x)
predictions


array(['ham', 'ham', 'spam', ..., 'ham', 'ham', 'ham'], 
      dtype='<U4')

## Step 7: Evaluate the model

Let's look at how our model performs.  We will do an accuracy measure.

In [13]:
from sklearn.metrics import accuracy_score
accuracy_score(test_y, predictions)

0.94974874371859297

Let us do a confusion matrix.

In [14]:
from sklearn.metrics import confusion_matrix
confusion_matrix(test_y, predictions)

array([[1203,    0],
       [  70,  120]])

Hmmm.. the positive case didn't do as well as the negative case. Let's calculate precision and recall, and f1.

In [33]:
from sklearn.metrics import precision_recall_fscore_support
pd.DataFrame(list(precision_recall_fscore_support(test_y, predictions)),
             columns=['ham', 'spam'],
             index=['Precision', 'Recall', "F1", "Support"])

Unnamed: 0,ham,spam
Precision,0.945012,1.0
Recall,1.0,0.631579
F1,0.971729,0.774194
Support,1203.0,190.0


## Step 8:  Run your own test

Now it's your turn!   Make a new dataframe with some sample test data of your own creation.  Make some "spammy" SMSes and some ordinary ones.  See how our spam filter does.

In [None]:
# TODO: make a dataframe with some of your own data.

mydata = pd.DataFrame({'isspam' : ['spam', 'ham', ...],
              'text' : ['My text here', 'My Text Here 2', ...]
             })


# BONUS: Word2Vec Instead of TF/IDF

We used the TF/IDF encoding. We might get better resu

lts if we use Word2Vec instead. Run with word2vec and see if you get a better accuracy rate.