# Spam Filter using  Naive Bayes

We'll start by using sklearn.naive_bayes to train a spam classifier! Most of the code is just loading our training data into a pandas DataFrame that we can play with:

In [None]:
import os
import io
import numpy
from pandas import DataFrame
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

def readFiles(path):
    for root, dirnames, filenames in os.walk(path):
        for filename in filenames:
            path = os.path.join(root, filename)

            inBody = False
            lines = []
            f = io.open(path, 'r', encoding='latin1')
            for line in f:
                if inBody:
                    lines.append(line)
                elif line == '\n':
                    inBody = True
            f.close()
            message = '\n'.join(lines)
            yield path, message


def dataFrameFromDirectory(path, classification):
    rows = []
    index = []
    for filename, message in readFiles(path):
        rows.append({'message': message, 'class': classification})
        index.append(filename)

    return DataFrame(rows, index=index)

data = DataFrame({'message': [], 'class': []})

data = data.append(dataFrameFromDirectory('/Data/emails/spam', 'spam'))
data = data.append(dataFrameFromDirectory('/Data/emails/ham', 'ham'))


Let's have a look at that DataFrame:

In [None]:
data.message.head(10)

Now we will use a CountVectorizer to split up each message into its list of words, and throw that into a MultinomialNB classifier. Call fit() and we've got a trained spam filter ready to go! It's just that easy.

In [None]:
data.shape

In [None]:
data['class'].value_counts()

In [None]:
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(data['message'].values)

classifier = MultinomialNB()
targets = data['class'].values
classifier.fit(counts, targets)

In [None]:
len(vectorizer.vocabulary_)

Let's try it out:

In [37]:
examples = ['Free Viagra now!!!', 
            "Hi Bob, how about a game of golf tomorrow?",
           "Get discounts on medicines and free offer if you buy movies.",
           "GIFTING you FREE now"]
example_counts = vectorizer.transform(examples)
predictions = classifier.predict(example_counts)
predictions

array(['spam', 'ham', 'spam', 'spam'], dtype='<U4')

In [38]:
examples = ["Protect your family's financial future from GreatOffers.com"]

example_counts = vectorizer.transform(examples)
predictions = classifier.predict(example_counts)
predictions

array(['spam'], dtype='<U4')

## Train Test Split and Accuracy

Our data set is small, so our spam classifier isn't actually very good. Try running some different test emails through it and see if you get the results you expect.

If you really want to challenge yourself, try applying train/test to this spam classifier - see how well it can predict some subset of the ham and spam emails.

In [39]:
len(vectorizer.vocabulary_)

62964

In [40]:
vectorizer.vocabulary_

{'doctype': 20407,
 'html': 28844,
 'public': 44554,
 'w3c': 57486,
 'dtd': 21111,
 'transitional': 54131,
 'en': 22319,
 'head': 27856,
 'meta': 36946,
 'content': 17466,
 '3d': 3669,
 'text': 53112,
 'charset': 15912,
 '3dwindows': 3865,
 '1252': 1193,
 'http': 28855,
 'equiv': 22714,
 '3dcontent': 3777,
 'ype': 60830,
 'mshtml': 38010,
 '00': 0,
 '2314': 2507,
 '1000': 878,
 'name': 38606,
 '3dgenerator': 3793,
 'body': 13853,
 'inserted': 30798,
 'by': 14755,
 'calypso': 15154,
 'table': 52559,
 'border': 13981,
 '3d0': 3670,
 'cellpadding': 15671,
 'cellspacing': 15676,
 '3d2': 3687,
 'id': 29548,
 '3d_calyprintheader_': 3751,
 'ules': 55210,
 '3dnone': 3822,
 'style': 51599,
 'color': 16855,
 'black': 13539,
 'display': 20085,
 'none': 39475,
 'width': 58579,
 '100': 877,
 'tbody': 52779,
 'tr': 54034,
 'td': 52828,
 'colspan': 16868,
 '3d3': 3701,
 'hr': 28772,
 '3dblack': 3768,
 'noshade': 39549,
 'size': 50047,
 '3d1': 3672,
 'end': 22373,
 'font': 24768,
 '000000': 3,
 'face'

In [41]:
from sklearn.model_selection import train_test_split,ShuffleSplit
X=counts
y=targets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

In [43]:
X_test.shape

(900, 62964)

In [42]:
classifier.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [44]:
print ('Train Accuracy:{0: .3f}%'.format(classifier.score(X_train, y_train) * 100))
print ('Test Accuracy:{0: .3f}%'.format(classifier.score(X_test, y_test) * 100))

Train Accuracy: 97.762%
Test Accuracy: 94.667%


## Cleaning up HTML etc

In [45]:
from bs4 import BeautifulSoup

In [46]:
import re

In [47]:
def subst(text):
    
    soup = BeautifulSoup(text, 'html.parser')
    text = soup.get_text()
    return text
#re.sub("[^a-zA-Z]"," ",text)

In [48]:
data['cleanmsg']=data['message'].apply(subst)







" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
  markup


In [49]:
data.head(10)

Unnamed: 0,message,class,cleanmsg
/Data/emails/spam\00001.7848dde101aa985090474a91ec93fcf0,"<!DOCTYPE HTML PUBLIC ""-//W3C//DTD HTML 4.0 Tr...",spam,\n\n\n\n\n\n\n\n\n\n\n\n\n\n<=\n\n/TR>\nSave u...
/Data/emails/spam\00002.d94f1b97e48ed3b553b3508d116e6a09,1) Fight The Risk of Cancer!\n\nhttp://www.adc...,spam,1) Fight The Risk of Cancer!\n\nhttp://www.adc...
/Data/emails/spam\00003.2ee33bc6eacdb11f38d052c44819ba6c,1) Fight The Risk of Cancer!\n\nhttp://www.adc...,spam,1) Fight The Risk of Cancer!\n\nhttp://www.adc...
/Data/emails/spam\00004.eac8de8d759b7e74154f142194282724,##############################################...,spam,##############################################...
/Data/emails/spam\00005.57696a39d7d84318ce497886896bf90d,I thought you might like these:\n\n1) Slim Dow...,spam,I thought you might like these:\n\n1) Slim Dow...
/Data/emails/spam\00006.5ab5620d3d7c6c0db76234556a16f6c1,A POWERHOUSE GIFTING PROGRAM You Don't Want To...,spam,A POWERHOUSE GIFTING PROGRAM You Don't Want To...
/Data/emails/spam\00007.d8521faf753ff9ee989122f6816f87d7,Help wanted. We are a 14 year old fortune 500...,spam,Help wanted. We are a 14 year old fortune 500...
/Data/emails/spam\00008.dfd941deb10f5eed78b1594b131c9266,<html>\n\n<head>\n\n<title>ReliaQuote - Save U...,spam,\n\nReliaQuote - Save Up To 70% On Life Insura...
/Data/emails/spam\00009.027bf6e0b0c4ab34db3ce0ea4bf2edab,TIRED OF THE BULL OUT THERE?\n\nWant To Stop L...,spam,TIRED OF THE BULL OUT THERE?\n\nWant To Stop L...
/Data/emails/spam\00010.445affef4c70feec58f9198cfbc22997,"Dear ricardo1 ,\n\n\n\n<html>\n\n<body>\n\n<ce...",spam,"Dear ricardo1 ,\n\n\n\n\n\n\nCOST EFFECTIVE Di..."


In [50]:

from sklearn.model_selection import train_test_split,ShuffleSplit
vectorizer = CountVectorizer(max_features=20000)
counts = vectorizer.fit_transform(data['cleanmsg'].values)

classifier = MultinomialNB()
targets = data['class'].values
classifier.fit(counts, targets)
X=counts
y=targets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

classifier.fit(X_train, y_train)
print ('Train Accuracy:{0: .3f}%'.format(classifier.score(X_train, y_train) * 100))
print ('Test Accuracy:{0: .3f}%'.format(classifier.score(X_test, y_test) * 100))


Train Accuracy: 99.810%
Test Accuracy: 99.111%


In [None]:
len(vectorizer.vocabulary_)

In [None]:
examples = ['Free Viagra now!!!', 
            "Hi Bob, how about a game of golf tomorrow?",
           "Get discounts on medicines and free offer if you buy movies.",
           "GIFTING you FREE now"]
example_counts = vectorizer.transform(examples)
predictions = classifier.predict(example_counts)
predictions

In [None]:
examples = ['Offers rush to nearest store to buy your favorite car', 
            "What is the status of the work ",
           "Unlock rewards for your awesome profile"]
example_counts = vectorizer.transform(examples)
predictions = classifier.predict(example_counts)
predictions