# Building a Spam Classifier

__NOTE! This file is the exact same one as the one we covered in lesson 4. We will just rerun the code to refresh our memories!__

Let's say we have a list of 1000 emails which are well classified into spam emails and not spam.

We'll be using the sklearn.naive_bayes to train a spam classifier. Most of the code here is just loading training data into pandas that we can play with:

In [5]:
import os
import io
import numpy
from pandas import DataFrame
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

def readFiles(path):
    for root, dirnames, filenames in os.walk(path):
        for filename in filenames:
            path = os.path.join(root, filename)
            
            inBody = False
            lines = []
            f = io.open(path, 'r', encoding='latin1')
            for line in f:
                if inBody:
                    lines.append(line)
                elif line == '\n':
                    inBody = True
            f.close()
            message = '\n'.join(lines)
            yield path, message
            
def dataFrameFromDirectory(path, classification):
    rows = []
    index = []
    for filename, message in readFiles(path):
        rows.append({'message': message, 'class': classification})
        index.append(filename)
        
    return DataFrame(rows, index=index)

data = DataFrame({'message': [], 'class': []})

data = data.append(dataFrameFromDirectory('emails/spam', 'spam'))
data = data.append(dataFrameFromDirectory('emails/ham', 'ham'))



of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  sort=sort)


Let's have a look at our data set.

In [6]:
data.head()

Unnamed: 0,class,message
emails/spam/00249.5f45607c1bffe89f60ba1ec9f878039a,spam,"Dear Homeowner,\n\n \n\nInterest Rates are at ..."
emails/spam/00373.ebe8670ac56b04125c25100a36ab0510,spam,ATTENTION: This is a MUST for ALL Computer Use...
emails/spam/00214.1367039e50dc6b7adb0f2aa8aba83216,spam,This is a multi-part message in MIME format.\n...
emails/spam/00210.050ffd105bd4e006771ee63cabc59978,spam,IMPORTANT INFORMATION:\n\n\n\nThe new domain n...
emails/spam/00033.9babb58d9298daa2963d4f514193d7d6,spam,This is the bottom line. If you can GIVE AWAY...


For everything that we have learnt in conditional probability to this point, actually Python is able to help us quickly work out the steps we need.

The following code has two main variables to look at:
1. counts: which is the number of times a word appear
2. target: the classification of our data. In this case, namely spam and ham

We will use CountVectorizer to split up each message into its list of words, and throw that into a MultinomialNB classifier. Call fit() and we've got a trained spam filter ready to go!

In [7]:
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(data['message'].values)

classifier = MultinomialNB()
targets = data['class'].values
classifier.fit(counts, targets)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Let's try it out!

We will create our own list of string, break into number of word count and pass it into the predict method available inside the classifier.

In [13]:
examples = ['Free Viagra now!!!', "Hey Bob, how about a game of golf tomorrow?", "I am a prince from Nigeria"]
example_counts = vectorizer.transform(examples)
predictions = classifier.predict(example_counts)
predictions

array(['spam', 'ham', 'spam'], dtype='<U4')