# Naive Bayes (the easy way)

We'll cheat by using sklearn.naive_bayes to train a spam classifier! Most of the code is just loading our training data into a pandas DataFrame that we can play with:

In [46]:
import os
import io
import numpy
import pandas as pd
from pandas import DataFrame
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

def readFiles(path):
    for root, dirnames, filenames in os.walk(path):
        for filename in filenames:
            path = os.path.join(root, filename)

            inBody = True
            lines = []
            f = io.open(path, 'r', encoding='latin1')
            for line in f:
                if inBody:                 
                    lines.append(line)
                elif line == '\n':
                    inBody = True                   
            f.close()
            message = '\n'.join(lines)
            yield path, message


def dataFrameFromDirectory(path, classification):
    rows = []
    index = []
    for filename, message in readFiles(path):
        rows.append({'message': message, 'class': classification})
        index.append(filename)

    return DataFrame(rows, index=index)

data = DataFrame({'message': [], 'class': []})

#data = pd.concat([data, dataFrameFromDirectory("emails/spam", "spam")]);
#data = pd.concat([data, dataFrameFromDirectory("emails/ham", "ham")])
data = pd.concat([data, dataFrameFromDirectory("emails/enron1/spam", "spam")]);
data = pd.concat([data, dataFrameFromDirectory("emails/enron1/ham", "ham")])

#For Pandas 1.3:
#data = data.append(dataFrameFromDirectory('emails/spam', 'spam'))
#data = data.append(dataFrameFromDirectory('emails/ham', 'ham'))


Let's have a look at that DataFrame:

In [47]:
data.head()

Unnamed: 0,message,class
emails/enron1/spam/4743.2005-06-25.GP.spam.txt,"Subject: what up , , your cam babe\n\nwhat are...",spam
emails/enron1/spam/1309.2004-06-08.GP.spam.txt,Subject: want to make more money ?\n\norder co...,spam
emails/enron1/spam/0726.2004-03-26.GP.spam.txt,Subject: food for thoughts\n\n[\n\njoin now - ...,spam
emails/enron1/spam/0202.2004-01-13.GP.spam.txt,Subject: miningnews . net newsletter - tuesday...,spam
emails/enron1/spam/3988.2005-03-06.GP.spam.txt,Subject: your pharmacy ta\n\nwould you want ch...,spam


Now we will use a CountVectorizer to split up each message into its list of words, and throw that into a MultinomialNB classifier. Call fit() and we've got a trained spam filter ready to go! It's just that easy.

In [48]:
vectorizer = CountVectorizer()
print(data['message'].values)
counts = vectorizer.fit_transform(data['message'].values)

classifier = MultinomialNB()
targets = data['class'].values
classifier.fit(counts, targets)

["Subject: what up , , your cam babe\n\nwhat are you looking for ?\n\nif your looking for a companion for friendship , love , a date , or just good ole '\n\nfashioned * * * * * * , then try our brand new site ; it was developed and created\n\nto help anyone find what they ' re looking for . a quick bio form and you ' re\n\non the road to satisfaction in every sense of the word . . . . no matter what\n\nthat may be !\n\ntry it out and youll be amazed .\n\nhave a terrific time this evening\n\ncopy and pa ste the add . ress you see on the line below into your browser to come to the site .\n\nhttp : / / www . meganbang . biz / bld / acc /\n\nno more plz\n\nhttp : / / www . naturalgolden . com / retract /\n\ncounterattack aitken step preemptive shoehorn scaup . electrocardiograph movie honeycomb . monster war brandywine pietism byrne catatonia . encomia lookup intervenor skeleton turn catfish .\n"
 'Subject: want to make more money ?\n\norder confirmation . your order should be shipped by j

Let's try it out:

In [50]:
examples = ["Your package has been detained because the address and the zipcode are not matched","UPS delivery failed",'Free Viagra now!!!', "hey julien how are you doing", "my husband just died and we found your name online", "you've been selected!!","Hi Bob, how about a game of golf tomorrow?"]
example_counts = vectorizer.transform(examples)
predictions = classifier.predict(example_counts)
predictions

array(['ham', 'ham', 'spam', 'ham', 'spam', 'spam', 'ham'], dtype='<U4')

## Activity

Our data set is small, so our spam classifier isn't actually very good. Try running some different test emails through it and see if you get the results you expect.

If you really want to challenge yourself, try applying train/test to this spam classifier - see how well it can predict some subset of the ham and spam emails.