# Naive Bayes classifiers: Spam Filter

Naive Bayes classifiers work by correlating the use of tokens (typically words, or sometimes other things), with spam and non-spam e-mails and then using Bayes' theorem to calculate a probability that an email is or is not spam.

Naive Bayes spam filtering is a baseline technique for dealing with spam that can tailor itself to the email needs of individual users and give low false positive spam detection rates that are generally acceptable to users. It is one of the oldest ways of doing spam filtering, with roots in the 1990s.

Bag of words representation:
    - The order of words in the message is not important.
    - Each word is conditionally independent of the others given message class (spam / non spam)

The Bayesian Rule is:

<img class="irc_mi" src="http://www.saedsayad.com/images/Bayes_rule.png" onload="google.aft&amp;&amp;google.aft(this)" width="385" height="220" style="margin-top: 0px;" alt="Image result for naive bayes">

Where we could say..
<img style="display:block;float:none;margin-left:auto;margin-right:auto;border:0;" title="naivebayeseq2" src="https://computersciencesource.files.wordpress.com/2010/01/naivebayeseq2_thumb.png?w=360&amp;h=83" alt="naivebayeseq2" width="240" height="55" border="0" originalw="240" originalh="55" src-orig="https://computersciencesource.files.wordpress.com/2010/01/naivebayeseq2_thumb.png?w=240&amp;h=55" scale="1.5">

<img class="irc_mi" src="https://i.stack.imgur.com/0QOII.png" onload="google.aft&amp;&amp;google.aft(this)" width="685" height="355" style="margin-top: 1px;" alt="Image result for naive bayes spam filter">

In [6]:
import os
import io
import numpy
from pandas import DataFrame
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

def readFiles(path):
    for root, dirnames, filenames in os.walk(path):
        for filename in filenames:
            path = os.path.join(root, filename)

            inBody = False
            lines = []
            f = io.open(path, 'r', encoding='latin1')
            for line in f:
                if inBody:
                    lines.append(line)
                elif line == '\n':
                    inBody = True
            f.close()
            message = '\n'.join(lines)
            yield path, message

def dataFrameFromDirectory(path, classification):
    rows = []
    index = []
    for filename, message in readFiles(path):
        rows.append({'message': message, 'class': classification})
        index.append(filename)

    return DataFrame(rows, index=index)

data = DataFrame({'message': [], 'class': []})

data = data.append(dataFrameFromDirectory('spam', 'spam'))
data = data.append(dataFrameFromDirectory('ham', 'ham'))

Let's take a look at the data.

In [2]:
data

Unnamed: 0,class,message
spam\00001.7848dde101aa985090474a91ec93fcf0,spam,"<!DOCTYPE HTML PUBLIC ""-//W3C//DTD HTML 4.0 Tr..."
spam\00002.d94f1b97e48ed3b553b3508d116e6a09,spam,1) Fight The Risk of Cancer!\n\nhttp://www.adc...
spam\00003.2ee33bc6eacdb11f38d052c44819ba6c,spam,1) Fight The Risk of Cancer!\n\nhttp://www.adc...
spam\00004.eac8de8d759b7e74154f142194282724,spam,##############################################...
spam\00005.57696a39d7d84318ce497886896bf90d,spam,I thought you might like these:\n\n1) Slim Dow...
spam\00006.5ab5620d3d7c6c0db76234556a16f6c1,spam,A POWERHOUSE GIFTING PROGRAM You Don't Want To...
spam\00007.d8521faf753ff9ee989122f6816f87d7,spam,Help wanted. We are a 14 year old fortune 500...
spam\00008.dfd941deb10f5eed78b1594b131c9266,spam,<html>\n\n<head>\n\n<title>ReliaQuote - Save U...
spam\00009.027bf6e0b0c4ab34db3ce0ea4bf2edab,spam,TIRED OF THE BULL OUT THERE?\n\nWant To Stop L...
spam\00010.445affef4c70feec58f9198cfbc22997,spam,"Dear ricardo1 ,\n\n\n\n<html>\n\n<body>\n\n<ce..."


Now we will use a CountVectorizer to split up each message into its list of words, and throw that into a MultinomialNB classifier. Call fit() and we've got a trained spam filter ready to go.

In [3]:
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(data['message'].values)

classifier = MultinomialNB()
targets = data['class'].values
classifier.fit(counts, targets)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Let's try out couple examples.

In [5]:
examples = ['free viagra', "Hi Bob, how about a game of golf tomorrow?"]
example_counts = vectorizer.transform(examples)
predictions = classifier.predict(example_counts)
predictions

array(['spam', 'ham'], 
      dtype='<U4')

The first message 'free viagra' was correctly classified as **spam** and the second message, seems to be an actual question, was classified as **non spam**.