# Naive Bayes classifiers: Spam Filter

Naive Bayes classifiers work by correlating the use of tokens (typically words, or sometimes other things), with spam and non-spam e-mails and then using Bayes' theorem to calculate a probability that an email is or is not spam.

Naive Bayes spam filtering is a baseline technique for dealing with spam that can tailor itself to the email needs of individual users and give low false positive spam detection rates that are generally acceptable to users. It is one of the oldest ways of doing spam filtering, with roots in the 1990s.

Bag of words representation:
    - The order of words in the message is not important.
    - Each word is conditionally independent of the others given message class (spam / non spam)

The Bayesian Rule is:

<img class="irc_mi" src="http://www.saedsayad.com/images/Bayes_rule.png" onload="google.aft&amp;&amp;google.aft(this)" width="385" height="220" style="margin-left: 0px;" alt="Image result for naive bayes">

<img class="irc_mi" src="http://www.saedsayad.com/images/Bayes_3.png" onload="google.aft&amp;&amp;google.aft(this)" width="900" height="900" style="margin-left: 0px;" alt="Image result for naive bayes">

<img class="irc_mi" src="https://i.stack.imgur.com/0QOII.png" onload="google.aft&amp;&amp;google.aft(this)" width="685" height="355" style="margin-top: 1px;" alt="Image result for naive bayes spam filter">

In [1]:
import os
import io
import numpy
from pandas import DataFrame
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

def readFiles(path):
    for root, dirnames, filenames in os.walk(path):
        for filename in filenames:
            path = os.path.join(root, filename)

            inBody = False
            lines = []
            f = io.open(path, 'r', encoding='latin1')
            for line in f:
                if inBody:
                    lines.append(line)
                elif line == '\n':
                    inBody = True
            f.close()
            message = '\n'.join(lines)
            yield path, message

def dataFrameFromDirectory(path, classification):
    rows = []
    index = []
    for filename, message in readFiles(path):
        rows.append({'message': message, 'class': classification})
        index.append(filename)

    return DataFrame(rows, index=index)

data = DataFrame({'message': [], 'class': []})

data = data.append(dataFrameFromDirectory('spam', 'spam'))
data = data.append(dataFrameFromDirectory('ham', 'ham'))

Let's take a look at the data.

In [2]:
data

Unnamed: 0,class,message
spam/00178.cdecf0f56ddc0bf61e922a131dc806c2,spam,Protect your financial well-being.\n\nPurchase...
spam/00063.2334fb4e465fc61e8406c75918ff72ed,spam,IS YOUR BUSINESS MAKING MONEY!\n\nSet Up To Ac...
spam/00474.30772a1ac9e824976fc6676844d68b76,spam,<html><head>\n\n<title>Congratulations! You Ge...
spam/00401.309e29417819ce39d8599047d50933cc,spam,A great sponsor will not make you money.\n\nA ...
spam/00367.9688cdee9dfe720c297672c8f60d998f,spam,++++++++++++++++++++++++++++++++++++++++++++++...
spam/00466.ecb11c98ec4511b5422b20476d935bd1,spam,"Dear Sir, \n\nWith due respect and humility I ..."
spam/00026.da18dbed27ae933172f7a70f860c6ad0,spam,"DEAR FRIEND,I AM MRS. SESE-SEKO WIDOW OF LATE..."
spam/00128.721b6b20d5834d490662e2ae8c5c0684,spam,------=_NextPart_000_00A0_03E30A1A.B1804B54\n\...
spam/00245.f129d5e7df2eebd03948bb4f33fa7107,spam,\n\nSent e-mail message \n\n \n\nFrom: enen...
spam/00224.0654fe0af51e1dcefa0eb66eb932f55f,spam,"Dear sir,,\n\n\n\n\n\nMy name is DR Steven M D..."


Now we will use a CountVectorizer to split up each message into its list of words, and throw that into a MultinomialNB classifier. Call fit() and we've got a trained spam filter ready to go.

In [3]:
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(data['message'].values)

classifier = MultinomialNB()
targets = data['class'].values
classifier.fit(counts, targets)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Let's try out couple examples.

In [4]:
examples = ['free viagra', "Hi Bob, how about a game of golf tomorrow?", "Come to Sam GU's free lunch!"]
example_counts = vectorizer.transform(examples)
predictions = classifier.predict(example_counts)
predictions

array(['spam', 'ham', 'spam'],
      dtype='<U4')

The first message 'free viagra' was correctly classified as **spam** and the second message, seems to be an actual question, was classified as **non spam**.

### Let's digger deeper into above magic

In [5]:
counts

<1804x55330 sparse matrix of type '<class 'numpy.int64'>'
	with 285221 stored elements in Compressed Sparse Row format>

In [6]:
print('size in Mega Bytes:')
1804 * 55330 / 1024 / 1024

size in Mega Bytes:


95.1913070678711

### How many web pages are there on the internet in 2016?

The size of the World Wide Web (The Internet)

The Indexed Web contains at least **4.62 billion** pages (Wednesday, 30 March, 2016).
The Dutch Indexed Web contains at least **231.99 million** pages (Wednesday, 30 March, 2016)

In [7]:
# If we want to process all internet webpages (emails/document/news articles)
print('size in Tera Bytes:')
231.99 * 1000000 * 55330 / 1024 / 1024 / 1024 /2014

size in Tera Bytes:


5.935681632386063

In [8]:
counts[0]

<1x55330 sparse matrix of type '<class 'numpy.int64'>'
	with 100 stored elements in Compressed Sparse Row format>

In [9]:
counts[0].todense()

matrix([[0, 1, 0, ..., 0, 0, 0]])

In [10]:
counts[1]

<1x55330 sparse matrix of type '<class 'numpy.int64'>'
	with 76 stored elements in Compressed Sparse Row format>

In [11]:
example_counts

<3x55330 sparse matrix of type '<class 'numpy.int64'>'
	with 16 stored elements in Compressed Sparse Row format>

---