# Naive Bayes

Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. It is not a single algorithm but a family of algorithms where all of them share a common principle, i.e. every pair of features being classified is independent of each other.

In [8]:
#provides a way of using operating system dependent function
import os

import io
import numpy
from pandas import DataFrame

#convert a collection of text documents to a matrix of token counts
from sklearn.feature_extraction.text import CountVectorizer

#multinomial Naive Bayes classifier is used for classification with discrete features
from sklearn.naive_bayes import MultinomialNB

#create function to read files
def readFiles(path):

#Will iterate through every file given from the mentions directory
    for root, dirnames, filenames in os.walk(path):
        for filename in filenames:
            path = os.path.join(root, filename)

            #will train only the body
            inBody = False
            
            #create empty list lines
            lines = []
            
            #encoding type is latin1
            f = io.open(path, 'r', encoding='latin1')
            for line in f:
                if inBody:
                    lines.append(line)
                elif line == '\n':
                    inBody = True
            f.close()
            message = '\n'.join(lines)
            
            #yield is used like return but the function will return a generator
            yield path, message


def dataFrameFromDirectory(path, classification):
    #create empty list rows
    rows = []
    
    #create empty list index
    index = []
    for filename, message in readFiles(path):
        #append list rows
        rows.append({'message': message, 'class': classification})
        #append list index
        index.append(filename)
    #return dataframe
    return DataFrame(rows, index=index)
#create a data frame object which contains empty list for message and empty list for class
data = DataFrame({'message': [], 'class': []})

#read content from files placed in ham and spam
data = data.append(dataFrameFromDirectory('C:/ML - L00151073/ham and spam data set/spam', 'spam'))
data = data.append(dataFrameFromDirectory('C:/ML - L0015107/ham and spam data set/ham', 'ham'))


Let's have a look at that DataFrame:

In [9]:
#return first 5 default values
data.head()

Unnamed: 0,class,message
C:/ML - L00151073/ham and spam data set/spam\0000.7b1b73cf36cf9dbc3d64e3f2ee2b91f1,spam,
C:/ML - L00151073/ham and spam data set/spam\0001.bfc8d64d12b325ff385cca8d07b84288,spam,"<!DOCTYPE HTML PUBLIC ""-//W3C//DTD HTML 4.0 Tr..."
C:/ML - L00151073/ham and spam data set/spam\0002.24b47bb3ce90708ae29d0aec1da08610,spam,1) Fight The Risk of Cancer!\n\nhttp://www.adc...
C:/ML - L00151073/ham and spam data set/spam\0003.4b3d943b8df71af248d12f8b2e7a224a,spam,1) Fight The Risk of Cancer!\n\nhttp://www.adc...
C:/ML - L00151073/ham and spam data set/spam\0004.1874ab60c71f0b31b580f313a3f6e777,spam,##############################################...


We will use a CountVectorizer to split up each message into its list of words, and throw that into a MultinomialNB classifier. Call fit() and we've got a trained spam filter ready to go! It's just that easy.

In [10]:
vectorizer = CountVectorizer()

#check list of all words in email and number of times that words occur
#take the message column from dataframe tokenize it or convert all the words
counts = vectorizer.fit_transform(data['message'].values)

#two inputs actual data on which training is going on and the target
classifier = MultinomialNB()
targets = data['class'].values
classifier.fit(counts, targets)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Let's try it out:

In [12]:
#provide input to identify ham and spam
examples = ['Man Threatens Explosion In Moscow', 'buying life insurance simple and affordable']
example_counts = vectorizer.transform(examples)
predictions = classifier.predict(example_counts)
predictions

array(['spam', 'spam'], dtype='<U4')

The system identifies the text as spam!