# Building a SPAM Filter using Naive Bayes classifier

# Analyzing the Enron Email corpus

The Enron Email Corpus is one of the biggest email data sources in the world. It has 600,000 emails spread over 2.5 GB.

The difficult part of the project is reading almost half-a-million files. All the emails are in the MIME format - Multipurpose Internet Mail Extensions 

Dataset is obtained from : http://www2.aueb.gr/users/ion/data/enron-spam/

In [3]:
import os
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

Dataset comprises of 6 different folders, which contains different emails labelled as SPAM and HAM. Let us have a look at the directories and files in folders.

The OS module in Python provides a way of using operating system dependent
functionality. 

The functions that the OS module provides allows you to interface with the
underlying operating system that Python is running on

In [9]:
for directories, subdirs, files in os.walk("/Users/saiyesaswymylavarapu/Documents/Documents/PythonBootCamp/Spam Filter/Enron"):
    print(directories, subdirs, len(files))

/Users/saiyesaswymylavarapu/Documents/Documents/PythonBootCamp/Spam Filter/Enron ['enron1', 'enron6', 'enron5', 'enron2', 'enron3', 'enron4'] 1
/Users/saiyesaswymylavarapu/Documents/Documents/PythonBootCamp/Spam Filter/Enron/enron1 ['spam', 'ham'] 2
/Users/saiyesaswymylavarapu/Documents/Documents/PythonBootCamp/Spam Filter/Enron/enron1/spam [] 1500
/Users/saiyesaswymylavarapu/Documents/Documents/PythonBootCamp/Spam Filter/Enron/enron1/ham [] 3672
/Users/saiyesaswymylavarapu/Documents/Documents/PythonBootCamp/Spam Filter/Enron/enron6 ['spam', 'ham'] 1
/Users/saiyesaswymylavarapu/Documents/Documents/PythonBootCamp/Spam Filter/Enron/enron6/spam [] 4500
/Users/saiyesaswymylavarapu/Documents/Documents/PythonBootCamp/Spam Filter/Enron/enron6/ham [] 1500
/Users/saiyesaswymylavarapu/Documents/Documents/PythonBootCamp/Spam Filter/Enron/enron5 ['spam', 'ham'] 1
/Users/saiyesaswymylavarapu/Documents/Documents/PythonBootCamp/Spam Filter/Enron/enron5/spam [] 3675
/Users/saiyesaswymylavarapu/Documen

In the next step, instead of looping through all the available folders, we will loop over files only in the HAM and SPAM folders.
This is to make sure that we read only the files from SPAM and HAM folders

In [11]:
for directories, subdirs, files in os.walk("/Users/saiyesaswymylavarapu/Documents/Documents/PythonBootCamp/Spam Filter/Enron"):
    if (os.path.split(directories)[1]=='ham'):
        print(directories, subdirs, len(files))
        
    if (os.path.split(directories)[1]=='spam'):
        print(directories, subdirs, len(files))

/Users/saiyesaswymylavarapu/Documents/Documents/PythonBootCamp/Spam Filter/Enron/enron1/spam [] 1500
/Users/saiyesaswymylavarapu/Documents/Documents/PythonBootCamp/Spam Filter/Enron/enron1/ham [] 3672
/Users/saiyesaswymylavarapu/Documents/Documents/PythonBootCamp/Spam Filter/Enron/enron6/spam [] 4500
/Users/saiyesaswymylavarapu/Documents/Documents/PythonBootCamp/Spam Filter/Enron/enron6/ham [] 1500
/Users/saiyesaswymylavarapu/Documents/Documents/PythonBootCamp/Spam Filter/Enron/enron5/spam [] 3675
/Users/saiyesaswymylavarapu/Documents/Documents/PythonBootCamp/Spam Filter/Enron/enron5/ham [] 1500
/Users/saiyesaswymylavarapu/Documents/Documents/PythonBootCamp/Spam Filter/Enron/enron2/spam [] 1496
/Users/saiyesaswymylavarapu/Documents/Documents/PythonBootCamp/Spam Filter/Enron/enron2/ham [] 4361
/Users/saiyesaswymylavarapu/Documents/Documents/PythonBootCamp/Spam Filter/Enron/enron3/spam [] 1500
/Users/saiyesaswymylavarapu/Documents/Documents/PythonBootCamp/Spam Filter/Enron/enron3/ham [] 

# Reading the Data

Now let us read the spam and ham emails by appending each of them to their respective lists

In [12]:
ham_list = []
spam_list = []

In [15]:
for directories,subdirs,files in os.walk("/Users/saiyesaswymylavarapu/Documents/Documents/PythonBootCamp/Spam Filter/Enron"):
    if(os.path.split(directories)[1]=='ham'):
        for filename in files:
            with open(os.path.join(directories,filename),encoding='latin-1') as f:
                data = f.read()
                ham_list.append(data)

In [16]:
for directories,subdirs,files in os.walk("/Users/saiyesaswymylavarapu/Documents/Documents/PythonBootCamp/Spam Filter/Enron"):
    if(os.path.split(directories)[1]=='spam'):
        for filename in files:
            with open(os.path.join(directories,filename),encoding='latin-1') as f:
                data = f.read()
                spam_list.append(data)

In [20]:
print(spam_list[0])

Subject: what up , , your cam babe
what are you looking for ?
if your looking for a companion for friendship , love , a date , or just good ole '
fashioned * * * * * * , then try our brand new site ; it was developed and created
to help anyone find what they ' re looking for . a quick bio form and you ' re
on the road to satisfaction in every sense of the word . . . . no matter what
that may be !
try it out and youll be amazed .
have a terrific time this evening
copy and pa ste the add . ress you see on the line below into your browser to come to the site .
http : / / www . meganbang . biz / bld / acc /
no more plz
http : / / www . naturalgolden . com / retract /
counterattack aitken step preemptive shoehorn scaup . electrocardiograph movie honeycomb . monster war brandywine pietism byrne catatonia . encomia lookup intervenor skeleton turn catfish .



# Data cleaning

The next task would be preparing the data for Naive Bayes classifier. The Naive Bayes is a fairly simple machine learning algorithm, that works mainly with probabilities.

It consists of multiple steps:

First step would be defining a function for creating word features, which prepares each word in the required input format

In [24]:
# This is how the Naive Bayes classifier expects the input
def create_word_features(words):
    useful_words = [word for word in words if word not in stopwords.words("english")]
    my_dict = dict([(word, True) for word in useful_words])
    return my_dict

In the above case, instead of reading data as strings and appending it to a list. We tokenize each string into words

In [32]:
ham_data=[]
spam_data=[]
for directories,subdirs,files in os.walk("/Users/saiyesaswymylavarapu/Documents/Documents/PythonBootCamp/Spam Filter/Enron"):
    print(directories)
    if(os.path.split(directories)[1]=='ham'):
        for filename in files:
            with open(os.path.join(directories,filename),encoding='latin-1') as f:
                data = f.read()
                #Tokenizing sentences into words
                words = word_tokenize(data)
                #Creating word features for all the words
                ham_data.append((create_word_features(words),"ham"))
    
    if(os.path.split(directories)[1]=='spam'):
        for filename in files:
            with open(os.path.join(directories,filename),encoding='latin-1') as f:
                data = f.read()
                #Tokenizing sentences into words
                words = word_tokenize(data)
                #Creating word features for all the words
                spam_data.append((create_word_features(words),"spam"))

/Users/saiyesaswymylavarapu/Documents/Documents/PythonBootCamp/Spam Filter/Enron
/Users/saiyesaswymylavarapu/Documents/Documents/PythonBootCamp/Spam Filter/Enron/enron1
/Users/saiyesaswymylavarapu/Documents/Documents/PythonBootCamp/Spam Filter/Enron/enron1/spam
/Users/saiyesaswymylavarapu/Documents/Documents/PythonBootCamp/Spam Filter/Enron/enron1/ham
/Users/saiyesaswymylavarapu/Documents/Documents/PythonBootCamp/Spam Filter/Enron/enron6
/Users/saiyesaswymylavarapu/Documents/Documents/PythonBootCamp/Spam Filter/Enron/enron6/spam
/Users/saiyesaswymylavarapu/Documents/Documents/PythonBootCamp/Spam Filter/Enron/enron6/ham
/Users/saiyesaswymylavarapu/Documents/Documents/PythonBootCamp/Spam Filter/Enron/enron5
/Users/saiyesaswymylavarapu/Documents/Documents/PythonBootCamp/Spam Filter/Enron/enron5/spam
/Users/saiyesaswymylavarapu/Documents/Documents/PythonBootCamp/Spam Filter/Enron/enron5/ham
/Users/saiyesaswymylavarapu/Documents/Documents/PythonBootCamp/Spam Filter/Enron/enron2
/Users/saiye

In [33]:
ham_data[0]

({'!': True,
  '&': True,
  "'": True,
  '(': True,
  ')': True,
  ',': True,
  '-': True,
  '.': True,
  '/': True,
  '04': True,
  '05': True,
  '07': True,
  '08': True,
  '10': True,
  '11': True,
  '2000': True,
  '56': True,
  '7': True,
  '99': True,
  ':': True,
  '?': True,
  '@': True,
  'Subject': True,
  'accomplish': True,
  'activity': True,
  'advance': True,
  'agreement': True,
  'allow': True,
  'altrade': True,
  'anything': True,
  'based': True,
  'begin': True,
  'behalf': True,
  'brazoria': True,
  'brenda': True,
  'buy': True,
  'buys': True,
  'c': True,
  'carbide': True,
  'cc': True,
  'central': True,
  'ces': True,
  'check': True,
  'chemicals': True,
  'cheryl': True,
  'christi': True,
  'city': True,
  'clynes': True,
  'come': True,
  'company': True,
  'contracts': True,
  'contractual': True,
  'corporation': True,
  'corpus': True,
  'could': True,
  'counterparties': True,
  'customer': True,
  'customers': True,
  'daren': True,
  'deals': True

# Training the Naive Bayes

First step is creating Train and test samples

Ham and Spam data is initially combined and shuffled

In [37]:
import random
total_data = ham_data + spam_data
random.shuffle(total_data)

Create training and test datasets by splitting into 70% and 30% 

In [38]:
train_part = int(len(total_data) * .7)

train = total_data[:train_part]
 
test =  total_data[train_part:]
 
print (len(train))
print (len(test))

23601
10115


In [39]:
# Create the Naive Bayes filter
 
classifier = NaiveBayesClassifier.train(train)
 
# Find the accuracy, using the test data
 
accuracy = nltk.classify.util.accuracy(classifier, test)
 
print("Accuracy is: ", accuracy * 100)

Accuracy is:  98.80375679683638


So the accuracy of our model turned out to be 98.8%

Let us look at the most informative features:

In [41]:
classifier.show_most_informative_features(20)

Most Informative Features
                    meds = True             spam : ham    =    299.0 : 1.0
                     xls = True              ham : spam   =    287.4 : 1.0
                     713 = True              ham : spam   =    271.1 : 1.0
                 stinson = True              ham : spam   =    268.4 : 1.0
                crenshaw = True              ham : spam   =    268.4 : 1.0
                     ect = True              ham : spam   =    226.2 : 1.0
                     eol = True              ham : spam   =    212.9 : 1.0
             medications = True             spam : ham    =    209.1 : 1.0
              scheduling = True              ham : spam   =    205.6 : 1.0
                  louise = True              ham : spam   =    200.5 : 1.0
                     hpl = True              ham : spam   =    194.4 : 1.0
                 parsing = True              ham : spam   =    192.7 : 1.0
                     oem = True             spam : ham    =    173.6 : 1.0

This shows that words like medications, pill and sex are most likely to occur in SPAM emails

# Checking on some examples

I have taken some examples of SPAM and HAM emails from the web and tried to determine their class based on my model.

In [42]:
mail1 = '''It is truly incredible - People from all around the world are using
their webcams to get off. Now is your chance to watch Men and women,
boys and girls show off just for you. Best of all, it's FREE, LIVE
and UN-F*CKING-BELIEVABLE. Either peep in on the sexy activity or
participate with your own webcam! You've got to try this out!
Open Amateur Webcam Feeds are Active RIGHT NOW!!!
'''

In [43]:
words = word_tokenize(mail1)
features = create_word_features(words)
print("Email 1 is :" ,classifier.classify(features))

Email 1 is : spam


In [44]:
mail2 = '''Dear recipient,
Avangar Technologies announces the beginning of a new unprecendented global employment campaign. 
reviser yeller winers butchery twenties
Due to company's exploding growth Avangar is expanding business to the European region.
During last employment campaign over 1500 people worldwide took part in Avangar's business
and more than half of them are currently employed by the company. And now we are offering you
one more opportunity to earn extra money working with Avangar Technologies.
druggists blame classy gentry Aladdin

We are looking for honest, responsible, hard-working people that can dedicate 2-4 hours of their
time per day and earn extra Â£300-500 weekly. All offered positions are currently part-time
and give you a chance to work mainly from home.
lovelies hockey Malton meager reordered

Please visit Avangar's corporate web site (http://www.avangar.com/sta/home/0077.htm) for more details regarding these vacancies.
'''

In [45]:
words = word_tokenize(mail2)
features = create_word_features(words)
print("Email 2 is :" ,classifier.classify(features))

Email 2 is : spam


In [46]:
mail3 = '''Dear Mr. James:

I sincerely enjoyed meeting with you yesterday and learning more about the Position at Employer.

Our conversation confirmed my interest in becoming part of Employer's staff. I was particularly pleased at the prospect of being able to develop my own article ideas with the head of the bureau, and develop my multi-media skills.

I feel confident that my experiences both in the workplace and in the classroom would enable me to fill the job requirements effectively.

Please feel free to contact me if I can provide you with any further information.

I look forward eagerly to hearing from you, and thank you again for the courtesy you extended to me.

Sincerely'''

In [47]:
words = word_tokenize(mail3)
features = create_word_features(words)
print("Email 3 is :" ,classifier.classify(features))

Email 3 is : ham
