![Spamector Logo](jupyter_images/logo.png)


# Introduction
***
Text mining is a wide field which has gained popularity with the huge text data being generated. 
Automation of several applications like sentiment analysis, document classification, topic classification, text summarization, and machine translation has been done using machine learning models.

Spam filtering is an example of document classification task which involves classifying an email as spam or non-spam (a.k.a. ham) mail.

In this notebook, we’ll go step by step on how to implement a spam detection system, and also how to deploy our algorithm into an API then to a beautiful web interface.

# Table of Contents
***
1. [Obtaining Data](#Obtaining-Data)
2. [Exploring Data](#Exploring-Data)
3. [Preparing Data](#Preparing-Data)
4. [Modeling Data](#Modeling-Data)
5. [Evaluation](#Evaluation)
6. [Deployment](#Deployment)
7. [Web Application](#Web-Application)

# Let's Get Started

![One Bite](jupyter_images/one_bite.jpg)

## Obtaining Data 
***
* We will be using the Enron Spam dataset in its pre-processed form. You can get it from here : http://www.aueb.gr/users/ion/data/enron-spam
![Enron](jupyter_images/enron.png)

## Exploring Data 
***
* All the spam emails are in a folder called spam, while non-spam are in ham.

**Step 1**

* Printing all the directories, sub directories, and files in our data folder.

In [1]:
# Let’s import the necessary libraries.
from nltk.classify import NaiveBayesClassifier, SklearnClassifier
from sklearn.svm import LinearSVC
import nltk.classify.util
from nltk.corpus import movie_reviews
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.externals import joblib
from sklearn.svm import LinearSVC
import os, random, time, pickle

In [2]:
# This is our data directory
data_directory = "data"

In [3]:
# Let’s now loop through all the directories, subdirectories, and files in the above folder, and print them. 
# For files, we print the count.
for directories, subdirs, files in os.walk(data_directory):
    print(directories, subdirs, len(files))

data ['enron1', 'enron2', 'enron3', 'enron4', 'enron5', 'enron6'] 0
data\enron1 ['ham', 'spam'] 0
data\enron1\ham [] 3672
data\enron1\spam [] 1500
data\enron2 ['ham', 'spam'] 0
data\enron2\ham [] 4361
data\enron2\spam [] 1496
data\enron3 ['ham', 'spam'] 0
data\enron3\ham [] 4012
data\enron3\spam [] 1500
data\enron4 ['ham', 'spam'] 0
data\enron4\ham [] 1500
data\enron4\spam [] 4500
data\enron5 ['ham', 'spam'] 0
data\enron5\ham [] 1500
data\enron5\spam [] 3675
data\enron6 ['ham', 'spam'] 0
data\enron6\ham [] 1500
data\enron6\spam [] 4500


**Step 2**

* Printing the files when we are in the ham or spam folder.

In [4]:
# Now instead of printing all files and folders, we only print the files when we are in the ham or spam folder.
for directories, subdirs, files in os.walk(data_directory):
    if (os.path.split(directories)[1]  == 'spam'):
        print(directories, subdirs, len(files))

    if (os.path.split(directories)[1]  == 'ham'):
        print(directories, subdirs, len(files))

data\enron1\ham [] 3672
data\enron1\spam [] 1500
data\enron2\ham [] 4361
data\enron2\spam [] 1496
data\enron3\ham [] 4012
data\enron3\spam [] 1500
data\enron4\ham [] 1500
data\enron4\spam [] 4500
data\enron5\ham [] 1500
data\enron5\spam [] 3675
data\enron6\ham [] 1500
data\enron6\spam [] 4500


**Why does it matter?**

* When we start reading the files, we want to make sure we are only reading them from the spam and ham folders.

**Step 3**

* Reading all the files in those folders.

In [5]:
# We read the files and append them to the ham and spam list
hamList = []
spamList = []
for directories, subdirs, files in os.walk(data_directory):
    if (os.path.split(directories)[1]  == 'ham'):
        for fileName in files:      
            with open(os.path.join(directories, fileName), encoding="latin-1") as f:
                message = f.read()
                hamList.append(message)

    if (os.path.split(directories)[1]  == 'spam'):
        for fileName in files:
            with open(os.path.join(directories, fileName), encoding="latin-1") as f:
                message = f.read()
                spamList.append(message)
print(hamList[0])
print('--------- ')
print(spamList[0])

Subject: christmas tree farm pictures

--------- 
Subject: dobmeos with hgh my energy level has gone up ! stukm
introducing
doctor - formulated
hgh
human growth hormone - also called hgh
is referred to in medical science as the master hormone . it is very plentiful
when we are young , but near the age of twenty - one our bodies begin to produce
less of it . by the time we are forty nearly everyone is deficient in hgh ,
and at eighty our production has normally diminished at least 90 - 95 % .
advantages of hgh :
- increased muscle strength
- loss in body fat
- increased bone density
- lower blood pressure
- quickens wound healing
- reduces cellulite
- improved vision
- wrinkle disappearance
- increased skin thickness texture
- increased energy levels
- improved sleep and emotional stability
- improved memory and mental alertness
- increased sexual potency
- resistance to common illness
- strengthened heart muscle
- controlled cholesterol
- controlled mood swings
- new hair growth and co

- Why **encoding="latin-1"**?

Some files contain special characters, and if Python 2 would allow us to get away with this, Python 3 won’t. So the way to stop Python from throwing a Unicode error is to add an encoding.

Reference - [Processing Text Files in Python 3](http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html)

## Preparing Data 
***
* Now we need to prepare our data for the classifiers. But which classifiers we will use?

The naïve Bayes and support vector machine are the typical generative and discriminative classification models respectively, which are two popular classification approaches. We will use both of them and then compare which model is better.

**- Naive Bayes?**

Naive Bayes classifiers are a popular statistical technique of e-mail filtering. They typically use bag of words features to identify spam e-mail, an approach commonly used in text classification.

Naive Bayes classifiers work by correlating the use of tokens (typically words, or sometimes other things), with spam and non-spam e-mails and then using Bayes' theorem to calculate a probability that an email is or is not spam.

**- How does it work?**
![Naive Bayes Diagram](jupyter_images/naive_bayes.png)

**- Formula?**
![Naive Bayes Formula](jupyter_images/formula.png)

**- Support Vector Machine?**

Support Vector Machine (SVM) is a supervised machine learning algorithm which can be used for both classification or regression challenges. However,  it is mostly used in classification problems. In our case, we will use Linear Support Vector Classifier from Sklearn.
![SVM](jupyter_images/sphx_glr_plot_iris_0012.png)



- Reference

[Naive Bayes spam filtering](https://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering)

[Naïve Bayes vs. Support Vector Machine: Resilience to Missing Data](https://link.springer.com/chapter/10.1007/978-3-642-23887-1_86)

[scikit-learn.org - Support Vector Machines](http://scikit-learn.org/stable/modules/svm.html)

***
**Step 1**

* Naive Bayes expects the input in a particular format: 

{Word1: True, Word2: True, Word3: True}

* So here, we are just creating a dictionary that returns True for each word.

In [6]:
def createWordFeatures(words):
    my_dict = dict( [ (word, True) for word in words] )
    return my_dict

* Let's test our function

In [7]:
createWordFeatures(["the", "MIS 5470", "class", "has", "python", "and","R","modules"])

{'MIS 5470': True,
 'R': True,
 'and': True,
 'class': True,
 'has': True,
 'modules': True,
 'python': True,
 'the': True}

**Step 2**

* Now with the help of *createWordFeatures()* function, we will append a “ham” at the end. This is to tell the algorithm that this text is of type ham, and we’ll do the same for “spam”.

In [8]:
# Fist, using word_tokenize(), we break the sentences into words.
# Second, we use createWordFeatures() function.
hamList = []
spamList = []
for directories, subdirs, files in os.walk(data_directory):
    if (os.path.split(directories)[1]  == 'ham'):
        for fileName in files:      
            with open(os.path.join(directories, fileName), encoding="latin-1") as f:
                message = f.read()               
                wordList = word_tokenize(message)               
                hamList.append((createWordFeatures(wordList), "ham"))
    
    if (os.path.split(directories)[1]  == 'spam'):
        for fileName in files:
            with open(os.path.join(directories, fileName), encoding="latin-1") as f:
                message = f.read()               
                wordList = word_tokenize(message)               
                spamList.append((createWordFeatures(wordList), "spam"))
print(hamList[0])
print(spamList[0])

({'Subject': True, ':': True, 'christmas': True, 'tree': True, 'farm': True, 'pictures': True}, 'ham')
({'Subject': True, ':': True, 'dobmeos': True, 'with': True, 'hgh': True, 'my': True, 'energy': True, 'level': True, 'has': True, 'gone': True, 'up': True, '!': True, 'stukm': True, 'introducing': True, 'doctor': True, '-': True, 'formulated': True, 'human': True, 'growth': True, 'hormone': True, 'also': True, 'called': True, 'is': True, 'referred': True, 'to': True, 'in': True, 'medical': True, 'science': True, 'as': True, 'the': True, 'master': True, '.': True, 'it': True, 'very': True, 'plentiful': True, 'when': True, 'we': True, 'are': True, 'young': True, ',': True, 'but': True, 'near': True, 'age': True, 'of': True, 'twenty': True, 'one': True, 'our': True, 'bodies': True, 'begin': True, 'produce': True, 'less': True, 'by': True, 'time': True, 'forty': True, 'nearly': True, 'everyone': True, 'deficient': True, 'and': True, 'at': True, 'eighty': True, 'production': True, 'normall

**Step 3**

* Merging spam and ham lists.
* Shuffling the result to make it randomised.

In [9]:
mergedList = hamList + spamList
random.shuffle(mergedList)
print("Number of files: {}".format(len(mergedList)))

Number of files: 33716


## Modeling Data 
***
**Step 1**

* Creating test and train splits.

In [10]:
# We choose to have 60% of the data as training and 40% as test.
trainingPart = int(len(mergedList) * .6)
trainingData = mergedList[:trainingPart]
testData =  mergedList[trainingPart:]

print("Total Number of files:    {}".format(len(mergedList)))
print("Number of Training files: {}".format(len(trainingData)))
print("Number of Test files:     {}".format(len(testData)))

Total Number of files:    33716
Number of Training files: 20229
Number of Test files:     13487


**Step 2**

* Calling the Naive Bayes classifier.
* Finding its accuracy.

In [11]:
start_time = time.time()
nb_classifier = NaiveBayesClassifier.train(trainingData)
print("Trained the Naive Bayes classifier in {:,.2f} seconds".format(time.time()-start_time))
accuracy = nltk.classify.util.accuracy(nb_classifier, testData)
print("Accuracy is: ", accuracy * 100)

Trained the Naive Bayes classifier in 7.60 seconds
Accuracy is:  98.61347964706755


* We are getting an accuracy of 98%, which is Great.

**Step 3**

* Let's have a look at the most interesting features:

In [13]:
nb_classifier.show_most_informative_features(20)

Most Informative Features
                   enron = True              ham : spam   =   3115.4 : 1.0
                   daren = True              ham : spam   =    447.4 : 1.0
                     php = True             spam : ham    =    363.0 : 1.0
                     hpl = True              ham : spam   =    305.3 : 1.0
                     ect = True              ham : spam   =    219.8 : 1.0
                crenshaw = True              ham : spam   =    217.0 : 1.0
                   corel = True             spam : ham    =    200.7 : 1.0
                     713 = True              ham : spam   =    188.3 : 1.0
              scheduling = True              ham : spam   =    181.1 : 1.0
                     eol = True              ham : spam   =    179.0 : 1.0
                  louise = True              ham : spam   =    177.8 : 1.0
                 parsing = True              ham : spam   =    172.0 : 1.0
                     sex = True             spam : ham    =    158.7 : 1.0

**Step 4**

* Calling the Linear SVC.
* Finding its accuracy.

In [14]:
start_time = time.time()
SVM_classifier = SklearnClassifier(LinearSVC(), sparse=False).train(trainingData)
print("Trained the SVM classifier in {:,.2f} seconds".format(time.time()-start_time))
SVM_accuracy = nltk.classify.util.accuracy(SVM_classifier, testData)
print("Accuracy is: ", SVM_accuracy * 100)

Trained the SVM classifier in 100.69 seconds
Accuracy is:  98.32431230073404


## Evaluation
***
<img src="jupyter_images/evaluation.gif" style="width: 300px;"/>
### Comparing Models
![SVM_NB](jupyter_images/svm_nb.png)

- We can see that the two models they have the same accuracy however the SVM takes longer time than the naïve Bayes. So, our winner is the naïve Bayes model.

### Predicting New Emails

* Let's clasify the below messages as spam or ham. How to do it?


1. Break the message into words using *word_tokenzise*
2. Generate features using *createWordFeatures*
3. Use the *classify* function

In [15]:
# Example of Nigerian Scam email
msg1 = '''Sir, we are honourably seeking your assistance in the following ways.
        1) To provide a Bank account where this money would be transferred to.
        2) To serve as the guardian of this since I am a girl of 26 years.
        Moreover Sir, we are willing to offer you 15% of the sum as compensation for effort input after the successful transfer of this fund to your designate account overseas. please feel free to contact ,me via this email address
        wumi1000abdul@yahoo.comAnticipating to hear from you soon.Thanks and God Bless.'''

# Note from Professor regarding Python + Tableau for plotting flight pattern
msg2 = '''I've mentioned that Python is widely used for data pre-processing tasks. 
        Here's a really nice use of Python for preparing flight data for plotting in Tableau. 
        The same principles can be used for plotting all kinds of geographic related "routes". 
        Plus, you learn about KML files and working with spatial data.'''

In [16]:
words = word_tokenize(msg1)
features = createWordFeatures(words)
print("Message 1 is" ,nb_classifier.classify(features))

Message 1 is spam


<img src="jupyter_images/spam.gif" style="width: 300px;"/>

In [17]:
words = word_tokenize(msg2)
features = createWordFeatures(words)
print("Message 2 is" ,nb_classifier.classify(features))

Message 2 is ham


<img src="jupyter_images/ham.gif" style="width: 300px;"/>

# Deployment
***
## Saving Model
<img src="jupyter_images/picklejar.png"  style="width: 250px;"/>
* **Pickle** is the standard way of serializing objects in Python. We can use the pickle operation to serialize our machine learning algorithms and save the serialized format to a file.

In [18]:
# Saving the model to disk
filename = 'nb_model.sav'
pickle.dump(nb_classifier, open(filename, 'wb'))

## Loading Model

* Here we can load this file to deserialize our model and use it to make new predictions.

In [19]:
# Load the model from disk
start_time_pickle = time.time()
loaded_model = pickle.load(open(filename, 'rb'))
print("Loaded the model with Pickle in {:,.2f} seconds".format(time.time()-start_time_pickle))

Loaded the model with Pickle in 2.54 seconds


* We can use as well **Sklearn’s Joblib**.

In [20]:
# Saving the model with Joblib
joblib.dump(nb_classifier,"nb_model.joblib", compress=1) # compression into 1 file

['nb_model.joblib']

In [21]:
# Loading the model with Joblib
start_time_joblib = time.time()
joblib_model = joblib.load("nb_model.joblib")
print("Loaded the model with Joblib in {:,.2f} seconds".format(time.time()-start_time_joblib))

Loaded the model with Joblib in 13.73 seconds


### Pickle Vs Joblib Performance
<img src="jupyter_images/diff.png"/>

# Web Application
***
<img src="jupyter_images/web.gif"  style="width: 250px;"/>
*  Let's now create a simple web application that would allow us to enter any message as an input and tell us if it's a spam or ham. We'll be using Flask, a Python web application micro-framework.


**“Hello World” application in Flask**

In [22]:
from flask import Flask
app = Flask(__name__)

@app.route("/")
def hello():
    return "Hello 5470 World!"

#if __name__ == "__main__":
#    app.run()

## Structure

This is the basic structure of our web application:
![Structure](jupyter_images/structure.png)

* The **templates** folder is the place where the templates will be put. The **static** folder is the place where any files (images, css, javascript) needed by the web application will be put.

* **app.py**: It's our main Python Application.

* **engine.py**: It's the Python class containing the predictive model and its related functions.
***
## Engine ~ Spam Detector as a Class
***

In [23]:
class SpamDetector:

	def createWordFeatures(self,words):
		"""
		Function
		--------
		createWordFeatures

		Create a dictionary that returns True for each word.

		Parameters
		----------
		words : list
		The list of words

		Returns
		-------
		resDict : Dict

		Example
		-------
		createWordFeatures(["the", "MIS 5470", "class", "has", "python", "and","R","modules"])
		
		{'MIS 5470': True,
		 'R': True,
		 'and': True,
		 'class': True,
		 'has': True,
		 'modules': True,
		 'python': True,
		 'the': True}
		"""
		resDict = dict( [ (word, True) for word in words] )
		return resDict
		
	def tokenizeCreateWordFeatures(self):
		"""
		Function
		--------
		tokenizeCreateWordFeatures

		Append a "ham" or "spam" at the end of each dictionary created by createWordFeatures function. 
		This is to tell the algorithm that this text is of type ham or spam.
		Merge spam and ham lists.
		Shuffle the result to make it randomised.

		Returns
		-------
		mergedList : list

		Example
		-------
		output = tokenizeCreateWordFeatures()
		print(output[0])
		
		({'Subject': True, ':': True, 'fw': True, 'nice': True, 'mhoter': True, 'fucking': True, 
		'top': True, 'of': True, 'the': True, 'morning': True, 'to': True, 'you': True, '!': True,
		')': True, 'matayaa': True}, 'spam')
		"""
		data_directory = "data"		
		hamList = []
		spamList = []
		for directories, subdirs, files in os.walk(data_directory):
			if (os.path.split(directories)[1]  == 'ham'):
				for fileName in files:      
					with open(os.path.join(directories, fileName), encoding="latin-1") as f:
						message = f.read()					
						words = word_tokenize(message)						
						hamList.append((self.createWordFeatures(words), "ham"))
			
			if (os.path.split(directories)[1]  == 'spam'):
				for fileName in files:
					with open(os.path.join(directories, fileName), encoding="latin-1") as f:
						message = f.read()						
						words = word_tokenize(message)						
						spamList.append((self.createWordFeatures(words), "spam"))
		mergedList = hamList + spamList
		random.shuffle(mergedList)		
		return mergedList

	def createTestTrain(self):
		"""
		Function
		--------
		createTestTrain

		Create test/train splits.

		Returns
		-------
		(trainingData, testData) : tuple
		"""
		mergedList = self.tokenizeCreateWordFeatures()
		trainingPart = int(len(mergedList) * .6)
		trainingData = mergedList[:trainingPart]
		testData =  mergedList[trainingPart:]		
		return (trainingData, testData)


	def createModel(self):
		"""
		Function
		--------
		createModel

		Create the naive Bayes classifier

		Returns
		-------
		classifier : nltk.classify.naivebayes.NaiveBayesClassifier
		"""
		trainingData, testData = self.createTestTrain()		
		classifier = NaiveBayesClassifier.train(trainingData)
		return classifier

	def saveModel(self):
		"""
		Function
		--------
		saveModel

		Save the model to disk

		Returns
		-------
		outputMsg : str
		
		Example
		-------
		saveModel()
		'Model saved successfully !'
		"""
		classifier = self.createModel()
		fileName = 'nb_model.sav'
		outputMsg = ""
		try:
			pickle.dump(classifier, open(fileName, 'wb'))
			outputMsg = "Model saved successfully !"
		except:
			outputMsg = "Error when saving model !"
		return outputMsg

	def predictMessage(self,msg):
		"""
		Function
		--------
		predictMessage

		Classify the message by telling if it is a ham or spam.

		Parameters
		----------
		msg : str
		A message we want to classify
		
		Returns
		-------
		str: 'ham' or 'spam'
		
		Example
		-------
		message = "Hi welcome to this session, please log in using your username and password then click on the start button."
		predictMessage(message)
		
		'ham'
		"""
		fileName = 'nb_model.sav'
		loaded_model = pickle.load(open(fileName, 'rb'))
		output_msg = ""
		words = word_tokenize(msg)
		output_msg = dict( [ (word, True) for word in words] )
		return loaded_model.classify(output_msg)

## App ~ Flask Server
***

In [None]:
from flask import Flask, request, render_template, json, redirect, url_for, session, abort, Markup
from engine import SpamDetector
app = Flask(__name__)
app.secret_key = 'F12Zr47j\3yX R~X@H!jmM]Lwf/,?KT'

# Defining the basic route and its corresponding request handler
@app.route("/")
def index():

    return render_template('index.html')


# Run the engine
@app.route("/RunEngine", methods=['POST', 'GET'])
def RunEngine():

    clearsession()
    OutputMsg = None
    message = None

    if request.form['inputMessage'] !="":

        message = request.form['inputMessage']
        spam_detector = SpamDetector()
        OutputMsg = spam_detector.predictMessage(message)
        if OutputMsg == "ham":
            OutputMsg = Markup("<img src='../static/img/ham.png' width='50' height='50'/><br>No worries, your message is fine ðŸ˜‰")
        else:
            OutputMsg = Markup("<img src='../static/img/spam.png' width='50' height='50'/><br>Be careful! your message is a spam!")
        return render_template('index.html', message=message, OutputMsg=OutputMsg)

    else:
        return redirect(url_for('index'))

# Clear the session 
@app.route('/clear')
def clearsession():
    # Clear the session
    session.clear()
    # Redirect the user to the main page
    return redirect(url_for('index'))

# Build the model 
@app.route('/build')
def build_model():
    spam_detector = SpamDetector()
    build_results = spam_detector.saveModel()
    return build_results

# Predict the message via command line
@app.route('/predict/<message>')
def predict_message(message):
    spam_detector = SpamDetector()
    return spam_detector.predictMessage(message)

# Checking if the executed file is the main program and run the app	
#if __name__ == "__main__":
    #app.debug=True
    #app.run()

## Web Page
***
<img src="jupyter_images/webpage.png"/>

## Let's Run It
***

In [None]:
!python app.py

## Spamector as API
***

We have the possibility to use Spamector as an API. Here is an example using [Postman](https://www.getpostman.com/) and we are trying to classify the message passed in the URL.
<img src="jupyter_images/postman.png"/>