# **Spam Email Classification**

There are various supervised learning algorithms that can be used for classification problems. One of the applications of these learning algorithms is classifying spam emails. Here, various learning algorithms such as Decision tree, Random forest, SVC, Logistic Regression and MLP are used to classify the mails and their results are compared. The email data used for the analysis is obtained from the spamassassin website (https://spamassassin.apache.org/old/publiccorpus/).

## Downloading dataset

In [18]:
import os

SPAM_URLS = ["https://spamassassin.apache.org/old/publiccorpus/20021010_spam.tar.bz2", "https://spamassassin.apache.org/old/publiccorpus/20030228_spam.tar.bz2", "https://spamassassin.apache.org/old/publiccorpus/20030228_spam_2.tar.bz2", "https://spamassassin.apache.org/old/publiccorpus/20050311_spam_2.tar.bz2"]
HAM_URLS = ["https://spamassassin.apache.org/old/publiccorpus/20021010_easy_ham.tar.bz2", "https://spamassassin.apache.org/old/publiccorpus/20021010_hard_ham.tar.bz2", "https://spamassassin.apache.org/old/publiccorpus/20030228_easy_ham.tar.bz2", "https://spamassassin.apache.org/old/publiccorpus/20030228_easy_ham_2.tar.bz2","https://spamassassin.apache.org/old/publiccorpus/20030228_hard_ham.tar.bz2"]
DOWNLOAD_PATH = os.path.join("dataset")
EXTRACTION_PATH = os.path.join(DOWNLOAD_PATH, "extracted")

In [19]:
from urllib.request import urlretrieve
import tarfile

# function to download dataset from the spam corpus using URLs
def download_file(url):
    filename = url.split('/')[-1]

    if not os.path.isdir(DOWNLOAD_PATH):
        os.makedirs(DOWNLOAD_PATH)
         
    filepath, headers = urlretrieve(url, os.path.join(DOWNLOAD_PATH, filename))
    tar = tarfile.open(filepath)
    tar.extractall(os.path.join(EXTRACTION_PATH, filename.split('.')[0]))

## Preprocessing and Feature Extraction

In [20]:
import re

# function to preprocess the email message body to remove unnecessary contents
def get_processed_msg_content(message):
    
    # remove white space
    message = message.replace("\n", " ")
    
    # change the texts to lower case
    message = message.lower()
    
    # remove html tags and parse them
    message = re.sub(r"<(“[^”]*”|'[^’]*’|[^'”>])*>", " ", message)
    
    # replace emails with 'email'
    message = re.sub(r"[\S]+@[\S]+\.[\S]+", "EMAIL", message)
    
    # replace URLs with 'url'
    message = re.sub(r"http[s]?://[\S]+", "URL", message)
    
    # replace the currency symbols with 'currency'
    message = re.sub(r"\$([ ]?(\d)+)?", "AMOUNT", message)
    
    # replace numbers with 'number'
    message = re.sub(r"\b(\d)+\b", "NUMBER", message)
    
    # remove unnecessary punctuations and special characters
    message = re.sub(r"[!@#$%^&*()_+\-=\[\]{};`~':\"\\|,.<>\/?]+", " ", message)
    
    return message

In [21]:
import glob
import email

def get_mail_contents(mail_type):
    messages = []
    filelists = glob.glob(EXTRACTION_PATH + "/*" + mail_type + "*/*/*", recursive = True)
    for email_file in filelists:
        message = ""
        try:
            fp = open(email_file, encoding= 'latin-1')
            email_content = email.message_from_file(fp)
            for part in email_content.walk():
                if part.get_content_type() == 'text/plain':
                    message = part.get_payload()
        except:
            print("Error in parsing document %r" % email_file)
        messages.append(get_processed_msg_content(message))

    return messages

In [22]:
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
from sklearn.utils import shuffle

# creating features using the 5000 most frequently found words in the mail dataset
def create_features():
    ham = get_mail_contents("ham")
    spam = get_mail_contents("spam")
    messages = ham + spam
    vectorizer = CountVectorizer(stop_words = 'english', max_features = 5000)
    X = vectorizer.fit_transform(messages).toarray()
    y = np.concatenate((np.zeros((len(ham))), np.ones((len(spam)))))
    return shuffle(X, y, random_state = 4)

In [23]:
def load_dataset():
    for url in SPAM_URLS + HAM_URLS:
        download_file(url)
    return create_features()

In [24]:
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

def display_classifier_metrics(classifier, y_actual, y_pred):
    print("\n\n", classifier)
    print("Confusion matrics : ", confusion_matrix(y_actual, y_pred))
    print("Precision : ", precision_score(y_actual, y_pred))
    print("Recall : ", recall_score(y_actual, y_pred))
    print("F1 score : ", f1_score(y_actual, y_pred))
    print("Accuracy score : ", accuracy_score(y_actual, y_pred))

## Training the models

In [25]:
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier

X, y = load_dataset()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 4)

classifiers = {
    'svc' : SVC(),
    'decision tree' : DecisionTreeClassifier(),
    'random forest' : RandomForestClassifier(),
    'logistic regression' : LogisticRegression(solver = 'newton-cg'),
    'mlp' : MLPClassifier()
    }

## Result

In [26]:
for classifier_name, classifier in classifiers.items():
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    display_classifier_metrics(classifier_name, y_test, y_pred)



 svc
Confusion matrics :  [[2267   22]
 [ 783  476]]
Precision :  0.9558232931726908
Recall :  0.3780778395552025
F1 score :  0.5418326693227091
Accuracy score :  0.7731116121758738


 decision tree
Confusion matrics :  [[2141  148]
 [  19 1240]]
Precision :  0.8933717579250721
Recall :  0.9849086576648134
F1 score :  0.9369097091046469
Accuracy score :  0.9529312288613303


 random forest
Confusion matrics :  [[2199   90]
 [  15 1244]]
Precision :  0.9325337331334332
Recall :  0.9880857823669579
F1 score :  0.9595063632857693
Accuracy score :  0.9704058624577226


 logistic regression
Confusion matrics :  [[2192   97]
 [  12 1247]]
Precision :  0.9278273809523809
Recall :  0.9904686258935663
F1 score :  0.9581252401075682
Accuracy score :  0.9692784667418264


 mlp
Confusion matrics :  [[2202   87]
 [  12 1247]]
Precision :  0.9347826086956522
Recall :  0.9904686258935663
F1 score :  0.9618202853837254
Accuracy score :  0.9720969560315671


The result of the prediction using various learning algorithms shows that SVC has the least accuracy and F1 score while all the other algorithms perform well in classifying spam emails.