<a href="https://colab.research.google.com/github/tanishq150802/Bizanalytix_Intern_SpamClassifier/blob/main/spam_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Fetching the data**

In [None]:
import tarfile
import urllib
import os
import urllib.request

DOWNLOAD_ROOT = "http://spamassassin.apache.org/old/publiccorpus/"
HAM_URL = DOWNLOAD_ROOT + "20030228_easy_ham.tar.bz2"
SPAM_URL = DOWNLOAD_ROOT + "20030228_spam.tar.bz2"
SPAM_PATH = os.path.join("datasets", "spam")

def fetch_data(spam_url=SPAM_URL, spam_path=SPAM_PATH):
    if not os.path.isdir(spam_path):
        os.makedirs(spam_path)
    for filename, url in (("ham.tar.bz2", HAM_URL), ("spam.tar.bz2", SPAM_URL)):
        path = os.path.join(spam_path, filename)
        if not os.path.isfile(path):
            urllib.request.urlretrieve(url, path)
        tar_bz2_file = tarfile.open(path)
        tar_bz2_file.extractall(path=SPAM_PATH)
        tar_bz2_file.close()

fetch_data()

**Loading emails having name length greater than 21**

In [None]:
HAM_DIR = os.path.join(SPAM_PATH, "easy_ham")
SPAM_DIR = os.path.join(SPAM_PATH, "spam")
ham_filenames = [name for name in sorted(os.listdir(HAM_DIR)) if len(name) > 21]
spam_filenames = [name for name in sorted(os.listdir(SPAM_DIR)) if len(name) > 21]

print(len(ham_filenames),len(spam_filenames))

2500 500


**Parsing emails using email library (I thank Vikram sir for helping)**

In [None]:
import email
import email.policy

def load_email(is_spam, filename, spam_path=SPAM_PATH):
    directory = "spam" if is_spam else "easy_ham"
    with open(os.path.join(spam_path, directory, filename), "rb") as f:
        return email.parser.BytesParser(policy=email.policy.default).parse(f)

ham_emails = [load_email(is_spam=False, filename=name) for name in ham_filenames]
spam_emails = [load_email(is_spam=True, filename=name) for name in spam_filenames]

**Printing few examples**

In [None]:
print(ham_emails[0].get_content().strip())
print(spam_emails[0].get_content().strip())

Date:        Wed, 21 Aug 2002 10:54:46 -0500
    From:        Chris Garrigues <cwg-dated-1030377287.06fa6d@DeepEddy.Com>
    Message-ID:  <1029945287.4797.TMDA@deepeddy.vircio.com>


  | I can't reproduce this error.

For me it is very repeatable... (like every time, without fail).

This is the debug log of the pick happening ...

18:19:03 Pick_It {exec pick +inbox -list -lbrace -lbrace -subject ftp -rbrace -rbrace} {4852-4852 -sequence mercury}
18:19:03 exec pick +inbox -list -lbrace -lbrace -subject ftp -rbrace -rbrace 4852-4852 -sequence mercury
18:19:04 Ftoc_PickMsgs {{1 hit}}
18:19:04 Marking 1 hits
18:19:04 tkerror: syntax error in expression "int ...

Note, if I run the pick command by hand ...

delta$ pick +inbox -list -lbrace -lbrace -subject ftp -rbrace -rbrace  4852-4852 -sequence mercury
1 hit

That's where the "1 hit" comes from (obviously).  The version of nmh I'm
using is ...

delta$ pick -version
pick -- nmh-1.0.4 [compiled on fuchsia.cs.mu.OZ.AU at Sun Mar 17 14:55:56 

**counting different email structures like Multipart/plain**

In [None]:
def get_email_structure(email):
    if isinstance(email, str): #if the mail has just strings then it may be mostly plain
        return email
    payload = email.get_payload()
    if isinstance(payload, list): #otherwise if its a list then it may be multipart
        return "multipart({})".format(", ".join([
            get_email_structure(sub_email)
            for sub_email in payload]))
    else:
        return email.get_content_type()

from collections import Counter

def structures_counter(emails):
    structures = Counter()
    for email in emails:
        struc = get_email_structure(email)
        structures[struc] = structures[struc]+1
    return structures

structures_counter(ham_emails).most_common()

structures_counter(spam_emails).most_common()

[('text/plain', 218),
 ('text/html', 183),
 ('multipart(text/plain, text/html)', 45),
 ('multipart(text/html)', 20),
 ('multipart(text/plain)', 19),
 ('multipart(multipart(text/html))', 5),
 ('multipart(text/plain, image/jpeg)', 3),
 ('multipart(text/html, application/octet-stream)', 2),
 ('multipart(text/plain, application/octet-stream)', 1),
 ('multipart(text/html, text/plain)', 1),
 ('multipart(multipart(text/html), application/octet-stream, image/jpeg)', 1),
 ('multipart(multipart(text/plain, text/html), image/gif)', 1),
 ('multipart/alternative', 1)]

**Taking a look at one of the email header**

In [None]:
for header, value in spam_emails[1].items(): 
    print(header,":",value)

Return-Path : <ilug-admin@linux.ie>
Delivered-To : zzzz@localhost.spamassassin.taint.org
Received : from localhost (localhost [127.0.0.1])	by phobos.labs.spamassassin.taint.org (Postfix) with ESMTP id A7FD7454F6	for <zzzz@localhost>; Thu, 22 Aug 2002 08:27:38 -0400 (EDT)
Received : from phobos [127.0.0.1]	by localhost with IMAP (fetchmail-5.9.0)	for zzzz@localhost (single-drop); Thu, 22 Aug 2002 13:27:38 +0100 (IST)
Received : from lugh.tuatha.org (root@lugh.tuatha.org [194.125.145.45]) by    dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g7MCJiZ06043 for    <zzzz-ilug@jmason.org>; Thu, 22 Aug 2002 13:19:44 +0100
Received : from lugh (root@localhost [127.0.0.1]) by lugh.tuatha.org    (8.9.3/8.9.3) with ESMTP id NAA29323; Thu, 22 Aug 2002 13:18:52 +0100
Received : from email.qves.com ([67.104.83.251]) by lugh.tuatha.org    (8.9.3/8.9.3) with ESMTP id NAA29282 for <ilug@linux.ie>; Thu,    22 Aug 2002 13:18:37 +0100
Received : from qvp0091 ([169.254.6.22]) by email.qves.com with Micros

In [None]:
spam_emails[1]["Subject"]

'[ILUG] Guaranteed to lose 10-12 lbs in 30 days 10.206'

**70-30 Train-Test Split**

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
X = np.array(ham_emails + spam_emails)
y = np.array([0] * len(ham_emails) + [1] * len(spam_emails))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=None)

  This is separate from the ipykernel package so we can avoid doing imports until


**HTML to plain text function (No use of BeautifulSoup)**

In [None]:
import re
from html import unescape
def html_to_plain_text(html):
    text = re.sub('<head.*?>.*?</head>', '', html, flags=re.M | re.S | re.I)
    text = re.sub('<a\s.*?>', ' HYPERLINK ', text, flags=re.M | re.S | re.I)
    text = re.sub('<.*?>', '', text, flags=re.M | re.S)
    text = re.sub(r'(\s*\n)+', '\n', text, flags=re.M | re.S)
    return unescape(text)

html_spam_emails = [email for email in X_train[y_train==1] if get_email_structure(email) == "text/html"]
sample_html_spam = html_spam_emails[0]

print(sample_html_spam.get_content().strip()[:1000], "...") 
print(html_to_plain_text(sample_html_spam.get_content())[:1000], "...") #printing a converted example

<HTML><TABLE WIDTH=100% BORDER=0 CELLPADDING=0 CELLSPACING=0><TR><TD align=center valign=middle BGCOLOR=#0A0A5A><center><a href=http://www.freepornsecrets.net/bnr/3001J86020 target=_blank><font color=#FFFF00 size=5 face="Geneva, Arial, Helvetica, san-serif"><strong>GET FREE ACCESS TO XXX PORN!</strong></font></a><br><table width=100 border=3 cellspacing=0 cellpadding=0><tr><td><TABLE WIDTH=550 BORDER=0 CELLPADDING=0 CELLSPACING=0><TR><TD COLSPAN=3><a href=http://www.freepornsecrets.net/bnr/3001J86020 target=_blank><IMG SRC=http://www.freepornsecrets.net/art/freepornsecrets/HC_FPS_01.jpg WIDTH=550 HEIGHT=112 border=0></a></TD></TR><TR><TD><a href=http://www.freepornsecrets.net/bnr/3001J86020 target=_blank><IMG SRC=http://www.freepornsecrets.net/art/freepornsecrets/HC_FPS_02.gif WIDTH=104 HEIGHT=231 border=0></a></TD><TD><a href=http://www.freepornsecrets.net/bnr/3001J86020 target=_blank><IMG SRC=http://www.freepornsecrets.net/art/freepornsecrets/HC_FPS_03.jpg WIDTH=339 HEIGHT!
 =231 bor

**Generalized function to convert emails to plain text**

In [None]:
def email_to_text(email):
    html = None
    for part in email.walk():
        ctype = part.get_content_type()
        if not ctype in ("text/plain", "text/html"):
            continue
        try:
            content = part.get_content()
        except: # if any encoding issues
            content = str(part.get_payload())
        if ctype == "text/plain":
            return content
        else:
            html = content
    if html:
        return html_to_plain_text(html)

print(email_to_text(sample_html_spam)[:100], "...")

 HYPERLINK GET FREE ACCESS TO XXX PORN! HYPERLINK  HYPERLINK  HYPERLINK  HYPERLINK  HYPERLINK  HYPER ...


**NLTK for stemming with some examples**

In [None]:
import nltk
stemmer = nltk.PorterStemmer()
for word in ("Computations", "Computation", "Computing", "Computed", "Compute", "Compulsive"): 
  print(word, "=>", stemmer.stem(word))

Computations => comput
Computation => comput
Computing => comput
Computed => comput
Compute => comput
Compulsive => compuls


**urlextract to replace URLs with the word "URL"**

In [None]:
import google.colab
!pip install urlextract

import urlextract #Require an Internet connection to download root domain names
url_extractor = urlextract.URLExtract()
print(url_extractor.find_urls("Will it detect github.com and https://youtu.be/7Pq-S557XQU?t=3m32s"))

['github.com', 'https://youtu.be/7Pq-S557XQU?t=3m32s']


**Transformer to convert all the emails to word counter arrays**

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

class EmailToWordCounterTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, strip_headers=True, lower_case=True, remove_punctuation=True,
                 replace_urls=True, replace_numbers=True, stemming=True):
        self.strip_headers = strip_headers
        self.lower_case = lower_case
        self.remove_punctuation = remove_punctuation
        self.replace_urls = replace_urls
        self.replace_numbers = replace_numbers
        self.stemming = stemming
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        X_transformed = []
        for email in X:
            text = email_to_text(email) or ""
            if self.lower_case:
                text = text.lower()
            if self.replace_urls and url_extractor is not None:
                urls = list(set(url_extractor.find_urls(text)))
                urls.sort(key=lambda url: len(url), reverse=True)
                for url in urls:
                    text = text.replace(url, " URL ")
            if self.replace_numbers:
                text = re.sub(r'\d+(?:\.\d*(?:[eE]\d+))?', 'NUMBER', text)
            if self.remove_punctuation:
                text = re.sub(r'\W+', ' ', text, flags=re.M)
            word_counts = Counter(text.split())
            if self.stemming and stemmer is not None:
                stemmed_word_counts = Counter()
                for word, count in word_counts.items():
                    stemmed_word = stemmer.stem(word)
                    stemmed_word_counts[stemmed_word] += count
                word_counts = stemmed_word_counts
            X_transformed.append(word_counts)
        return np.array(X_transformed)


X_few = X_train[:4] #testing 4 examples
X_few_wordcounts = EmailToWordCounterTransformer().fit_transform(X_few)
X_few_wordcounts

array([Counter({'number': 6, 'is': 3, 'the': 3, 'list': 3, 'liblit': 2, 'if': 2, 'font': 2, 'ugli': 2, 'of': 2, 'are': 2, 'gnomenumb': 2, 'for': 2, 'rpm': 2, 'on': 1, 'fri': 1, 'oct': 1, 'ben': 1, 'eec': 1, 'berkeley': 1, 'edu': 1, 'wrote': 1, 'so': 1, 'your': 1, 'look': 1, 'lack': 1, 'bytecod': 1, 'hint': 1, 'not': 1, 'caus': 1, 'well': 1, 'you': 1, 'right': 1, 'sorri': 1, 'i': 1, 'didn': 1, 't': 1, 'have': 1, 'ani': 1, 'better': 1, 'idea': 1, 'yesterday': 1, 'late': 1, 'in': 1, 'even': 1, 'onli': 1, 'insid': 1, 'and': 1, 'clean': 1, 'app': 1, 'antialias': 1, 'disabl': 1, 'thi': 1, 'difficult': 1, 'to': 1, 'understand': 1, 'me': 1, 'regard': 1, 'from': 1, 'germani': 1, 'matthia': 1, '_______________________________________________': 1, 'mail': 1, 'freshrpm': 1, 'net': 1, 'url': 1}),
       Counter({'i': 10, 'the': 7, 'a': 6, 'in': 5, 'to': 4, 'like': 3, 'get': 3, 'work': 3, 'number': 3, 'that': 3, 's': 3, 'thi': 3, 'it': 3, 'with': 3, 't': 3, 'tag': 3, 'url': 2, 'we': 2, 'of': 2, 'iss

**Most models prefer numerical inputs, hence converting to vector**

In [None]:
from scipy.sparse import csr_matrix

class WordCounterToVectorTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, vocabulary_size=1000):
        self.vocabulary_size = vocabulary_size
    def fit(self, X, y=None):
        total_count = Counter()
        for word_count in X:
            for word, count in word_count.items():
                total_count[word] = total_count[word]+min(count, 10)
        most_common = total_count.most_common()[:self.vocabulary_size]
        self.most_common_ = most_common
        self.vocabulary_ = {word: index + 1 for index, (word, count) in enumerate(most_common)}
        return self
    def transform(self, X, y=None):
        rows = []
        cols = []
        data = []
        for row, word_count in enumerate(X):
            for word, count in word_count.items():
                rows.append(row)
                cols.append(self.vocabulary_.get(word, 0))
                data.append(count)
        return csr_matrix((data, (rows, cols)), shape=(len(X), self.vocabulary_size + 1))

vocab_transformer = WordCounterToVectorTransformer(vocabulary_size=10)
X_few_vectors = vocab_transformer.fit_transform(X_few_wordcounts)
X_few_vectors

X_few_vectors.toarray()

vocab_transformer.vocabulary_

{'a': 8,
 'hyperlink': 9,
 'i': 3,
 'if': 10,
 'in': 7,
 'list': 5,
 'number': 2,
 'the': 1,
 'to': 4,
 'you': 6}

In [None]:
t=X_few_vectors.toarray() #example of an output sparse matrix
t

array([[ 67,   3,   6,   1,   1,   3,   1,   1,   0,   0,   2],
       [154,   7,   3,  10,   4,   1,   1,   5,   6,   0,   2],
       [ 26,   2,   1,   0,   3,   1,   1,   0,   0,   7,   1],
       [117,   5,   4,   2,   3,   6,   6,   1,   1,   0,   1]],
      dtype=int64)

**Making a pipeline for all the preprocessing (emails to wordcounters and subsequently, wordcounter to vectors)**

In [None]:
from sklearn.pipeline import Pipeline

preprocess_pipeline = Pipeline([
    ("email_to_wordcount", EmailToWordCounterTransformer()),
    ("wordcount_to_vector", WordCounterToVectorTransformer()),])
X_train_transformed = preprocess_pipeline.fit_transform(X_train)
type(X_train_transformed)

scipy.sparse.csr.csr_matrix

In [None]:
X_train_transformed.shape

(2100, 1001)

**converting sparse matrix to array for training**

In [None]:
X_train_conv=X_train_transformed.toarray()
print(X_train_conv.shape,y_train.shape)

(2100, 1001) (2100,)


In [None]:
X_train_conv

array([[22,  6,  3, ...,  0,  0,  0],
       [30,  3,  7, ...,  0,  0,  0],
       [ 5,  1,  2, ...,  0,  0,  0],
       ...,
       [ 7,  3,  3, ...,  0,  0,  0],
       [13,  5,  4, ...,  0,  0,  0],
       [71, 23,  8, ...,  0,  0,  0]], dtype=int64)

**Training an artificial neural network**

In [None]:
from keras.models import Sequential
from keras.layers import Dense
model = Sequential()
model.add(Dense(units=2100, 
                input_shape =(1001,), # The input for each sample if matrix of size (dim1, dim2)
                activation='relu'))
model.add(Dense(units=1024, activation='relu'))
model.add(Dense(units=512, activation='relu'))
model.add(Dense(units=256, activation='relu'))
model.add(Dense(units=128, activation='relu'))
model.add(Dense(units=64, activation='relu'))
model.add(Dense(units=16, activation='relu'))
model.add(Dense(units=8, activation='relu'))
model.add(Dense(units=1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [None]:
model.fit(X_train_conv,y_train,batch_size=32,epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f66bd02f350>

In [None]:
X_test_transformed = preprocess_pipeline.fit_transform(X_test)
X_test_conv=X_test_transformed.toarray()

In [None]:
print(X_test_conv.shape,y_test.shape)

(900, 1001) (900,)


**ANN achieved a testing accuracy of 83.78%**

In [None]:
from numpy import argmax
from sklearn.metrics import accuracy_score
yhat = model.predict(X_test_conv)
yhat = argmax(yhat, axis=-1)
acc = accuracy_score(y_test, yhat)
print(acc)

0.8377777777777777


In [None]:
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

print("The precision is ", precision_score(y_test, yhat, average='weighted'))
print("The recall is ", recall_score(y_test, yhat, average='weighted'))

The precision is  0.7018716049382716
The recall is  0.8377777777777777


  _warn_prf(average, modifier, msg_start, len(result))
