## Exercise Solution from Chapter 3 of Hands-On Machine Learning

### Salient Points:

* Uses Apache SpamAssassin's public spam-ham dataset [Link](https://homl.info/spamassassin)
* Cleaned datasets in email-format
* Used custom sklearn transformers for data cleaning and vectorization

|   Model  | Precision | Recall |
|----------|-----------|--------|
| Logistic |     92%   |  96.8% |
| Naive Bayes|   93.9% |  97.9% |

In [1]:
import tarfile
import os
from pathlib import Path

### Step 1: Downloading the Dataset -> Done

### Step 2: EDA

In [2]:
spam_data_path = Path("./data/20030228_spam.tar.bz2")
ham_data_path = Path("./data/20030228_easy_ham.tar.bz2")

In [3]:
# Decompression 
for path in [spam_data_path, ham_data_path]:
    tar_obj = tarfile.open(path)
    tar_obj.extractall(path="data/")
    tar_obj.close()

In [4]:
# Data has been decompressed
os.listdir("data/")

['spam', '20030228_spam.tar.bz2', '20030228_easy_ham.tar.bz2', 'easy_ham']

In [5]:
# Path for decompressed data
spam_path = Path("./data/spam/")
ham_path = Path("./data/easy_ham/")

In [6]:
# List of all spam and ham files
spam_files = [spam_path/file for file in os.listdir(spam_path)  if file != 'cmds']
ham_files = [ham_path/file for file in os.listdir(ham_path)  if file != 'cmds']

In [7]:
print(spam_files[:5])
print(ham_files[:5])

[PosixPath('data/spam/00075.28a918cd03a0ef5aa2f1e0551a798108'), PosixPath('data/spam/00376.f4ed5f002f9b6b320a67f1da9cacbe72'), PosixPath('data/spam/00029.de865ad8d5ad0df985ae2f72388befba'), PosixPath('data/spam/00192.e5a6bb15ae1e965f3b823c75e435651a'), PosixPath('data/spam/00409.e59f63e813b6766a9a4ddf0790634ca3')]
[PosixPath('data/easy_ham/01263.40cec40ea12c55f2ac9a98dc07c55d1c'), PosixPath('data/easy_ham/00238.dab1868a3b43de1e01ebdfd0e53de50f'), PosixPath('data/easy_ham/00088.945614c3f6213f59548ab21306451675'), PosixPath('data/easy_ham/02051.58e196144807bd76d7b77d4b7efb6d32'), PosixPath('data/easy_ham/01232.2f44f5a2186e97cf4d65cf191d98e646')]


In [8]:
len(spam_files), len(ham_files)

(500, 2500)

In [9]:
# File format -> Email format
f = open(spam_files[0])
content = f.read()
f.close()
print(content)

From iiu-owner@taint.org  Mon Aug 26 15:48:26 2002
Return-Path: <iiu-owner@taint.org>
Delivered-To: zzzz@localhost.spamassassin.taint.org
Received: from localhost (localhost [127.0.0.1])
	by phobos.labs.spamassassin.taint.org (Postfix) with ESMTP id 35D3247C86
	for <zzzz@localhost>; Mon, 26 Aug 2002 10:41:37 -0400 (EDT)
Received: from phobos [127.0.0.1]
	by localhost with IMAP (fetchmail-5.9.0)
	for zzzz@localhost (single-drop); Mon, 26 Aug 2002 15:41:37 +0100 (IST)
Received: from dogma.slashnull.org (localhost [127.0.0.1]) by
    dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g7NIi2Z03983 for
    <zzzz-list-admin-iiu@jmason.org>; Fri, 23 Aug 2002 19:44:02 +0100
Received: from linux.local ([213.9.245.86]) by dogma.slashnull.org
    (8.11.6/8.11.6) with SMTP id g7NIh1Z03950 for <iiu-admin@taint.org>;
    Fri, 23 Aug 2002 19:43:02 +0100
Message-Id: <200208231843.g7NIh1Z03950@dogma.slashnull.org>
Received: (qmail 28875 invoked from network); 23 Aug 2002 18:18:58 -0000
Received: from un

In [10]:
# Using python's email library as it makes parsing easy
# https://docs.python.org/3/library/email.examples.html
from email import policy
from email.parser import BytesParser

In [11]:
def email_loader(filenames):
    parsed_email = []
    for file in filenames:
        with open(file, 'rb') as fp:
            parsed_email.append(BytesParser(policy=policy.default).parse(fp))
    return parsed_email

In [12]:
spam_email = email_loader(spam_files)
ham_email = email_loader(ham_files)

In [13]:
# Same as above, can be accessed as key value pairs
print(spam_email[0].items())
print(spam_email[0].get_content().strip())

[('Return-Path', '<iiu-owner@taint.org>'), ('Delivered-To', 'zzzz@localhost.spamassassin.taint.org'), ('Received', 'from localhost (localhost [127.0.0.1])\tby phobos.labs.spamassassin.taint.org (Postfix) with ESMTP id 35D3247C86\tfor <zzzz@localhost>; Mon, 26 Aug 2002 10:41:37 -0400 (EDT)'), ('Received', 'from phobos [127.0.0.1]\tby localhost with IMAP (fetchmail-5.9.0)\tfor zzzz@localhost (single-drop); Mon, 26 Aug 2002 15:41:37 +0100 (IST)'), ('Received', 'from dogma.slashnull.org (localhost [127.0.0.1]) by    dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g7NIi2Z03983 for    <zzzz-list-admin-iiu@jmason.org>; Fri, 23 Aug 2002 19:44:02 +0100'), ('Received', 'from linux.local ([213.9.245.86]) by dogma.slashnull.org    (8.11.6/8.11.6) with SMTP id g7NIh1Z03950 for <iiu-admin@taint.org>;    Fri, 23 Aug 2002 19:43:02 +0100'), ('Message-Id', '<200208231843.g7NIh1Z03950@dogma.slashnull.org>'), ('Received', '(qmail 28875 invoked from network); 23 Aug 2002 18:18:58 -0000'), ('Received', 'f

In [14]:
# Checking the headers and items for the files
for key, val in zip(spam_email[23].keys(), spam_email[23].values()):
    print("{} : {}".format(key, val))

Return-Path : <evtwqmigru@datcon.co.uk>
Delivered-To : zzzz@localhost.spamassassin.taint.org
Received : from localhost (jalapeno [127.0.0.1])	by zzzzason.org (Postfix) with ESMTP id F3CF116F1B	for <zzzz@localhost>; Tue,  8 Oct 2002 11:02:13 +0100 (IST)
Received : from jalapeno [127.0.0.1]	by localhost with IMAP (fetchmail-5.9.0)	for zzzz@localhost (single-drop); Tue, 08 Oct 2002 11:02:13 +0100 (IST)
Received : from webnote.net (mail.webnote.net [193.120.211.219]) by    dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g989wGK10152 for    <zzzz@jmason.org>; Tue, 8 Oct 2002 10:58:16 +0100
Received : from sanaga.camtel.cm ([195.24.194.61]) by webnote.net    (8.9.3/8.9.3) with ESMTP id KAA21887 for <zzzz@spamassassin.taint.org>;    Tue, 8 Oct 2002 10:59:00 +0100
Received : from ens.fr (host42-226.pool8173.interbusiness.it    [81.73.226.42]) by sanaga.camtel.cm with SMTP (Microsoft Exchange Internet    Mail Service Version 5.5.1960.3) id 431SJ0H6; Tue, 8 Oct 2002 06:55:54    -0000
Message-I

The `Content-Type` varies across the different emails and so we need to use Beautiful Soup to properly extract the content from them 

In [15]:
dict(spam_email[23].items())['Content-Type']

'text/plain; charset="Windows-1252"'

In [16]:
from collections import Counter
def email_content_type(emails):
    content_type = Counter()
    for email in emails:
        content_type[email.get_content_type()] += 1
    return content_type

In [17]:
print(email_content_type(ham_email))
print(email_content_type(spam_email))

Counter({'text/plain': 2408, 'multipart/signed': 68, 'multipart/mixed': 10, 'multipart/alternative': 9, 'multipart/related': 3, 'multipart/report': 2})
Counter({'text/plain': 218, 'text/html': 183, 'multipart/alternative': 47, 'multipart/mixed': 43, 'multipart/related': 9})


&#8593; shows that spam email has a smaller proportion of data in plain text

In [18]:
# This is a good way of getting the content structures as this gives
# a more detailed output (From solutions)


# def get_email_structure(email):
#     if isinstance(email, str):
#         return email
#     payload = email.get_payload()
#     if isinstance(payload, list):
#         return "multipart({})".format(", ".join([
#             get_email_structure(sub_email)
#             for sub_email in payload
#         ]))
#     else:
#         return email.get_content_type()

# from collections import Counter

# def structures_counter(emails):
#     structures = Counter()
#     for email in emails:
#         structure = get_email_structure(email)
#         structures[structure] += 1
#     return structures

In [36]:
from bs4 import BeautifulSoup
from IPython.core.debugger import set_trace
# https://stackoverflow.com/questions/328356/extracting-text-from-html-file-using-python

In [102]:
def multi_(email):
    clean_text=""
    if len(email.get_payload()) > 100:
            clean_text = ''.join(BeautifulSoup
                (email.get_payload(), "html.parser").stripped_strings)
    else:
        for submail in email.get_payload():
            if submail.get_content_type() in ('multipart/alternative', 'multipart/related', 'multipart/mixed', 'multipart/report', 'multipart/signed'):
                clean_text += multi_(submail)
            elif submail.get_content_type() in ("application/octet-stream", "image/jpeg", "image/gif"):
                clean_text += submail.get_content_type()
            elif submail.get_content_type() in ("text/plain", "text/html"):
                clean_text += ''.join(BeautifulSoup
                    (submail.get_payload(), "html.parser").stripped_strings)
            else: continue
    return clean_text

def html2text(emails):
    data = []
    for email in emails:
        if email.get_content_type() in ("text/plain", "text/html") or len(email.get_payload()) > 100:
            clean_text = ''.join(BeautifulSoup
                (email.get_payload(), "html.parser").stripped_strings)
        elif email.get_content_type() in ('multipart/alternative', 'multipart/related', 'multipart/mixed', 'multipart/report', 'multipart/signed'):
            clean_text = multi_(email)
        else:
            clean_text = ""
            for submail in email.get_payload():
                clean_text += ''.join(BeautifulSoup
                (submail.get_payload(), "html.parser").stripped_strings)
        data.append(clean_text)
    return data

In [103]:
all_spam_clean = html2text(spam_email)
all_ham_clean = html2text(ham_email)

In [106]:
import numpy as np
from sklearn.model_selection import train_test_split

X = np.array(all_ham_clean + all_spam_clean)
y = np.array([0] * len(all_ham_clean) + [1] * len(all_spam_clean))

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#### Creating Custom Transformer

In [117]:
from sklearn.base import TransformerMixin, BaseEstimator
from collections import Counter
import urlextract #Taken from sols as regex approach was not great
import nltk
import re

url_extractor = urlextract.URLExtract()
stemmer = nltk.PorterStemmer()

class data_prep(BaseEstimator, TransformerMixin):
    def __init__(self, lowercase=True, punctuation=True, urls=True, 
                numbers=True, stemming=True):
        self.lowercase = lowercase
        self.punctuation = punctuation
        self.urls = urls
        self.numbers = numbers
        self.stemming = True
    def fit(self, x, y=None):
        return self
    def transform(self, X, y=None):
        data = []
        for text in X:
            if self.lowercase:
                text = text.lower()
            if self.urls:
                ext_urls = list(set(url_extractor.find_urls(text)))
                ext_urls.sort(key=lambda x: len(x), reverse=True)
                for url in ext_urls:
                    text = text.replace(url, "URL")
            if self.punctuation:
                text = re.sub(r'[^\w\s]', ' ', text)
#           # From solution
            if self.numbers:
                text = re.sub(r'\d+(?:\.\d*(?:[eE]\d+))?', 'NUMBER', text)
            words = Counter(text.split())
            if self.stemming:
                word_stemmed = Counter()
                for word, count in words.items():
                    word_stemmed[stemmer.stem(word)] += count
                words = word_stemmed
            data.append(words)
        return np.array(data)

In [119]:
# Its working
data_prep().fit_transform(x_test[5:10])

array([Counter({'number': 6, 'url': 3, 'date': 1, 'numbertnumb': 1}),
       Counter({'to': 10, 'the': 6, 'url': 5, 'in': 5, 'line': 4, 'i': 4, 'it': 4, 'a': 4, 'mail': 4, 'list': 4, 'm': 3, 'sa': 3, 'and': 3, 'spamassassin': 3, 'user_pref': 3, 'is': 3, 'be': 3, 'have': 2, 'get': 2, 'work': 2, 'like': 2, 'on': 2, 'if': 2, 'user': 2, 'ha': 2, 'file': 2, 'doe': 2, 'that': 2, 'all': 2, 'against': 2, 'ani': 2, 'still': 2, 'here': 2, 'their': 2, 'of': 2, 'for': 2, 'net': 2, 'talk': 2, 'as': 1, 'subject': 1, 'indic': 1, 'sure': 1, 'these': 1, 'are': 1, 'stupid': 1, 'question': 1, 'but': 1, 'troubl': 1, 'understand': 1, 'should': 1, 'about': 1, 'given': 1, 'up': 1, 'tri': 1, 'figur': 1, 'out': 1, 'myself': 1, 'whitelist_from': 1, 'hi': 1, 'not': 1, 'effect': 1, 'tell': 1, 'program': 1, 'take': 1, 'no': 1, 'action': 1, 'at': 1, 'come': 1, 'from': 1, 'or': 1, 'there': 1, 'check': 1, 'done': 1, 'latter': 1, 'what': 1, 'he': 1, 'need': 1, 'place': 1, 'caus': 1, 'such': 1, 'ignor': 1, 'test': 1, '

In [132]:
# From solutions
from scipy.sparse import csr_matrix

class vectorizer(BaseEstimator, TransformerMixin):
    def __init__(self, vocab_size = 1000):
        self.vocab_size = vocab_size
    def fit(self, x, y=None):
        total_words = Counter()
        for text in x:
            for word, count in text.items():
                total_words[word] += count
        most_common = total_words.most_common()[:self.vocab_size]
        self.most_common = most_common
        self.vocabulary_ = {word:index+1 for index, (word, count) in enumerate(most_common)}
        return self
    def transform(self, x, y=None):
        rows = []
        cols = []
        data = []
        for index, text in enumerate(x):
            for word, count in text.items():
                rows.append(index)
                cols.append(self.vocabulary_.get(word, 0))
                data.append(count)
        return csr_matrix((data, (rows, cols)), shape=(len(x), self.vocab_size + 1))

In [136]:
vectorizer(vocab_size=20).fit_transform(data_prep().fit_transform(x_test[5:10]))

<5x21 sparse matrix of type '<class 'numpy.longlong'>'
	with 65 stored elements in Compressed Sparse Row format>

### Transforming Data for Training

In [137]:
from sklearn.pipeline import Pipeline

data_preprocess_pipe = Pipeline([
    ("text_clean", data_prep()),
    ("text_vectorizer", vectorizer()),
])

In [138]:
x_train_trans = data_preprocess_pipe.fit_transform(x_train)
x_test_trans = data_preprocess_pipe.transform(x_test)

In [140]:
# How the transformed data in vectorized form looks
x_train_trans[0].toarray()

array([[17,  5,  6, ...,  0,  0,  0]])

### Training

In [151]:
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import precision_score, recall_score

log = LogisticRegression(random_state=42)
mnb = MultinomialNB()

In [147]:
def score(y_test, y_pred):
    print("Precision: {:.1f}%".format(100 * precision_score(y_test, y_pred)))
    print("Recall: {:.1f}%".format(100 * recall_score(y_test, y_pred)))

In [148]:
# Logistic Regression
log.fit(x_train_trans, y_train)

y_pred = log.predict(x_test_trans)
score(y_test, y_pred)

Precision: 92.0%
Recall: 96.8%


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html.
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [152]:
# Gaussian Naive Bayes
mnb.fit(x_train_trans, y_train)

y_pred = mnb.predict(x_test_trans)
score(y_test, y_pred)

Precision: 93.9%
Recall: 97.9%
