# Spam Classifier

## About Dataset 

Build spam classifier using dataset from [Apache SpamAssassin's public datasets](https://spamassassin.apache.org/old/publiccorpus/). Two datasets are used, namely "20030228_easy_ham.tar.bz2" and "20030228_spam.tar.bz2".

In [1]:
# Extract and load emails from tarfiles
import tarfile, os

spam_path = "spam_dataset"
ham_tar = tarfile.open("20030228_easy_ham.tar.bz2", "r:bz2")  
ham_tar.extractall(path=spam_path)
ham_filenames = [name for name in sorted(os.listdir(os.path.join(spam_path, "easy_ham"))) if len(name) > 20]
ham_tar.close()
spam_tar = tarfile.open("20030228_spam.tar.bz2", "r:bz2")  
spam_tar.extractall(path=spam_path)
spam_filenames = [name for name in sorted(os.listdir(os.path.join(spam_path, "spam"))) if len(name) > 20]
spam_tar.close()

In [2]:
len(ham_filenames)

2500

In [3]:
len(spam_filenames)

500

In [4]:
# Use `email` module from python to parse these emails.
import email
import email.policy

def load_email(is_spam, filename, spam_path=spam_path):
    directory = "spam" if is_spam else "easy_ham"
    with open(os.path.join(spam_path, directory, filename), "rb") as file:
        return email.parser.BytesParser(policy=email.policy.default).parse(file)

In [5]:
ham_emails = [load_email(is_spam=False, filename=name) for name in ham_filenames]
spam_emails = [load_email(is_spam=True, filename=name) for name in spam_filenames]

## Data Exploration

In [6]:
# Look at an example of ham emails
print(ham_emails[5].get_content().strip())

> I just had to jump in here as Carbonara is one of my favourites to make and 
> ask 
> what the hell are you supposed to use instead of cream? 

Isn't it just basically a mixture of beaten egg and bacon (or pancetta, 
really)? You mix in the raw egg to the cooked pasta and the heat of the pasta 
cooks the egg. That's my understanding.

Martin

------------------------ Yahoo! Groups Sponsor ---------------------~-->
4 DVDs Free +s&p Join Now
http://us.click.yahoo.com/pt6YBB/NXiEAA/mG3HAA/7gSolB/TM
---------------------------------------------------------------------~->

To unsubscribe from this group, send an email to:
forteana-unsubscribe@egroups.com

 

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/


In [7]:
# Look at an example of spam emails
print(spam_emails[5].get_content().strip())

A POWERHOUSE GIFTING PROGRAM You Don't Want To Miss! 
 
  GET IN WITH THE FOUNDERS! 
The MAJOR PLAYERS are on This ONE
For ONCE be where the PlayerS are
This is YOUR Private Invitation

EXPERTS ARE CALLING THIS THE FASTEST WAY 
TO HUGE CASH FLOW EVER CONCEIVED
Leverage $1,000 into $50,000 Over and Over Again

THE QUESTION HERE IS:
YOU EITHER WANT TO BE WEALTHY 
OR YOU DON'T!!!
WHICH ONE ARE YOU?
I am tossing you a financial lifeline and for your sake I 
Hope you GRAB onto it and hold on tight For the Ride of youR life!

Testimonials

Hear what average people are doing their first few days:
�We've received 8,000 in 1 day and we are doing that over and over again!' Q.S. in AL
 �I'm a single mother in FL and I've received 12,000 in the last 4 days.� D. S. in FL
�I was not sure about this when I sent off my $1,000 pledge, but I got back $2,000 the very next day!� L.L. in KY
�I didn't have the money, so I found myself a partner to work this with. We have received $4,000 over the last 2 days

In [8]:
# Emails come in different structures, with images or attachements

def get_email_structure(email):
    if isinstance(email, str):
        return email
    payload = email.get_payload()
    if isinstance(payload, list):
        return "multipart({})".format(", ".join([get_email_structure(sub_email) for sub_email in payload]))
    else:
        return email.get_content_type()

In [9]:
from collections import Counter

def structures_counter(emails):
    structures = Counter()
    for email in emails:
        structure = get_email_structure(email)
        structures[structure] += 1
    return structures

In [10]:
structures_counter(ham_emails).most_common()

[('text/plain', 2408),
 ('multipart(text/plain, application/pgp-signature)', 66),
 ('multipart(text/plain, text/html)', 8),
 ('multipart(text/plain, text/plain)', 4),
 ('multipart(text/plain)', 3),
 ('multipart(text/plain, application/octet-stream)', 2),
 ('multipart(text/plain, text/enriched)', 1),
 ('multipart(text/plain, application/ms-tnef, text/plain)', 1),
 ('multipart(multipart(text/plain, text/plain, text/plain), application/pgp-signature)',
  1),
 ('multipart(text/plain, video/mng)', 1),
 ('multipart(text/plain, multipart(text/plain))', 1),
 ('multipart(text/plain, application/x-pkcs7-signature)', 1),
 ('multipart(text/plain, multipart(text/plain, text/plain), text/rfc822-headers)',
  1),
 ('multipart(text/plain, multipart(text/plain, text/plain), multipart(multipart(text/plain, application/x-pkcs7-signature)))',
  1),
 ('multipart(text/plain, application/x-java-applet)', 1)]

In [11]:
structures_counter(spam_emails).most_common()

[('text/plain', 218),
 ('text/html', 183),
 ('multipart(text/plain, text/html)', 45),
 ('multipart(text/html)', 20),
 ('multipart(text/plain)', 19),
 ('multipart(multipart(text/html))', 5),
 ('multipart(text/plain, image/jpeg)', 3),
 ('multipart(text/html, application/octet-stream)', 2),
 ('multipart(text/plain, application/octet-stream)', 1),
 ('multipart(text/html, text/plain)', 1),
 ('multipart(multipart(text/html), application/octet-stream, image/jpeg)', 1),
 ('multipart(multipart(text/plain, text/html), image/gif)', 1),
 ('multipart/alternative', 1)]

The output shows that ham emails are mostly plain text compared to spam emails which mostly have HTML. Some ham emails are signed using PGP whereas no spam emails is.

In [12]:
# Check email headers
for header, value in spam_emails[0].items():
    print(header, ":", value)

Return-Path : <12a1mailbot1@web.de>
Delivered-To : zzzz@localhost.spamassassin.taint.org
Received : from localhost (localhost [127.0.0.1])	by phobos.labs.spamassassin.taint.org (Postfix) with ESMTP id 136B943C32	for <zzzz@localhost>; Thu, 22 Aug 2002 08:17:21 -0400 (EDT)
Received : from mail.webnote.net [193.120.211.219]	by localhost with POP3 (fetchmail-5.9.0)	for zzzz@localhost (single-drop); Thu, 22 Aug 2002 13:17:21 +0100 (IST)
Received : from dd_it7 ([210.97.77.167])	by webnote.net (8.9.3/8.9.3) with ESMTP id NAA04623	for <zzzz@spamassassin.taint.org>; Thu, 22 Aug 2002 13:09:41 +0100
From : 12a1mailbot1@web.de
Received : from r-smtp.korea.com - 203.122.2.197 by dd_it7  with Microsoft SMTPSVC(5.5.1775.675.6);	 Sat, 24 Aug 2002 09:42:10 +0900
To : dcek1a1@netsgo.com
Subject : Life Insurance - Why Pay More?
Date : Wed, 21 Aug 2002 20:31:57 -1600
MIME-Version : 1.0
Message-ID : <0103c1042001882DD_IT7@dd_it7>
Content-Type : text/html; charset="iso-8859-1"
Content-Transfer-Encoding : qu

This shows the sender's email address that looks suspicious, also subject line besides all other information.

## Split Dataset Into Training and Test Sets

In [13]:
from sklearn.model_selection import train_test_split
import numpy as np

X = np.array(ham_emails + spam_emails, dtype=object)
y = np.array([0] * len(ham_emails) + [1] * len(spam_emails))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Data Preprocessing

Convert HTML to plain text. Remove `<head>` section, convert `<a>` into HYPERLINK, remove all HTML tags, replace multiple newlines with a single one and unescapes html entities like `&gt` or `&nbsp`.

In [14]:
import re
from html import unescape

def html_to_plain_text(html):
    text = re.sub("<head.*?>.*?</head>", "", html, flags=re.M | re.S | re.I)
    text = re.sub("<a\s.*?>", " HYPERLINK ", text, flags=re.M | re.S | re.I)
    text = re.sub("<.*?>", "", text, flags=re.M | re.S)
    text = re.sub(r"(\s*\n)+", "\n", text, flags=re.M | re.S)
    return unescape(text)

In [28]:
# Testing using HTML spam email
html_spam = [email for email in X_train[y_train==1] if get_email_structure(email)=="text/html"]
sample_html_spam = html_spam[100]
print(sample_html_spam.get_content().strip()[:1000], "...")

<html>

<head>
<meta http-equiv="Content-Language" content="en-us">
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<meta name="GENERATOR" content="Microsoft FrontPage 4.0">
<meta name="ProgId" content="FrontPage.Editor.Document">
<title>Does Your Computer Need an Oil Change</title>
</head>

<body>

<table border="0" width="538" height="1">
  <tr>
    <td width="538" height="1" align="center" bgcolor="#000000"><b><font face="Century Gothic" size="5" color="#FFFFFF">Does Your Computer Need an Oil
      Change?</font></b></td>
  </tr>
</table>
<table border="0" width="538" height="151">
  <tr>
    <td width="530" height="145"><b><font face="Tahoma" size="5">Norton</font><font color="#006600" face="Verdana" size="7"><br></font><i><font face="Verdana" color="#CC0000" size="7">SystemWorks
      2002</font></i><font size="4" face="Verdana"><br> </font><font face="Tahoma" size="5">Professional
      Edition</font> </b></td>
  </tr>
</table>
<table border="0" width="

In [29]:
# The resulting plain text
print(html_to_plain_text(sample_html_spam.get_content())[:1000], "...")


    Does Your Computer Need an Oil
      Change?
    NortonSystemWorks
      2002 Professional
      Edition
    Made
      by the Creators of the #1 Anti-Virus Software on the Market!
    This
      UNBEATABLE software suite comes with  EVERY
      program you'll  ever need to answer the problems or threats that your
      computer faces each day of it's Life!Included in this magnificent deal
      are the following programs:
    Norton
      AntiVirusÿFFFF99 2002 - THE #1
      ANTI-VIRUS PROTECION EVER!Norton UtilitiesÿFFFF99 2002
      - DIAGNOSE ANY PROBLEM WITH YOUR
      SYSTEM!
      Norton GhostÿFFFF99 2002 - MAKES
      BACKING UP YOUR VALUABLE DATA EASY!
      Norton CleanSweepÿFFFF99 2002 - CLEANS
      OUT EXCESS INTERNET FILE BUILDUP!
      Norton WinFaxÿFFFF99 Basic - TURNS YOUR
      CPU INTO A FAX MACHINE!
      GoBackÿFFFFAE 3 Personal - HELPS
      PREVENT YOU FROM MAKING ANY MISTAKES!
    *ALL
      this has a retail price of $99.95*  Get it
      Now for ONLY $29.

In [30]:
# Transform emails into plain text regardless of format
def email_to_text(email):
    html = None
    for part in email.walk():
        ctype = part.get_content_type()
        if not ctype in ("text/plain", "text/html"):
            continue
        # if there are encoding issues
        try:
            content = part.get_content()
        except:
            content = str(part.get_payload())
        if ctype == "text/plain":
            return content
        else:
            html = content
    if html:
        return html_to_plain_text(html)

In [32]:
print(email_to_text(sample_html_spam)[:1000], "...")


    Does Your Computer Need an Oil
      Change?
    NortonSystemWorks
      2002 Professional
      Edition
    Made
      by the Creators of the #1 Anti-Virus Software on the Market!
    This
      UNBEATABLE software suite comes with  EVERY
      program you'll  ever need to answer the problems or threats that your
      computer faces each day of it's Life!Included in this magnificent deal
      are the following programs:
    Norton
      AntiVirusÿFFFF99 2002 - THE #1
      ANTI-VIRUS PROTECION EVER!Norton UtilitiesÿFFFF99 2002
      - DIAGNOSE ANY PROBLEM WITH YOUR
      SYSTEM!
      Norton GhostÿFFFF99 2002 - MAKES
      BACKING UP YOUR VALUABLE DATA EASY!
      Norton CleanSweepÿFFFF99 2002 - CLEANS
      OUT EXCESS INTERNET FILE BUILDUP!
      Norton WinFaxÿFFFF99 Basic - TURNS YOUR
      CPU INTO A FAX MACHINE!
      GoBackÿFFFFAE 3 Personal - HELPS
      PREVENT YOU FROM MAKING ANY MISTAKES!
    *ALL
      this has a retail price of $99.95*  Get it
      Now for ONLY $29.

In [33]:
# Add stemming using natural language toolkit (NLTK). Test it
from nltk import PorterStemmer

stemmer = PorterStemmer()
for word in ("Computations", "Computation", "Computing", "Computed", "Compute", "Compulsive"):
    print(word, "=>", stemmer.stem(word))

Computations => comput
Computation => comput
Computing => comput
Computed => comput
Compute => comput
Compulsive => compuls


In [37]:
# %pip install urlextract

In [38]:
# Replace URLs with URL using `urlextract` library. Test it
import urlextract

url_extractor = urlextract.URLExtract()
print(url_extractor.find_urls("Will it detect github.com"))

['github.com']


## Prepare Transformer to Convert Emails to Word Counters 

In [41]:
from sklearn.base import BaseEstimator, TransformerMixin

class EmailWordCounter(BaseEstimator, TransformerMixin):
    
    def __init__(self, strip_headers=True, remove_punctuation=True, lower_case=True, stemming=True, 
                 replace_numbers=True, replace_urls=True):
        self.strip_headers = strip_headers
        self.remove_punctuation = remove_punctuation
        self.lower_case = lower_case
        self.stemming = stemming
        self.replace_numbers = replace_numbers
        self.replace_urls = replace_urls
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        X_transformed = []
        for email in X:
            text = email_to_text(email) or ""
            if self.lower_case:
                text = text.lower()
            if self.replace_urls and url_extractor is not None:
                urls = list(set(url_extractor.find_urls(text)))
                urls.sort(key=lambda url: len(url), reverse=True)
                for url in urls:
                    text = text.replace(url, " URL ")
            if self.replace_numbers:
                text = re.sub(r"\d+(?:\.\d*)?(?:[eE][+-]?\d+)?", "NUMBER", text)
            if self.remove_punctuation:
                text = re.sub(r"\W+", " ", text, flags=re.M)
            word_counts = Counter(text.split())
            if self.stemming and stemmer is not None:
                stemmed_word_counts = Counter()
                for word, count in word_counts.items():
                    stemmed_word = stemmer.stem(word)
                    stemmed_word_counts[stemmed_word] += count
                word_counts = stemmed_word_counts
            X_transformed.append(word_counts)
        return np.array(X_transformed)

In [44]:
# Test it on few emails. Goal is to get the word counts that will be later converted to vectors
X_few = X_train[:5]
X_few_wordcounts = EmailWordCounter().fit_transform(X_few)
X_few_wordcounts

array([Counter({'chuck': 1, 'murcko': 1, 'wrote': 1, 'stuff': 1, 'yawn': 1, 'r': 1}),
       Counter({'the': 11, 'of': 9, 'and': 8, 'all': 3, 'christian': 3, 'to': 3, 'by': 3, 'jefferson': 2, 'i': 2, 'have': 2, 'superstit': 2, 'one': 2, 'on': 2, 'been': 2, 'ha': 2, 'half': 2, 'rogueri': 2, 'teach': 2, 'jesu': 2, 'some': 1, 'interest': 1, 'quot': 1, 'url': 1, 'thoma': 1, 'examin': 1, 'known': 1, 'word': 1, 'do': 1, 'not': 1, 'find': 1, 'in': 1, 'our': 1, 'particular': 1, 'redeem': 1, 'featur': 1, 'they': 1, 'are': 1, 'alik': 1, 'found': 1, 'fabl': 1, 'mytholog': 1, 'million': 1, 'innoc': 1, 'men': 1, 'women': 1, 'children': 1, 'sinc': 1, 'introduct': 1, 'burnt': 1, 'tortur': 1, 'fine': 1, 'imprison': 1, 'what': 1, 'effect': 1, 'thi': 1, 'coercion': 1, 'make': 1, 'world': 1, 'fool': 1, 'other': 1, 'hypocrit': 1, 'support': 1, 'error': 1, 'over': 1, 'earth': 1, 'six': 1, 'histor': 1, 'american': 1, 'john': 1, 'e': 1, 'remsburg': 1, 'letter': 1, 'william': 1, 'short': 1, 'again': 1, 'becom

## Prepare Transformer to Convert Word Counts to Vectors

In [46]:
# Build transformer that converts word count dictionaries into sparse vectors using fixed vocabulary size
# Fit method builds the vocabulary and transform method use vocabulary to convert word counts to vectors
from scipy.sparse import csr_matrix

class WordCounterVector(BaseEstimator, TransformerMixin):
    
    def __init__(self, vocabulary_size=1000):
        self.vocabulary_size = vocabulary_size
        
    def fit(self, X, y=None):
        total_count = Counter()
        for word_count in X:
            for word, count in word_count.items():
                total_count[word] += min(count, 10)
        most_common = total_count.most_common()[:self.vocabulary_size]
        self.vocabulary_ = {word: index + 1 for index, (word, count) in enumerate(most_common)}
        return self
    
    def transform(self, X, y=None):
        rows = []
        cols = []
        data = []
        for row, word_count in enumerate(X):
            for word, count in word_count.items():
                rows.append(row)
                cols.append(self.vocabulary_.get(word, 0))
                data.append(count)
        return csr_matrix((data, (rows, cols)), shape=(len(X), self.vocabulary_size + 1))

In [47]:
# Test the transformer
vocab_transformer = WordCounterVector(vocabulary_size=10)
X_few_vectors = vocab_transformer.fit_transform(X_few_wordcounts)
X_few_vectors

<5x11 sparse matrix of type '<class 'numpy.intc'>'
	with 39 stored elements in Compressed Sparse Row format>

In [48]:
# Check how the vector looks like
X_few_vectors.toarray()

array([[  6,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
       [105,  11,   3,   8,   9,   1,   1,   1,   2,   2,   0],
       [ 67,   0,   3,   2,   1,   0,   4,   2,   0,   1,   1],
       [ 48,   1,   6,   1,   1,   2,   2,   1,   2,   0,   0],
       [ 88,   6,   2,   2,   1,   8,   1,   3,   1,   2,   4]],
      dtype=int32)

In the first column and second row number 105 shows how many words are in the email that are not part of teh vocabulary. Number 11 in the same row means that the first word in the vocabulary is present 11 times and so on.

In [49]:
# Check what are the words in vocabulary
vocab_transformer.vocabulary_

{'the': 1,
 'to': 2,
 'and': 3,
 'of': 4,
 'a': 5,
 'url': 6,
 'in': 7,
 'i': 8,
 'on': 9,
 'number': 10}

## Logistic Regression Classifier