# Класифікатор спаму: підготовка корпусу SpamAssassin

Відкрита база листів [SpamAssassin Public Corpus](https://spamassassin.apache.org/old/publiccorpus/) складається з кількох наборів текстових файлів, де кожен файл — дамп email-протоколу для одного листа. Тому, початкова мета — перетворити ці дані в формат, легший для моделювання.

Ми виконаємо такі операції:

1. Залишимо тільки тіла листів (email body), і вилучимо всі службові заголовки email-протоколів.
2. Приберемо всі шумові слова (stop words) — артиклі, прийменники тощо.
3. Замінимо всі URL, числа та email-адреси на службові слова `HTTPADDRESS`, `NUMBER`, `EMAILADDRESS`.
4. Всі слова, що залишилися, нормалізуємо за допомогою [стемінгу (stemming)](https://en.wikipedia.org/wiki/Stemming).
5. Збережемо слова кожного листа у вигляді одного великого JSON-масиву.

In [6]:
import email
import glob
import itertools
import json
import os
import re

In [7]:
import numpy as np
from tqdm import tqdm_notebook as progressbar

In [8]:
from tqdm.notebook import tqdm as progressbar
# import tqdm import tqdm.notebook as progressbar

Завантажимо необхідні мовні моделі та словники:

In [9]:
import nltk
# from nltk.corpus import stopwords

# nltk.download("stopwords")
stopwords = nltk.corpus.stopwords.words("english")
stemmer = nltk.stem.snowball.EnglishStemmer()

## Розділення листів на слова

Відокремлюємо тіла листів на ділимо їх на слова.

In [10]:
re_email_filename = re.compile(r"[0-9a-f]{5}\.[0-9a-f]{32}", re.UNICODE)

In [11]:
re_url = re.compile(r"http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", flags=re.MULTILINE | re.UNICODE)
re_email = re.compile(r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+", flags=re.MULTILINE | re.UNICODE)
re_number = re.compile(r"\b\d+\b", flags=re.MULTILINE | re.UNICODE)
re_hash = re.compile(r"[0-9a-f]{32}", flags=re.MULTILINE | re.UNICODE)
re_non_word = re.compile(r"\W+", flags=re.MULTILINE | re.UNICODE)

In [12]:
def email_body_to_words(body, stopwords, stemmer):
    # Convert to lowercase.
    body = body.lower()

    # Replace URLs.
    body = re_url.sub("HTTPADDRESS", body)

    # Replace email addresses.
    body = re_email.sub("EMAILADDRESS", body)

    # Replace hashes.
    body = re_hash.sub("HASH", body)
    
    # Replace numbers.
    body = re_number.sub("NUMBER", body)

    # Remove non-word characters (punctuation, spacers, etc.)
    body = re_non_word.sub(" ", body)

    # Remove trailing spaces.
    body = body.strip()

    # Remove stopwords and stem the remaining words.
    words = [stemmer.stem(word) for word in body.split() if word not in stopwords]
    
    # Remove technical-looking words (base64, underscores, etc.) words.
    words = [word for word in words if len(word) <= 20 and not "_" in word and not any(ch.isdigit() for ch in word)]
    
    return words

In [13]:
def tokenize_emails_in_folders(folders, filename_regex, stopwords, stemmer):
    emails = []
    
    # Iterate through every folder in our dataset.
    for folder in progressbar(folders, desc="Folders"):
        email_filenames = [
            filename
            for filename in sorted(os.listdir(folder))
            if filename_regex.match(os.path.basename(filename))
        ]

        # Each email is kept as a single file, iterate through them.
        for file in progressbar(email_filenames, desc=os.path.basename(folder)):
            email_path = os.path.join(folder, file)
            
            with open(email_path, "r", encoding="utf-8", errors="ignore") as email_file:
                msg = email.message_from_file(email_file)
                if not msg.is_multipart():
                    body = msg.get_payload()
                    words = email_body_to_words(body, stopwords, stemmer)
                    emails.append(words)
    
    return emails

In [14]:
folder = "C:\\Users\\user\\ML_Basecamp\\Naive Bayes\\data\\"

In [15]:
emails_tokenized_ham = tokenize_emails_in_folders(
    [
        folder + "easy_ham",
        folder + "easy_ham_2",
        folder + "hard_ham"
    ],
    re_email_filename,
    stopwords,
    stemmer
)

HBox(children=(IntProgress(value=0, description='Folders', max=3, style=ProgressStyle(description_width='initi…

HBox(children=(IntProgress(value=0, description='easy_ham', max=2500, style=ProgressStyle(description_width='i…




HBox(children=(IntProgress(value=0, description='easy_ham_2', max=1400, style=ProgressStyle(description_width=…




HBox(children=(IntProgress(value=0, description='hard_ham', max=250, style=ProgressStyle(description_width='in…





In [16]:
emails_tokenized_spam = tokenize_emails_in_folders(
    [
        folder + "spam",
        folder + "spam_2"
    ],
    re_email_filename,
    stopwords,
    stemmer
)

HBox(children=(IntProgress(value=0, description='Folders', max=2, style=ProgressStyle(description_width='initi…

HBox(children=(IntProgress(value=0, description='spam', max=500, style=ProgressStyle(description_width='initia…




HBox(children=(IntProgress(value=0, description='spam_2', max=1396, style=ProgressStyle(description_width='ini…





In [17]:
print("# Ham emails:  ", len(emails_tokenized_ham))
print("# Spam emails: ", len(emails_tokenized_spam))

# Ham emails:   3952
# Spam emails:  1590


In [18]:
def save_as_json_file(obj, filename):
    with open(filename, "w", encoding="utf-8") as f:
        json.dump(obj, f)

In [19]:
save_as_json_file(emails_tokenized_ham, "D:/Students/Spam Classifier/Spam data/emails-tokenized-ham.json")
save_as_json_file(emails_tokenized_spam, "D:/Students/Spam Classifier/Spam data/emails-tokenized-spam.json")

FileNotFoundError: [Errno 2] No such file or directory: 'D:/Students/Spam Classifier/Spam data/emails-tokenized-ham.json'

## Побудова словника

Наша мета — присвоїти порядковий номер кожному унікальному слову в базі.

Отримуємо множину всіх унікальних слів:

In [20]:
vocab_set = set(itertools.chain(*emails_tokenized_ham)).union(set(itertools.chain(*emails_tokenized_spam)))

Формуємо з неї словник: _cлово_ -> _номер_.

In [21]:
vocab = {
    word: index
    for index, word in enumerate(sorted(list(vocab_set)))
}

In [22]:
print("Vocabulary length:", len(vocab))

Vocabulary length: 34133


Перевіримо наш словних на кількох випадкових словах.

In [23]:
for word in ["buy", "watch", "discount"]:
    print(word.ljust(10), vocab[word])

buy        3953
watch      32213
discount   7616


Збережемо словник у файл.

In [24]:
save_as_json_file(vocab, folder + "vocab.json")

## Огляд результатів

Перевіримо, як наш метод очищення листів працює на прикладі довільного листа.

In [20]:
sample_email = email.message_from_string(open(\
                        "D:/Students/Spam Classifier/Spam data/easy_ham/01306.01273f7d32eaabde7b20f220e13eb927").read())

In [21]:
sample_body = sample_email.get_payload()
print(sample_body)

Hello,

Has anyone made a working source RPM for dvd::rip for Red Hat 8.0?
Matthias has a spec file on the site for 0.46, and there are a couple of
spec files lying around on the dvd::rip website, including one I patched
a while ago, but it appears that the Makefile automatically generated is
trying to install the Perl libraries into the system's, and also at the
moment dvd::rip needs to be called with PERLIO=stdio as it seems to not
work with PerlIO on RH8's Perl.

Not too sure what the cleanest way to fix this is - anyone working on
this?

Thanks,

-- 
MichÃ¨l Alexandre Salim
Web:		http://salimma.freeshell.org
GPG/PGP key:	http://salimma.freeshell.org/files/crypto/publickey.asc

__________________________________________________
Do You Yahoo!?
Everything you'll ever need on one web page
from News and Sport to Email and Music Charts
http://uk.my.yahoo.com

_______________________________________________
RPM-List mailing list <RPM-List@freshrpms.net>
http://lists.freshrpms.net/mailman/

In [22]:
print(email_body_to_words(sample_body, stopwords, stemmer))

['hello', 'anyon', 'made', 'work', 'sourc', 'rpm', 'dvd', 'rip', 'red', 'hat', 'number', 'number', 'matthia', 'spec', 'file', 'site', 'number', 'number', 'coupl', 'spec', 'file', 'lie', 'around', 'dvd', 'rip', 'websit', 'includ', 'one', 'patch', 'ago', 'appear', 'makefil', 'automat', 'generat', 'tri', 'instal', 'perl', 'librari', 'system', 'also', 'moment', 'dvd', 'rip', 'need', 'call', 'perlio', 'stdio', 'seem', 'work', 'perlio', 'perl', 'sure', 'cleanest', 'way', 'fix', 'anyon', 'work', 'thank', 'michã', 'l', 'alexandr', 'salim', 'web', 'httpaddress', 'gpg', 'pgp', 'key', 'httpaddress', 'yahoo', 'everyth', 'ever', 'need', 'one', 'web', 'page', 'news', 'sport', 'email', 'music', 'chart', 'httpaddress', 'rpm', 'list', 'mail', 'list', 'emailaddress', 'httpaddress']
