In [None]:
from IPython.core.display import HTML
with open ("../style.css", "r") as file:
    css = file.read()
HTML(css)

# Spam Detection  Using the Naive Bayes Algorithm

The process of creating a spam detector using the naive Bayes algorithm is split up into four steps.

  - Create a set of the most common words occurring in spam and ham (i.e. non-spam) emails.
  - For every word occurring in this set, compute the conditional probability that this words occurs in a spam or ham email.
  - Create a function that takes an email and the conditional probabilities computed before and that then computes the probability
    that the given email is spam.
  - Evaluate the <em style='color:blue;'>precision</em> and the <em style='color:blue;'>recall</em> of the spam classifier.

## Step 1: Create Word Dictionary

We need the module `os` for reading directories and the module `re` for 
<em style='color:blue;'>regular expressions</em>.

In [None]:
import os
import re
import math

An object of class <a href='https://docs.python.org/2/library/collections.html#counter-objects'>`Counter`</a> is a special form of a `dictionary` that is used for counting.  We need a counter to figure out what the most common words are.

In [None]:
from collections import Counter

The directory 
https://github.com/karlstroetmann/Artificial-Intelligence/tree/master/Python/EmailData
contains 960 emails that are divided into four subdirectories:

  - `spam-train` contains 350 spam emails for training,
  - `ham-train`  contains 350 non-spam emails for training,
  - `spam-test`  contains 130 spam emails for testing,
  - `ham-test`   contains 130 non-spam emails for testing.

Originally, this data has been collected by **Ion Androutsopoulos**.  I have found this data on the page 
http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&doc=exercises/ex6/ex6.html provided by Andrew Ng.

We declare some variables so this notebook can be adapted to other data sets.

In [None]:
spam_dir_train = 'EmailData/spam-train/'
ham__dir_train = 'EmailData/ham-train/'
spam_dir_test  = 'EmailData/spam-test/'
ham__dir_test  = 'EmailData/ham-test/'
Directories    = [spam_dir_train, ham__dir_train, spam_dir_test, ham__dir_test]

In order to compute the <em style='color:blue;'>prior probability</em> that an email is ham or spam we need to count the number of spam and ham emails.

In [None]:
no_spam    = len(os.listdir(spam_dir_train))
no_ham     = len(os.listdir(ham__dir_train))
spam_prior = no_spam / (no_spam + no_ham)
ham__prior = no_ham  / (no_spam + no_ham)
spam_prior, ham__prior

I have checked that the proportion of spam and ham emails in the test directory is also $1:1$.  If the proportion of spam and ham emails in life is different from $1:1$, then we would have to use this proportion in the spam filter to be developed.

The function $\texttt{get_words}(\texttt{fn})$ takes a filename $\texttt{fn}$ as its argument.  It reads the file and returns a set of all words that are found in this file.  The words are transformed to lower case.

In [None]:
def get_words(fn):
    file = open(fn)
    text = file.read()
    text = text.lower()
    return set(re.findall(r"[\w']+", text))

Let us test this function with a small example mail.

In [None]:
get_words('EmailData/ham-train/3-380msg4.txt')

The function `read_all_files` reads all files contained in those directories that are stored in the list `Directories`. 
It returns a `Counter`.  For every word $w$ this counter contains the number of files that contain $w$. 

In [None]:
def read_all_files():
    Words = Counter()
    for directory in Directories:
        for file_name in os.listdir(directory):
            Words.update(get_words(directory + file_name))
    return Words

`Common_Words` is a list of the 2500 most common words found in all of our emails.

In [None]:
N            = 2500             # number of the most common words to use
Word_Counter = read_all_files()
Word_Counter

In [None]:
Common_Words = { w for w, _ in Word_Counter.most_common(N) }
Common_Words

## Computing the Conditional Probabilities

Having computed the most common words, we are now ready to compute the conditional probability that a given word occurs in a spam email.

The function $\texttt{get_common_words}(\texttt{fn})$ takes a filename $\texttt{fn}$ 
as its argument.  It reads the file and returns the set of all words in `Common_Words` that are found in the given file.  

In [None]:
def get_common_words(fn):
    return get_words(fn) & Common_Words

We test this function for a small email.

In [None]:
get_common_words('EmailData/ham-train/3-380msg4.txt')

The function `count_common_words` takes a string specifying a `directory`.  It returns a 
`Counter` that counts how often the words in `Common_Words` occur in any of the files in `directory`.

In [None]:
def count_commmon_words(directory):
    Words = Counter()
    for file_name in os.listdir(directory):
        Words.update(get_common_words(directory + file_name))
    return Words

Next, we compute dictionaries that store the number of occurrences in emails for every common word.

In [None]:
Spam_Counter = count_commmon_words(spam_dir_train)
Spam_Counter

In [None]:
Ham__Counter = count_commmon_words(ham__dir_train)
Ham__Counter

 For every common word $w$  we compute the probability that $w$ occurs in a spam or ham email.  The formula for spam is:
 $$ P(w \in\texttt{Spam}) = \frac{\mbox{number of spam emails containing $w$}}{\mbox{number of all spam emails}} $$
 The formula for ham is similar:
 $$ P(w \in\texttt{Ham}) = \frac{\mbox{number of ham emails containing $w$}}{\mbox{number of all ham emails}} $$
 However, if we would use this formular, than a common word $w$ that, for some reason, hasn't yet occurred in any spam email, would have a 
 probability of $0$ of occurring in spam email.  Hence, our classifier would never classify an email with the word $w$ as spam.  As this
 cannot be right, we assume that there is one further spam email that contains every common word.  This 
 <em style='color:blue;'>Laplace smoothing</em> assumption changes the formula for $P(w \in\texttt{Spam})$ as follows:
 $$ P(w \in\texttt{Spam}) = \frac{\mbox{number of spam emails containing $w$ + 1}}{\mbox{number of all spam emails + 1}} $$

In [None]:
Spam_Probability = {}
Ham__Probability = {}
for w in Common_Words:
    Spam_Probability[w] = (Spam_Counter[w] + 1) / (no_spam + 1) 
    Ham__Probability[w] = (Ham__Counter[w] + 1) / (no_ham  + 1) 
Spam_Probability

According to our computation, the probability that a spam email contains the word `'consonant'` is about $0.28\%$, while the probability that this word occurs in a ham email is $2.55\%$.

In [None]:
Spam_Probability['consonant'], Ham__Probability['consonant']

For the word `'dollar'` the probability that a spam email contains this word is about $21.1\%$, while the probability that this word occurs in a ham email is $1.99\%$.

In [None]:
Spam_Probability['dollar'], Ham__Probability['dollar']

## Deciding whether an Email is Spam

Given a file name `fn`, this function returns the probability that the message contained in the given file is spam.  

When implementing the formula 
$$\arg\max\limits_{C \in \mathcal{C}}  \left(\prod\limits_{i=1}^m P(f_i \;|\; C)\right) \cdot P(C) $$
we have to be careful, because a naive implementation will eveluate the product
$$\prod\limits_{i=1}^m P(f_i \;|\; C)$$
as the number $0$ due to numerical underflow.  The trick to compute this product is to remember that
$$ \ln(a \cdot b) = \ln(a) + \ln(b) $$
and therefore transform the product into a sum of logarithms:
$$ \prod\limits_{i=1}^m P(f_i \;|\; C) = \exp\left(\alpha + \sum\limits_{i=1}^m \ln\bigl(P(f_i \;|\; C)\bigr) \right) \cdot \exp(-\alpha)$$
Here, the constant $\alpha$ has to be chosen such that the application of the function `exp` to the value
$$ \alpha + \sum\limits_{i=1}^m \ln\bigl(P(f_i \;|\; C)\bigr) $$
does not lead to an underflow error.

As we want to compute a probability, we have to be aware that the term
$$ \left(\prod\limits_{i=1}^m P(f_i \;|\; C)\right) \cdot P(C) $$
is not the probability that the object is of class $C$ but rather is only *proportional* to this probability.  The fact that the probability
of an email being spam + the probability that the email is ham must be $1$ enables us to compute the probability.

In [None]:
def spam_probability(fn):
    log_p_spam = 0.0
    log_p_ham  = 0.0
    words = get_common_words(fn)
    for w in Common_Words:
        if w in words:
            log_p_spam += math.log(Spam_Probability[w])
            log_p_ham  += math.log(Ham__Probability[w])
        else:
            log_p_spam += math.log(1.0 - Spam_Probability[w])
            log_p_ham  += math.log(1.0 - Ham__Probability[w])
    alpha  = abs(max(log_p_spam, log_p_ham))
    p_spam = math.exp(log_p_spam + alpha) * spam_prior
    p_ham  = math.exp(log_p_ham  + alpha) * ham__prior
    return p_spam / (p_spam + p_ham)

Let us test this with a ham email.

In [None]:
spam_probability('EmailData/ham-train/3-430msg1.txt')

Ok, we got this one right.  Let us check the general performance.

## Evaluate Precision and Recall

In order to evaluate the performance of this algorithm, we need to define two new concepts: <em style='color:blue;'>precision</em> and 
<em style='color:blue;'>recall</em>.  Let us call the ham emails the <em style='color:blue;'>positives</em>, while the spam emails are called the
<em style='color:blue;'>negatives</em>.  Then we define

  - <em style='color:blue;'>true positives</em>: ham emails that are classified as ham,
  - <em style='color:blue;'>false positives</em>: spam emails that are classified as ham,
  - <em style='color:blue;'>true negatives</em>: spam emails that are classified as spam,
  - <em style='color:blue;'>false negatives</em>: ham emails that are classified as spam.
  
The <em style='color:blue;'>precision</em> of the spam classifier is then defined as
$$ \texttt{precision} = \frac{\mbox{number of true positives}}{\mbox{number of true positives} + \mbox{number of false positives}} $$
Therefore, the **precision** measures the percentage of the ham emails in the set of all emails that are classified as ham.
The <em style='color:blue;'>recall</em> of the spam classifier is defined as
$$ \texttt{recall} = \frac{\mbox{number of true positives}}{\mbox{number of true positives} + \mbox{number of false negatives}} $$
Therefore, the **recall** measures the percentage of those ham emails that are indeed classified as ham.  

Usually, it is very important that the recall is high as we don't want to miss a ham email because our classifier has incorrectly classified it as a spam email.  
On the other hand, having a high precision is not that important.  After all, if $10\%$ of the emails offered to us as ham are, in fact, spam, we might tolerate this.  However, we would certainly not tolerate missing $10\%$ of our ham emails because they are incorrectly specified as spam.

The function `precission_recall` takes two directories as arguments: `spam_dir` is supposed to contain spam emails, while `ham_dir` contains ham emails.  It computes the **precision** and the **recall** of our spam classifier with respect to these test data.

In [None]:
def precission_recall(spam_dir, ham_dir):
    TN = 0 # true negatives
    FP = 0 # false positives
    for email in os.listdir(spam_dir):
        if spam_probability(spam_dir + email) > 0.5:
            TN += 1
        else:
            FP += 1
    FN = 0 # false negatives
    TP = 0 # true positives
    for email in os.listdir(ham_dir):
        if spam_probability(ham_dir + email) > 0.5:
            FN += 1
        else:
            TP += 1
    precision = TP / (TP + FP)
    recall    = TP / (TP + FN)
    accuracy  = (TN + TP) / (TN + TP + FN + FP)
    return precision, recall, accuracy

In [None]:
precission_recall(spam_dir_train, ham__dir_train)

In [None]:
precission_recall(spam_dir_test, ham__dir_test)