# Homework 02

## 2.1 - Spam Filter


Using a method described by Paul Graham (http://www.paulgraham.com/spam.html)
this weeks assignment was to build a spam filter.
It uses a statistical approach to attempt to detect spam.

Below are functions used to help implement the spam filter.

In [10]:
def build_hash(corpus):
    """This function build a hashtable of occurrences for each word in a corpus"""
    table = {}
    for email in corpus:
        for token in email:
            if token not in table:
                count = 0
                for spam in corpus:
                    for word in spam:
                        if token == word:
                            count += 1
                # lowercase for consistency across the two hash tables
                table.update({token.lower(): count})
    return table


def spam_probability(token, good, bad, ngood, nbad):
    """Calculates the probability a given word is from a spam email or not"""
    g = 2 * (good.get(token) or 0)
    b = (bad.get(token) or 0)
    if g + b >= 1:
        return max(0.01, min(0.99, min(1.0, b / nbad) / (min(1.0, g / ngood) + min(1.0, b / nbad))))
    else:
        return 0
    
    
def build_probability_table(good, bad, ngood, nbad):
    """Builds the probability table for each token from the bad and good corpuses"""
    probs_table = {}
    for word in good:
        probs_table.update({word: spam_probability(word, good, bad, ngood, nbad)})
    for word in bad:
        if word not in probs_table:
            probs_table.update({word: spam_probability(word, good, bad, ngood, nbad)})
    return probs_table


def new_mail_probability(mail, probs_table):
    """Given an email and the probabilities of each token, calculates the probability the individual email is spam."""
    prod = 1
    complements = 1
    # finds product of all the elements, and complements of the elements
    for token in mail:
        prob = probs_table.get(token.lower())
        prod *= prob
        complements *= (1 - prob)

    return prod / (prod + complements)

The spam filter is tested using these two predefined corpus's:


In [11]:
spam_corpus = [["I", "am", "spam", "spam", "I", "am"], ["I", "do", "not", "like", "that", "spamiam"]]
ham_corpus = [["do", "i", "like", "green", "eggs", "and", "ham"], ["i", "do"]]

Using the build_hash() function the count of occurrences of 
each word in the corpus's are recorded.

In [12]:
spam = build_hash(spam_corpus)
ham = build_hash(ham_corpus)

print(spam)
print(ham)

{'i': 3, 'am': 2, 'spam': 2, 'do': 1, 'not': 1, 'like': 1, 'that': 1, 'spamiam': 1}
{'do': 2, 'i': 2, 'like': 1, 'green': 1, 'eggs': 1, 'and': 1, 'ham': 1}


The probabilities table for each word gives the odds that
the word is from a spam message or not.

In [13]:
# number of emails in each corpus
nbad = len(spam_corpus)
ngood = len(ham_corpus)

probabilities = build_probability_table(ham, spam, ngood, nbad)
print(probabilities)

{'do': 0.3333333333333333, 'i': 0.5, 'like': 0.3333333333333333, 'green': 0.01, 'eggs': 0.01, 'and': 0.01, 'ham': 0.01, 'am': 0.99, 'spam': 0.99, 'not': 0.99, 'that': 0.99, 'spamiam': 0.99}


To test out the original "emails" we started with, we can 
use the new_mail_probability() function. For each of the four emails, 
the probability it's spam is accurate to the truth.

In [14]:
spam_prob = new_mail_probability(spam_corpus[0], probabilities)
print("spam_corpus[0]: " + str(spam_prob))
spam_prob = new_mail_probability(spam_corpus[1], probabilities)
print("spam_corpus[1]: " + str(spam_prob))
spam_prob = new_mail_probability(ham_corpus[0], probabilities)
print("ham_corpus[0]: " + str(spam_prob))
spam_prob = new_mail_probability(ham_corpus[1], probabilities)
print("ham_corpus[1]: " + str(spam_prob))

spam_corpus[0]: 0.9999999895897965
spam_corpus[1]: 0.999995877576386
ham_corpus[0]: 2.6025508824397714e-09
ham_corpus[1]: 0.3333333333333333


All the numbers are either very high or very low, except for
ham_corpus[1]. That's because the "email" there was simply
"I do". Even with such little information to go off of, 
the spam filter still gives a fairly correct judgement on 
if this email is spam or not. 

Graham calls this approach a Bayesian one because it assigns a spam probability
to each email, not just a score. Other spam filters use scores,
but it's hard to understand what a score even is. Because this
approach uses probability, both good and bad evidence can 
be used to determine if a email is spam or not, and the 
probability given for an individual email gives much more insight
than a simple score does.

Bayesian statistics also describe probability as being a degree
of belief in an event. This degree of belief is based off
of prior knowledge or experience. In this case, the degree
of belief is the odds a given email is spam, with the
prior knowledge of the frequency of words in other
spam or non-spam emails. By this definition of Bayesian
statistics, this is very much a Bayesian approach to spam
detection.

## 2.2 Bayesian Network


Using Figure 14.12a, from the AIMA textbook, this Bayesian
Network is created:

In [15]:
from probability import BayesNet, enumeration_ask, elimination_ask, gibbs_ask, likelihood_weighting, rejection_sampling

# Utility variables
T, F = True, False

weather = BayesNet([
    ('Cloudy', '', 0.5),
    ('Sprinkler', 'Cloudy', {T: 0.1, F: 0.5}),
    ('Rain', 'Cloudy', {T: 0.8, F: 0.2}),
    ('WetGrass', 'Sprinkler Rain', {(T, T): 0.99, (T, F): 0.9, (F, T): 0.9, (F, F): 0.0})
])