# Homework 2

## 2.1

The code below implements a spam filter based on Paul Graham's "A Plan for Spam". A spam and not spam corpus were
given as test cases to test for spam. This code prints the probability tables for all the unique words in these
corpora, and whether or not the messages in these corpora are considered spam.

This approach is Bayesian because it uses Bayes Rule to calculate the probability of an event (email being spam)
given certain conditions (individual words). The spam probabilities of each individual word are combined to give
the overall spam probability of an email. Words can positively or negatively affect this probability, where good
words reduce the spam probability while bad words increase it.


In [2]:
# test messages for spam and not spam in order to determine spam probabilities
spam_corpus = [["I", "am", "spam", "spam", "I", "am"], ["I", "do", "not", "like", "that", "spamiam"]]
ham_corpus = [["do", "i", "like", "green", "eggs", "and", "ham"], ["i", "do"]]


# This function returns a dictionary with words as the keys and spam probabilities as the values
def create_probabilityHT(spam, ham):
    # initialize counters for spam and non-spam messages
    nbad = 0
    ngood = 0
    # initialize list to record all unique words used
    allwords = []
    # initialize a dictionary of bad words
    bad = {}
    for phrase in spam:
        nbad += 1
        # count how many times each word is used
        for word in phrase:
            # make everything lowercase for uniformity
            word = word.lower()
            if bool(bad.get(word)):
                bad[word] = bad[word] + 1
            else:
                bad[word] = 1
            if word not in allwords:
                allwords.append(word)
    # initialize a dictionary of good words
    good = {}
    for phrase in ham:
        ngood += 1
        # count how many times each word is used
        for word in phrase:
            # make everything lowercase for uniformity
            word = word.lower()
            if bool(good.get(word)):
                good[word] = good[word] + 1
            else:
                good[word] = 1
            if word not in allwords:
                allwords.append(word)
    # initialize probability dictionary for all unique words
    probabilities = {}
    for word in allwords:
        if bool(good.get(word)):
            # weight good words more than bad words
            g = 2 * good[word]
        else:
            g = 0
        if bool(bad.get(word)):
            b = bad[word]
        else:
            b = 0
        # threshold of 1 so every word is evaluated
        # calculate the probability that a word is spam based on the number of occurrences and number of good and bad messages
        probabilities[word] = max(0.01, min(0.99, min(1.0, b / nbad) / (min(1.0, g / ngood) + min(1.0, b / nbad))))
    return probabilities

# none of our test messages are 15+ words --> use all of them in test
# This function determines the probability that a message is spam based on the individual probability of each word being spam
def is_spam(message, dict):
    # keep track of words used so each one is used once
    usedwords = []
    prod = 0
    comp_prod = 0
    first = True
    for word in message:
        # make everything lowercase for uniformity
        word = word.lower()
        if word not in usedwords:
            # first time through loop is different so that you don't multiply by 0
            if first:
                prod = dict[word]
                comp_prod = 1 - prod
                first = False
            else:
                # multiply all the probabilities together
                prod *= dict[word]
                # multiply all the complements together
                comp_prod *= 1 - dict[word]
            usedwords.append(word)
    # calculate combined probability
    if prod / (prod + comp_prod) > 0.9:
        return True
    else:
        return False


hash = create_probabilityHT(spam_corpus, ham_corpus)
test1 = is_spam(spam_corpus[0], hash)
test2 = is_spam(spam_corpus[1], hash)
test3 = is_spam(ham_corpus[0], hash)
test4 = is_spam(ham_corpus[1], hash)

print(hash)
print("First message spam? ", test1)
print("Second message spam? ", test2)
print("Third message spam? ", test3)
print("Fourth message spam? ", test4)


{'i': 0.5, 'am': 0.99, 'spam': 0.99, 'do': 0.3333333333333333, 'not': 0.99, 'like': 0.3333333333333333, 'that': 0.99, 'spamiam': 0.99, 'green': 0.01, 'eggs': 0.01, 'and': 0.01, 'ham': 0.01}
First message spam?  True
Second message spam?  True
Third message spam?  False
Fourth message spam?  False


## 2.2

This problem uses the domain from Figure 14.12 of the AIMA text. The full joint probability distribution for
this domain has 2^4 - 1 = 15 independent values if no conditional independence. This is because there are 4
variables that each have 2 possible states, but they must all sum to 1 which removes an independent value.
The number of independent values for the Bayesian network of this domain is different because it implies that
Sprinkler and Rain depend on Cloudy, and in turn WetGrass depends on Sprinkler and Rain. Therefore, Cloudy has
1 independent value, Sprinkler and Rain both have 2 based on the value of Cloudy, and WetGrass has 4 based on the
value of Sprinkler and Rain, resulting in a total of 9 independent values, which is much better than the
original 15 independent values in the full joint probability distribution.

The code below shows the implementation of the Bayesian network for this domain, along with different probability
calculations. These calculations were also done by hand and can be found on the sheet submitted.

In [1]:
from probability import BayesNet, enumeration_ask

# Utility variables
T, F = True, False

# From AIMA - Fig. 14.12
weather = BayesNet([
    ('Cloudy', '', 0.5),
    ('Sprinkler', 'Cloudy', {T: 0.1, F: 0.5}),
    ('Rain', 'Cloudy', {T: 0.80, F: 0.20}),
    ('WetGrass', 'Sprinkler Rain', {(T, T): 0.99, (T, F): 0.90, (F, T): 0.90, (F, F): 0.0})
    ])

# Compute P(Cloudy).
print(enumeration_ask('Cloudy', dict(), weather).show_approx())
# Compute P(Sprinkler | cloudy).
print(enumeration_ask('Sprinkler', dict(Cloudy=T), weather).show_approx())
# Compute P(Cloudy | the sprinkler is running and it's not raining).
print(enumeration_ask('Cloudy', dict(Sprinkler=T, Rain=F), weather).show_approx())
# Compute P(WetGrass | it's cloudy, the sprinkler is running and it's raining).
print(enumeration_ask('WetGrass', dict(Cloudy=T, Sprinkler=T, Rain=T), weather).show_approx())
# Compute P(Cloudy | grass is not wet).
print(enumeration_ask('Cloudy', dict(WetGrass=F), weather).show_approx())

False: 0.5, True: 0.5
False: 0.9, True: 0.1
False: 0.952, True: 0.0476
False: 0.01, True: 0.99
False: 0.639, True: 0.361
