# Homework 2

Spam filter and Bayesian network exercises for CS-344, Professor Keith Vander Linden, Calvin University.

Completed by Nathan Meyer, 3/10/2020.

## 2.1

Implementation of spam filter based on "A Plan for Spam" algorithm by Paul Graham.

### Implementation
First, each word in the corpora's messages are counted, starting at 1 for each "new" word found.

In [1]:
def hash_corpus(corpus):
    '''
    Reads a given corpus (list of message lists) and returns a hash table for
    the number of occurrences of each word
    '''
    hashed = {}

    for message in corpus:
        for word in message:
            entry = word.lower()    # ignore case
            if entry not in hashed:
                hashed[entry] = 1
            else:
                hashed[entry] += 1

    return hashed


spam_corpus = [["I", "am", "spam", "spam", "I", "am"],
               ["I", "do", "not", "like", "that", "spamiam"]]
ham_corpus = [["do", "i", "like", "green", "eggs", "and", "ham"], ["i", "do"]]

spam = hash_corpus(spam_corpus)
ham = hash_corpus(ham_corpus)

print("Spam hash table:\n\t" + str(spam))
print("Ham hash table:\n\t" + str(ham))

Spam hash table:
	{'i': 3, 'am': 2, 'spam': 2, 'do': 1, 'not': 1, 'like': 1, 'that': 1, 'spamiam': 1}
Ham hash table:
	{'do': 2, 'i': 2, 'like': 1, 'green': 1, 'eggs': 1, 'and': 1, 'ham': 1}


Then begin hashing a table of probabilities for how likely each word is spam.

In [2]:
def hash_prob_table(good, bad, ngood, nbad):
    '''Creates a hash table of the probabilities that each word is spam'''
    probs = {}
    probs.update(good)
    probs.update(bad)

    for word in probs:
        probs[word] = word_spam_prob(word, good, bad, ngood, nbad)

    return probs

This calls word_spam_prob(), which, based on Graham's algorithm, calculates "good" and "bad" values for each word based on the good and bad hash tables.

In [3]:
def word_spam_prob(word, good, bad, ngood, nbad):
    '''Uses Paul Graham's algorithm to determine how likely a word is spam'''
    if word in good:
        g = 2 * good[word]
    else:
        g = 0
    if word in bad:
        b = bad[word]
    else:
        b = 0

    if (g + b) >= 1:
        return max(0.01,
                   min(0.99,
                       min(1.0, b / nbad) / (min(1.0, g / ngood) +
                                                 min(1.0, b / nbad))))
    else:
        return 0

Using the given spam and ham corpora, calling hash_prob_table() on them (with calculated number of messages in each corpus) produces these results:

In [5]:
nspam = len(spam_corpus)
nham = len(spam_corpus)

probabilities = hash_prob_table(ham, spam, nham, nspam)
print("Probability table:\n\t" + str(probabilities))

Probability table:
	{'do': 0.3333333333333333, 'i': 0.5, 'like': 0.3333333333333333, 'green': 0.01, 'eggs': 0.01, 'and': 0.01, 'ham': 0.01, 'am': 0.99, 'spam': 0.99, 'not': 0.99, 'that': 0.99, 'spamiam': 0.99}


Then we can determine the probability that each multi-word message is spam, using the second section of Paul Graham's algorithm.

In [6]:
def msg_spam_prob(message, probs):
    '''
    Determines the probability that an entire message is spam
    based on Paul Graham's algorithm
    '''
    product = 1
    complement = 1

    for word in message:
        word_prob = (probs.get(word.lower()) or 1)
        product *= word_prob
        complement *= (1 - word_prob)

    return product / (product + complement)

Running msg_spam_prob() on all of the messages in the corpora produces these results:

In [7]:
first_spam_msg = msg_spam_prob(spam_corpus[0], probabilities)
second_spam_msg = msg_spam_prob(spam_corpus[1], probabilities)
first_ham_msg = msg_spam_prob(ham_corpus[0], probabilities)
second_ham_msg = msg_spam_prob(ham_corpus[1], probabilities)

print("Messages and their probabilities of spam:")
print("\t" + str(spam_corpus[0]) + ": " + str(first_spam_msg))
print("\t" + str(spam_corpus[1]) + ": " + str(second_spam_msg))
print("\t" + str(ham_corpus[0]) + ": " + str(first_ham_msg))
print("\t" + str(ham_corpus[1]) + ": " + str(second_ham_msg))

Messages and their probabilities of spam:
	['I', 'am', 'spam', 'spam', 'I', 'am']: 0.9999999895897965
	['I', 'do', 'not', 'like', 'that', 'spamiam']: 0.999995877576386
	['do', 'i', 'like', 'green', 'eggs', 'and', 'ham']: 2.6025508824397714e-09
	['i', 'do']: 0.3333333333333333


### What makes it Bayesian?

In regards to the probability of a message being spam, it opts to only evaluate the probability based upon the words that are present within the message. The probability that 'do' is spam or not is not relevant to the "I am spam, spam I am" message, nor any other message which does not contain 'do', so it is not considered. In this way, a Bayesian network of sorts is formed for each message, gathering probabilities given that certain words appear in the message. Within this approach, words missing from the message are irrelevant, so those probabilities are not even considered, like a Bayesian network.

## 2.2