# Notebook E-tivity 3 CE4021 Task 2

Student name: Peter O'Mahony

Student ID: 8361967

<hr style=\"border:2px solid gray\"> </hr>

_(Markdown cell with problem description - Remember to merge these into one)_
## Problem Description
Create a Naive Bayes SPAM filter. Test your filter using the messages in new_emails. You may add as many cells as you require to complete the task.


## Initial Thoughts
There will be an initial training task and then an analysis of new emails.  I guess we create a dictionary of words in the spam and another of ham words. Does the context or position matter? Is our software supposed to understand the difference between our and your in terms of ownership or are we simply looking at word frequency?

## Later Thoughts
Is this a question of simply taking each word in an email and summing the frequency of that word in spam (and using the negative amount) and ham (using the positive amount) and if the result is not less than zero, we conclude that it is ham? Seems too basic and not using enough theory.

Is there a clear definition of what a Naive Bayes Spam Filter actually is? Does it just score a block of text based on proportion of spamwords versus hamwords?  I'll sleep on it after a Friday pint.

...OK I had the pint and now I'm thinking that I need a function to calculate the probability of a word indicating that the message is spam. If that works, then I should do it for every word in the email and multiply those probabilities together to score the whole email.

I'm very curious to see how much better that algorithm will be than the intuitive one I started with.

## Probability
We are exploring the application of Bayes Rule so let's be clear on what that is.  It looks like this:

$$ P(H|E)^\color{green}{\text{"posterior"}} = P(H)^\color{green}{\text{"prior"}}  \frac{P(E|H)^\color{green}{\text{"likelihood"}} }{P(E)^\color{green}{\text{"marginal"}}} $$

where H represents the Hypothesis and E is the Evidence.  The real challenge (for me at least) is that of re-stating a task correctly in these terms.

In English we can articulate his Rule as giving us the probability of the hypothesis being true, if the evidence is present.  This can also be called the Posterior and it is impolite to look further into this.

Can we say H is "this email is spam" and E is "this word is spam"? Let's try...

We'll rewrite it as:
$$ P(S|W) = P(S)  \frac{P(W|S)}{P(W)} $$

Looking at "Notes on Naive Bayes Classifiers for Spam Filtering" by Jonathan Lee, he expands this more usefully as:
$$ P(S|W) = P(S)  \frac{P(W|S)}{P(W|S)P(S) + P(W|not S)P(not S)} $$


In our case the Posterior is the probability that an email is spam.

The Prior is the probability of the Hypothesis being true, that is, of an email being spam.

The Likelihood is the probability of the email being spam given the Evidence that the word is spammy.

The Marginal is the probability that the word is spammy and it has been expanded from P(W) into useful components that say" the probability of the word appearing in a spam message times the probability of the email being spam plus the probability of the word appearing in a ham message times the probability of the email being ham".

Let's work it out by hand for the message 'benefits of our account'.  After some visual analysis, we can make this reference table:

| |Ham|Spam|Total|
|-|---|----|------|
| all messages | 3 | 4 | 7 |
| msgs with 'benefits' | 1 | 0 | 1|
| msgs with 'of' | 0 | 0 | 0|
| msgs with 'our' | 0 | 1 | 1|
| msgs with 'account' | 1 | 3 | 4 |
--------------

P(H), the Prior, is the probability that an email is spam and we calculate this as 4/7 because 4 out of our 7 emails are spam.

P(E|H), the Likelihood, is the probability that the word is spam given that the email is spam and we calculate this for 'benefits' as 0 because it never appears in any spam emails.

P(E), the Marginal, is the probability that the word occuring in an email and that is 1/7. _(But how do we handle words that don't appear in any email? That would give us a divide by zero error!)_

so we get
$$ P(H|E) = \frac {4}{7} * \frac {0}{\frac {1}{7}} = 0 $$


and this tells us that there is 0% chance that this word indicates the email is spam.

I say this with 60% certainty so I need to recheck this later.

For the next word, 'of', we get the same:
$$ P(H|E) = \frac {4}{7} * \frac {0}{0^{*}} = 0 $$  $^{*}$ _how can this be?_

For the next word, 'our', we get:
$$ P(H|E) = \frac {4}{7} * \frac {1}{1} = \frac {4}{7}$$

For the last word, 'account', we get:
$$ P(H|E) = \frac {4}{7} * \frac {1}{\frac {4}{7}} = 1$$

Intuitively, I think that if we multiply P(H|E) for each word in an email then we get the probability that that email is spam but that makes no sense if we can get a zero probability for any word.  We can't add together the probabilities because that could result in a number greater than 1 (using my current calculations as least).

Clearly more work is needed here. My maths are wrong right now. No pint for me tonight.

Ah, I have now discovered Laplace Smoothing and that might address my zero denominator concern. More to be done here after I absorb the mandatory reading.

# Note on Initial Implementation
I wrote most of the code below before learning about Bayes Rule.  It is based on intuition and I thought that I would find it closely aligned with Bayes but it appears not.  The code tries to identify spam simply based on the spamminess and hamminess of each word in the email. I am leaving it here because I am curious to compare it to the Bayesian results.  If I have time, I will find a much larger dataset and apply both algorithms and compare the results.

<hr style=\"border:2px solid gray\"> </hr>

In [201]:
previous_spam = ['send us your password', 'review our website', 'send your password', 'send us your account']
previous_ham = ['Your activity report','benefits physical activity', 'the importance vows']
new_emails = {'spam':['renew your password', 'renew your vows'], 'ham':['benefits of our account', 'the importance of physical activity']}

In [202]:
def build_sorted_dict(phrases: list[str]) -> dict:
    """
    Count the frequency of each word in each phrase on the list and return a dictionary of
    words sorted by descending frequency.
    """
    ordered_words = {}
    for phrase in phrases:
        # assume the phrases have no punctuation and words are delimited by spaces only. Force lowercase for consistency.
        words = phrase.lower().split(' ')  
        for word in words:
            if (word in ordered_words):    # if we know the word already
                ordered_words[word] += 1
            else:                          # otherwise this is a new word
                ordered_words[word] = 1

    # sort the words in descending order of frequency
    return (dict(sorted(ordered_words.items(), key=lambda item: item[1], reverse=True)))
    

In [203]:
def get_word_freq(word: str, built_dict: dict) -> int:
    """
    Return the frequency of a word from the dictonary or zero if not found.
    """
    return built_dict[word] if (word in built_dict) else 0

In [204]:
def test_phrase(phrase: str, built_dict: dict) -> int:
    """
    Given a string of words, look up each word in the dictionary.  If found, sum the frequency value
    from the dictionary of that word and return the total value/score for all words in the string.
    """
    words = phrase.lower().split(' ')  # split into words and convert to lowercase for consistency
    freq = 0
    for word in words:
        if (word in built_dict):
            freq += get_word_freq(word, built_dict)
    return freq

In [205]:
def superscript(power: int) -> str:
    """
    I knew it would come in handy again.
    """
    # created a translation table using https://symbl.cc/en/unicode/blocks/latin-1-supplement/
    translate_table = str.maketrans("0123456789-.", 
                                    "\u2070\u00B9\u00B2\u00B3\u2074\u2075\u2076\u2077\u2078\u2079\u207B\u00B7")
    return str(power).translate(translate_table)

In [206]:
def colour_phrase(phrase: str, ham_dict: dict, spam_dict: dict) -> str:
    """
    This returns a colour encoded string highlighting the spam and ham 
    words with an indicator of the frequency of that word in the relevant dictionary.
    More of my messing with python.  It is not very efficient but, hey, I had fun writing it.
    """
    red     = '\x1b[1;31m'
    green   = '\x1b[1;32m'
    off     = '\x1b[0m'
    spam    = red
    ham     = green
    neutral = ''
    
    pretty_phrase = ''
    for word in phrase.split(' '):
        ham_score  = get_word_freq(word, ham_dict)
        spam_score = get_word_freq(word, spam_dict)
        if (ham_score==0) and (spam_score==0):
            colour = neutral
        else:
            if (ham_score>spam_score):
                colour = ham
                word += superscript(ham_score-spam_score)
            else:
                colour = spam
                word += superscript(spam_score-ham_score)
        pretty_phrase += f"{colour}{word}{off} "
    return pretty_phrase

In [207]:
def spam_or_ham(phrase: str, ham_dict: dict, spam_dict: dict) -> str:
    """
    Given a string of words, calculate the ham score and subtract the spam score and 
    return a string explaining the result.
    """
    score = test_phrase(phrase,ham_dict) - test_phrase(phrase,spam_dict)
    result_text = f'Score is {score:3} so '
    result_text += 'Spam' if score < 0 else 'Ham '
    return result_text

In [208]:
def print_pretty_result(phrase: str, ham_dict: dict, spam_dict: dict) -> None:
    """
    Display the result of our analysis, the score and a colour coded version of the email 
    indicating the words and weights that influenced the analysis.
    """
    print(f'{spam_or_ham(phrase, ham_dict, spam_dict)}: {colour_phrase(phrase, ham_dict, spam_dict)}')

In [209]:
def process_new_emails(new_emails: list, ham_dict: dict, spam_dict: dict) -> None:
    """
    Given a dictionary keyed by an expected result (spam or ham) with a collection of associated emails,
    test each email with the spam_or_ham function to determine its quality.
    """
    for expected_result, emails in new_emails.items():
        print(f'\n---- EXPECTING {expected_result}')
        for email in emails:
            print_pretty_result(email, ham_dict, spam_dict)

In [210]:
# Train/Build the Spam dictionary
spam_dict = build_sorted_dict(previous_spam)
print(f'SPAM WORDS ({len(spam_dict)}):',spam_dict)
# Train/Build the Ham dictionary
ham_dict = build_sorted_dict(previous_ham)
print(f'MEATY HAM WORDS ({len(ham_dict)}):',ham_dict)
# Test all new emails and praise or condemn them
print(f'\n==================== REQUIRED TESTS')
process_new_emails(new_emails, ham_dict, spam_dict)

print(f'\n==================== OTHER TESTS')
my_emails = {
    "spam":[
        "get your drugs here",
        "buy some viagra from this super web site"
        ],
    "ham":[
        "send us the physical benefits of important account activity"
        ]
    }
process_new_emails(my_emails, ham_dict, spam_dict)


SPAM WORDS (8): {'send': 3, 'your': 3, 'us': 2, 'password': 2, 'review': 1, 'our': 1, 'website': 1, 'account': 1}
MEATY HAM WORDS (8): {'activity': 2, 'your': 1, 'report': 1, 'benefits': 1, 'physical': 1, 'the': 1, 'importance': 1, 'vows': 1}


---- EXPECTING spam
Score is  -4 so Spam: renew[0m [1;31myour²[0m [1;31mpassword²[0m 
Score is  -1 so Spam: renew[0m [1;31myour²[0m [1;32mvows¹[0m 

---- EXPECTING ham
Score is  -1 so Spam: [1;32mbenefits¹[0m of[0m [1;31mour¹[0m [1;31maccount¹[0m 
Score is   5 so Ham : [1;32mthe¹[0m [1;32mimportance¹[0m of[0m [1;32mphysical¹[0m [1;32mactivity²[0m 


---- EXPECTING spam
Score is  -2 so Spam: get[0m [1;31myour²[0m drugs[0m here[0m 
Score is   0 so Ham : buy[0m some[0m viagra[0m from[0m this[0m super[0m web[0m site[0m 

---- EXPECTING ham
Score is  -1 so Spam: [1;31msend³[0m [1;31mus²[0m [1;32mthe¹[0m [1;32mphysical¹[0m [1;32mbenefits¹[0m of[0m important[0m [1;31maccount¹[0m [1;32mactivity²[0m 

### Thoughts
We correctly identified both spam messages as spam (100% true positives) but incorrectly classified one ham message as spam (50% false positive).

I guess we start to apply the probability and Naive Bayes story here. To be done...

<hr style=\"border:2px solid gray\"> </hr>

## Reflection

To be done.