# Notebook E-tivity 3 CE4021 Task 2

**Student name:** Jason Coleman

**Student ID:** 9539719

## Imports

In [1]:
#None

If you believe required imports are missing, please contact your moderator.

<hr style=\"border:2px solid gray\"> </hr>

## Task 2

Use the below information to create a Naive Bayes SPAM filter. Test your filter using the messages in new_emails. You may add as many cells as you require to complete the task.

In [2]:
previous_spam = ['send us your password', 'review our website', 'send your password', 'send us your account']
previous_ham = ['Your activity report','benefits physical activity', 'the importance vows']
new_emails = {'spam':['renew your password', 'renew your vows'], 'ham':['benefits of our account', 'the importance of physical activity']}

<hr style=\"border:2px solid gray\"> </hr>

## Introduction

This will be a very naive Naive Bayes classifier. It will follow a so-called `Bag-of-words` approach where I will classify a message as `SPAM` or `HAM` based on the presence of the words in the string. Notably, the bag of words does not consider the context in which the work is used (i.e. you can reorder the words into complete nonsense and get the same result). 

## Background

### Bayes Theorem
Let us start with the formula for Bayes Theorem. 

$$
P(A \mid B) = \frac{P(B \mid A) \times P(A)}{P(B)}
$$

* where $P(A \mid B)$ is the probability of event A happening given that event B has happened.
* and $P(B \mid A)$ is the probability of event B given A.
* and $P(A)$ and $P(B)$ are the probabilities of events A and B, respectively.

In the context of our spam filter:


*  where $A$ could be the event "The email is spam".
*  and $B$ could be "The email contains the word 'pAssword'".

### Naive Bayes Classifier (NB)

NB is a classification technique based on Bayes' theorem; with the "naive" assumption that every pair of features is independent of each other. Even though it is caled naive, it can perform surpisingly well. I worked as an analysts in the AV industry a few years ago and, in the early days, when we had less data (and features) NB worked really well. 

Reframing this a little, in the context of SPAM/HAM.

* Where $S$ represents the event that an email is spam.
* and $W$ represents the event that a specific word appears in the email.

We can write:

$$
P(S \mid W) = \frac{P(W \mid S) \times P(S)}{P(W)}
$$


*  Where $P(S \mid W)$ is the probability that an email is spam given that it contains the word $W$.
*  and $P(W \mid S)$ is the probability that the word $W$ appears in a spam email.
*  and $P(S)$ is the overall probability (or prior) that any email is spam. This is a `prior`. 
*  and $P(W)$ is the probability that the word $W$ appears in any email, regardless of it being spam or not. This is a `prior`. 

#### Working on messages with more than one word

Emails are rarely composed of just one word so our implemnentation will need to account for all of the words in the message. So, for a given email message, $m$, consisting of words $w_1$, $w_2$, $\ldots$, $w_n$, we need to modify the method to account for multiple words. 

0. We will need to calculate the likelihoods for words being both SPAM or HAM. That is: $P(w_1 | S)$, $P(w_1 | H)$ for each word int he training data. 

Then, for the training messages:

1. Calculate the likelihood of the email being Spam, based on word frequncy:

$$
P(m | S) = P(w_1 | S) \times P(w_2 | S) \times \ldots \times P(w_n | S)
$$

2. Calculate the likelihood of the email being Ham, based on word frequncy:

$$
P(m | H) = P(w_1 | H) \times P(w_2 | H) \times \ldots \times P(w_n | H)
$$

Then, finally, calculate the posterior probabilitites for being Ham and Spam.

$$
P(S | m) = \frac{P(m | S) \times P(S)}{P(m)}
$$

and

$$
P(H | m) = \frac{P(m | H) \times P(H)}{P(m)}
$$

* Where $P(S)$ and $P(H)$ are the prior probabilities of a message being spam and ham, respectively (we know this from the initial training set).
* and $P(m)$ is the total probability of observing message $m$. 

Note, I don't need to calculate $P(m)$ if I am comparing the relative probability of $P(S | m)$ and $P(H | m)$ so the method I will implement will actually be the following:

$$
P(S | m) = P(w_1 | S) \times P(w_2 | S) \times \ldots \times P(w_n | S) \times P(S)
$$

and

$$
P(H | m) = P(w_1 | H) \times P(w_2 | H) \times \ldots \times P(w_n | H)\times P(H)
$$

#### Classification
Then, given a new message, if $P(S | m) > P(H | m)$, I will classify the message as SPAM, else I will classify the message as Ham. 

By doing this, I am side-stepping using a threshold and just comparing the relative posterior probabilitites for the message being HAM and SPAM.

## Python Implementation
We will need to preprocess the data a little; to get it into the right shape. We will start by getting the components of the Bayes formula: the `priors` and the `likelihoods` for the words within the messages. 

**Tokenise:** break each message into words. Each word is called a `token`. It's the fundemental unit of classification in a Bag of Words classifier.

We will need to calculate Priors and likelihoods. To do this we first need to construct a vocabulary. 

**Notes**:

* I will reference the steps in the reference implementation
* My implementation will use list and dictionary comprehensions to make the code more concise.
* I will print intermediate output to aid understanding and debugging.

### Build a vocabulary
The prior probability of an email being spam or ham is calculated based on the proportion of spam or ham emails in the training data.

1. Create a set called *vocabulary*.
2. For each email in the SPAM and HAM training messages, split the words
3. Add all words to the set.

We now have a vocabulary.

In [3]:
vocabulary = set()

for email in previous_spam + previous_ham:
    words = email.split()
    for word in words:
        vocabulary.add(word)

print('Vocabulary:', end=' ')
for word in vocabulary:
    print(word, end=', ')


Vocabulary: report, benefits, password, importance, send, account, the, your, vows, us, activity, physical, review, website, our, Your, 

## Calculate the Priors

The prior probability of an email being spam or ham is calculated based on the proportion of spam or ham emails that exists in the training data.

$$
\texttt{prior\_spam} = \frac{\texttt{num\_spam\_words}}{\texttt{total\_words}}
$$

and

$$
\texttt{prior\_ham} = \frac{\texttt{num\_ham\_words}}{\texttt{total\_words}}
$$

In [4]:
#get words from previous_spam
spam_words = []
for email in previous_spam:
    words = email.split()
    for word in words:
        spam_words.append(word)
num_spam_words = len(spam_words)

print (f"There are {len(spam_words)} spam words in the training set.")
        
# get words from previous_ham
ham_words = []
for email in previous_ham:
    words = email.split()
    for word in words:
        ham_words.append(word)
num_ham_words = len(ham_words)

print (f"There are {len(ham_words)} ham words in the training set.")

total_words = len(spam_words) + len(ham_words) # total number of words in spam emails
print( f"Total words in the training set: {total_words}")

prior_spam = num_spam_words / (num_spam_words + num_ham_words)
prior_ham = 1 - prior_spam

print(f"Prior Spam: {prior_spam}")
print(f"Prior Ham: {prior_ham}")

There are 14 spam words in the training set.
There are 9 ham words in the training set.
Total words in the training set: 23
Prior Spam: 0.6086956521739131
Prior Ham: 0.3913043478260869


## Calculate the Conditional Probabilitites (aka Likelihoods)
Here, for each word in the vocabulary, the probability of the word given that an email is spam or ham is calculated. We do this by:

1. Counting the occurrences of each word in both the spam and ham emails.
2. Apply Laplace Smoothing to avoid zero probabilities. If we don't add this, we will eventually end up multiplying by zero (as the word has never been seen before, the conditional probability willk be zero).
3. Calculate the conditional probabilities (likelihoods) for each word being both SAPM and HAM.

In [5]:
# Create dictionaries from the vocabulary to store word counts, initialise the count to zero
spam_word_counts = {word: 0 for word in vocabulary}
ham_word_counts = {word: 0 for word in vocabulary}
print(f"Vocabulary: \n{vocabulary}\n")

# Determine frequencies of words in spam and ham emails (I cosider repeatred words here)
for email in previous_spam:
    words = email.split()
    for word in words:
        spam_word_counts[word] += 1

for email in previous_ham:
    words = email.split()
    for word in words:
        ham_word_counts[word] += 1

print("Spam/HAM dictionaries are as follows:\n")

# print the dictionaries for spam ham in one table
print('\tWords\t\tSpam Count \tHam Count')
print('\t'+'-' * 40)
for word in vocabulary:
    print(f'\t{word:10s}\t{spam_word_counts[word]:2d}\t\t{ham_word_counts[word]:2d}')

print()

#  Preparing the (smoothed) likelihoods
alpha = 1 # Laplace smoothing (in this case 'add-one' smoothing) parameter

# Compute the likeloods for SPAM/HAM.
spam_likelihoods = {word: (count + alpha) / num_spam_words for word, count in spam_word_counts.items()}
ham_likelihoods = {word: (count + alpha) / num_ham_words for word, count in ham_word_counts.items()}

print("The condtional probabilities (likelihoods) for spam and ham are as follows (.md format):\n")

# print a table of spam_probs and ham_probs
print('|Words\t\t|P(w\|S)\t|P(w\|H)|')
print('|------------|--------------|-------------|')
for word in vocabulary:
    print(f'|{word:10s}\t|{spam_likelihoods[word]:.4f}|\t\t{ham_likelihoods[word]:.4f}|')


Vocabulary: 
{'report', 'benefits', 'password', 'importance', 'send', 'account', 'the', 'your', 'vows', 'us', 'activity', 'physical', 'review', 'website', 'our', 'Your'}

Spam/HAM dictionaries are as follows:

	Words		Spam Count 	Ham Count
	----------------------------------------
	report    	 0		 1
	benefits  	 0		 1
	password  	 2		 0
	importance	 0		 1
	send      	 3		 0
	account   	 1		 0
	the       	 0		 1
	your      	 3		 0
	vows      	 0		 1
	us        	 2		 0
	activity  	 0		 2
	physical  	 0		 1
	review    	 1		 0
	website   	 1		 0
	our       	 1		 0
	Your      	 0		 1

The condtional probabilities (likelihoods) for spam and ham are as follows (.md format):

|Words		|P(w\|S)	|P(w\|H)|
|------------|--------------|-------------|
|report    	|0.0714|		0.2222|
|benefits  	|0.0714|		0.2222|
|password  	|0.2143|		0.1111|
|importance	|0.0714|		0.2222|
|send      	|0.2857|		0.1111|
|account   	|0.1429|		0.1111|
|the       	|0.0714|		0.2222|
|your      	|0.2857|		0.1111|
|vows      	

## Test the Classifier
We can now test the classfier on the test set of data. We use the previously calculated priors and likelihood to calculate the posterior probabilities for both SPAM and HAM and this will allow me to classify new emails as spam or ham. The classification is done by comparing the posterior probability for Ham to that of SPAM: the highest one is the output label.

We can assess the accuracy of the classifier against data it has not seen before. We do this with a labelled test set.

In [6]:
num_correct = 0
num_total = 0

# for all labelled messages in the test set, grouped by label
for expected_label, msgs in new_emails.items():
    for msg in msgs:

        print(f'\n{expected_label.upper()} msg: "{msg}"')

        words = msg.split()

        # initialise with the priors
        spam_posterior_prob = prior_spam
        ham_posterior_prob = prior_ham

      
        # Caculate the posterior probabilities based on the multiplied likelihoods
        for word in words:
            if word in spam_likelihoods:
                spam_posterior_prob *= spam_likelihoods[word]
            else:
                spam_posterior_prob *= alpha / num_spam_words 
                
            if word in ham_likelihoods:
                ham_posterior_prob *= ham_likelihoods[word]
            else:
                ham_posterior_prob *= alpha / num_ham_words

        print(f'\tSpam Posterior Probability: {spam_posterior_prob}')
        print(f'\tHam Posterior Probability: {ham_posterior_prob}')
                
        # I think it is better to compare the relative probabilities instead of
        # picking a threshold. You always get into a trade-off between false
        # positives and false negatives.        
        if spam_posterior_prob > ham_posterior_prob:
            predicted_label = 'spam'
        else:
            predicted_label = 'ham'

        status = 'WRONG!' if predicted_label != expected_label else 'CORRECT!'
        print(f'{msg:30s} {predicted_label:5s} {status}')
        
        if predicted_label == expected_label:
            num_correct += 1

        num_total += 1

accuracy = num_correct / num_total
print(f'\nAccuracy: {accuracy:.2f}')


SPAM msg: "renew your password"
	Spam Posterior Probability: 0.0026619343389529724
	Ham Posterior Probability: 0.0005367686527106816
renew your password            spam  CORRECT!

SPAM msg: "renew your vows"
	Spam Posterior Probability: 0.0008873114463176574
	Ham Posterior Probability: 0.0010735373054213632
renew your vows                ham   WRONG!

HAM msg: "benefits of our account"
	Spam Posterior Probability: 6.337938902268981e-05
	Ham Posterior Probability: 0.0001192819228245959
benefits of our account        ham   CORRECT!

HAM msg: "the importance of physical activity"
	Spam Posterior Probability: 1.1317748039766037e-06
	Ham Posterior Probability: 0.00015904256376612786
the importance of physical activity ham   CORRECT!

Accuracy: 0.75


## How does this all work?
The approach relies on Bayes' Rule, which relates the probability of a message being SPAM based on prior knowledge (that is, having seen examples of HAM and SPAM). 

The approach taken relies on pre-processing the data to ensure I have a vocabulary of words and an understanding of their occurrence in terms of being SPAM or HAM (i.e. P(w|S) and P(w|H)). I classify a message by breaking the message up into its constituent words and working on these components, as outlined here. 

There are two main stages to this process: fitting and prediction.

**Fitting:**

* Compute Priors: Calculate the prior probabilities, based on the frequency of spam and ham emails in the training data.
* Compute Likelihoods: For each word in our vocabulary, compute the likelihood it appears in spam and ham emails. Note that I use Laplace smoothing to make sure that any words not seen in the training set don't zero out the probabilities when making predictions (i.e. multiplying by zero pulls the probability down to zero - this is called Laplacian smoothing and is described in the branch).

In our case, for our training data set the Priors are calculated as:

$$
P(S) = \frac{14}{23} = 0.6086956521739131
$$

$$
P(H) = \frac{9}{23} = 0.3913043478260869
$$

The conditional probabilities (or likelihoods) are calculated to be:

| Words       | Fractional P(w\|S) | P(w\|S) | Fractional P(w\|H) | P(w\|H) |
|-------------|-------------------|---------|-------------------|---------|
| password    | 3/14              | 0.2143  | 1/9               | 0.1111  |
| our         | 1/7               | 0.1429  | 1/9               | 0.1111  |
| vows        | 1/14              | 0.0714  | 2/9               | 0.2222  |
| us          | 3/14              | 0.2143  | 1/9               | 0.1111  |
| the         | 1/14              | 0.0714  | 2/9               | 0.2222  |
| benefits    | 1/14              | 0.0714  | 2/9               | 0.2222  |
| website     | 1/7               | 0.1429  | 1/9               | 0.1111  |
| importance  | 1/14              | 0.0714  | 2/9               | 0.2222  |
| activity    | 1/14              | 0.0714  | 1/3               | 0.3333  |
| send        | 2/7               | 0.2857  | 1/9               | 0.1111  |
| physical    | 1/14              | 0.0714  | 2/9               | 0.2222  |
| review      | 1/7               | 0.1429  | 1/9               | 0.1111  |
| your        | 2/7               | 0.2857  | 1/9               | 0.1111  |
| report      | 1/14              | 0.0714  | 2/9               | 0.2222  |
| account     | 1/7               | 0.1429  | 1/9               | 0.1111  |
| Your        | 1/14              | 0.0714  | 2/9               | 0.2222  |

When calculating posterior probabiltities for spam, unknown words will be assigned, $\frac{1}{14}$; and for ham, unknown words will be assigned $\frac{1}{9}$. 

**Prediction:**

* For a new, unlabeled email, break it down into its constituent words.
* For each word, multiply the likelihoods, initialising the product with the prior (I do two passes: one assuming the email is spam and separately assuming it's ham). 
* Apply Bayes Rule. Combine the prior probabilities with the computed likelihoods to get the posterior probabilities that the email is spam or ham. Classify the email based on whichever probability is higher, SPAM or HAM.

I then evaluate the accuracy of the model by predicting the labels of a test set of data and measuring how many we predict correctly. 

Note, you can always re-fit the models as you get more data, improving the system's awareness of what is (and is not) SPAM (see the update method on the python class).

### Classify (or Predict the Class) for the message: "renew your password"

We take the test message, "renew your password" and compute the following:

$$
\begin{align*}
    P(S|\text{renew your password}) &= P(S) \times P(renew | S) \times P(your | S) \times P(password | S) \\
    P(S|\text{renew your password}) &= \frac{14}{23} \times \frac{1}{14} \times  \frac{2}{7} \times \frac{3}{14}  \\
    P(S|\text{renew your password}) &= \frac{3}{1127} \\
    P(S|\text{renew your password}) &= 0.0026619343389529724 \\
\end{align*} 
$$

Where "renew" is unknown in the Spam corpus and smoothed to $\frac{1}{14}$. And 

$$
\begin{align*}
    P(H|\text{renew your password}) &= P(H) \times P(renew | H) \times P(your | H \times P(password | H) \\
    P(H|\text{renew your password}) &= \frac{9}{23} \times \frac{1}{9}   \times  \frac{1}{9} \times \frac{1}{9}  \\
    P(H|\text{renew your password}) &= \frac{1}{1863} \\
    P(H|\text{renew your password}) &= 0.0005367686527106816 \\
\end{align*} 
$$

Where "renew" is unknown in the ham corpus and smoothed to $\frac{1}{9}$

If $P(S|\text{renew your password}) > P(H|\text{renew your password})$ then SPAM, else HAM. In this case, `renew your password` is predicted to be `spam`.

Let's package all of the previous code into classes, so it's more modular. Note, I will do some refactoring but keep the same behaviour as before.

In [7]:
class ClassifierTester:
    """ 
    The prupose of this class is to test the classifier. If I was doing 
    this properly, I would use a notion of abstract methods so the
    tester could test the accuracy of any approach that uses the same 
    signatures for fit and predict.
    """
    def __init__(self, classifier):
        self.classifier = classifier

    def test(self, test_data, print_results=True):
        """
        Test the classifier with test data. This just measures
        how many correct predictions the classifier makes (divided
        by the total number of possible predictions).     

        :param test_data: a dictionary of test data, where 
                          the key is the label and the value 
                          is a list of emails
        :param print_results: whether to print the results 
                          (highlight mismatches)
        :return: the accuracy of the classifier
        """
        num_correct = 0
        num_total = 0

        max_len = max([len(email) for emails in test_data.values() for email in emails])

        if print_results:
            print(f'\t{"Message":{max_len}} {"Expected":10} {"Predicted":10}')
            print(('\t'*1 + '-' * max_len) + ' ' + '-' * 10 + ' ' + '-' * 10)

        for label, emails in test_data.items():
            for email in emails:
                predicted_label = self.classifier.predict(email)

                if print_results:
                    print(f'\t{email:{max_len}} {label:10} {predicted_label:10} {"WRONG!" if predicted_label != label else "CORRECT!"}')

                if predicted_label == label:
                    num_correct += 1
                num_total += 1

        accuracy = num_correct / num_total
        
        return accuracy


In [8]:
class NaiveBayesClassifier:

    def __init__(self):
        """
        Initialize the NaiveBayesClassifier.
        """
        self._spam_list = []
        self._ham_list = []
        self._vocabulary = set()
        self._alpha = 1  # Laplace smoothing parameter
        self._spam_word_counts = {}
        self._ham_word_counts = {}
        self._num_spam_words = 0
        self._num_ham_words = 0
        self._prior_spam = 0
        self._prior_ham = 0
        self._spam_likelihoods = {}
        self._ham_likelihoods = {}

    def __repr__(self) -> str:
        """
        This just helps pretty print the class to help understand the internals a little better.

        :return: a string representation of the classifier.
        """
        return (f'NB Classifier with {len(self._vocabulary)} words.\n'
                f'\tA priori probabilities: spam={self._prior_spam:.4f}, ham={self._prior_ham:.4f}.\n'
                f'\tAlpha parameter: {self._alpha}')

    def fit(self, spam_list: list, ham_list: list):
        """
        Train the classifier with the provided spam and ham lists.
        
        :param spam_list: List of spam sentences.
        :param ham_list: List of ham sentences.
        """
        self._spam_list = spam_list
        self._ham_list = ham_list

        self._vocabulary = self._build_vocabulary()
        self._spam_word_counts, self._ham_word_counts = self._build_word_counts()
        self._num_spam_words = sum(self._spam_word_counts.values())
        self._num_ham_words = sum(self._ham_word_counts.values())
        self._prior_spam = self._num_spam_words / (self._num_spam_words + self._num_ham_words)
        self._prior_ham = 1 - self._prior_spam
        self._spam_likelihoods = self._compute_likelihoods(self._spam_word_counts, self._num_spam_words)
        self._ham_likelihoods = self._compute_likelihoods(self._ham_word_counts, self._num_ham_words)

    def _build_vocabulary(self) -> set:
        """
        Build a vocabulary from the spam and ham lists.

        :return: The vocabulary as a set (unique words).
        """
        vocabulary = set()
        for email in self._spam_list + self._ham_list:
            words = email.split()
            for word in words:
                vocabulary.add(word)
        return vocabulary

    def _build_word_counts(self) -> (dict, dict):
        """
        Build dictionaries of word counts for spam and ham.

        :return: Dictionaries of word counts.
        """
        spam_word_counts = {word: 0 for word in self._vocabulary}
        ham_word_counts = {word: 0 for word in self._vocabulary}

        for email in self._spam_list:
            for word in email.split():
                if word in spam_word_counts:
                    spam_word_counts[word] += 1

        for email in self._ham_list:
            for word in email.split():
                if word in ham_word_counts:
                    ham_word_counts[word] += 1

        return spam_word_counts, ham_word_counts

    def _compute_likelihoods(self, word_counts: dict, total_words: int) -> dict:
        """
        Compute the likelihood of each word.

        :param word_counts: The frequency of each word.
        :param total_words: Total number of words.
        :return: Dictionary of likelihoods.
        """
        return {word: (count + self._alpha) / total_words for word, count in word_counts.items()}

    def update(self, new_spam: list, new_ham: list):
        """
        Update the classifier with new spam and ham messages.

        :param new_spam: List of new spam messages.
        :param new_ham: List of new ham messages.
        """
        self._spam_list.extend(new_spam)
        self._ham_list.extend(new_ham)

        # Now refit for priors and likelihoods
        self.fit(self._spam_list, self._ham_list)

    def predict(self, msg: str) -> str:
        """
        Classify a single email as spam or ham.

        :param msg: The email message to classify.
        :return: Classification as 'SPAM' or 'HAM'.
        """
        words = msg.split()

        spam_posterior_prob = self._prior_spam
        ham_posterior_prob = self._prior_ham

        for word in words:
            spam_posterior_prob *= self._spam_likelihoods.get(word, self._alpha / self._num_spam_words)
            ham_posterior_prob *= self._ham_likelihoods.get(word, self._alpha / self._num_ham_words)

        return 'spam' if spam_posterior_prob > ham_posterior_prob else 'ham'



In [9]:
# Train the classifier
nb_classifier = NaiveBayesClassifier()
nb_classifier.fit(previous_spam, previous_ham)

print(f"{nb_classifier}\n")

# Test the classifier
nb_tester = ClassifierTester(nb_classifier)
print("Testing the classifier...\n")

#How good is the classifier?
accuracy = nb_tester.test(new_emails)
print(f'\nAccuracy: {accuracy:.2f}')

NB Classifier with 16 words.
	A priori probabilities: spam=0.6087, ham=0.3913.
	Alpha parameter: 1

Testing the classifier...

	Message                             Expected   Predicted 
	----------------------------------- ---------- ----------
	renew your password                 spam       spam       CORRECT!
	renew your vows                     spam       ham        WRONG!
	benefits of our account             ham        ham        CORRECT!
	the importance of physical activity ham        ham        CORRECT!

Accuracy: 0.75


### Update the Classifier


Here I demonstrate a super-naive way to update an existing model and re-test. Again, there;s no notion of serialisation/deserialisation so if the classifier goes out of scope, you have to rebuild everything. there really are better ways to do this... 

In [10]:
# Update the classifier
nb_classifier.update(new_emails['spam'], new_emails['ham'])

# Present new data to the classifier
new_msg = "Give me your password"
result = nb_classifier.predict(new_msg)

print(f'\nPrediction for \"{new_msg}\": {result}')

new_msg = "Send me your data"
result = nb_classifier.predict(new_msg)

print(f'Prediction for \"{new_msg}\": {result}')

new_msg = "Give me your money"
result = nb_classifier.predict(new_msg)

print(f'Prediction for \"{new_msg}\": {result}')

new_msg = "I would love a car"
result = nb_classifier.predict(new_msg)

print(f'Prediction for \"{new_msg}\": {result}')


Prediction for "Give me your password": spam
Prediction for "Send me your data": spam
Prediction for "Give me your money": spam
Prediction for "I would love a car": ham


## Reflection

This toy NB bayes approach demonstrates the use or prior knowledge of SPAM/HAM in the classification of new messages. We would constantly update the likehoods as we get new corrected labelled data.

### Sources
I used all of the references below to build this classifier.

### Feedback
`PENDING`

### System Design
The NaiveBayesClassifier class allows:

* Training an NB classifier. Training (or fitting) generates the priors and likelihoods for the data set.
* Updates the priors and likelihoods based on new, labelled data.

The ClassifierTester class enables us to establish the accuracy of the classifier using labelled, but previously, unseen data.

#### Advantages 
* Surprisingly good even if it assumes all words are independant (which is mostly, clearly not the case in language).
* Easy to understand and 
* relatively quick to implement

#### Disadvantages
* The NB classifier has no sense of order (or context).
* The data set is far to small.
* we assume all input is lowercase. There's no facility for stopwords.
* The system has no notion of punctuation or spelling mistakes.
* The current design calls for the NB class to retain the labelled data that it has previously seen. This is poor as, once the object goes out of scope, the GC will clean it up and all training data will be lost. Ideally, we would at the very least, save the core data to disk and deserialise when we reinstantiate an opbject.

### Discussion
#### Using a Threshold
I have included two ways to predict whether a message is SPAM or HAM:

* Comparingthe relative posterior probabilities (picking the highest as the most likely)
* Using a threshold

I prefer the former asthe latter approach raises the question of what threshold should I set; i.e. how good is good enough.

#### Text Pre-processing: Stopwords
Removing stop words actually drops the accuracy down to 50/50. Toolkits like ntlk have modules with many stops works (for english and many other languages). In this case, with so few words, removing stops words may actually remove information from the train and test sets.

#### Text Pre-processing: Remove punctuation, force lower case
We could pre-process the test to remove punctuation and force all text to be lowercase. 

#### Accuracy (with different test/train splits or with new data)
Note that the accuracy of the system can be affected by changing how we divide the training and tests set. See here:

In [11]:
# take a SPAM message from previous and move to test and vice versa.
previous_spam = ['renew your vows', 'review our website', 'send your password', 'send us your account']
previous_ham = ['benefits of our account','Your activity report', 'the importance vows']
new_emails = {'spam':['send us your password', 'renew your password' ], 'ham':['benefits physical activity', 'the importance of physical activity']}

# Train the classifier
nb_classifier = NaiveBayesClassifier()
nb_classifier.fit(previous_spam, previous_ham)

# Test the classifier
nb_tester = ClassifierTester(nb_classifier)
print("Testing the classifier...\n")

accuracy = nb_tester.test(new_emails)
print(f'\nAccuracy: {accuracy:.2f}')

Testing the classifier...

	Message                             Expected   Predicted 
	----------------------------------- ---------- ----------
	send us your password               spam       spam       CORRECT!
	renew your password                 spam       spam       CORRECT!
	benefits physical activity          ham        ham        CORRECT!
	the importance of physical activity ham        ham        CORRECT!

Accuracy: 1.00


Here i get an accuracy of 1.0??? Very suspect! The same labelled messages exist in both test and train sets, I've just changed what I used to test and train with; and I get a different accuracy depending on the split.

Also, increasing the number of samples in the test set will drive the accuracy down (see the example below where I have added one new message to the test set - taking the accuracy from 0.75 to 0.6). This is shown next:

In [12]:
previous_spam = ['send us your password', 'review our website', 'send your password', 'send us your account']
previous_ham = ['Your activity report','benefits physical activity', 'the importance vows']
new_emails = {'spam':['renew your password', 'renew your vows'], 'ham':['benefits of our account', 'your exam results are in', 'the importance of physical activity']}

# Train the classifier
nb_classifier = NaiveBayesClassifier()
nb_classifier.fit(previous_spam, previous_ham)

# Test the classifier
nb_tester = ClassifierTester(nb_classifier)
print("Testing the classifier...\n")

accuracy = nb_tester.test(new_emails)
print(f'\nAccuracy: {accuracy:.2f}')

Testing the classifier...

	Message                             Expected   Predicted 
	----------------------------------- ---------- ----------
	renew your password                 spam       spam       CORRECT!
	renew your vows                     spam       ham        WRONG!
	benefits of our account             ham        ham        CORRECT!
	your exam results are in            ham        ham        CORRECT!
	the importance of physical activity ham        ham        CORRECT!

Accuracy: 0.80




#### False Positives
A few years ago, I worked in the AV industry as a malware analyst and we always had to deal with notions of false positive results (TP, TN, FP, FN); think how many times your AV thought a valid, reputable application was malware. Misclassification is a fact of life with these methods.

#### Bags of Words and (Not) Understanding Context
Note that this approach is naive in many ways. It assumes the features are independance. Consider what happens if we jumble up the words in our training and test data.

In [13]:
previous_spam = ['us send password your', 'website our review', 'your password send', 'send us your account']
previous_ham = ['activity report Your', 'benefits activity physical', 'the vows importance']
new_emails = {'spam': ['your renew  password', 'renew vows your'], 'ham': ['our benefits of account', 'the importance of physical activity']}

# Train the classifier
nb_classifier = NaiveBayesClassifier()
nb_classifier.fit(previous_spam, previous_ham)

# Test the classifier
nb_tester = ClassifierTester(nb_classifier)
print("Testing the classifier...\n")

accuracy = nb_tester.test(new_emails)
print(f'\nAccuracy: {accuracy:.2f}')

Testing the classifier...

	Message                             Expected   Predicted 
	----------------------------------- ---------- ----------
	your renew  password                spam       spam       CORRECT!
	renew vows your                     spam       ham        WRONG!
	our benefits of account             ham        ham        CORRECT!
	the importance of physical activity ham        ham        CORRECT!

Accuracy: 0.75


Same result. This is not an exhaustive test but demonstrates the intuition that the method ignores word order.

### Investigating floating point underflow
Since we cannot import libraries, we will need to approximate the log function.

In [14]:
# TODO: 


## References
* [Naive Bayes, Clearly Explain!!!](https://www.youtube.com/watch?v=O2L2Uv9pdDA&ab_channel=StatQuestwithJoshStarmer)
* [Bayes theorem, the geometry of changing beliefs](https://www.youtube.com/watch?v=HZGCoVF3YvM&ab_channel=3Blue1Brown)
* [Notes on Naive Bayes Classifiers for Spam Filtering](https://courses.cs.washington.edu/courses/cse312/18sp/lectures/naive-bayes/naivebayesnotes.pdf)

## Appendix
