# Notebook E-tivity 3 CE4021 Task 2

Student name: Mitchell de Bruyn

Student ID: 23296828

<hr style=\"border:2px solid gray\"> </hr>

## Imports

In [1]:
#None

If you believe required imports are missing, please contact your moderator.

<hr style=\"border:2px solid gray\"> </hr>

## Task 2

Use the below information to create a Naive Bayes SPAM filter. Test your filter using the messages in new_emails. You may add as many cells as you require to complete the task.

In [2]:
previous_spam = ['send us your password', 'review our website', 'send your password', 'send us your account']
previous_ham = ['Your activity report','benefits physical activity', 'the importance vows']
new_emails = {'spam':['renew your password', 'renew your vows'], 'ham':['benefits of our account', 'the importance of physical activity']}

# Bayes' Theorem


$$
P(S|W) = \frac{P(W|S) \times P(S)} {P(W)}
$$

Where:

- P(S) - **<u>prior</u>** the probability that an email is spam
- P(W) - the probability that an email contains the word. Total probability: $ P(W) = P(W|S)P(S) + P(W|H)P(H) $
- P(W|S) - **<u>likelihood</u>** - the probability of an email containing a specific the word given that an email is spam
- P(S|W) - **<u>posterior</u>** - (the answer) the probability an email is spam given that it contains a specific word.


# Naive Bayes' formula

This formula uses the naive assumption that the probabilities of the occurrences of specific words are independent, allowing us to multiply the probabilities of each word being present.

$$
P(S | w_1, \ldots, w_n) = \frac{P(S) \prod_{i=1}^{n} P(w_i | S)}{P(S) \prod_{i=1}^{n} P(w_i | S) + P(H) \prod_{i=1}^{n} P(w_i | H)}
$$


# Laplace smoothing

To prevent 0 probabilities from destroying our result when multiplying all the different probabilities, we need to think about using Laplace smoothing.

$$
P(w | S) = \frac{| \text{spam emails containing } w|+1}{|\text{spam emails}|+2}
$$

# Final approximations

We want to find the probability where:

$$
P(S | x_1, \ldots , x_n) > 0.5
$$

But using the naive assumption of independence between the features, this can be simplified to:

$$
P(S) \prod_{i=1}^{n} P(x_i | S) > P(H) \prod_{i=1}^{n} P(x_i | H)
$$

To prevent numerical underflow when dealing with small probabilities, we use the logarithm trick, giving:

$$
\log(P(S)) + \sum_{i=1}^{n} \log(P(x_i | S)) > \log(P(H)) + \sum_{i=1}^{n} \log(P(x_i | H))
$$



# Naive Bayes Spam Classification Algorithm

## 1. Feature Gathering:
- Gather all distinct words from the training data to create a training word set. This set should include words from both spam (`S`) and ham (`H`) emails.

## 2. Calculate Probability of Word Given Spam (`P(w|S)`):
- For each word in the set, calculate its probability of being in a spam email:
    - Count the number of spam emails containing that word.
    - Use Laplace smoothing to calculate the probability of the word being in a spam email:
    $$ P(\text{word given spam}) = \frac{\text{Number of spam emails containing the word} + 1}{\text{Total number of spam emails} + 2} $$
- Save the results in a dictionary, `p_word_given_spam`.

## 3. Calculate Probability of Word Given Ham (`P(w|H)`):
- Use the method above to calculate and save the results in a dictionary, `p_word_given_ham`.

## 4. Calculate Prior Probabilities:
- Calculate the overall probability of an email being spam based on the training data:
  $$ P(S) = \frac{\text{Number of emails in the spam set}}{\text{Total number of emails in the training set}} $$
  
## 5. Classify Emails in the Test Set:
For each email in the unlabelled test set:

  i. Create a set of distinct words from the email, ignoring words not present in the training word set.

  ii. Compute the spam and ham scores using:
  $$ P(S | w_1, \ldots, w_n) = \frac{P(S) \prod_{i=1}^{n} P(w_i | S)}{P(S) \prod_{i=1}^{n} P(w_i | S) + P(H) \prod_{i=1}^{n} P(w_i | H)} $$

  iii. If the spam score is greater than the ham score, print spam else print ham.


In [3]:
# Training data
previous_spam = ['send us your password', 'review our website', 'send your password', 'send us your account']
previous_ham = ['Your activity report','benefits physical activity', 'the importance vows']

# Tokenize emails and create training word set
training_words = set()
for email in previous_spam + previous_ham:
    training_words.update(email.split())

# 2. Calculate P(w|S) and P(w|H)
p_word_given_spam = {}
p_word_given_ham = {}

for word in training_words:
    p_word_given_spam[word] = (sum(word in email.split() for email in previous_spam) + 1) / (len(previous_spam) + 2)
    p_word_given_ham[word] = (sum(word in email.split() for email in previous_ham) + 1) / (len(previous_ham) + 2)

# 4. Calculate P(S) and P(H)
p_s = len(previous_spam) / (len(previous_spam) + len(previous_ham))
p_h = 1 - p_s

# 5. Classify new emails
new_emails = {'spam':['renew your password', 'renew your vows'], 
              'ham':['benefits of our account', 'the importance of physical activity']}

def prod(numbers):
    """
    I wanted to use math.prod(), but had to write my own...
    Calculates the product of a list of numbers.

    Args:
        numbers: A list of numbers.

    Returns:
        The product of the numbers.
    """
    result = 1
    for num in numbers:
        result *= num
    return result


for label, emails in new_emails.items():
    for email in emails:
        # list of words in email found in trainting set
        words = [word for word in email.split() if word in training_words]
        
        # Calculate P(S | w_1, ..., w_n) using the given formula
        numerator_spam = p_s * prod(p_word_given_spam[word] for word in words)
        denominator = numerator_spam + p_h * prod(p_word_given_ham[word] for word in words)
        
        p_spam_given_email = numerator_spam / denominator

        print("=============================================================")
        print(f"Email: '{email}'")
        print('-------------------------------------------------------------')
        print("label:", label, end='')
        print("\tprediction:", "spam" if p_spam_given_email > 0.5 else "ham")
        

Email: 'renew your password'
-------------------------------------------------------------
label: spam	prediction: spam
Email: 'renew your vows'
-------------------------------------------------------------
label: spam	prediction: spam
Email: 'benefits of our account'
-------------------------------------------------------------
label: ham	prediction: spam
Email: 'the importance of physical activity'
-------------------------------------------------------------
label: ham	prediction: ham


<hr style=\"border:2px solid gray\"> </hr>

## Reflection

Write you reflection in below cell.