# Notebook E-tivity 3 CE4021 Task 2

Student name: Bartlomiej Mlynarkiewicz

Student ID: 17241782

<hr style=\"border:2px solid gray\"> </hr>

## Imports

If you believe required imports are missing, please contact your moderator.

<hr style=\"border:2px solid gray\"> </hr>

## Task 2

Using Etivity3-Task2.ipynb from the Gitlab repository and the dataset contained therein, create a Naive Bayes Classifier to filter incoming mail for SPAM.

The notebook provides a small dataset of previous 'emails' (please note the absence of punctuation which simplifies the coding challenge somewhat). Previous wanted emails are contained in previous_ham. Previous unwanted emails are contained in previous_spam. 

Write code using Bayes' Rule to determine whether the messages contained in new_emails are HAM or SPAM. Compare the decisions your classifier takes with the label associated with the messages (indicated by the key under which they are stored in the new_emails dictionary. 

If time permits, add the code required to allow your classifier to learn from the email messages contained in new_emails. Note that this functionality is required to be graded in the Exemplary category. 

HINTS:

1. Use functions to divide up the task in smaller components. It is useful to work through the problem by hand to get a handle on what functions would be useful.
2. Choose a suitable threshold of 'spamicity' (or 'spaminess') to distinguish between spam and ham messages in this dataset. 

Use the below information to create a Naive Bayes SPAM filter. Test your filter using the messages in new_emails. You may add as many cells as you require to complete the task.

### Background

A Naive Bayes classifier is a probabilistic machine learning model that’s used for classification task. The crux of the classifier is based on the Bayes theorem.


$$
P(A|B) = \frac{P(A|B)P(A)}{P(B)}
$$

Using Bayes theorem, we can find the probability of A happening, given that B has occurred. Here, B is the evidence and A is the hypothesis. The assumption made here is that the features are independent. That is presence of one particular feature does not affect the other. Hence it is called naive.

### Data

In [5]:
previous_spam = ['send us your password', 'review our website', 'send your password', 'send us your account']
previous_ham = ['Your activity report','benefits physical activity', 'the importance vows']
new_emails = {'spam':['renew your password', 'renew your vows'], 'ham':['benefits of our account', 'the importance of physical activity']}

### Implementation

Below is a `defaultdict` implementation, similar to `defaultdict` provided by the collections library. The standard dictionary includes the method setdefault() for retrieving a value and establishing a default if the value does not exist. By contrast, defaultdict lets the caller specify the default(value to be returned) up front when the container is initialized. This is to avoid `KeyError` being thrown if a key doesn't exist within the `dict`.

In [110]:
def defaultdict(default_type):
    class DefaultDict(dict):
        def __getitem__(self, key):
            if key not in self:
                dict.__setitem__(self, key, default_type())
            return dict.__getitem__(self, key)
    return DefaultDict()

Tokenize all words within an text and convert to lowercase for comparison purposes.

In [111]:
def to_lower_tokens(text):
    return [word.lower() for word in text.split()]

Train the Naive Bayes Classifier

In [177]:
def train_naive_bayes(previous_ham, previous_spam):
    spam_word_counts = defaultdict(int)
    ham_word_counts = defaultdict(int)
    
    # Count the number of previous spam and ham email
    total_spam_messages = len(previous_spam)
    total_ham_messages = len(previous_ham)

    # Count the number of words in previous spam and ham emails
    total_words_in_spam = sum(len(email.split()) for email in previous_spam)
    total_words_in_ham = sum(len(email.split()) for email in previous_ham)
    
    # Count the frequency of each word in previous spam emails
    for email in previous_spam:
        words = to_lower_tokens(email)
        for word in words:
            spam_word_counts[word] += 1

    # Count the frequency of each word in previous ham emails
    for email in previous_ham:
        words = to_lower_tokens(email)
        for word in words:
            ham_word_counts[word] += 1
    
    return spam_word_counts, ham_word_counts, total_spam_messages, total_ham_messages, total_words_in_spam, total_words_in_ham

Use Naive Bayes to predict whether an email is SPAM or HAM

In [209]:
def predict_naive_bayes(email, spam_word_counts, ham_word_counts, total_spam_messages, total_ham_messages, total_words_in_spam, total_words_in_ham):
    words = to_lower_tokens(email)
    
    # Laplace smoothing
    alpha = 1

    # Calculate P(S)
    spam_score = total_spam_messages / (total_spam_messages + total_ham_messages)
    # Calculate P(¬S)
    ham_score = total_ham_messages / (total_spam_messages + total_ham_messages)
    
    uniqe_words = set().union(spam_word_counts.keys(), ham_word_counts.keys())

    for word in words:
        # Calculate the conditional probabilities
        prob_word_given_spam = (spam_word_counts[word] + alpha) / (total_words_in_spam + len(uniqe_words))
        prob_word_given_ham = (ham_word_counts[word] + alpha) / (total_words_in_ham + len(uniqe_words))
        
        spam_score *= prob_word_given_spam
        ham_score *= prob_word_given_ham
    
    # Use Bayes' theorem to calculate P(spam | words) and P(ham | words)

    spamminess_score = spam_score / ham_score
    prediction = 'spam' if spam_score > ham_score else 'ham' 
    
    return prediction, spamminess_score

Train the Naive Bayes Classifier

In [210]:
spam_word_counts, ham_word_counts, total_spam_messages, total_ham_messages, total_words_in_spam, total_words_in_ham = train_naive_bayes(previous_ham, previous_spam)

Use the classifier to predict whether new emails are SPAM or HAM

In [212]:
classified_emails = {'spam': [], 'ham': []}
spamminess_scores = {'spam': [], 'ham': []}

for label, emails in new_emails.items():
    for email in emails:
        prediction, spamminess_score = predict_naive_bayes(email, spam_word_counts, ham_word_counts, total_spam_messages, total_ham_messages, total_words_in_spam, total_words_in_ham)
        classified_emails[prediction].append(email)
        spamminess_scores[prediction].append(spamminess_score)

In [213]:
print("Classified SPAM:", classified_emails['spam'])
print("Classified HAM:", classified_emails['ham'])
print("Actual Labels - SPAM:", new_emails['spam'])
print("Actual Labels - HAM:", new_emails['ham'])

print("Spamminess Scores - SPAM:", spamminess_scores['spam'])
print("Spamminess Scores - HAM:", spamminess_scores['ham'])

Classified SPAM: ['renew your password', 'benefits of our account']
Classified HAM: ['renew your vows', 'the importance of physical activity']
Actual Labels - SPAM: ['renew your password', 'renew your vows']
Actual Labels - HAM: ['benefits of our account', 'the importance of physical activity']
Spamminess Scores - SPAM: [4.534503259666243, 1.2860082304526745]
Spamminess Scores - HAM: [0.7716049382716047, 0.02305609567131223]


In [214]:
correct = 0
total = sum(len(new_emails[key]) for key in list(new_emails.keys()))

for key in list(classified_emails.keys()):
    for row in classified_emails[key]:
        if row in new_emails[key]:
            correct = correct + 1

print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', correct/total)

Correct: 2
Incorrect: 2
Accuracy: 0.5


<hr style=\"border:2px solid gray\"> </hr>

## Reflection