# Notebook E-tivity 3 CE4021 Task 2

Student name: Bartlomiej Mlynarkiewicz

Student ID: 17241782

<hr style=\"border:2px solid gray\"> </hr>

## Imports

If you believe required imports are missing, please contact your moderator.

<hr style=\"border:2px solid gray\"> </hr>

## Task 2

Using Etivity3-Task2.ipynb from the Gitlab repository and the dataset contained therein, create a Naive Bayes Classifier to filter incoming mail for SPAM.

The notebook provides a small dataset of previous 'emails' (please note the absence of punctuation which simplifies the coding challenge somewhat). Previous wanted emails are contained in previous_ham. Previous unwanted emails are contained in previous_spam. 

Write code using Bayes' Rule to determine whether the messages contained in new_emails are HAM or SPAM. Compare the decisions your classifier takes with the label associated with the messages (indicated by the key under which they are stored in the new_emails dictionary. 

If time permits, add the code required to allow your classifier to learn from the email messages contained in new_emails. Note that this functionality is required to be graded in the Exemplary category. 

HINTS:

1. Use functions to divide up the task in smaller components. It is useful to work through the problem by hand to get a handle on what functions would be useful.
2. Choose a suitable threshold of 'spamicity' (or 'spaminess') to distinguish between spam and ham messages in this dataset. 

Use the below information to create a Naive Bayes SPAM filter. Test your filter using the messages in new_emails. You may add as many cells as you require to complete the task.

### Background

A Naive Bayes classifier is a probabilistic machine learning model that’s used for classification task. The crux of the classifier is based on the Bayes theorem.


$$
P(A|B) = \frac{P(A|B)P(A)}{P(B)}
$$

Using Bayes theorem, we can find the probability of A happening, given that B has occurred. Here, B is the evidence and A is the hypothesis. The assumption made here is that the features are independent. That is presence of one particular feature does not affect the other. Hence it is called naive.

### Data

In [1]:
previous_spam = ['send us your password', 'review our website', 'send your password', 'send us your account']
previous_ham = ['Your activity report','benefits physical activity', 'the importance vows']
new_emails = {'spam':['renew your password', 'renew your vows'], 'ham':['benefits of our account', 'the importance of physical activity']}

### Implementation

Below is a `defaultdict` implementation, similar to `defaultdict` provided by the `collections` library. The standard dictionary includes the method setdefault() for retrieving a value and establishing a default if the value does not exist. By contrast, defaultdict lets the caller specify the default(value to be returned) up front when the container is initialized. This is to avoid `KeyError` being thrown if a key doesn't exist within the `dict`.

In [2]:
def defaultdict(default_type) -> 'DefaultDict':
    """
    Returns a defaultdict object.
    
    Parameters
    ----------
    default_type
        A default type which is used to set the value of item.
            
    Returns
    -------
    DefaultDict
        Returns a an instance of DefaultDict.
    """
    class DefaultDict(dict):
        def __getitem__(self, key):
            if key not in self:
                dict.__setitem__(self, key, default_type())
            return dict.__getitem__(self, key)
    return DefaultDict()

### Tokenize all words within an text, remove fill words and punctuations

In [3]:
def remove_fill_words(text):
    fill_words = [
        "as",
        "to",
        "and",
        "the",
        "a",
        "of",
        "our"
    ]
        
    return [word for word in text.split() if word not in fill_words]

In [4]:
def remove_punctuation_and_fill_words(text: str) -> list[str]:
    """
    Removes punctuations from the passed in text.
    
    Parameters
    ----------
    text: str
        A string
            
    Returns
    -------
    str
        Returns a string without punctuations.
    """
    punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~,.-'''
    clean_text = ''.join(char for char in text if char not in punctuations).lower()
    return [word for word in remove_fill_words(clean_text)]

### Train the Naive Bayes Classifier

In [5]:
def train_naive_bayes(previous_ham: list[str], previous_spam: list[str]):
    """
    Initializes the required constants required for classification. 
    
    Parameters
    ----------
    previous_ham: list[str]
        A list of previous emails that were catagorised as HAM
        
    previous_spam: list[str]
        A list of previous emails that were catagorised as SPAM
            
    Returns
    -------
    spam_word_counts: DefaultDict
        Dictionary with frequency of each word in previous spam emails - - Nwi|spam
        
    ham_word_counts: DefaultDict
        Dictionary with frequency of each word in previous ham emails - Nwi|ham
        
    total_spam_messages: int
        The count of previous spam emails
        
    total_ham_messages: int
        The count of previous ham emails
        
    total_words_in_spam: int 
        The count of words in previous spam emails - Nspam
        
    total_words_in_ham: int
        The count of words in previous ham emails - Nham
    """
    spam_word_counts = defaultdict(int)
    ham_word_counts = defaultdict(int)
    
    # Count the number of previous spam and ham email
    total_spam_messages = len(previous_spam)
    total_ham_messages = len(previous_ham)

    # Count the number of words in previous spam and ham emails
    total_words_in_spam = sum(len(email.split()) for email in previous_spam)
    total_words_in_ham = sum(len(email.split()) for email in previous_ham)
    
    # Count the frequency of each word in previous spam emails
    for email in previous_spam:
        words = remove_punctuation_and_fill_words(email)
        for word in words:
            spam_word_counts[word] += 1

    # Count the frequency of each word in previous ham emails
    for email in previous_ham:
        words = remove_punctuation_and_fill_words(email)
        for word in words:
            ham_word_counts[word] += 1
    
    return spam_word_counts, ham_word_counts, total_spam_messages, total_ham_messages, total_words_in_spam, total_words_in_ham

### Laplace Smoothing

Laplace smoothing is a smoothing technique that handles the problem of zero probability in Naïve Bayes.

$$
P(w_i|Spam) = \frac{N_{wi|Spam} + \alpha}{N_{Spam} + \alpha * N_{UniqueWords}}
$$

- $\alpha$ represents the smoothing parameter.
- $N_{wi|Spam}$ represents the number of times the word occurs in SPAM emails.
- $N_{Spam}$represents the count of words in SPAM emails.
- $N_{UniqueWords}$ represents the count of unique words in both SPAM and HAM previous emails.

$$
P(w_i|Ham) = \frac{N_{wi|Ham} + \alpha}{N_{Ham} + \alpha * N_{UniqueWords}}
$$

- $\alpha$ represents the smoothing parameter.
- $N_{wi|Ham}$ represents the number of times the word occurs in HAM emails.
- $N_{Ham}$ represents the count of words in HAM emails.
- $N_{UniqueWords}$ represents the count of unique words in both SPAM and HAM previous emails.


If a value of $\alpha$ != 0 is choosen, the probability will no longer be zero even if a word is not present in the training dataset.

In the below example, I use the approach of comparing the $P(Spam|wi)$ is grater than the $P(Ham|wi)$ then the below implementation classified the email as SPAM.

In [6]:
def predict_naive_bayes(email: str, spam_word_counts: 'DefaultDict', ham_word_counts: 'DefaultDict', total_spam_messages: int, total_ham_messages: int, total_words_in_spam: int, total_words_in_ham: int):
    """
    Classified emails SPAM or HAM based on the comparsion between the ham and spam score. 
    
    Parameters
    ----------
    email: str
        A string to be used to classify.
        
    spam_word_counts: DefaultDict
        Dictionary with frequency of each word in previous spam emails - - Nwi|spam
        
    ham_word_counts: DefaultDict
        Dictionary with frequency of each word in previous ham emails - Nwi|ham
        
    total_spam_messages: int
        The count of previous spam emails
        
    total_ham_messages: int
        The count of previous ham emails
        
    total_words_in_spam: int 
        The count of words in previous spam emails - Nspam
        
    total_words_in_ham: int
        The count of words in previous ham emails - Nham
            
    Returns
    -------
    spamminess_score: float
        The ratio of the spam score to the ham score.
        
    prediction: str
        The email classification based on the ratio of the spam score to the ham score.
    """
    # Tokenize email and lower case each token
    email_tokenized = remove_punctuation_and_fill_words(email)
    
    # Laplace smoothing
    alpha = 1

    # Calculate P(S) -> P(Spam)
    spam_score = total_spam_messages / (total_spam_messages + total_ham_messages)
    # Calculate P(¬S) -> P(Ham)
    ham_score = total_ham_messages / (total_spam_messages + total_ham_messages)
    
    # Count of unique words in both Spam and Ham
    no_of_uniqe_words = len(set().union(spam_word_counts.keys(), ham_word_counts.keys()))

    for word in email_tokenized:
        # Calculate the conditional probabilities P(w_i|Spam) and P(w_i|Ham)
        prob_word_given_spam = (spam_word_counts[word] + alpha) / (total_words_in_spam + (alpha*no_of_uniqe_words))
        prob_word_given_ham = (ham_word_counts[word] + alpha) / (total_words_in_ham + (alpha*no_of_uniqe_words))
        
        # Update the spam and ham scores by multiplying them with the calculated conditional probabilities for each word in the email.
        spam_score *= prob_word_given_spam
        ham_score *= prob_word_given_ham
    
    # The spamminess score is calculated as the ratio of the spam score to the ham score.
    spamminess_score = spam_score / ham_score
    
    # Classify SPAM if spam_score > ham_score
    prediction = 'spam' if spam_score > ham_score else 'ham' 
    
    return prediction, spamminess_score

In [7]:
def pritty_print(lst, msg):
    print(f"\033[1m{msg}\033[0m\n")
    for index, obj in enumerate(lst):
        print(f"{index+1}.\033[1m{'Email:':<40}\033[0m {obj['email'] if 'email' in obj else obj}")
        if 'spamminess_score' in obj:
            print(f"{index+1}.\033[1m{'Spaminess Score:':<40}\033[0m {obj['spamminess_score']}\n")
    print()

### Compare the predicted values with the actual values to measure how good our spam filter is with classifying new emails. 

$$
\text{Accuracy} = \frac{\text{no. of correctly classified emails}}{\text{total number of classified emails}}
$$

In [8]:
def calculate_accuracy(classified_emails: dict, new_emails: dict):
    total = sum(len(new_emails[key]) for key in list(new_emails.keys()))
    
    spam_count = len(list(set(new_emails['spam']).intersection(set(email['email'] for email in classified_emails['spam']))))
    ham_count = len(list(set(new_emails['ham']).intersection(set(email['email'] for email in classified_emails['ham']))))
    
    print('Correct:', spam_count + ham_count)
    print('Incorrect:', total - (spam_count+ham_count))
    print('Accuracy:', (spam_count+ham_count)/total)

Use the classifier to predict whether new emails are SPAM or HAM

In [9]:
spam_word_counts, ham_word_counts, total_spam_messages, total_ham_messages, total_words_in_spam, total_words_in_ham = train_naive_bayes(previous_ham, previous_spam)
    
classified_emails = {'spam': [], 'ham': []}
spam_word_score = defaultdict(int)
ham_word_score = defaultdict(int)

for label, emails in new_emails.items():
    for email in emails:
        prediction, spamminess_score = predict_naive_bayes(email, spam_word_counts, ham_word_counts, total_spam_messages, total_ham_messages, total_words_in_spam, total_words_in_ham)
        classified_emails[prediction].append({"email": email, "spamminess_score": spamminess_score})

In [10]:
pritty_print(previous_spam, "Previous SPAM")
pritty_print(previous_ham, "Previous HAM")

pritty_print(new_emails['spam'], "Actual SPAM")
pritty_print(new_emails['ham'], "Actual HAM")

pritty_print(classified_emails['spam'], "Bayes Classified SPAM")
pritty_print(classified_emails['ham'], "Bayes Classified HAM")

[1mPrevious SPAM[0m

1.[1mEmail:                                  [0m send us your password
2.[1mEmail:                                  [0m review our website
3.[1mEmail:                                  [0m send your password
4.[1mEmail:                                  [0m send us your account

[1mPrevious HAM[0m

1.[1mEmail:                                  [0m Your activity report
2.[1mEmail:                                  [0m benefits physical activity
3.[1mEmail:                                  [0m the importance vows

[1mActual SPAM[0m

1.[1mEmail:                                  [0m renew your password
2.[1mEmail:                                  [0m renew your vows

[1mActual HAM[0m

1.[1mEmail:                                  [0m benefits of our account
2.[1mEmail:                                  [0m the importance of physical activity

[1mBayes Classified SPAM[0m

1.[1mEmail:                                  [0m renew your password
1.

In [11]:
calculate_accuracy(classified_emails, new_emails)

Correct: 3
Incorrect: 1
Accuracy: 0.75


### Generated email subject lines to increase the training set to improve the models accuracy

In [12]:
extra_spa_samples = new_emails['spam'] +  [
     "Exciting Product Launch: Don't Miss Our Latest Gadgets!",
    "Join Our Webinar: Digital Marketing Strategies for Success",
    "Adventure Retreats: Explore Remote Wilderness with Us",
    "Exclusive Sale: Get 40% Off on Latest Fashion Trends",
    "Mastering Time Management: Boost Your Productivity",
    "Culinary Journey: Explore International Cuisine with Us",
    "Career Advancement: Strategies for Success in the Workplace",
    "Sustainable Living: Eco-Friendly Tips for a Greener Future",
    "Art & Culture Festival: Celebrating Diversity in Our City",
    "Discover Hidden Treasures: Antique Collecting Insights",
    "Space Exploration News: Unprecedented Discoveries",
    "Smart Living: Transform Your Home with Technology",
    "Hiking Adventures: Conquer Majestic Peaks and Trails",
    "Women in Leadership: Empowering Female Entrepreneurs",
    "Historical Mysteries: Explore Ancient Civilizations",
    "eSports Phenomenon: Dive into Competitive Gaming",
    "Photography Mastery: Learn from the Pros",
    "Financial Planning: Tips for a Secure Future",
    "Mindfulness & Meditation: Find Inner Peace and Resilience",
    "Fashion Evolution: Styles through the Decades"
]

extra_ham_samples = new_emails['ham'] + [
    "Explore the Secrets of Ancient Egypt: Unearth Mysteries of the Pharaohs, Pyramids, and Hieroglyphs in Our Archaeological Expedition",
    "Revolutionize Your Home Office: Discover Ergonomic Furniture, High-Tech Gadgets, and Office Equipment for Ultimate Comfort and Productivity",
    "Journey to the Stars: Join Our Astronaut Training Program and Experience the Rigorous Preparations for Space Exploration",
    "Immerse Yourself in World Literature: Our Acclaimed Book Club Delves Deep into Timeless Classics and Engages in Thoughtful Literary Discussions",
    "Embark on an Epic Culinary Adventure: Experience the Art of Cooking with Michelin-Starred Chefs and Master the Secrets of Gourmet Cuisine",
    "Unleash Your Inner Explorer: Join Our Adventure Club and Traverse the Most Remote and Breathtaking Corners of the Globe",
    "Redefine Sustainable Living: Eco-Friendly Practices and Green Innovations for a More Environmentally Conscious Lifestyle",
    "Elevate Your Career: Uncover the Secrets of Effective Leadership and Team Management in Our Comprehensive Leadership Development Program",
    "Escape to Tropical Paradise: Our Luxury Island Retreat in the Maldives Offers Pristine Beaches, Sunshine, and Overwater Bungalows",
    "Transform Your Backyard into a Serene Oasis: Explore Our Luxurious Outdoor Furniture and Garden Decor Made from Sustainable Materials",
    "Dive Deep into Ocean Conservation: Join Our Marine Biology Expedition for Hands-On Research, Diving, and Wildlife Encounters",
    "Experience the Power of Giving: Our Philanthropy Symposium Focuses on Transforming Lives and Making a Positive Impact in Your Community",
    "A Journey Through Ancient History: Explore Mythology, Civilizations, and Archaeological Sites That Shaped Our World",
    "Discover the Future of Transportation: Get Behind the Wheel of Electric Vehicles with Advanced Autonomous Features and Eco-Friendly Mobility Solutions",
    "Charting New Horizons in Medicine: Explore the Latest Healthcare Innovations and Breakthroughs with Insights from Medical Professionals",
    "Unleash Your Creativity: Join Our Workshop Series with Renowned Artists and Innovators to Explore New Perspectives in Art and Design",
    "Elevate Your Financial Literacy: Enroll in Our Personal Finance Course Covering Budgeting, Investing, Retirement Planning, and Wealth Management",
    "Witness the Majesty of Opera: Our Exclusive Opera Tour Takes You to Prestigious Venues in Europe, Where You'll Meet Opera Stars and Enjoy World-Class Performances",
    "Savor the Flavors of the World: Join Our International Food Festival and Delight in Culinary Delights from Around the Globe",
    "The Future of Education: Explore Innovations in Learning and Teaching in Our Symposium with Expert Educators and Visionaries"
]

In [13]:
spam_word_counts, ham_word_counts, total_spam_messages, total_ham_messages, total_words_in_spam, total_words_in_ham = train_naive_bayes(previous_ham + extra_ham_samples, previous_spam + extra_spa_samples)
    
classified_emails = {'spam': [], 'ham': []}
spam_word_score = defaultdict(int)
ham_word_score = defaultdict(int)

for label, emails in new_emails.items():
    for email in emails:
        prediction, spamminess_score = predict_naive_bayes(email, spam_word_counts, ham_word_counts, total_spam_messages, total_ham_messages, total_words_in_spam, total_words_in_ham)
        classified_emails[prediction].append({"email": email, "spamminess_score": spamminess_score}) 

In [14]:
pritty_print(previous_spam, "Previous SPAM")
pritty_print(previous_ham, "Previous HAM")

pritty_print(new_emails['spam'], "Actual SPAM")
pritty_print(new_emails['ham'], "Actual HAM")

pritty_print(classified_emails['spam'], "Bayes Classified SPAM")
pritty_print(classified_emails['ham'], "Bayes Classified HAM")

[1mPrevious SPAM[0m

1.[1mEmail:                                  [0m send us your password
2.[1mEmail:                                  [0m review our website
3.[1mEmail:                                  [0m send your password
4.[1mEmail:                                  [0m send us your account

[1mPrevious HAM[0m

1.[1mEmail:                                  [0m Your activity report
2.[1mEmail:                                  [0m benefits physical activity
3.[1mEmail:                                  [0m the importance vows

[1mActual SPAM[0m

1.[1mEmail:                                  [0m renew your password
2.[1mEmail:                                  [0m renew your vows

[1mActual HAM[0m

1.[1mEmail:                                  [0m benefits of our account
2.[1mEmail:                                  [0m the importance of physical activity

[1mBayes Classified SPAM[0m

1.[1mEmail:                                  [0m renew your password
1.

In [15]:
calculate_accuracy(classified_emails, new_emails)

Correct: 4
Incorrect: 0
Accuracy: 1.0


<hr style=\"border:2px solid gray\"> </hr>

## Reflection