# Notebook E-tivity 3 CE4021 Task 2

Student name: Vaclav Krol

Student ID: 23307102

<hr style=\"border:2px solid gray\"> </hr>

## Imports

In [576]:
#None

If you believe required imports are missing, please contact your moderator.

<hr style=\"border:2px solid gray\"> </hr>

## Task 2

Use the below information to create a Naive Bayes SPAM filter. Test your filter using the messages in new_emails. You may add as many cells as you require to complete the task.

In [581]:
previous_spam = ['send us your password', 'review our website', 'send your password', 'send us your account']
previous_ham = ['Your activity report','benefits physical activity', 'the importance vows']
new_emails = {'spam':['renew your password', 'renew your vows'], 'ham':['benefits of our account', 'the importance of physical activity']}

<hr style=\"border:2px solid gray\"> </hr>

### Bayes Theorem

In probability theory and statistics, Bayes' theorem (alternatively Bayes' law or Bayes' rule), describes the probability of an event, based on prior knowledge of conditions that might be related to the event.

Bayes' theorem is stated mathematically as the following equation:

$$
P(A \mid B) = \frac{P(B \mid A) \times P(A)}{P(B)}
$$

where A and B are events and ${\displaystyle P(B)\neq 0}$

* $P(A \mid B)$ is the conditional probability of event A given that event B is true
* $P(B \mid A)$ is the conditional probability of event B given that event A is true
* $P(A)$ and $P(B)$ are the probabilities of observing A and B respectively without any given conditions; they are known as the prior probability and marginal probability.


### Naive Bayes classifier

It is a classification technique based on Bayes’ Theorem with an independence assumption among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. Despite this simplification, the algorithm achieves high accuracy levels.

This is the Naive Bayes formula:

$$P(S | x1, . . . , xn) ≈ \frac{P(S) \prod_{i=1}^n P(x_i | S)}{P(S) \prod_{i=1}^n P(x_i | S)\quad + \quad P(H) \prod_{i=1}^n P(x_i | H) }\$$
<br>

In plain English, using Bayesian probability terminology, the above equation can be written as:

$${\displaystyle {\text{posterior}}={\frac {{\text{prior}}\times {\text{likelihood}}}{\text{evidence}}}\,}$$


prior for a given class is then:
$${\displaystyle {\text{prior for a given class}}={\frac {\text{no. of samples in that class}}{\text{total no. of samples}}}\,}$$



#### In the context of our spam classifier I will apply the Naive Bayes formula as follows:

1. Compute the total probability of spam emails based on hand-labelled data - prior
$$ P(S) = \frac{|spam\, emails|} {(|spam\, emails|\, +\, |ham\, emails|)}$$
<br>
2. Compute the total probability of ham emails based on hand-labelled data
$$ P(H) = \frac{|ham\, emails|} {(|spam\, emails|\, +\, |ham\, emails|)}$$
<br>

1. Iterate over the previous labelled spam emails and, for each word w in the entire training set, count how many
of the spam emails contain w. Compute conditional probability:
$$P(w | S) = \frac{|spam\, emails\, containing\, w|+1)} {|spam\, emails|\, +\, 2} $$
<br>
2. Compute P(w | H) the same way for ham emails.
$$P(w | H) = \frac{|ham\, emails\, containing\, w|+1)} {|ham\, emails|\, +\, 2} $$
<br><br>
5. Given a set of new unlabelled test emails, iterate over each email and:
- Create a set {x1, . . . , xn} of the distinct words in the email. Ignore the words that you haven’t
seen in the labelled training data.
- Compute<br>
$$P(S | x1, . . . , xn) ≈ \frac{P(S) \prod_{i=1}^n P(x_i | S)}{P(S) \prod_{i=1}^n P(x_i | S)\quad + \quad P(H) \prod_{i=1}^n P(x_i | H) }\$$
<br>
- If P(S | x1, . . . , xn) > defined_threshold, email is considered as "spam”, otherwise it is "ham"


<hr style=\"border:2px solid gray\"> </hr>

At first, let's set a couple of constants 

In [582]:
### Constants

# Key name for new unlabelled spam emails
SPAM_DICT_KEY = 'spam'

# Key name for new unlabelled ham emails
HAM_DICT_KEY = 'ham'



<hr style=\"border:2px solid gray\"> </hr>

In [583]:
class Bayes_Spam_Filter():
    '''
    Class implementing the Naive Bayes Filter used for computing probability of an email being a SPAM
    '''
    
    # Constants used by class methods

    # Laplace smoothing - the "alpha" parameter added to numerator when multiplying
    # probabilities to avoid zero result for unknown words
    SMOOTH_ALPHA = 1

    # Laplace smoothing - the "K" parameter added to denominator to avoid division by zero
    # It is set to "2" because we have 2 features - SPAM and HAM
    SMOOTH_K = 2
    
    # A simple list of unimportant (stop) words
    # The user can set whether they should be removed from the algorithm or not
    UNIMPORTANT_WORDS = ['us', 'the', 'of', 'your', 'our']


    def __init__(self) -> None:
        '''
        Constructor - no parameters expected
        Initializes variables and builds structures that are used later during the learning process
        
        ::Params - none
        '''
    
        # Numbers of emails
        self.__nb_train_emails_spam = 0
        self.__nb_train_emails_ham = 0
        self.__nb_train_emails_all = 0

        # Under normal circumstances it would be better to define asll variables as private (__x)
        # but I want to have public access to them for printing results
        # and creating getters does not make much sense for values or dictionaries

        # General probability of spam and ham based on train data evidence
        self.pr_s_prior = 0
        self.pr_w_prior = 0

        # Dictionary with spamicity - the probability of each word occurring in spam emails based on evidence
        # This dictionary is structured as "{word1: spamicity, word2: spamicity, etc.}"
        self.word_spamicity = {}

        # Dictionary with hamicity - the probability of each word occurring in ham emails based on evidence
        # This dictionary is structured as "{word1: hamicity, word2: hamicity, etc.}"
        self.word_hamicity = {}


       
    def learn(self, train_emails_spam: list[str], train_emails_ham: list[str]) -> None:
        '''
        This method implements the learning process based on the input spam and ham emails.
        It builds all necessary variables and structures used for computing bayes probability.
        
        ::Params
        train_emails_spam - list of emails labeled as SPAM
        train_emails_ham  - list of emails labeled as HAM
        '''
        
        # Number of emails in train lists of emails
        self.__nb_train_emails_spam = len(train_emails_spam)
        self.__nb_train_emails_ham = len(train_emails_ham)
        self.__nb_train_emails_all = self.__nb_train_emails_spam + self.__nb_train_emails_ham

        # General probability of spam and ham based on train data evidence
        self.pr_s_prior = self.__nb_train_emails_spam / self.__nb_train_emails_all
        self.pr_h_prior = self.__nb_train_emails_ham / self.__nb_train_emails_all

        # Building the dictionary with spamicity for each word in SPAM emails
        self.build_word_probs(train_emails_spam, self.word_spamicity)

        # Building the dictionary with hamicity for each word in HAM emails
        self.build_word_probs(train_emails_ham, self.word_hamicity)


        
    def build_word_counts(self, email_list: list[str]) -> dict:
        '''
        Returns a dictionary of "word:frequency" items based on the given list of emails.
        Each unique word from the email_list is a dictionary "key" and the value for that key
        represents the number of occurrences of this word in the whole email_list.
        All words are changed to lowercase.

        ::Params
        email_list: A list of emails represented as list[str] to build the dictionary on 
        ( for example: ['Your activity report','Your benefits activity'] )
        It is assumed that words in strings are split by space
    
        ::Returns
        A dictionary of "word:frequency" items based on the given list of emails.
        Output for the example above: {"activity": 2, "benefits": 1, "your: 2", "report": 1}
        '''
        
        # Init the dictionary
        word_counts = {}
        
        # Let's loop through the list and create items
        for word_list in [sentence.split() for sentence in email_list]:
            for word in word_list:
                word = word.lower()
                if (word in word_counts):
                    word_counts[word] += 1
                else:
                    word_counts[word] = 1
                    
        return word_counts
    


    def build_word_probs(self, email_list: list[str], word_probs: dict) -> None:
        '''
        Fills up the word_probs dictionary with items in the form of "word:probability" based on the given list of emails.
        Each unique word from the email_list is a dictionary "key" and the value for that key
        represents the probability of this word occurring in the email_list.
        The probability of each word is based on the frequency of the word occurring in the list of emails.
        Case insensitive

        ::Params
        email_list: A list of emails to learn from represented as list[str]
        ( for example: ['Your activity report','Your benefits activity'] )
        It is assumed that words in strings are split by space
    
        ::Returns
        A dictionary of "word:probability" items based on the given list of emails.
        An example: {"activity": 0.25, "benefits": 0.32}
        '''
        
        # At first let's build the word counts from the given list
        word_counts = self.build_word_counts(email_list)
        
        # Clearing the probabilities if there is anything in it
        word_probs.clear()

        # And this is a question here - should the denominator be "the number of all emails" or "the number of all words"?
        # Students use a different approach here
        denom_total = len(email_list)
        # denom_total = sum(word_counts.values())
        for word in word_counts:
            # word_probs[word] = (word_counts.get(word, 0) + SMOOTH_ALPHA) / (denom_total + SMOOTH_DENOMINATOR)
            word_probs[word] = (word_counts.get(word, 0) ) / (denom_total )



    def calc_bayes_spam_prob(self, email: str, remove_stop_words: bool) -> float:
        '''
        Calculates the probability of the input email being a spam. Returns a float value in the range (0, 1)
        The function is based on the Naive Bayes Theorem.
        IMPORTANT: The algorithm has to be trained first by running the "learn" method.
        Only then this function can be called.
        
        ::Params
        email: the email for which to check the spam probability
        remove_stop_words: whether the algorithm should exclude the stop words from the computations

        ::Returns
        Returns a float value representing the spam probability of the given email in the range (0, 1),
        based on the train set of SPAM and HAM emails.
        '''
    
        # Create a list of all words from the given email (in lowercase)
        email_words = email.lower().split()

        # To get a better accurracy, let's remove unseen words in train sets of SPAM or HAM emails
        email_words = [word for word in email_words 
                            if word in self.word_spamicity.keys() or word in self.word_hamicity.keys()]        
        
        # Removing unimportant (stop) words - if asked
        if (remove_stop_words):
            email_words = [word for word in email_words if word not in self.UNIMPORTANT_WORDS]
            
        # This is the initial spam and ham probability based on prior evidence
        p_s = self.pr_s_prior
        p_h = self.pr_h_prior
    
        # Calculating the probability for each word and adding into a list - they get multiplied at the end 
        for word in email_words:
            
            # Computing p_w_s - word in SPAM
            if word in self.word_spamicity:
                p_w_s = self.word_spamicity[word]
            else:
                # Smoothing for word not seen in spam training data
                p_w_s = self.SMOOTH_ALPHA / (self.__nb_train_emails_spam + self.SMOOTH_K)
            
            # And the SPAM probability of this word gets multiplied with the probabilities of previous words
            p_s *= p_w_s
          

            # Computing p_w_h - word in HAM
            if word in self.word_hamicity:
                p_w_h = self.word_hamicity[word]
            else:
                # Smoothing for word not seen in ham training data
                p_w_h = self.SMOOTH_ALPHA / (self.__nb_train_emails_ham + self.SMOOTH_K)

            # And the HAM probability of this word gets multiplied with the probabilities of previous words            
            p_h *= p_w_h

        # Calculating the final probability
        p_s_final = (p_s / (p_s + p_h) )

        return(p_s_final)
       


<hr style=\"border:2px solid black\"> </hr>

### Running the classifier on the new set of emails and printing results

In [591]:
# Creating an instance of the Naive Bayes filter
bayes_spam_filter = Bayes_Spam_Filter()

# Running the learning process on the trained emails
bayes_spam_filter.learn(previous_spam, previous_ham)

# Just for printing purposes
max_email_len = max([len(message) for emails in new_emails.values() for message in emails])

# We are running two tests - one for removed stop-words and one with them included
for remove_stop_words in (True, False):

    if (remove_stop_words):
        print('\n          Test results with stop words removed')
    else:
        print('\n\n\n          Test results with stop words included')  

    print((max_email_len+43)*'-')
    print(f'{"Email":{max_email_len}} {"Expected":10} {"Spam prediction":15} {"Result":15}')
    print((max_email_len * '-') + ' ' + '-' * 10 + ' ' + '-' * 15 + ' ' + '-' * 15)
    # Checking allegedly SPAM emails and their probability of being a spam
    for new_email in new_emails[SPAM_DICT_KEY]:
        p_s = bayes_spam_filter.calc_bayes_spam_prob(new_email, remove_stop_words)
        p_s_perc = f"{(p_s * 100.0):.2f} %"
        print(f'{new_email:{max_email_len}} {"spam":10} {p_s_perc:15} {"SPAM (p > 50%)!" if p_s > 0.5 else "HAM (p <= 50%)"}')

    # Checking allegedly HAM emails and their probability of being a spam
    for new_email in new_emails[HAM_DICT_KEY]:
        p_s = bayes_spam_filter.calc_bayes_spam_prob(new_email, remove_stop_words)
        p_s_perc = f"{(p_s * 100.0):.2f} %"
        print(f'{new_email:{max_email_len}} {"ham":10} {p_s_perc:15} {"SPAM (p > 50%)!" if p_s > 0.5 else "HAM (p <= 50%)"}')

        
# Printing prior probabilities
print("\n\n")
print("Prior probabilities")
print(20 * '-')
print(f'Spam: {bayes_spam_filter.pr_s_prior:.4f}')
print(f'Ham : {bayes_spam_filter.pr_h_prior:.4f}')
        
        
# Printing conditional word probabilities
print("\n\n")
print("Conditional probabilities (spamicity and hamicity):")
print(51 * '-')
print(f'{"Word":15} {"Spamicity":15} {"Hamicity":15}')
print((15 * '-') + ' ' + '-' * 15 + ' ' + '-' * 15)


# Combining all words from spamicity and hamicity
words_spam = set(bayes_spam_filter.word_spamicity.keys())
words_ham  = set(bayes_spam_filter.word_hamicity.keys())
words_all = words_spam.union(words_ham)

for word in words_all:
    spamicity = bayes_spam_filter.word_spamicity.get(word, "")
    spamicity_f = (f'{spamicity:.4f}' if spamicity else "")
    hamicity = bayes_spam_filter.word_hamicity.get(word, "")
    hamicity_f = (f'{hamicity:.4f}' if hamicity else "")
    print(f'{word:15} {spamicity_f:15} {hamicity_f:15}')




          Test results with stop words removed
------------------------------------------------------------------------------
Email                               Expected   Spam prediction Result         
----------------------------------- ---------- --------------- ---------------
renew your password                 spam       76.92 %         SPAM (p > 50%)!
renew your vows                     spam       40.00 %         HAM (p <= 50%)
benefits of our account             ham        45.45 %         HAM (p <= 50%)
the importance of physical activity ham        7.69 %          HAM (p <= 50%)



          Test results with stop words included
------------------------------------------------------------------------------
Email                               Expected   Spam prediction Result         
----------------------------------- ---------- --------------- ---------------
renew your password                 spam       88.24 %         SPAM (p > 50%)!
renew your vows                    

<hr style=\"border:2px solid gray\"> </hr>

## Reflection

Write you reflection in below cell.

TO DO
