# Notebook E-tivity 3 CE4021 Task 2

Student name: Yvonne Ryan

Student ID: 21208298

<hr style=\"border:2px solid gray\"> </hr>

## Imports

In [None]:
#None

If you believe required imports are missing, please contact your moderator.

<hr style=\"border:2px solid gray\"> </hr>

## Task 2

Use the below information to create a Naive Bayes SPAM filter. Test your filter using the messages in new_emails. You may add as many cells as you require to complete the task.

In [2]:
previous_spam = ['send us your password', 'review our website', 'send your password', 'send us your account']
previous_ham = ['Your activity report','benefits physical activity', 'the importance vows']
new_emails = {'spam':['renew your password', 'renew your vows'], 'ham':['benefits of our account', 'the importance of physical activity']}

### Solution

*Notes to self:*

Bayes' Theorem:
$$ P(H|E) = \frac{P(E|H)P(H)}{P(E)} = \frac{P(E|H)P(H)}{P(E|H)P(H) + P(E|\overline{H})P(\overline{H})} $$

Naive Bayes:
$$ P(S|x_1, x_2, ..., x_n) \approx \frac{P(S)\prod_{i=1}^n P(x_n|S)}{P(S)\prod_{i=1}^n P(x_n|S) + P(H)\prod_{i=1}^n P(x_n|H)} $$

##### Define functions

In [89]:
### Preparing data ###

def word_set(msg_list):
    '''
    Function to create set of unique words from a
    list of messages.
    
    Returns a set of unique strings.
    '''
    
    def msg_lower(msg_list):
        '''
        Converts all characters in a list of messages
        to lower case. Conversion is performed in-place.

        Returns None.
        '''
        for i in range(0,len(msg_list)):
            msg_list[i] = msg_list[i].lower()

        return None
    
    if type(msg_list) is list:
        msg_lower(msg_list)
    
    words = set()
    for msg in msg_list:
        for word in msg.split():
            words.add(word)
    
    return words    


### Naive Bayes classifier ### 

def naive_bayes(msg, previous_spam, previous_ham):
    '''
    Implementation of a naive Bayes classifier.
    Takes a list of strings (msg) & classifies it as 
    either Spam or Ham, based on labelled training data
    previous_spam & previous_ham.
    
    Subfunctions:
        word_prob_dict(word_set_A, msg_list_A)
        word_smooth_prob(w, msg_list_A)
        marginal_prob(msg_list_A,msg_list_B)
        dict_prod(word_set, prob_dict)
        bayes_product(common_words, previous_words_A, 
            previous_msgs_A, marg_prob_A)
    
    '''
    
    def word_prob_dict(word_set_A, msg_list_A):
            '''
            Returns a dictionary of word:probability pairs for the
            words in the input word_set_A, based on the frequency of
            that word in the training dataset (msg_list_A).
            '''

            def word_smooth_prob(w, msg_list_A):
                '''
                Calculate the Laplace smoothed probability for a word (w) to 
                appear in messages of type A, based on the frequency of w in 
                a training dataset of this type of message (msg_list_A). 
                In probability notation: P(w|A).
                e.g. P(w|spam) = Laplace smoothed probability of w appearing in 
                spam messages.

                1. Count how many messages in msg_list_A contain w & add 1. 
                2. Divide this by the number of messages in msg_list_A plus 2.

                Returns a float.
                '''
                count = 0
                for msg in msg_list_A:
                    if w in msg.split():
                        count += 1

                return (float(count) + 1) / (float(len(msg_list_A)) + 2)

            prob_dict = {}
            for word in word_set_A:
                prob_dict[word] = word_smooth_prob(word, msg_list_A)

            return prob_dict    
    
    def marginal_prob(msg_list_A,msg_list_B):
        '''
        Calculates the marginal probability of messages of type A
        (assuming only messages of type A or B are possible). 
        In probability notation: P(A)

        The number of messages in msg_list_A is divided by the 
        total number of messages in msg_list_A & msg_list_B.

        Returns a float.
        '''
        return float(len(msg_list_A) / (len(msg_list_A) + len(msg_list_B)))


    def dict_prod(word_set, prob_dict):
        '''
        Computes the product of the probabilities of the words
        in word_set, pulling these values from a dictionary of 
        word:probability pairs. If the word is not in the dictionary,
        1 is returned as the value.
        In probability notation: Product[P(w_i|A)] from i=1 to n.

        '''
        if len(word_set) > 0:
            product = 1
            for w in word_set:
                product *= prob_dict.get(w,1)

            return product

        else: 
            return 0    
    
    
    def bayes_product(common_words, previous_words_A, previous_msgs_A, marg_prob_A):
        '''
        Calculates a product term for use in the naive Bayes algorithm.
        In probability notation: P(A)Product[P(x_i|A)] from i=1 to n.
        '''
        p_xi_A = dict_prod(common_words, word_prob_dict(previous_words_A, previous_msgs_A))

        return p_xi_A * marg_prob_A    
    
    
    ### Training classifier ###
    
    # Create sets of unique words per spam / ham list
    previous_spam_words, previous_ham_words = word_set(previous_spam), word_set(previous_ham)
    
    # Build dictionaries of probabilities for each word per spam / ham list
    word_prob_spam = word_prob_dict(previous_spam_words, previous_spam)
    word_prob_ham = word_prob_dict(previous_ham_words, previous_ham)
      
    # Calculate the marginal probabilities for spam & ham
    marg_prob_spam = marginal_prob(previous_spam, previous_ham)
    marg_prob_ham = marginal_prob(previous_ham, previous_spam)  
    
    
    ### Running classifier ###
    
    # Make a set of unique words common to msg & the training datasets
    common_words = (set(msg.split())).intersection((previous_spam_words).union(previous_spam_words))
    
    
    # Compute the probabilities P(x_i|S), P(x_i|H) for this message
    p_xi_S = bayes_product(common_words, previous_spam_words, previous_spam, marg_prob_spam)
    p_xi_H = bayes_product(common_words, previous_ham_words, previous_ham, marg_prob_ham) 

    
    # Compute probability P(S|x_i) for this message & classify as 'SPAM' or 'HAM'
    if p_xi_S != 0 and p_xi_S / (p_xi_S + p_xi_H) > 0.5:
        return 'SPAM'
    else:
        return 'HAM' 

##### Run classifier on new messages

In [85]:
# Run each message in the new_emails dataset through the classifier
for msg in (new_emails['spam'] + new_emails['ham']):
    print('Message text: "{}"'.format(msg))
    print('-> Classified as', naive_bayes(msg, previous_spam, previous_ham),'\n')

Message text: "renew your password"
-> Classified as SPAM 

Message text: "renew your vows"
-> Classified as SPAM 

Message text: "benefits of our account"
-> Classified as HAM 

Message text: "the importance of physical activity"
-> Classified as HAM 



<hr style=\"border:2px solid gray\"> </hr>

## Reflection

Write you reflection in below cell.