# Niave Bayes:  Spam Filtering 

## INDEX
- Bayes Theorem 
- Imports
- Import Dataset CSV from Prpare Data
- Update Fix Datframe 
    - Seperate Data Spam/Ham
- Calculate Probabilities
- Create Spam/Ham Dataframes Containing Word Frequency, Word Probabilities, Spam/Ham Probability
- Hardcoded Calculation Using Naive Bayes to Classify Spam or Normal Message("Ham")
    - Psuedo Code for Naive Bayes Calculation 
    - Use pandas datframe to calculated the prior 
    - Test with Random Examples from Dataset 
    - Test with Real 'Ham' Email 
- References

Using Naive Bayes to Classify Spam or Normal Message("Ham")
In this file I have hardcoded the calculation for classifying whether a message is a Normal Email or a Spam Email based on the probability that a list of tokenized words given word in each of the messages would be present. 

#### Previous Files in this Folder 

__Prepare_Data.ipynb :__
Used to pre-process, clean and prepare the dataset used to test and train in this file. A csv was exported and read into this file. 

__EDA_Text_Data_Exploration :__
Used to explore the most frequently used words, to look for commonalities and to help fine tune the prepare_data file so that i cleaned the words properly. 


## Bayes Theorem & Naive Bayes Explained 

Probability of A given B 

    P ( A | B ) 
    
The probability of A given B is the probability that A and B (the intersection) have happened divided by the probability that B has happened, that is:

                      P ( A n B ) 
    P ( A | B )  =   -------------
                        P ( B )
                        
Now, what about the probability that B has happened, given that A has also happened? Following the previous formula, we have that:

                      P ( B n A ) 
    P ( B | A )  =   -------------
                        P ( A )
                     
Notice from the Venn diagram that

    P ( B n A ) == P ( A n B ) 
    
Thus, by equating Eq. 1 and Eq. 2 we get the Bayes theorem:

                    P ( B | A ) P ( A )
    P ( A | B ) =  ---------------------
                        P ( B )

Given the significance of Bayes theorem in the theory of probability, each term has a name:

                    likelihood x prior
    posterior  =  -----------------------
                        evidence 
                        
In simple terms, the “prior” P(A) and the “evidence” P(B) refer to the probabilities of observing A and B independently from each other, whereas the “posterior” and the “likelihood” are the conditional probabilities of observing A given B and vice versa.

                                        P ( Message_Word | Spam ) P ( Spam) 
            P ( Spam | Message_Word ) = ------------------------------------
                                                    P ( Message_Word ) 
            

            Message_word = { w1,w2,w3,...wi ) 
            
Where Message_Word is a feature vector containing the words coming from the Spam (or Ham) emails


The “Naive” assumption that the Naive Bayes classifier makes is that the probability of observing a word is independent of each other. The result is that the “likelihood” is the product of the individual probabilities of seeing each word in the set of Spam or Ham emails. We calculated these probabilities in Step 3 and stored them in the “occurrence” column.


In [79]:
#pip uninstall scikit-learn -y

In [80]:
#pip install scikit-learn 

In [81]:
#Computation
import pandas as pd
import collections
import numpy as np
import random
import math
import urllib

#Sklearn
import sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

### Import Data Set 
This dataset has been processed. See jupyter notebook file titled "Prepare_Data" for methods to process this data. 

In [82]:
df = pd.read_csv("spam_ham_processed.csv")
df.head()

Unnamed: 0,Unknown,ham/spam,Original_Text,is_Spam,Word_List
0,605,ham,Subject: enron methanol ; meter # : 988291\r\n...,0,"['meter', 'follow', 'note', 'gave', 'prelimina..."
1,2349,ham,"Subject: hpl nom for january 9 , 2001\r\n( see...",0,"['see', 'attached', 'file']"
2,3624,ham,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0,"['neon', 'retreat', 'around', 'wonderful', 'ti..."
3,4685,spam,"Subject: photoshop , windows , office . cheap ...",1,"['office', 'cheap', 'main', 'darer', 'prudentl..."
4,2030,ham,Subject: re : indian springs\r\nthis deal is t...,0,"['deal', 'book', 'revenue', 'understanding', '..."


#### Update/Fix Dataframe
When exporting the data from a dataframe to csv from the Prepare_Data file the column data for Word_List which was originally a list inside each column row is mistakenly turned into one long string. The following code fixes and updates the datafram to have a list of words again.

In [83]:
def processString(txt):
    specialChars = "[]',"
    for specialChar in specialChars:
        txt = txt.replace(specialChar, '')
    return txt.split()

##### Fix Dataframe

In [84]:
# Fix the Data Frame Column "Word_List" 
new_list = []
for i in range(len(df)):
    new_list.append(processString(df.Word_List[i]))
    
df["Word_List"] = new_list
df.head(5)

Unnamed: 0,Unknown,ham/spam,Original_Text,is_Spam,Word_List
0,605,ham,Subject: enron methanol ; meter # : 988291\r\n...,0,"[meter, follow, note, gave, preliminary, flow,..."
1,2349,ham,"Subject: hpl nom for january 9 , 2001\r\n( see...",0,"[see, attached, file]"
2,3624,ham,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0,"[neon, retreat, around, wonderful, time, year,..."
3,4685,spam,"Subject: photoshop , windows , office . cheap ...",1,"[office, cheap, main, darer, prudently, fortui..."
4,2030,ham,Subject: re : indian springs\r\nthis deal is t...,0,"[deal, book, revenue, understanding, u, check,..."


#### Seperate Data into Spam and Ham

In [85]:
# Seperate the Data from ham and spam 
df_spam = df.loc[(df["ham/spam"] == "spam")]
df_spam = df_spam.reset_index()
df_ham = df.loc[(df["ham/spam"] == "ham")]
df_ham = df_ham.reset_index()

### Create New Dataframes 
#### Containing Word Frequency, Word Probabilities, Spam/Ham Probability
Creating a datafram of the all the words and probability that the word is in the message

__Columns__ 
- "Word" : Indicates which word we are calculating the probability for
- "Word_Frequency" : The frequency in which the word occures in either spam or ham 
- "Probability_Word_in_Spam" : The probability that you would see the word in the message
- "Probability_is_Spam" : The probability that the message is spam 

#### Calculate Alpha

In [86]:
alpha = 1/ len(df)

#### Calculating the Probabilities of A Message Being Spam or Ham

In [87]:
# The probability a message is spam or ham 
p_ham = len(df_ham)/ len(df)
p_spam = len(df_spam)/ len(df)

print('The probability that the message IS NOT spam is ' + str(p_ham))
print('The probability that the message IS spam is ' + str(p_spam))

The probability that the message IS NOT spam is 0.7101140978534133
The probability that the message IS spam is 0.2898859021465867


#### Calculating the Probabilities of A Word is in a Message Spam or Ham

In [88]:
# Creates a list of every word in all of the ham emails 
ham_list = []
for i in range(len(df_ham['Word_List'])):
    for j in range(len(df_ham['Word_List'][i])):
        ham_list.append(df_ham['Word_List'][i][j])

In [89]:
# Creates a list of every word in all of the ham emails 
spam_list = []
for i in range(len(df_spam['Word_List'])):
    for j in range(len(df_spam['Word_List'][i])):
        spam_list.append(df_spam['Word_List'][i][j])


In [90]:
def get_word_frequency_df(list_of_words):
    counter=collections.Counter(list_of_words)
    word_frequency = list(counter.values())
    word = list(counter.keys())
    data_tuples = list(zip(word,word_frequency))
    word_frequency = pd.DataFrame(data_tuples, columns=['Word','Word_Frequency'])
    wf_df = word_frequency.copy()
    wf_df = wf_df.sort_values('Word_Frequency', ascending=False)
    return wf_df

#### Dataframes for Words, Word_Frequency, Probability_Word_in_Spam, Probability_is_Spam

In [91]:
# Top 10 most frequently used words in spam emails
spam_wf = get_word_frequency_df(spam_list)
spam_wf = spam_wf.rename(columns={'Word': 'Word_Spam', 'Word_Frequency': 'Word_Frequency_Spam'})

# This adds to the frequency table the probability that each word is in a spam email 
spam_wf['Probability_Word_in_Spam'] = spam_wf.Word_Frequency_Spam / sum(spam_wf.Word_Frequency_Spam)
spam_wf['Probability_is_Spam'] = len(df_spam)/ len(df)
spam_wf.head()

Unnamed: 0,Word_Spam,Word_Frequency_Spam,Probability_Word_in_Spam,Probability_is_Spam
110,company,728,0.006832,0.289886
140,information,517,0.004852,0.289886
2936,font,515,0.004833,0.289886
251,please,483,0.004532,0.289886
293,get,481,0.004514,0.289886


#### Dataframes for Words, Word_Frequency, Probability_Word_in_Ham, Probability_is_Ham

In [92]:
# Top 10 most frequently used words in ham emails
ham_wf = get_word_frequency_df(ham_list)
ham_wf = ham_wf.rename(columns={'Word': 'Word_Ham', 'Word_Frequency': 'Word_Frequency_Ham'})
# This adds to the frequency table the probability that each word is in a spam email 
ham_wf['Probability_Word_in_Ham'] = ham_wf.Word_Frequency_Ham / sum(ham_wf.Word_Frequency_Ham)
ham_wf['Probability_is_Ham'] = len(df_ham)/ len(df)
ham_wf.head()

Unnamed: 0,Word_Ham,Word_Frequency_Ham,Probability_Word_in_Ham,Probability_is_Ham
18,gas,2856,0.01724,0.710114
143,deal,2786,0.016818,0.710114
248,subject,2731,0.016486,0.710114
8,please,2715,0.016389,0.710114
0,meter,2452,0.014801,0.710114


### Classifying Spam or Ham using Naive Bayes Probability Calculation
Given a message we can look at the words in the message and see if given the words what is the probability that the message is spam or ham

### Psuedo Code for Calculating Naive Bayes 

If we pick a random message from the original dataframe.

__Prior Probability__
    
    Probability the Message is Ham 
    p_ham = num_ham_messages / total_num_messages
    
    Probability the Message is Spam 
    p_spam = num_spam_messages / total_num_messages

__Probability Word is in Message Given that is is in Spam or Ham__

    For Each Word in Random Message
        if it is in spam, select the probability that it is a word in a spam message
            p_spam * Probability_Word_in_Spam[i] (for each word in random message)
            P( spam | random_message )
            
        if it is in ham, select the probability that it is a word in a ham message
            p_ham * Probability_Word_in_Ham[i] (for each word in random message)
        probabi
            P( ham | random_message )
        
__Classify Based on Probability__ 

        if P( ham | random_message ) > P( spam | random_message )
            then, ramdom_message == classify_as_ham
            
        if P( spam | random_message ) > P( ham | random_message ) 
            then, ramdom_message == classify_as_spam

In [93]:
# Calculates the Probability Word is in Message Given that is is in Spam
def probability_message_is_Spam(random_message):
    p_spam_word_df = spam_wf.copy()
    p_spam_word_df = p_spam_word_df.loc[spam_wf['Word_Spam'].isin(random_message)]
    counter=collections.Counter(random_message)
    count = list(counter.values())
    word = list(counter.keys())
    temp_df = pd.DataFrame(list(zip(word,count)), columns = ['Word_Spam','Word_Frequency_in_Message'])
    merged_df = p_spam_word_df.merge(temp_df, on='Word_Spam', how='right', indicator=True)
    merged_df['Probability_Word_in_Spam'] =merged_df['Probability_Word_in_Spam'].fillna(alpha)
    merged_df['Update_Prob_Word_in_Spam'] = merged_df['Probability_Word_in_Spam']**merged_df['Word_Frequency_in_Message']
    p_list_spam = list(merged_df['Update_Prob_Word_in_Spam'])
    return np.prod(p_list_spam)*p_spam

In [94]:
def probability_message_is_Ham(random_message):
    p_ham_word_df = ham_wf.copy()
    p_ham_word_df = p_ham_word_df.loc[ham_wf['Word_Ham'].isin(random_message)]
    counter=collections.Counter(random_message)
    count = list(counter.values())
    word = list(counter.keys())
    temp_df = pd.DataFrame(list(zip(word,count)), columns = ['Word_Ham','Word_Frequency_in_Message'])
    merged_df = p_ham_word_df.merge(temp_df, on='Word_Ham', how='right', indicator=True)
    merged_df['Probability_Word_in_Ham'] =merged_df['Probability_Word_in_Ham'].fillna(alpha)
    merged_df['Update_Prob_Word_in_Ham'] = merged_df['Probability_Word_in_Ham']**merged_df['Word_Frequency_in_Message']
    p_list_ham = list(merged_df['Update_Prob_Word_in_Ham'])
    return np.prod(p_list_ham)*p_ham

In [95]:
# Classify Based on Probability
def classify_ham_or_spam(random_message, row_n):
    is_Spam = df['is_Spam'][row_n]
    
    if probability_message_is_Ham(random_message) > probability_message_is_Spam(random_message):
        is_classified = 0
        correct_classification = is_classified == is_Spam
        print("The message is ham, which is " + str(correct_classification))
    if probability_message_is_Spam(random_message) > probability_message_is_Ham(random_message):
        is_classified = 1
        correct_classification = is_classified == is_Spam
        print("The message is spam, which is " + str(correct_classification))
    if probability_message_is_Spam(random_message) == 0.0:
        print("This returned two small of a number to identify") 
    if probability_message_is_Ham(random_message) == 0.0:
        print("This returned two small of a number to identify") 

#### Test with Random Examples from Dataset 

This testing any randomly chosen message which already exists in the data set. Then calculates whether the probability is higher that it is spam or ham. After it classifies it, I compare to whether it was true or not. 

In [96]:
# Generate a random list of pre-processed words from a message already present in dataset
random_n = random.randint(1,len(df['Word_List']))
random_message = df['Word_List'][random_n]

#Classify Random Message as Spam or Ham 
classify_ham_or_spam(random_message,random_n)


The message is ham, which is True


__Explanation__

Output: 
    "The message is [Classified As Spam or Ham], 
    which is [Boolean dependent on whether Classified == Actual from Dataset]"

#### Test with Real "Ham" Email 

For fun, I took a real email which WAS NOT Spam tested it in this model. I wanted to see if it  would classify the email as spam or not. This Naive Bayes classifier failed to detect that this was a real message and there are many reason for why that may be the case. 

In [97]:
# Testing a real email that should be classified as ham, which I recieved from General Automics
testing_real_ham = "Thank you for your interest in a career opportunity with General Atomics."
test_real_ham = testing_real_ham.split()
classify_ham_or_spam(test_real_ham,0)

The message is spam, which is False


#### References

  __WEBLINKS__
    Performing Sentiment Analysis With Naive Bayes Classifier!
    https://www.analyticsvidhya.com/blog/2021/07/performing-sentiment-analysis-with-naive-bayes-classifier/
    Naive Bayes Using SciKit-Learn
    https://scikit-learn.org/stable/modules/naive_bayes.html

  __VIDEOS__
  - Naive Bayes Tutorial: 
      - https://www.youtube.com/watch?v=O2L2Uv9pdDA
  - Text Classification Using Naive Bayes | Naive Bayes Algorithm In Machine Learning | Simplilearn :
      - https://www.youtube.com/watch?v=60pqgfT5tZM
        