## Assignment 5a
Using Naive Bayesian classifier predict where a given mail is spam or not. Use the data set provided for this purpose.   
Submitted by: **IEC2016012**

### Importing libraries and dataset

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv("spam.csv",  encoding = "ISO-8859-1", usecols = ["v1","v2"])
df.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


### Preprocessing

In [3]:
# Removing punctuation and special characters using regex 
df['v2'] = df['v2'].str.replace(r'[^\w\s]+', '')
df['v2'] = [word.lower() for word in df['v2']]

# Removing repeated entries
print("Entries before removing duplicates: " +  str(len(df)))
df.drop_duplicates(subset=['v2'], inplace=True)
print("Entries after removing duplicates: " +  str(len(df)))
df.reset_index(drop=True, inplace=True)
df.head()

Entries before removing duplicates: 5572
Entries after removing duplicates: 5142


Unnamed: 0,v1,v2
0,ham,go until jurong point crazy available only in ...
1,ham,ok lar joking wif u oni
2,spam,free entry in 2 a wkly comp to win fa cup fina...
3,ham,u dun say so early hor u c already then say
4,ham,nah i dont think he goes to usf he lives aroun...


In [4]:
mask = np.random.rand(len(df))<0.7
X_train = df.v2[mask].values
y_train = df.v1[mask].values
X_test = df.v2[~mask].values
y_test = df.v1[~mask].values
print(X_train[:4], y_train[:4], X_test[:4], y_test[:4])


['go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat'
 'ok lar joking wif u oni'
 'nah i dont think he goes to usf he lives around here though'
 'freemsg hey there darling its been 3 weeks now and no word back id like some fun you up for it still tb ok xxx std chgs to send å150 to rcv'] ['ham' 'ham' 'ham' 'spam'] ['free entry in 2 a wkly comp to win fa cup final tkts 21st may 2005 text fa to 87121 to receive entry questionstd txt ratetcs apply 08452810075over18s'
 'u dun say so early hor u c already then say'
 'even my brother is not like to speak with me they treat me like aids patent'
 'winner as a valued network customer you have been selected to receivea å900 prize reward to claim call 09061701461 claim code kl341 valid 12 hours only'] ['spam' 'ham' 'ham' 'spam']


### Creating dictionary of all words


In [5]:
messages = df.v2.values
words = []
for m in messages:
    words += (m.split(" "))
unique = set(words)
dic_words = {i:words.count(i) for i in unique}
dic_words

{'': 1678,
 '150pmtmsgrcvd18': 2,
 '101mega': 1,
 '11': 7,
 'speed': 3,
 'httpwwwwtlpcouktext': 1,
 'way': 95,
 'tncs': 5,
 'franyxxxxx': 1,
 'entrepreneurs': 1,
 'prabha': 2,
 'thanku': 1,
 '30s': 1,
 'jenne': 1,
 'display': 3,
 'matches': 7,
 'kkthis': 1,
 'prefer': 3,
 'technologies': 1,
 'urgh': 1,
 'lets': 20,
 'bad': 30,
 'noisy': 1,
 'first': 58,
 'bunkers': 1,
 'woods': 1,
 '32323': 1,
 'hits': 1,
 'rules': 1,
 '3cover': 1,
 'euro': 2,
 'uploaded': 1,
 'chrgd50p': 1,
 'pockets': 1,
 'themobyo': 1,
 'convince': 1,
 'nte': 1,
 'presnts': 1,
 'when': 264,
 'imat': 1,
 'nt': 15,
 'stability': 1,
 'approve': 1,
 'prin': 1,
 'ie': 2,
 '09063440451': 1,
 '146tf150p': 1,
 'mind': 35,
 'costa': 4,
 'noooooooo': 1,
 '8lovable': 1,
 'surprised': 4,
 'software': 3,
 'gay': 8,
 'else': 24,
 'bears': 2,
 'passed': 4,
 'classmates': 1,
 'bec': 2,
 'familiar': 1,
 'call2optout4qf2': 1,
 'lined': 1,
 'x2': 1,
 'haven': 1,
 'fgkslpo': 1,
 'taylor': 2,
 'watching': 31,
 'mrur': 1,
 'dry': 4,
 'st

### Creating dictionary of spam words


In [6]:
spam_messages = df.v2.values[df.v1 == "spam"]
spam_words = []
for m in spam_messages:
    spam_words += (m.split(" "))
unique_spam = set(spam_words)
dic_spam = {i:spam_words.count(i) for i in unique_spam}
dic_spam

{'': 223,
 '150pmtmsgrcvd18': 2,
 '11': 3,
 '0808': 1,
 'speed': 1,
 'way': 1,
 'httpwwwwtlpcouktext': 1,
 'lily': 1,
 'xx': 4,
 'tncs': 5,
 '09050000301': 1,
 'rcvd': 4,
 'w14rg': 1,
 'december': 4,
 'phones': 8,
 '08000930705': 15,
 '30s': 1,
 '449050000301': 1,
 'app': 4,
 'matches': 7,
 'removal': 2,
 'energy': 2,
 'php': 2,
 'there': 11,
 'pc': 6,
 'lapdancer': 1,
 'ultimate': 1,
 'lets': 3,
 'bad': 1,
 'recorder': 1,
 'first': 6,
 'woods': 1,
 'videochat': 3,
 'jamster': 2,
 'mths': 3,
 'gent': 2,
 '80488biz': 1,
 '2morro': 1,
 '32323': 1,
 'wiv': 2,
 'prizeto': 1,
 'euro': 2,
 'chrgd50p': 1,
 'themobyo': 1,
 'hi': 13,
 'keep': 5,
 '1st': 19,
 'when': 11,
 '09066362220': 1,
 '2u': 1,
 'luxury': 2,
 'nt': 1,
 'just': 65,
 'holder': 6,
 'area': 8,
 'vpod': 1,
 'hava': 1,
 '09063440451': 1,
 'pick': 2,
 '146tf150p': 1,
 'mind': 1,
 'matthew': 1,
 '21': 1,
 'costa': 4,
 'flirt': 4,
 'deluxe': 1,
 'travel': 1,
 'nowsend': 1,
 'noworriesloanscom': 1,
 'team': 2,
 '2untamed': 1,
 'colle

### Creating dictionary of ham words


In [7]:
ham_messages = df.v2.values[df.v1 == "ham"]
ham_words = []
for m in ham_messages:
    ham_words += (m.split(" "))
unique_ham = set(ham_words)
dic_ham = {i:ham_words.count(i) for i in unique_ham}
dic_ham

{'': 1455,
 '101mega': 1,
 '11': 4,
 'speed': 2,
 'way': 94,
 'franyxxxxx': 1,
 'entrepreneurs': 1,
 'prabha': 2,
 'thanku': 1,
 'jenne': 1,
 'display': 3,
 'kkthis': 1,
 'prefer': 3,
 'technologies': 1,
 'urgh': 1,
 'lets': 17,
 'bad': 29,
 'noisy': 1,
 'first': 52,
 'bunkers': 1,
 'hits': 1,
 'rules': 1,
 '3cover': 1,
 'uploaded': 1,
 'pockets': 1,
 'convince': 1,
 'nte': 1,
 'presnts': 1,
 'when': 253,
 'imat': 1,
 'nt': 14,
 'stability': 1,
 'approve': 1,
 'prin': 1,
 'ie': 2,
 'mind': 34,
 'noooooooo': 1,
 '8lovable': 1,
 'surprised': 4,
 'software': 3,
 'gay': 2,
 'else': 23,
 'passed': 4,
 'classmates': 1,
 'bec': 2,
 'familiar': 1,
 'lined': 1,
 'x2': 1,
 'haven': 1,
 'taylor': 2,
 'watching': 31,
 'mrur': 1,
 'dry': 4,
 'store': 2,
 'mentor': 1,
 'prepaid': 1,
 'sachin': 1,
 'mesages': 1,
 'adding': 1,
 'finds': 1,
 'further': 1,
 'dramatic': 1,
 'satisfied': 1,
 'nothin': 2,
 'understanding': 3,
 'boyy': 1,
 'distract': 2,
 'buy': 58,
 'forgot': 28,
 '5gently': 1,
 'abel': 1,

In [8]:
total_words = len(words)
total_spam = len(spam_words)
total_ham = len(ham_words)
print(total_words, total_spam, total_ham)

79567 15235 64332


### Various probability functions

In [9]:
def P_w_given_spam(w):
    return (dic_spam[w]/total_spam) #/(total_spam/total_words)
def P_w_given_ham(w):
    return (dic_ham[w]/total_ham) #/(total_ham/total_words)
def P_w(w):
    return dic_words[w]/total_words
def P_mess_is_spam(mess):
    num = den = 1
    for w in mess.split():
        if w in spam_words:
            num *= P_w_given_spam(w)
            den *= P_w(w)
    if den==0:
        num+=1
        den+=1
    return num/den
def P_mess_is_ham(mess): 
    num = den = 1
    for w in mess.split():
        if w in ham_words:
            num *= P_w_given_ham(w)
            den *= P_w(w)
    if den==0:
        num+=1
        den+=1
    return num/den
def predict(mess):
#     print("******")
#     print(mess)
#     print(P_mess_is_spam(mess))
#     print(P_mess_is_ham(mess))
#     print("******")
    if P_mess_is_spam(mess) > P_mess_is_ham(mess):
        return "spam"
    else:
        return "ham"

# Testing

##### Model for spam/ham classifier:   
The message would be classified as *spam* if:   
$$P(spam | w_1 \cap w_2 \cap .. \cap w_n) > P(ham | w_1 \cap w_2 \cap .. \cap w_n)$$  
where
\begin{align}
P(spam | w_1 \cap w_2 \cap .. \cap w_n) &= \frac{P(w_1|spam)*P(w_2|spam)*..*P(w_n|spam)}{P(w_1)*P(w_2)*..*P(w_n} \\
P(ham | w_1 \cap w_2 \cap .. \cap w_n) &= \frac{P(w_1|ham)*P(w_2|ham)*..*P(w_n|ham)}{P(w_1)*P(w_2)*..*P(w_n} \\
P(w_i) &= \frac{frequency of w_i in words}{total\_words}\\
P(w_i|spam) &= \frac{frequency of w_i in spam\_words}{spam\_words}\\
P(w_i|ham) &= \frac{frequency of w_i in ham\_words}{ham\_words}\\
\end{align}

### Calculating accuracy

In [10]:
fp = fn = tp = tn = 0
for i,m in enumerate(X_test):
    pred = predict(m)
    act = y_test[i]
    if pred == "spam" and act == "spam":
        fn+=1
    if pred == "spam" and act == "ham":
        tn+=1
    if pred == "ham" and act == "spam":
        fp+=1
    if pred == "ham" and act == "ham":
        tp+=1
acc = (fn+tp)/len(X_test)
print("Accuracy = " + str(acc))

Accuracy = 0.9654485049833887


### Confusion Matrix

In [11]:
print("Pre                  Actual Values")
print("dic")
print("ted              Positive     Negative")
print("val       Positive   " + str(tp) + "           " + str(fp))
print("ues       Negative   " + str(tn) + "           " + str(fn))

Pre                  Actual Values
dic
ted              Positive     Negative
val       Positive   1291           4
ues       Negative   48           162


# Conclusion
The spam/ham classifier is used on the dataset of emails. It is found that there were various *duplicate messages* in the dataset. Those entries were removed, after which the preprocessing of the data was done. Preprocessing included:
1. Removing different encodings.
2. Removing punctuations using regex.
    A. \w splits the sentence into different words
    B. \s rejoins the words into sentences.
    Ultimately, the puctuation of any form is removed.
3. All the text was converted into lowercase, to obtain the words with the same meaning as a single key.       

Then using Baye's Theorem, the classification was done on the testing data ($30$%).    
The **accuracy** obtained was: $96$%.    
The confusion matrix was drawn using the results obtained as such:   
- True Positive: 1291
- True Negative: 48
- False Positive: 4
- False Negative: 162