# Naive Bayes spam filtering

Consider that you are given with a data set of text messages which are labeled with ham or spam. We will use a training sample with ~4000 text messages, but first let's consider a few examples to get familiar with the naive Bayes idea.

| Class| Message|Bag of words|
| -|:-:|:-:|
| Spam| Send us your password| send, password|
| Ham| I will send you the letter| send, letter|
| Ham| I wrote a letter | write, letter|

We want to compute P(Spam|Bag of words). Last session, we learned from Bayes' rule:    


<div align="center">$P(\text{Spam|Bag of words})= \frac{P(\text{Bag of words|Spam})P(\text{Spam})}{P(\text{Bag of words|Spam})P(\text{Spam})+P(\text{Bag of words|Ham})P(\text{Ham})}$</div>



P(word|spam) and P(word|ham) can be estimated from the training sample. To avoid zero probabilities, we consider the initial value of 1 for the number of occurence of a word. Note that the priors are P(ham)=$\frac{2}{3}$ and P(spam)=$\frac{1}{3}$.

| Spam| Ham|word|Spam(i=1) |Ham(i=1)|
| -|:-:|:-:|:-:|:-:|
|$\frac{1}{2}$ |$\frac{1}{4}$ |send|$\frac{1+1}{2+4}$ |$\frac{1+1}{4+4}$ |
|$\frac{1}{2}$ |$\frac{0}{4}$ |password|$\frac{1+1}{2+4}$ |$\frac{0+1}{4+4}$ |
|$\frac{0}{2}$| $\frac{2}{4}$|letter|$\frac{0+1}{2+4}$| $\frac{2+1}{4+4}$|
|$\frac{0}{2}$ |$\frac{1}{4}$ |write|$\frac{0+1}{2+4}$ |$\frac{1+1}{4+4}$ |


Now, consider a new text message "*write your password in the password box*". We don't have the word "*box*" in our training sample, so the safe choice would be to remove this from the bag of words and make decision based on on the other two words, "*write*" and "*password*". "*password*" occured twice. 

<div align="center">$P(\text{spam|write,password,password})= \frac{P(\text{write|spam})P(\text{password|spam})P(\text{password|spam})P(\text{spam})}{P(\text{write|Spam})P(\text{password|Spam})P(\text{password|spam})P(\text{Spam})+P(\text{write|ham})P(\text{password|ham})P(\text{password|ham})P(\text{ham})}$</div>


<div align="center">$P(\text{spam|write,password,password})= \frac{\frac{1}{6}\times\frac{2}{6}\times\frac{2}{6}\times\frac{1}{3}}{\frac{1}{6}\times\frac{2}{6}\times\frac{2}{6}\times\frac{1}{3}+\frac{2}{8}\times\frac{1}{8}\times\frac{1}{8}\times\frac{2}{3}}\sim 70\%$</div>


and $P(\text{ham|write,password,password})=1-P(\text{spam|write,password,password})=30\%$, so we classify this email as a spam message. This was just a demonsteration of the naive Bayes method. Let's use a large data set to build a model and evaluate its performance.  

In [1]:
import numpy as np

In [2]:
import pandas as pd
from collections import Counter

NLTK (Natural Language Toolkit) is a set of libraries for Natural Language Processing (NLP)

In [3]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /Users/zahra/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Stop words are the most common words in a language which don't carry much information. We will filter them before NLP

In [4]:
stopwords=nltk.corpus.stopwords.words('english')
print(stopwords[:5])

['i', 'me', 'my', 'myself', 'we']


A word can have many variations with the same meaning. So, we will use stem package to normalize the words.

In [5]:
from nltk.stem import PorterStemmer
Ps=PorterStemmer()
Ps.stem('cook'),Ps.stem('cooking'),Ps.stem('cooked')

('cook', 'cook', 'cook')

We also need to remove punctuations, they are not informative in our classification.

In [6]:
import string
punctuations=string.punctuation
print(punctuations)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


Let's load the data:

In [7]:
data=pd.read_csv('Spam.csv')
data.head()

Unnamed: 0,Class,Text
0,spam,SMS. ac Blind Date 4U!: Rodds1 is 21/m from Ab...
1,ham,Yup... From what i remb... I think should be c...
2,ham,Jos ask if u wana meet up?
3,ham,Lol yes. Our friendship is hanging on a thread...
4,spam,TheMob> Check out our newest selection of cont...


Change categorical data into numbers which can be processed in the code

In [8]:
data['Class_code']=pd.get_dummies(data.Class,drop_first=True)
data.head()

Unnamed: 0,Class,Text,Class_code
0,spam,SMS. ac Blind Date 4U!: Rodds1 is 21/m from Ab...,1
1,ham,Yup... From what i remb... I think should be c...,0
2,ham,Jos ask if u wana meet up?,0
3,ham,Lol yes. Our friendship is hanging on a thread...,0
4,spam,TheMob> Check out our newest selection of cont...,1


In [9]:
def train_test_split(dataframe,test_size=0.3,rs=None):
    """A function which takes pandas dataframe and split it to train and test samples"""
    dataframe_test=dataframe.sample(frac=test_size,random_state=rs)
    dataframe_train=dataframe.loc[dataframe.index.difference(dataframe_test.index)]
    
    return (dataframe_train.reset_index(drop=True),dataframe_test.reset_index(drop=True))

In [10]:
data_train,data_test=train_test_split(data,test_size=0.3,rs=4)

In [11]:
data_train.head()

Unnamed: 0,Class,Text,Class_code
0,spam,SMS. ac Blind Date 4U!: Rodds1 is 21/m from Ab...,1
1,ham,Lol yes. Our friendship is hanging on a thread...,0
2,spam,TheMob> Check out our newest selection of cont...,1
3,ham,Where are the garage keys? They aren't on the ...,0
4,ham,Today is ACCEPT DAY..U Accept me as? Brother S...,0


In [12]:
data_test.head()

Unnamed: 0,Class,Text,Class_code
0,ham,I also thk too fast... Xy suggest one not me. ...,0
1,ham,I not busy juz dun wan 2 go so early.. Hee..,0
2,ham,Thanks honey but still haven't heard anything ...,0
3,spam,You will recieve your tone within the next 24h...,1
4,ham,Lol you won't feel bad when I use her money to...,0


Cleaning up one of the text messages as an example:

In [19]:
message=data_train.Text[44]
print(message)

Here is your discount code RP176781. To stop further messages reply stop. www.regalportfolio.co.uk. Customer Services 08717205546


In [20]:
# convert to lower case
message=message.lower()
print(message)

here is your discount code rp176781. to stop further messages reply stop. www.regalportfolio.co.uk. customer services 08717205546


In [21]:
message=''.join([x for x in message if x not in punctuations])
print(message)

here is your discount code rp176781 to stop further messages reply stop wwwregalportfoliocouk customer services 08717205546


In [22]:
message=[x for x in message.split() if x not in stopwords]
print(message)

['discount', 'code', 'rp176781', 'stop', 'messages', 'reply', 'stop', 'wwwregalportfoliocouk', 'customer', 'services', '08717205546']


In [23]:
message=[Ps.stem(x) for x in message]
print(message)

['discount', 'code', 'rp176781', 'stop', 'messag', 'repli', 'stop', 'wwwregalportfoliocouk', 'custom', 'servic', '08717205546']


In [24]:
print(Counter(message))

Counter({'stop': 2, 'discount': 1, 'code': 1, 'rp176781': 1, 'messag': 1, 'repli': 1, 'wwwregalportfoliocouk': 1, 'custom': 1, 'servic': 1, '08717205546': 1})


Now put them together in a function

In [25]:
def clean_message(message):
    """a function to clean up message and return a dict with bag of their occurence rate"""
    message=message.lower()
    message=''.join([x for x in message if x not in punctuations])
    message=[x for x in message.split() if x not in stopwords]
    message=[Ps.stem(x) for x in message]
    return(Counter(message))

In [26]:
print(data_train.Text[80])
print(clean_message(data_train.Text[80]))

Alright i have a new goal now
Counter({'alright': 1, 'new': 1, 'goal': 1})


Apply the function to all the data set

In [27]:
data_train['bag_of_words']=data_train['Text'].apply(clean_message)
data_train.head()

Unnamed: 0,Class,Text,Class_code,bag_of_words
0,spam,SMS. ac Blind Date 4U!: Rodds1 is 21/m from Ab...,1,"{'sm': 2, 'ac': 1, 'blind': 2, 'date': 2, '4u'..."
1,ham,Lol yes. Our friendship is hanging on a thread...,0,"{'lol': 1, 'ye': 1, 'friendship': 1, 'hang': 1..."
2,spam,TheMob> Check out our newest selection of cont...,1,"{'themob': 1, 'check': 1, 'newest': 1, 'select..."
3,ham,Where are the garage keys? They aren't on the ...,0,"{'garag': 1, 'key': 1, 'arent': 1, 'bookshelf'..."
4,ham,Today is ACCEPT DAY..U Accept me as? Brother S...,0,"{'today': 1, 'accept': 2, 'dayu': 1, 'brother'..."


In [28]:
bows=data_train.bag_of_words

In [29]:
bows_ham=data_train[data_train.Class_code==0].bag_of_words
bows_spam=data_train[data_train.Class_code==1].bag_of_words

In [30]:
words=list(set().union(*bows))

In [31]:
number_of_occurence_ham={key:1 for key in words}
for word in words:
    for bow in bows_ham:
        if word in bow.keys():
            number_of_occurence_ham[word]+=bow[word]

In [32]:
number_of_occurence_ham['soon']

39

In [33]:
number_of_occurence_spam={key:1 for key in words}
for word in words:
    for bow in bows_spam:
        if word in bow.keys():
            number_of_occurence_spam[word]+=bow[word]

In [34]:
number_of_occurence_spam['free']

131

Probability of a word given that the text is ham/spam

In [35]:
P_word_h={}
P_word_s={}
for key in number_of_occurence_ham:
    P_word_h[key]=number_of_occurence_ham[key]/sum(number_of_occurence_ham.values())
for key in number_of_occurence_spam:
    P_word_s[key]=number_of_occurence_spam[key]/sum(number_of_occurence_spam.values())   

Finding the priors

In [36]:
P_h=bows_ham.size/bows.size
P_s=bows_spam.size/bows.size

In [37]:
print(P_s)
print(P_h)

0.1238262910798122
0.8761737089201878


Define the main classifier function

In [38]:
def classifier(document):
    document_bag_of_words=clean_message(document)
    P_document_h=1
    P_document_s=1
    for key in document_bag_of_words:
        if key in words:
            P_document_h=P_document_h*P_word_h[key]
            P_document_s=P_document_s*P_word_s[key]
    P_document_h=P_document_h*P_h
    P_document_s=P_document_s*P_s
    
    Pr_doc_h_normalized=P_document_h/(P_document_h+P_document_s)
    
    if Pr_doc_h_normalized>0.5:
        return 0
    else:
        return 1
classifier=np.vectorize(classifier)

In [39]:
classifier('congratulations! you won $500')

array(1)

In [40]:
classifier("Let's apply this model to the test sample")

array(0)

In [41]:
prediction=classifier(data_test.Text.values)

In [42]:
T=data_test.Class_code

In [43]:
TP,TN,FP,FN=0,0,0,0
for i in range(len(T)):
    if T[i]==1:
        if prediction[i]==1:
            TP+=1
        if prediction[i]==0:
            FN+=1
    if T[i]==0:
        if prediction[i]==1:
            FP+=1
        if prediction[i]==0:
            TN+=1

Confusion matrix

In [44]:
print(np.array([[TP,FP],[FN,TN]]))

[[ 168    8]
 [  19 1266]]


In [45]:
precision=TP/(TP+FP)
print("precision=",precision)

precision= 0.9545454545454546


In [46]:
recall=TP/(TP+FN)
print("recall=",recall)

recall= 0.8983957219251337


In [47]:
F1_score=2*precision*recall/(precision+recall)
print("F1_score=",F1_score)

F1_score= 0.9256198347107438


In [48]:
accuracy=(TP+TN)/(TP+FP+FN+TN)
print("accuracy=",accuracy)

accuracy= 0.9815195071868583
