---
# <center> **Introduction to Applied Data Science**

### <center> *Prof. Bahram Mobasher*
#### <center> Teaching Assistant: *[Sina Taamoli](https://sinataamoli.github.io/)* | email: *sina.taamoli@email.ucr.edu*
### <center> **Week 5: Naive Bayes
---

## Naive Bayes spam filtering

Consider that you are given a data set of text messages which are labeled with ham or spam. We will use a training sample with ~4000 text messages, but first let’s consider a few examples to get familiar with the naive Bayes idea. <br>
<center>

| Class | Message | Bag of words |
| -------- | -------- | -------- |
| Spam   | Send us your password   | send, password   |
| Ham   | I will send you the letter   | send, letter   |
| Ham   | I wrote a letter   | write, letter   | <br>
</center>

We want to compute P(Spam|Bag of words). Last session, we learned from Bayes’ rule: <br>

$P(Spam | \text{Bag of words}) = \frac{P(\text{Bag of words}|Spam)P(Spam)}{P(\text{Bag of words}|Spam)P(Spam)+P(\text{Bag of words}|Ham)P(Ham)}$ <br>

P(word|spam) and P(word|ham) can be estimated from the training sample. To avoid zero probabilities, we consider the initial value of 1 for the number of occurence of a word. Note that the priorsareP(ham)=23 andP(spam)=31. <br>


<center>

| Spam | Ham | Word | Spam(i=1) | Ham(i=1) |
| -------- | -------- | -------- | -------- | -------- |
| $\frac{1}{2}$   | $\frac{1}{4}$   | send   | $\frac{1+1}{2+4}$ | $\frac{1+1}{4+4}$ |
| $\frac{1}{2}$   | $\frac{0}{4}$   | password   | $\frac{1+1}{2+4}$ | $\frac{0}{2}$ |
| $\frac{2}{4}$   | letter   | $\frac{0+1}{2+4}$   | $\frac{2+1}{4+4}$ |  |
| $\frac{0}{2}$   | $\frac{1}{4}$   | write   | $\frac{0+1}{2+4}$ | $\frac{1+1}{4+4}$ |

</center> <br>

Now, consider a new text message “write your password in the password box”. We don’t have the word “box” in our training sample, so the safe choice would be to remove this from the bag of words and make decision based on on the other two words, “write” and “password”. “password” occured twice. <br>

$P(spam|write,password,password) = \frac{P(write|spam)P(password|spam)P(password|spam)P(spam)}{P(write|Spam)P(password|Spam)P(password|spam)P(Spam)+P(write|ham)P(password|ham)P(password|ham)P(ham)}$ <br>

$P(spam|write,password,password) = \frac{\frac{1}{6} \times \frac{2}{6} \times \frac{2}{6} \times \frac{1}{3}}{\frac{1}{6} \times \frac{2}{6} \times \frac{2}{6} \times \frac{1}{3} + \frac{2}{8} \times \frac{1}{8} \times \frac{1}{8} \times \frac{2}{3}} \sim 70\%$ <br>

and $P(ham|write,password,password) = 1 − P(spam|write,password,password) = 30%$, so we classify this email as a spam message. This was just a demonsteration of the naive Bayes method. Let’s use a large data set to build a model and evaluate its performance.

In [25]:
import numpy as np
import pandas as pd
from collections import Counter

NLTK (Natural Language Toolkit) is a set of libraries for Natural Language Processing (NLP)

In [29]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/sinataamoli/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Stop words are the most common words in a language which don’t carry much information. We will filter them before NLP

In [6]:
stopwords = nltk.corpus.stopwords.words('english')
print(stopwords[:10])

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]


A word can have many variations with the same meaning. So, we will use stem package to normalize the words.

In [7]:
from nltk.stem import PorterStemmer
Ps = PorterStemmer()
Ps.stem('cook'), Ps.stem('cooking'), Ps.stem('cooked')

('cook', 'cook', 'cook')

We also need to remove punctuations, they are not informative in our classification.

In [27]:
import string
punctuations = string.punctuation
print(punctuations)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


Let’s load the data:

In [28]:
data = pd.read_csv('spam.csv')
data.head()

Unnamed: 0,Class,Text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Change categorical data into numbers which can be processed in the code

In [33]:
data['Class_code'] = pd.get_dummies(data.Class,drop_first=True)
data.head()

Unnamed: 0,Class,Text,Class_code
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


In [34]:
def train_test_split(dataframe, test_size = 0.3, rs = None):
    """
    A function which takes pandas dataframe and split it to train and test samples
    """ 
    dataframe_test = dataframe.sample(frac = test_size, random_state=rs)
    dataframe_train = dataframe.loc[dataframe.index.difference(dataframe_test.index)]
    return (dataframe_train.reset_index(drop=True), dataframe_test.reset_index(drop=True))

In [36]:
data_train, data_test = train_test_split(data, test_size = 0.3, rs = 4)

In [37]:
data_train.head()

Unnamed: 0,Class,Text,Class_code
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,ham,U dun say so early hor... U c already then say...,0
3,spam,FreeMsg Hey there darling it's been 3 week's n...,1
4,ham,Even my brother is not like to speak with me. ...,0


In [38]:
data_test.head()

Unnamed: 0,Class,Text,Class_code
0,ham,No problem. Talk to you later,0
1,ham,"No idea, I guess we'll work that out an hour a...",0
2,ham,"Em, its olowoyey@ usc.edu have a great time in...",0
3,ham,I'm in a movie... Collect car oredi...,0
4,ham,"Sorry man, accidentally left my phone on silen...",0


Cleaning up one of the text messages as an example:

In [39]:
message = data_train.Text[46]
print(message)

Your gonna have to pick up a $1 burger for yourself on your way home. I can't even move. Pain is killing me.


In [40]:
message = ''.join([x for x in message if x not in punctuations])
print(message)

Your gonna have to pick up a 1 burger for yourself on your way home I cant even move Pain is killing me


In [41]:
message.split()

['Your',
 'gonna',
 'have',
 'to',
 'pick',
 'up',
 'a',
 '1',
 'burger',
 'for',
 'yourself',
 'on',
 'your',
 'way',
 'home',
 'I',
 'cant',
 'even',
 'move',
 'Pain',
 'is',
 'killing',
 'me']

In [42]:
message = [x for x in message.split() if x not in stopwords]
print(message)

['Your', 'gonna', 'pick', '1', 'burger', 'way', 'home', 'I', 'cant', 'even', 'move', 'Pain', 'killing']


In [43]:
message=[Ps.stem(x) for x in message]
print(message)

['your', 'gonna', 'pick', '1', 'burger', 'way', 'home', 'i', 'cant', 'even', 'move', 'pain', 'kill']


In [44]:
print(Counter(message))

Counter({'your': 1, 'gonna': 1, 'pick': 1, '1': 1, 'burger': 1, 'way': 1, 'home': 1, 'i': 1, 'cant': 1, 'even': 1, 'move': 1, 'pain': 1, 'kill': 1})


Now put them together in a function

In [21]:
def clean_message(message):
    """
    a function to clean up message and return a dict with bag of their occurence rate
    """
    message = message.lower()
    message = ''.join([x for x in message if x not in punctuations])
    message = [x for x in message.split() if x not in stopwords] 
    message = [Ps.stem(x) for x in message]
    return(Counter(message))

In [22]:
print(data_train.Text[80])
print(clean_message(data_train.Text[80]))

What is the plural of the noun research?
Counter({'plural': 1, 'noun': 1, 'research': 1})


Apply the function to all the data set

In [45]:
data_train['bag_of_words'] = data_train['Text'].apply(clean_message)
data_train.head()

Unnamed: 0,Class,Text,Class_code,bag_of_words
0,ham,"Go until jurong point, crazy.. Available only ...",0,"{'go': 1, 'jurong': 1, 'point': 1, 'crazi': 1,..."
1,ham,Ok lar... Joking wif u oni...,0,"{'ok': 1, 'lar': 1, 'joke': 1, 'wif': 1, 'u': ..."
2,ham,U dun say so early hor... U c already then say...,0,"{'u': 2, 'dun': 1, 'say': 2, 'earli': 1, 'hor'..."
3,spam,FreeMsg Hey there darling it's been 3 week's n...,1,"{'freemsg': 1, 'hey': 1, 'darl': 1, '3': 1, 'w..."
4,ham,Even my brother is not like to speak with me. ...,0,"{'even': 1, 'brother': 1, 'like': 2, 'speak': ..."


In [46]:
bows = data_train.bag_of_words
bows

0       {'go': 1, 'jurong': 1, 'point': 1, 'crazi': 1,...
1       {'ok': 1, 'lar': 1, 'joke': 1, 'wif': 1, 'u': ...
2       {'u': 2, 'dun': 1, 'say': 2, 'earli': 1, 'hor'...
3       {'freemsg': 1, 'hey': 1, 'darl': 1, '3': 1, 'w...
4       {'even': 1, 'brother': 1, 'like': 2, 'speak': ...
                              ...                        
3613    {'ard': 1, '6': 1, 'like': 1, 'dat': 1, 'lor': 1}
3614    {'remind': 1, 'o2': 1, 'get': 1, '250': 1, 'po...
3615    {'2nd': 1, 'time': 1, 'tri': 1, '2': 2, 'conta...
3616     {'piti': 1, 'mood': 1, 'soani': 1, 'suggest': 1}
3617                    {'rofl': 1, 'true': 1, 'name': 1}
Name: bag_of_words, Length: 3618, dtype: object

In [47]:
bows_ham = data_train[data_train.Class_code==0].bag_of_words
bows_spam = data_train[data_train.Class_code==1].bag_of_words

In [49]:
bows_spam

3       {'freemsg': 1, 'hey': 1, 'darl': 1, '3': 1, 'w...
6       {'mobil': 3, '11': 1, 'month': 1, 'u': 1, 'r':...
8       {'six': 1, 'chanc': 1, 'win': 1, 'cash': 1, '1...
9       {'urgent': 1, '1': 1, 'week': 1, 'free': 1, 'm...
14      {'england': 2, 'v': 1, 'macedonia': 1, 'dont':...
                              ...                        
3590    {'privat': 1, '2003': 1, 'account': 1, 'statem...
3597    {'want': 1, 'explicit': 1, 'sex': 1, '30': 1, ...
3602    {'contract': 1, 'mobil': 1, '11': 1, 'mnth': 1...
3614    {'remind': 1, 'o2': 1, 'get': 1, '250': 1, 'po...
3615    {'2nd': 1, 'time': 1, 'tri': 1, '2': 2, 'conta...
Name: bag_of_words, Length: 474, dtype: object

In [53]:
words = list(set().union(*bows))
words

['hotel',
 'veggi',
 'condit',
 'close',
 'spec',
 'curri',
 'pay',
 '08714714011',
 '12hr',
 'camcord',
 'newscast',
 'note',
 'hypotheticalhuagauahahuagahyuhagga',
 'cornwal',
 'biola',
 'ami',
 'futur',
 '08002986906',
 'texa',
 'ltdå£150mtmsgrcvd18',
 'tonight',
 'pleasur',
 'e',
 'whatsup',
 'lux',
 'xafter',
 'asapok',
 'm263uz',
 '5iåõm',
 'breaker',
 '2optout',
 'agocusoon',
 'avenu',
 'urself',
 'strongli',
 'freemsgfav',
 'nbme',
 'way',
 'holi',
 'vth',
 'eh',
 'go',
 'goodnight',
 '1225',
 'iz',
 'page',
 'ltgt',
 'in2',
 'mudyadhu',
 'vid',
 'gsoh',
 'tcrw1',
 'notif',
 '0776xxxxxxx',
 'selfish',
 'freesend',
 'bt',
 'himthen',
 'jackpot',
 'prayingwil',
 'shake',
 'leu',
 'it\x89û÷',
 'ie',
 'outif',
 '150ptext',
 'ipod',
 'tech',
 'tarpon',
 'rd',
 'fine',
 'diwali',
 'å£3350',
 'ahwhat',
 'con',
 'cream',
 'flirt',
 '30apr',
 '83222',
 'checkmat',
 'pressi',
 'pocay',
 'santha',
 'brah',
 'expiredso',
 'knee',
 'wound',
 'texd',
 'c',
 'å£10',
 'cashbincouk',
 'usualiam

In [54]:
len(words)

6535

In [55]:
number_of_occurence_ham = {key:1 for key in words} # Initializing all words by occurance=1
for word in words:
    for bow in bows_ham:
        if word in bow.keys():
            number_of_occurence_ham[word]+=bow[word]

In [56]:
number_of_occurence_ham['soon']

42

In [57]:
number_of_occurence_spam={key:1 for key in words}
for word in words:
    for bow in bows_spam:
        if word in bow.keys():
            number_of_occurence_spam[word]+=bow[word]

In [58]:
number_of_occurence_spam['free']

143

Probability of a word given that the text is ham/spam

In [62]:
number_of_occurence_ham

{'hotel': 4,
 'veggi': 2,
 'condit': 2,
 'close': 14,
 'spec': 2,
 'curri': 3,
 'pay': 29,
 '08714714011': 1,
 '12hr': 1,
 'camcord': 1,
 'newscast': 2,
 'note': 3,
 'hypotheticalhuagauahahuagahyuhagga': 2,
 'cornwal': 2,
 'biola': 3,
 'ami': 1,
 'futur': 5,
 '08002986906': 1,
 'texa': 2,
 'ltdå£150mtmsgrcvd18': 1,
 'tonight': 42,
 'pleasur': 5,
 'e': 57,
 'whatsup': 2,
 'lux': 1,
 'xafter': 1,
 'asapok': 2,
 'm263uz': 1,
 '5iåõm': 2,
 'breaker': 1,
 '2optout': 1,
 'agocusoon': 2,
 'avenu': 2,
 'urself': 6,
 'strongli': 2,
 'freemsgfav': 1,
 'nbme': 2,
 'way': 64,
 'holi': 2,
 'vth': 3,
 'eh': 11,
 'go': 280,
 'goodnight': 7,
 '1225': 1,
 'iz': 3,
 'page': 6,
 'ltgt': 167,
 'in2': 2,
 'mudyadhu': 2,
 'vid': 1,
 'gsoh': 1,
 'tcrw1': 1,
 'notif': 1,
 '0776xxxxxxx': 1,
 'selfish': 2,
 'freesend': 1,
 'bt': 8,
 'himthen': 2,
 'jackpot': 1,
 'prayingwil': 2,
 'shake': 3,
 'leu': 2,
 'it\x89û÷': 2,
 'ie': 3,
 'outif': 2,
 '150ptext': 1,
 'ipod': 1,
 'tech': 3,
 'tarpon': 2,
 'rd': 4,
 'fine'

In [63]:
P_word_h={}
P_word_s={}
for key in number_of_occurence_ham:
    P_word_h[key]=number_of_occurence_ham[key]/sum(number_of_occurence_ham.values())
for key in number_of_occurence_spam:
    P_word_s[key]=number_of_occurence_spam[key]/sum(number_of_occurence_spam.values())

Finding the priors

In [65]:
P_h=bows_ham.size/bows.size
P_s=bows_spam.size/bows.size

In [66]:
print(P_s)
print(P_h)

0.1310116086235489
0.8689883913764511


In [68]:
def classifier(document):
    document_bag_of_words=clean_message(document)
    P_document_h=1
    P_document_s=1
    for key in document_bag_of_words:
        if key in words:
            P_document_h=P_document_h*P_word_h[key]
            P_document_s=P_document_s*P_word_s[key]
    P_document_h=P_document_h*P_h
    P_document_s=P_document_s*P_s
    Pr_doc_h_normalized=P_document_h/(P_document_h+P_document_s)
    if Pr_doc_h_normalized>0.5:
        return 0
    else:
        return 1
classifier=np.vectorize(classifier)

In [69]:
classifier('congratulations! you won $500')

array(1)

In [70]:
classifier("Let's apply this model to the test sample")

array(0)

In [71]:
prediction = classifier(data_test.Text.values)

In [74]:
prediction

array([0, 0, 0, ..., 0, 0, 0])

In [76]:
T = data_test.Class_code
T

0       0
1       0
2       0
3       0
4       0
       ..
1546    0
1547    0
1548    0
1549    0
1550    0
Name: Class_code, Length: 1551, dtype: uint8

In [77]:
TP, TN, FP, FN = 0, 0, 0, 0
for i in range(len(T)):
    if T[i]==1:
        if prediction[i]==1:
            TP+=1
        if prediction[i]==0:
            FN+=1
    if T[i]==0:
        if prediction[i]==1:
            FP+=1
        if prediction[i]==0:
            TN+=1

Confusion matrix

In [78]:
print(np.array([[TP,FP],[FN,TN]]))

[[ 158    9]
 [  21 1363]]


In [79]:
precision=TP/(TP+FP)
print("precision=",precision)

precision= 0.9461077844311377


In [80]:
recall=TP/(TP+FN)
print("recall=",recall)

recall= 0.88268156424581


In [81]:
F1_score=2*precision*recall/(precision+recall)
print("F1_score=",F1_score)

F1_score= 0.9132947976878613


In [82]:
accuracy=(TP+TN)/(TP+FP+FN+TN)
print("accuracy=",accuracy)

accuracy= 0.9806576402321083
