---
# <center> **Introduction to Applied Data Science**

### <center> *Prof. Bahram Mobasher*
#### <center> Teaching Assistant: *[Sina Taamoli](https://sinataamoli.github.io/)* | email: *sina.taamoli@email.ucr.edu*
### <center> **Week 6: Naive Bayes**
---

## Naive Bayes spam filtering

Consider that you are given a data set of text messages which are labeled with ham or spam. We will use a training sample with ~4000 text messages, but first let’s consider a few examples to get familiar with the naive Bayes idea. <br>
<center>

| Class | Message | Bag of words |
| -------- | -------- | -------- |
| Spam   | Send us your password   | send, password   |
| Ham   | I will send you the letter   | send, letter   |
| Ham   | I wrote a letter   | write, letter   | <br>
</center>

We want to compute P(Spam|Bag of words). Last session, we learned from Bayes’ rule: <br>

$P(Spam | \text{Bag of words}) = \frac{P(\text{Bag of words}|Spam)P(Spam)}{P(\text{Bag of words}|Spam)P(Spam)+P(\text{Bag of words}|Ham)P(Ham)}$ <br>

P(word|spam) and P(word|ham) can be estimated from the training sample. To avoid zero probabilities, we consider the initial value of 1 for the number of occurence of a word. Note that the priors are P(ham)=2/3 and P(spam)=1/3. <br>


<center>

| Spam | Ham | Word | Spam(i=1) | Ham(i=1) |
| -------- | -------- | -------- | -------- | -------- |
| $\frac{1}{2}$   | $\frac{1}{4}$   | send   | $\frac{1+1}{2+4}$ | $\frac{1+1}{4+4}$ |
| $\frac{1}{2}$   | $\frac{0}{4}$   | password   | $\frac{1+1}{2+4}$ | $\frac{0 + 1}{4+4}$ |
$\frac{0}{2}$| $\frac{2}{4}$   | letter   | $\frac{0+1}{2+4}$   | $\frac{2+1}{4+4}$ |  |
| $\frac{0}{2}$   | $\frac{1}{4}$   | write   | $\frac{0+1}{2+4}$ | $\frac{1+1}{4+4}$ |

</center> <br>

Now, consider a new text message “write your password in the password box”. We don’t have the word “box” in our training sample, so the safe choice would be to remove this from the bag of words and make decision based on on the other two words, “write” and “password”. “password” occured twice. <br>

$P(spam|write,password,password) = \frac{P(write|spam)P(password|spam)P(password|spam)P(spam)}{P(write|Spam)P(password|Spam)P(password|spam)P(Spam)+P(write|ham)P(password|ham)P(password|ham)P(ham)}$ <br>

$P(spam|write,password,password) = \frac{\frac{1}{6} \times \frac{2}{6} \times \frac{2}{6} \times \frac{1}{3}}{\frac{1}{6} \times \frac{2}{6} \times \frac{2}{6} \times \frac{1}{3} + \frac{2}{8} \times \frac{1}{8} \times \frac{1}{8} \times \frac{2}{3}} \sim 70\%$ <br>

and $P(ham|write,password,password) = 1 − P(spam|write,password,password) = 30\%$, so we classify this email as a spam message. This was just a demonsteration of the naive Bayes method. Let’s use a large data set to build a model and evaluate its performance.

In [1]:
import numpy as np
import pandas as pd
from collections import Counter

NLTK (Natural Language Toolkit) is a set of libraries for Natural Language Processing (NLP)

In [2]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/sinataamoli/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Stop words are the most common words in a language which don’t carry much information. We will filter them before NLP

In [3]:
stopwords = nltk.corpus.stopwords.words('english')
print(stopwords[:10])

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]


In [45]:
stopwords

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

A word can have many variations with the same meaning. So, we will use stem package to normalize the words.

In [4]:
from nltk.stem import PorterStemmer
Ps = PorterStemmer()
Ps.stem('cook'), Ps.stem('cooking'), Ps.stem('cooked')

('cook', 'cook', 'cook')

We also need to remove punctuations, they are not informative in our classification.

In [5]:
import string
punctuations = string.punctuation
print(punctuations)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


Let’s load the data:

In [6]:
data = pd.read_csv('spam.csv')
data.head()

Unnamed: 0,Class,Text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Change categorical data into numbers which can be processed in the code

In [8]:
data.Class

0        ham
1        ham
2       spam
3        ham
4        ham
        ... 
5164    spam
5165     ham
5166     ham
5167     ham
5168     ham
Name: Class, Length: 5169, dtype: object

In [7]:
data['Class_code'] = pd.get_dummies(data.Class, drop_first=True)
data.head()

Unnamed: 0,Class,Text,Class_code
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


In [31]:
def train_test_split(dataframe, test_size = 0.3, rs = None):
    """
    A function which takes pandas dataframe and split it to train and test samples
    """ 
    dataframe_test = dataframe.sample(frac = test_size, random_state = rs)
    dataframe_train = dataframe.loc[dataframe.index.difference(dataframe_test.index)]
    return (dataframe_train.reset_index(drop=True), dataframe_test.reset_index(drop=True))

In [38]:
data_train, data_test = train_test_split(data, test_size = 0.3, rs = 3)

In [39]:
data_train.head()

Unnamed: 0,Class,Text,Class_code
0,ham,Ok lar... Joking wif u oni...,0
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
2,ham,U dun say so early hor... U c already then say...,0
3,ham,"Nah I don't think he goes to usf, he lives aro...",0
4,spam,FreeMsg Hey there darling it's been 3 week's n...,1


In [40]:
data_test.head()

Unnamed: 0,Class,Text,Class_code
0,ham,"Just looked it up and addie goes back Monday, ...",0
1,ham,You best watch what you say cause I get drunk ...,0
2,ham,Me i'm not workin. Once i get job...,0
3,ham,Yar lor... How u noe? U used dat route too?,0
4,ham,"Under the sea, there lays a rock. In the rock,...",0


Cleaning up one of the text messages as an example:

In [41]:
message = data_train.Text[46]
print(message)

Wah lucky man... Then can save money... Hee...


In [42]:
message = ''.join([x for x in message if x not in punctuations])
print(message)

Wah lucky man Then can save money Hee


In [44]:
type(message)

str

In [43]:
message.split()

['Wah', 'lucky', 'man', 'Then', 'can', 'save', 'money', 'Hee']

In [46]:
message = [x for x in message.split() if x not in stopwords]
print(message)

['Wah', 'lucky', 'man', 'Then', 'save', 'money', 'Hee']


In [47]:
message=[Ps.stem(x) for x in message]
print(message)

['wah', 'lucki', 'man', 'then', 'save', 'money', 'hee']


In [48]:
print(Counter(message))

Counter({'wah': 1, 'lucki': 1, 'man': 1, 'then': 1, 'save': 1, 'money': 1, 'hee': 1})


Now put them together in a function

In [49]:
def clean_message(message):
    """
    a function to clean up message and return a dict with bag of their occurence rate
    """
    message = message.lower()
    message = ''.join([x for x in message if x not in punctuations])
    message = [x for x in message.split() if x not in stopwords]
    message=[Ps.stem(x) for x in message]
    
    return (Counter(message))

In [51]:
print(data_train.Text[10])
print(clean_message(data_train.Text[10]))

I HAVE A DATE ON SUNDAY WITH WILL!!
Counter({'date': 1, 'sunday': 1})


Apply the function to all the data set

In [52]:
data_train['bag_of_words'] = data_train['Text'].apply(clean_message)
data_train.head()

Unnamed: 0,Class,Text,Class_code,bag_of_words
0,ham,Ok lar... Joking wif u oni...,0,"{'ok': 1, 'lar': 1, 'joke': 1, 'wif': 1, 'u': ..."
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1,"{'free': 1, 'entri': 2, '2': 1, 'wkli': 1, 'co..."
2,ham,U dun say so early hor... U c already then say...,0,"{'u': 2, 'dun': 1, 'say': 2, 'earli': 1, 'hor'..."
3,ham,"Nah I don't think he goes to usf, he lives aro...",0,"{'nah': 1, 'dont': 1, 'think': 1, 'goe': 1, 'u..."
4,spam,FreeMsg Hey there darling it's been 3 week's n...,1,"{'freemsg': 1, 'hey': 1, 'darl': 1, '3': 1, 'w..."


In [None]:
bows = data_train.bag_of_words
bows

In [None]:
bows_ham = 
bows_spam = 

In [None]:
bows_spam

In [None]:
words = 
words

In [None]:
len(words)

In [None]:
number_of_occurence_ham = {key:1 for key in words} # Initializing all words by occurance=1




In [None]:
number_of_occurence_ham['soon']

In [None]:
number_of_occurence_spam={key:1 for key in words}





In [None]:
number_of_occurence_spam['free']

Probability of a word given that the text is ham/spam

In [None]:
number_of_occurence_ham

In [None]:
P_word_h={}
P_word_s={}
for key in number_of_occurence_ham:
    
for key in number_of_occurence_spam:
    

Finding the priors

In [None]:
P_h=
P_s=

In [None]:
print(P_s)
print(P_h)

In [None]:
def classifier(document):

    
    
    
    
    
    
    
    
    
classifier=np.vectorize(classifier)

In [None]:
classifier('congratulations! you won $500')

In [None]:
classifier("Let's apply this model to the test sample")

In [None]:
prediction = classifier(data_test.Text.values)

In [None]:
prediction

In [None]:
T = data_test.Class_code
T

In [None]:
TP, TN, FP, FN = 0, 0, 0, 0












Confusion matrix

In [None]:
print(np.array([[TP,FP],[FN,TN]]))

In [None]:
precision=TP/(TP+FP)
print("precision=",precision)

In [None]:
recall=TP/(TP+FN)
print("recall=",recall)

In [None]:
F1_score=2*precision*recall/(precision+recall)
print("F1_score=",F1_score)

In [None]:
accuracy=(TP+TN)/(TP+FP+FN+TN)
print("accuracy=",accuracy)