<img src="https://images.unsplash.com/photo-1466096115517-bceecbfb6fde?ixlib=rb-0.3.5&ixid=eyJhcHBfaWQiOjEyMDd9&s=427bcc1d8e2505d31a239d0de6b13f75&auto=format&fit=crop&w=1950&q=80"  width="900" height="400">

*Problem statement:** classify SMS messages as *HAM* or *SPAM* using **naive bayes** in supervised machine setting.
See this link to get an idea supervised learning workflow [supervsed learning workflow](http://www.allprogrammingtutorials.com/tutorials/introduction-to-machine-learning.php)

**Dataset:** We will use [SMS Spam Collection Data Set](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection) from UCI machine learning repository.

credit:

- some of the images are from https://cdn.pixabay.com
- https://unsplash.com

Running this notebook may require installation of gensim  NLP(text processing) library

In [1]:
import os

In [2]:
# output should be 0 after successful install
# run this only once. Comment later
os.system('pip install gensim')

0

In [2]:
#Must for inline plot
%matplotlib inline 
import requests
import numpy as np
import pprint # for pretty printing
import zipfile # for zip and unzip utilities
import pandas # for data analysis
import csv
import matplotlib.pyplot as plt # for plotting

import seaborn as sb
import gensim
from collections import Counter

  import pandas.util.testing as tm


In [3]:
data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip'
r = requests.get(data_url)
r.content

b'PK\x03\x04\x14\x03\x00\x00\x08\x00\x81\xb4o>t\x90k\x96\xd2\x0e\x03\x00\xd3J\x07\x00\x11\x00\x00\x00SMSSpamCollection\xd4]\xc9v\x1bM\x15^\xc3S\x94\x81\x93\x10P:\x92,;\xb69`\xe4!\xb6\xc1C\xf0\x10\'l8-\xa9%u,u\x8b\x1e,\x8b\xc3\xe1\xf0\x0c\xacX\xb0\xe3-8\xac\xe0Mx\x12\xbe\xef\xde[\xd5r\x98\x87\x05\xe4\xff\x13[\xad\xea\xee\x1an\xdd\xe1\xbbCM\xe3\xf9WNrWgU:s\x9f\xeb"\xcf&n\x91\xa7Y\xd5r\xc3"\xfe\xd9*\x8a\\\xff1Ng\xf1`\x96\xb8<\x9b\xad\\\x9a\xb9A=IK\x97\xb9I\x91\xc4\x95[\xe6\xc5l\xe4f\xb1K\xf0\xc5x\x9cT\x11n:L\xb3\xc4U\xd3\xa4H\xdc$\xaf\\<\xcf\xf1\xdb2\xe6w_\xe5;\xaf\x1epG\xc1\x96?\xc8\x1fR\xbct\x99\x8e]\x8d7\xa4lQ.\xd0\xe4]\x91$.\xc9\xaaB\xde\xd9u\xb1[>\xe0\xfd\xc3|\xbepU\xee\x96\xb8\xf8\xae\xef\x0e\xeb\x85\x1b\xa7Y<s\xd5CU\xban\xa7\xac\xdcE\xbcr\xddv{+r\xb7\xc9S\x85Vl\xbf\xf3\xb6\xd3\xed\xf0\x97"\x19&\xe9\xa3\x7f\xf4O\xeb\xa4\xac\xd2<\xfbfY\x8d\\\x85\xd6E\\%\xafn_\x1c\xbe,]\xbcX\xe0\x85\xed\x9d\xdeVw\xa7\xd3n\xbf\xdd\xca\x1f\x93\xa2\xb3\xf3\xb2\x94!\xdc\xb9Q\x9d\xb9\x12\xef*s\x97\xc4\x05\

Let's download and save the zip file

In [4]:
sms_zip_file = 'smsspamcollection.zip'
with open(sms_zip_file, 'wb') as out_file:
    out_file.write(r.content)

# Let's verify it. 
**make sure output of following command contains smsspamcollection.zip file**

In [5]:
#Let verify it. 
dir_listing = os.listdir('.') # list content of current directory
print(dir_listing)

['.ipynb_checkpoints', 'data', 'Hands on Machine Learning with Scikit Learn and TensorFlow.pdf', 'HW1.ipynb', 'HW1.py', 'HW2f.ipynb', 'HW6.ipynb', 'Kernel PCA.ipynb', 'LDA_MNIST.ipynb', 'MNIST_data', 'Naive_bayes classifier.ipynb', 'PCA_MNIST.ipynb', 'ridge_regression_code.ipynb', 'smsspamcollection.zip', 'Sols.pdf', 'Text.pdf']


In [6]:
with zipfile.ZipFile(sms_zip_file,"r") as zip_ref:
    zip_ref.extractall("data")

# Let's list the content of the new data folder

In [7]:
print(os.listdir('./data'))

['readme', 'SMSSpamCollection']


SMSSpamCollection file contains around 5k SMS messages. Checkout readme file for details.

**Let's open this file and store line in python list**

In [10]:
with  open('./data/SMSSpamCollection', 'r') as f:
    sms_messages = f.readlines()
print(sms_messages)



In [11]:
# Following code show how to write list comprehension. We could have done this using for loop too.
# [<some_func>(x) for x in <something> if  <some_condition_is_true>]
sms_messages = [m.rstrip() for m in sms_messages] # we are not using if condition part
#print('Number of sms messages is {}'.format(len(sms_messages)))
print(sms_messages)



# Let's check couple of messages again

In [12]:
for idx, msg in enumerate(sms_messages[0:20]): # see how we can slice list using : operator
    print('message id {}  {}'.format(idx, msg))

message id 0  ham	Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
message id 1  ham	Ok lar... Joking wif u oni...
message id 2  spam	Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
message id 3  ham	U dun say so early hor... U c already then say...
message id 4  ham	Nah I don't think he goes to usf, he lives around here though
message id 5  spam	FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, Â£1.50 to rcv
message id 6  ham	Even my brother is not like to speak with me. They treat me like aids patent.
message id 7  ham	As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune
message id 8  spam	WINNER!! As a valued network customer

**This is our  data set $\mathcal{D} = \{({x_i}, y_i)\}_{i=1}^{N=5574}$ $x_i$ is sms message and $y_i$ is label(ham or spam)**. Using  this we will train(learn parameters $\theta$ of a models(Naive bayes etc.)) and use trained model to classify new messages as ham or spam

In [13]:
# Wrapping the file in pandas simplify lot of tasks
messages = pandas.read_csv('./data/SMSSpamCollection', sep='\t', quoting=csv.QUOTE_NONE,
                           names=["label", "message"])
messages.head(6) 

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...


# Let's try to understand various attribute of the data

*How many messages in each group etc.*

In [14]:
messages.groupby('label').describe()

Unnamed: 0_level_0,message,message,message,message
Unnamed: 0_level_1,count,unique,top,freq
label,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
ham,4827,4518,"Sorry, I'll call later",30
spam,747,653,Please call our customer service representativ...,4


- As you can see  spam class has less number of example than ham class. This is called **class imbalance** issue. 
- We need to be carfeful in machine learning application about class imbalance.

This can be handled by :

Up-sampling minority class:
This is done by randomly duplicating observations from the minority class in order to reinforce its signal.

Down-sampling majority class:
This is done by randomly removing observations from the majority class to prevent its signal from dominating the learning algorithm

# Feature engineering

We need to convert text to vectors(features)

we'll use the [Bag-of-words model](https://en.wikipedia.org/wiki/Bag-of-words_model) approach for creating feature
representing our sms messages.

### Bag of word model for document:

In BOG  we treat document as collection of word without any order, like they are lying in bag. 

Two model to represent sms/document in a vector form are:

- **Bernoulli document model: mes**sage is represented by a binary feature vector of absence or presence of word.
- **Multinomial document model**: message is represented by an integer feature vector of word frequency.



We will use **Multinomial document model** in this exercise

To convert a message into vector we need to:

1. convert a sentence into word token
2. Normalize the words i.e do we care about Capital form(Cow vs cow), inflected form ("goes" vs. "go")
3. Build a dictionary of words and map the messages into vector using this dictionary
4. Finally train a  Naive Bayes model

**We will use a python library [Gensim](https://radimrehurek.com/gensim/tutorial.html) to do heavy lifting for us.**

In [15]:
preprocessed_messages = []
for c in messages.message:
    preprocessed_messages.append(gensim.utils.simple_preprocess(c))

In [16]:
preprocessed_messages=  np.array(preprocessed_messages)

In [17]:
len(preprocessed_messages)

5574

In [18]:
for m in messages.message[0:3]:
    print(m)

Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
Ok lar... Joking wif u oni...
Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's


In [19]:
preprocessed_messages[0:3]

array([list(['go', 'until', 'jurong', 'point', 'crazy', 'available', 'only', 'in', 'bugis', 'great', 'world', 'la', 'buffet', 'cine', 'there', 'got', 'amore', 'wat']),
       list(['ok', 'lar', 'joking', 'wif', 'oni']),
       list(['free', 'entry', 'in', 'wkly', 'comp', 'to', 'win', 'fa', 'cup', 'final', 'tkts', 'st', 'may', 'text', 'fa', 'to', 'to', 'receive', 'entry', 'question', 'std', 'txt', 'rate', 'apply', 'over'])],
      dtype=object)

# Let's partition our data into training and test set

In [20]:
messages_labels = messages.label

In [21]:
training_set_portion =.9 # keep 90 % data for training
#LEt's create some random integer index and partition the data
number_of_examples = len(preprocessed_messages)
print('Total examples are {}'.format(number_of_examples))
np.random.seed(0) # to make sure multiple run give same result
random_index = np.random.permutation(range(number_of_examples))
training_set_size = int(number_of_examples*training_set_portion)
print('train set size is {} test set size is {}'.format(training_set_size,number_of_examples - training_set_size))

Total examples are 5574
train set size is 5016 test set size is 558


In [22]:
preprocessed_messages[1:4]

array([list(['ok', 'lar', 'joking', 'wif', 'oni']),
       list(['free', 'entry', 'in', 'wkly', 'comp', 'to', 'win', 'fa', 'cup', 'final', 'tkts', 'st', 'may', 'text', 'fa', 'to', 'to', 'receive', 'entry', 'question', 'std', 'txt', 'rate', 'apply', 'over']),
       list(['dun', 'say', 'so', 'early', 'hor', 'already', 'then', 'say'])],
      dtype=object)

In [23]:
training_messages = preprocessed_messages[random_index[:training_set_size]]
training_labels = messages_labels[random_index[:training_set_size]]
test_messages = preprocessed_messages[random_index[training_set_size:]]
test_labels = messages_labels[random_index[training_set_size:]]

print('Shape of training X {} and train Y {}'.format(training_messages.shape, training_labels.shape))
print('Shape of test X {} and test Y {}'.format(test_messages.shape, test_labels.shape))

Shape of training X (5016,) and train Y (5016,)
Shape of test X (558,) and test Y (558,)


**We'll use training messages only for building the model**

- We need to convert each message into count vector. Where a ham/spam message is mapped to vector representing each word frequency in the message
    + To do this we need to build a dictionary of words first

In [24]:
# How many unique words are in our dictionary
unique_word = set()
for message in training_messages:
    unique_word.update(message)

In [25]:
# how many words in vocabulary(V)
len(unique_word) 

7353

We will encode each message into len(unique_word) dimensional vector. In meachine learning we call acitivity like this feature engineering.

In [26]:
#let' use default dictionary to assign each word a unique location in feature vector
from collections import defaultdict, Counter
word_to_index_dict = defaultdict(int)
for index , word in enumerate(unique_word):
    word_to_index_dict[word] = index

In [27]:
# Let's create a reverse dictionary  for mapping index to word. It will help in debugging etc.
# See how we used dictionary comprehension
index_to_word_dict = { value:key  for key, value in word_to_index_dict.items()}

In [28]:
print(training_messages.shape)

(5016,)


## we will convert each  messages into |V| dimensional vector, where |V| is size of our dictionary

## Let's create a numpy integer matrix of right size, initialized with zero

In [29]:
# each row in training_X is our x_i
training_X = np.zeros((len(training_messages), len(unique_word)), dtype=int)
print(training_X.shape)

(5016, 7353)


# Using [Counter](https://docs.python.org/3.5/library/collections.html#collections.Counter) from collections to count frequency of each word in a message.

In [30]:
# Let's go over each training message, count the words using Counter and set count in feature vector for sms 
for sms_no, sms in enumerate(training_messages):
    word_freq = Counter(sms)
    # setting the word count in sms_no row of sms_features
    for word, freq in word_freq.items():
            index_of_word = word_to_index_dict[word]
            training_X[sms_no][index_of_word] = freq

## Writing code is easy :)
## But how we check if it is correct
## Let's do some primitive checking on a sms message

In [31]:
sms_no =3
message_word_count = Counter(training_messages[sms_no])
print(message_word_count)

# Let' check non zero location in sms_features to see if count is set properly
print('##Encoding for sms no {} in feature vector is ##'.format(sms_no))
for i, count in enumerate(training_X[sms_no]):
    if count >0:
        print(index_to_word_dict[i], count)


Counter({'if': 2, 'you': 2, 'ok': 1, 'can': 1, 'be': 1, 'later': 1, 'showing': 1, 'around': 1, 'want': 1, 'cld': 1, 'have': 1, 'drink': 1, 'before': 1, 'wld': 1, 'prefer': 1, 'not': 1, 'to': 1, 'spend': 1, 'money': 1, 'on': 1, 'nosh': 1, 'don': 1, 'mind': 1, 'as': 1, 'doing': 1, 'that': 1, 'nxt': 1, 'wk': 1})
##Encoding for sms no 3 in feature vector is ##
to 1
showing 1
nosh 1
can 1
prefer 1
on 1
as 1
mind 1
before 1
you 2
if 2
later 1
around 1
doing 1
money 1
don 1
have 1
be 1
drink 1
ok 1
that 1
not 1
cld 1
wk 1
nxt 1
wld 1
want 1
spend 1


# We have successfully converted sms message into feature vector and
# collected them in numpy matrix

<img src="https://images.unsplash.com/photo-1522098543979-ffc7f79a56c4?ixlib=rb-0.3.5&ixid=eyJhcHBfaWQiOjEyMDd9&s=3deb7fa95bb0a7343a38b724cbee4b5a&auto=format&fit=crop&w=1868&q=80" alt="Well done" width="500" height="400">


# Let's convert ham and spam label to 1 and 0  respectively

In [32]:
training_labels.tail(7)# can check from head too

1201     ham
4260     ham
424     spam
4421     ham
3715     ham
664      ham
350      ham
Name: label, dtype: object

In [33]:
# This is our training_y value
training_y = (training_labels.values == 'ham').astype(int)
print(training_y.shape)

(5016,)


In [34]:
# Let's check some lable value
training_y[-7:]

array([1, 1, 0, 1, 1, 1, 1])

# Training the model or estimating parameters $\theta$ of the model
Now we have vector feature representation $x_i$ of our sms samples. 

Let review some theory and see what parameters we need to estimate for Naive bayes model.

We know that we classify a sms $x_i$  to a class c= HAM or c= SPAM which has maximum value of $P(c|x_i).$ Using bayes rule we have $P(c|x_i) = \frac{P(x_i|c) P(c)}{P(x_i)} \propto P(x_i|c) P(c)$ as normalization doesn't depend on class label. 

In naive bayes assumption for modelling class conditional densities we have $P(x_i|c) = \prod_j^D P(x_{ij}|c)$ assuming  $x_i \in \mathbb{R}^D$ i.e each example has $D$ dimentional features.

**Note:$D$ is size of our vacabulary ($|V|$) build from sms document corpus i.e D = |V|**

**what probability distribution we should choose for $P(x_{ij}|c)?$ **

Each value $x_{ij}$ is an integer values(count of words) and there are total $D$ different unique values(word). This definetly suits a **$D$ side die** situation. In our case die has $D = |V|$ sides= size of feature vector.

**Infact once we have learned $P(x_{ij}|c)?$ i.e probabilites of different sides for ham and spam die,**
** ham or spam sms generation in bag of word model is nothing but rolling ham or spam die. Pick the word dictated by the side of die throw.**

Now we  know that we can put multinomial distribution for such situation. Hence
<font size = 6> 
$P(x_i|c) = \frac{n !}{\prod_j^D x_{ij !}} P(c) \prod^{D} P(w_j|c)^{x_{ij}} \propto P(c) \prod^{D} P(w_j|c)^{x_{ij}}$ 
</font>
as normalization doesn't depend on class label

We know that using MLE estimate we have
<font size = 8> 
$P(w_j|c) = \frac{\sum_{i=1}^N x_{ij}\mathbb{1}(y_i=c)}{\sum_{k=1}^{D} \sum_{i=1}^N x_{ik}\mathbb{1}(y_i=c)}.$ 
</font>
where $\mathbb{1}$ is indicator function.


- Hence the parameters are nothing  nothing but relative frequency of $w_j$ in documents of class c=SPAM or c= HAM
with respect to the total number of words in documents of that class.

- We can sum our numpy sms_feature matrix along row or dim 0 to get total frequency of each feature for ham and spam class
- normalize total frequency of each feature with total frequency of all the features for each class.
- prior class  densites are estimated as $P(c) = \frac{N_c}{N}.$ Where $N_c$ are numer of document in class k.



# Let's learn the parameters for c= ham(1) and c= spam(0)



# For ham class

In [35]:
# First estimate for ham

# summing up per feature count
training_X_ham = training_X[training_y ==1]
print(training_X_ham.shape)
per_feature_count =np.sum(training_X_ham, axis = 0)
print(per_feature_count.shape)

#print(np.count_nonzero(per_feature_count))
print(np.sum(per_feature_count))
parameters_w_ham = per_feature_count/(np.sum(per_feature_count))

parameters_w_ham

(4346, 7353)
(7353,)
56547


array([1.76844041e-05, 1.60928078e-03, 1.76844041e-05, ...,
       0.00000000e+00, 3.53688082e-05, 1.76844041e-05])

# Let's estimate parameters for spam

In [36]:
training_X_spam = training_X[training_y ==0]
print(training_X_spam.shape)
per_feature_count =np.sum(training_X_spam, axis = 0)
per_feature_count.shape

np.count_nonzero(per_feature_count)
parameters_w_spam = per_feature_count/(np.sum(per_feature_count))
print(parameters_w_spam)

(670, 7353)
[0.         0.00021219 0.         ... 0.00014146 0.         0.        ]


# Zero probability issue
As we can see some of the probablity can be zero. It will create problem when we estimate probability of a new document in test set if that word was not in training set. 

If any of the term in product is zero it will result in zero product. If any of the class don't have this term then probability of this document for any class will be zero. It is an ambiguous situation. If we play log trick for comparing product of probability, we will be in troble as log of zero is not defined too.

One way to handle this situtation to add a fake 1 count of the word in each class. This is called Laplace law of sccession or add one smoothing.

We estimate
<font size = 8> 
$P(w_j|c) = \frac{\sum_{i=1}^N x_{ij}\mathbb{1}(y_i=c) + 1}{\sum_{k=1}^{D} \sum_{i=1}^N x_{ik}\mathbb{1}(y_i=c) + |V|}.$ 
</font>
where $\mathbb{1}$ is indicator function and $|V|$ is size of our dictionary.

This can be done by adding a row of ones to training_X_ham and training_X_spam


# New parameters Laplace law of sccession or add one smoothing.

# Estimate new parameter for ham class

In [37]:
training_X_ham = training_X[training_y ==1]
training_X_ham = np.append(training_X_ham, np.ones(shape=(1, training_X_ham.shape[1])), axis=0)
print(training_X_ham.shape)
per_feature_count =np.sum(training_X_ham, axis = 0)
per_feature_count.shape

np.count_nonzero(per_feature_count)
parameters_w_ham = per_feature_count/(np.sum(per_feature_count) + training_X_ham.shape[1])
print(parameters_w_ham)

(4347, 7353)
[2.80689936e-05 1.29117370e-03 2.80689936e-05 ... 1.40344968e-05
 4.21034904e-05 2.80689936e-05]


# Estimate new parameter for spam class 

In [38]:
training_X_spam = training_X[training_y ==0]
training_X_spam = np.append(training_X_spam, np.ones(shape=(1, training_X_spam.shape[1])), axis=0)
print(training_X_spam.shape)
per_feature_count =np.sum(training_X_spam, axis = 0)
per_feature_count.shape

np.count_nonzero(per_feature_count)
parameters_w_spam = per_feature_count/(np.sum(per_feature_count) + training_X_spam.shape[1])
print(parameters_w_spam)

(671, 7353)
[3.46692553e-05 1.38677021e-04 3.46692553e-05 ... 1.04007766e-04
 3.46692553e-05 3.46692553e-05]


# class probabilities

# Estimate class probabilities P(c=ham) and P(c=spam)

In [39]:
ham = training_X_ham.shape[0]/(training_X_ham.shape[0] + training_X_spam.shape[0])
spam = 1- ham
ham,spam

(0.8662813870067756, 0.1337186129932244)

In [40]:
print(len(training_X_ham))

4347


# Now we have learned the model(i.e its parameters, probabilities of different words occuring in ham dice and spam dice)

# How good is our model ?
- Let take out our test data and convert to count feature vector using same dictionary
- Calulate the probability if test data belonging to Ham or spam. i.e if probability if >=.5 Ham otherwise spam
 or we can calulate the ratio
 <font size = 5>
 $\frac{P(x_{test}|c=ham)}{P(x_{test}|c=spam)} = \frac{ P(c=ham)  \prod^{D}_{j =1} P(w_j|c=ham)^{x_{test,j}}} { P(c= spam)\prod^{D}_{j=1} P(w_j|c=spam)^{x_{test, j}}}$ 
 </font>
 
 **Note:Generally such large product of probabilties, turns out to be zero because of computer representation limits of real numbers.**
 
 Another option is let take log on right hand side and after some manipulation one can show that if
 <font size = 5>
  $\sum_{j =1}^{D} (x_{test,j})log (P(w_j|c=ham)) +log(P(c= ham)) \ge log(P(c=spam))+ \sum_{j =1}^{D} (x_{test,j})log (P(w_j|c=spam))$
  
  </font>
  
 then it is ham otherwise spam
 
 
 
 
- Compare our prediction of label with test data label and let's report accuracy

In [41]:
test_x = np.zeros((len(test_messages), len(unique_word)), dtype=int)
print(test_x.shape)
print(len(word_to_index_dict))
print(test_messages.shape)

(558, 7353)
7353
(558,)


In [42]:
def build_feature(sms, word_to_index_dict):
    feature = np.zeros((len(word_to_index_dict),), dtype=int)
    word_freq =  Counter(sms)
    # setting the word count in sms_no row of sms_features
    for word, freq in word_freq.items():
        if word in word_to_index_dict:
            index_of_word = word_to_index_dict[word]
            feature[index_of_word] = freq
    return feature        
    
for sms_no, sms in enumerate(test_messages):
    test_x[sms_no] = build_feature(sms, word_to_index_dict)

print(test_x.shape)

(558, 7353)


# Again checking feature creation/encoding

In [43]:
sms_no =2
message_word_count = Counter(test_messages[sms_no])
print(message_word_count)

# Let' check non zero location in sms_features to see if count is set properly
print('##Encoding for sms no {} in feature vector is ##'.format(sms_no))
for i, count in enumerate(test_x[sms_no]):
    if count >0:
        print(index_to_word_dict[i], count)


Counter({'hello': 1, 'they': 1, 'are': 1, 'going': 1, 'to': 1, 'the': 1, 'village': 1, 'pub': 1, 'at': 1, 'so': 1, 'either': 1, 'come': 1, 'here': 1, 'or': 1, 'there': 1, 'accordingly': 1, 'ok': 1})
##Encoding for sms no 2 in feature vector is ##
to 1
pub 1
here 1
come 1
village 1
going 1
either 1
the 1
ok 1
or 1
at 1
so 1
they 1
accordingly 1
there 1
are 1
hello 1


# Convert ham and spam to 1 and 0 integer as done in training set

In [44]:
# This is our test_y value
test_y =(test_labels.values == 'ham').astype(int)
print(test_y.shape)

(558,)


# Finally let's calculate ham/spam probability for test messages

In [45]:
ham_score = np.zeros_like(test_y,dtype=float)
spam_score = np.zeros_like(test_y,dtype=float)
ham_score.shape, spam_score.shape# just printing to make sure shape is right

((558,), (558,))

In [46]:
def calculate_score(parameters,test_sms, class_prior):
    return np.sum(np.log(np.power(parameters,test_sms))) + class_prior

for idx, test_sms in enumerate(test_x):# this will fetch row by row, encoded test messages
    ham_score[idx] = calculate_score(parameters_w_ham,test_sms, np.log(ham))
    spam_score[idx] = calculate_score(parameters_w_spam, test_sms, np.log(spam))

    

In [47]:
# Let print some values for visual comparision/verification
ham_score[0:2], spam_score[0:2], test_y[0:2]

(array([ -64.13108259, -158.05199309]),
 array([ -80.92560929, -137.90110922]),
 array([1, 0]))

In [48]:
# predict the label ham(1) or spam(0)
ham_or_spam = (ham_score >= spam_score).astype(int)

In [49]:
ham_or_spam[0:5], test_y[0:5]

(array([1, 0, 1, 0, 1]), array([1, 0, 1, 0, 1]))

# Accuracy calculation

In [50]:
accuracy= np.sum(ham_or_spam == test_y) / len(ham_or_spam)
print('accuracy on test set is {}'.format(accuracy))

accuracy on test set is 0.9802867383512545


In [51]:
def predict_ham_or_spam(message):
    feature = build_feature(message, word_to_index_dict)
    ham_score = calculate_score(parameters_w_ham,feature, np.log(ham))
    spam_score = calculate_score(parameters_w_spam, feature, np.log(spam))
    
    return 'ham' if ham_score > spam_score else 'spam'

# Let's see how it works on new spam message which is a modified  training message (coming from the same distribution as training messages)

In [52]:
predict_ham_or_spam(' your mailbox messaging sm alert call back 09056242159 to retrieve your message'.split())

'spam'

<font color = 'BlueViolet' size = 6> Comparing our model from scratch with inbuilt sklearn classifiers </font>

In [53]:
from sklearn.naive_bayes import MultinomialNB

In [54]:
classifier = MultinomialNB()
classifier.fit(training_X, training_y)
ypred= classifier.predict(test_x)
classifier.score(test_x, test_y)

0.9802867383512545

In [55]:
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression()
classifier.fit(training_X, training_y)
ypred= classifier.predict(test_x)
classifier.score(test_x, test_y)



0.9802867383512545

# Kudos we have matched the accuracy of sklearn.