<a href="https://colab.research.google.com/github/sdhar2020/Simple-Naive-Bayes/blob/master/Simple_Naive_Bayes_model_for_classifying_SMS_messages_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Settings:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from urllib.request import urlretrieve
import os
from zipfile import ZipFile
#from google.colab import drive
#drive.mount('/content/gdrive')

# Introduction

The task is to "teach" the computer how to classify messages. To do that, we'll use the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages that are already classified by humans.

The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the The UCI Machine Learning Repository. 

In [None]:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00228/'
f = 'smsspamcollection.zip'

In [None]:
def download(url, file):
    if not os.path.isfile(file):
        print("Download file... " + file + " ...")
        urlretrieve(url,file)
        print("File downloaded")

download(url,f)
print("All the files are downloaded")

Download file... smsspamcollection.zip ...
File downloaded
All the files are downloaded


In [None]:
!pwd
!ls -d

!cd /content
!pwd
!cd /content/gdrive
!pwd

/content
.
/content
/bin/bash: line 0: cd: /content/gdrive: No such file or directory
/content


In [None]:
# def uncompress_data(f):
#     if(os.path.isfile('f')):
#         with ZipFile(f) as zipf:
#             zipf.extractall(f)
#         print('Data extracted')
#     else:
#         print('Zip file not found')

In [None]:
from google.colab import files
uploaded = files.upload()

Saving SMSSpamCollection.csv to SMSSpamCollection.csv


In [None]:
import io
df = pd.read_csv(io.BytesIO(uploaded['SMSSpamCollection.csv']), header = None, sep = '\t', names=['Label', 'SMS'])

In [None]:
df.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


# High Level Exploration

In [None]:
df.shape

(5572, 2)

In [None]:
df['Label'].value_counts(normalize = True)*100

ham     86.593683
spam    13.406317
Name: Label, dtype: float64

Total 5572 rows with 2 columns. 86.6% Ham and 13.4% Spam

## Train/ Test Split

In [None]:
df_rand = df.sample(frac=1, random_state=1)
# Random;y suffle records

In [None]:
tt_index = round(len(df_rand)*.8)
train = df_rand[:tt_index].reset_index(drop = True)
test  = df[tt_index:].reset_index(drop = True)

In [None]:
train['Label'].value_counts(normalize = True)*100

ham     86.54105
spam    13.45895
Name: Label, dtype: float64

In [None]:
test['Label'].value_counts(normalize = True)*100

ham     86.983842
spam    13.016158
Name: Label, dtype: float64

The composition of spam versus ham in train and test appear to be very similar

In [None]:
import re

# Creating Vocab list

We shall now build a vocab list in the training data. This means we have to extract and somewhat standardized the sms messages and create a single list of words from the messages

In [None]:
train['clean_sms']= train['SMS'].str.replace("\W", " ").str.lower().str.split()

In [None]:
train['clean_sms'].head()

0                    [yep, by, the, pretty, sculpture]
1    [yes, princess, are, you, going, to, make, me,...
2                      [welp, apparently, he, retired]
3                                             [havent]
4    [i, forgot, 2, ask, ü, all, smth, there, s, a,...
Name: clean_sms, dtype: object

In [None]:
vocabulary= []

for i in range(len(train)):
  l = train['clean_sms'].iloc[i]
  for j in range(len(l)):
    vocabulary.append(l[j])

In [None]:
vocabulary = set(vocabulary)
vocabulary = list(vocabulary)

In [None]:
len(vocabulary)

7783

# Word Count

We are going to do word count in the good old way. 
- We start by initializing a dictionary where each key is a unique word (a string) from the vocabulary, and each value is a list of the length of training set, where each element in the list is a 0
- We loop over training sms list using at the same time the enumerate() function to get both the index and the SMS message (index and sms).



In [None]:
word_counts_per_sms = {unique_word: [0] * len(train['clean_sms']) for unique_word in vocabulary}

for index, sms in enumerate(train['clean_sms']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [None]:
len(word_counts_per_sms)

7783

In [None]:
word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.head()

Unnamed: 0,bloomberg,09099726481,hol,bajarangabali,stuffs,frnt,roses,volcanoes,commercial,les,lodge,yeh,sambar,j,promise,callcost150ppmmobilesvary,sure,69988,intention,leh,happier,bang,fever,club,frontierville,puppy,08715205273,cos,prizes,plumbing,birthday,seeds,iriver,w1,mostly,itself,promptly,admirer,helens,doug,...,rael,2find,east,him,surely,mobilesvary,07821230901,ts,breathing,dessert,flirtparty,made,lyrics,virgil,hold,mood,outrageous,neighbor,stifled,dint,dinner,20,olage,din,fone,tor,ben,ela,somone,bids,drinkin,avenue,he,howda,certificate,by,86688,getting,papa,youuuuu
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [None]:
train_wdcnt = pd.concat([train, word_counts], axis=1)
train_wdcnt.head()

Unnamed: 0,Label,SMS,clean_sms,bloomberg,09099726481,hol,bajarangabali,stuffs,frnt,roses,volcanoes,commercial,les,lodge,yeh,sambar,j,promise,callcost150ppmmobilesvary,sure,69988,intention,leh,happier,bang,fever,club,frontierville,puppy,08715205273,cos,prizes,plumbing,birthday,seeds,iriver,w1,mostly,itself,promptly,...,rael,2find,east,him,surely,mobilesvary,07821230901,ts,breathing,dessert,flirtparty,made,lyrics,virgil,hold,mood,outrageous,neighbor,stifled,dint,dinner,20,olage,din,fone,tor,ben,ela,somone,bids,drinkin,avenue,he,howda,certificate,by,86688,getting,papa,youuuuu
0,ham,"Yep, by the pretty sculpture","[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
1,ham,"Yes, princess. Are you going to make me moan?","[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,ham,Welp apparently he retired,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
3,ham,Havent.,[havent],0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,ham,I forgot 2 ask ü all smth.. There's a card on ...,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


# P(Ham) vs. P(Spam)



In [None]:
p_ham = sum(train_wdcnt['Label']== 'ham')/ len(train_wdcnt['Label'])
p_spam = 1- p_ham

In [None]:
ham = train_wdcnt[train_wdcnt['Label']== 'ham']
spam = train_wdcnt[train_wdcnt['Label']== 'spam']

In [None]:
# N_Spam
n_words_per_spam_message = spam['clean_sms'].apply(len)
n_spam = n_words_per_spam_message.sum()

# N_Ham
n_words_per_ham_message = ham['clean_sms'].apply(len)
n_ham = n_words_per_ham_message.sum()

# N_Vocabulary
n_vocabulary = len(vocabulary)

# Laplace smoothing
alpha = 1

In [None]:
parameters_spam = {unique_word:0 for unique_word in vocabulary}
parameters_ham = {unique_word:0 for unique_word in vocabulary}

In [None]:
for word in vocabulary:
  n_words_given_ham = ham[word].sum()
  n_words_given_spam = spam[word].sum()
  n_word_given_ham = ham[word].sum()   # ham_messages already defined in a cell above
  n_word_given_spam = spam[word].sum()   # spam_messages already defined in a cell above
  p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + alpha*n_vocabulary)
  p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + alpha*n_vocabulary)
  parameters_spam[word] = p_word_given_spam
  parameters_ham[word] = p_word_given_ham

In [None]:
tst_msg = 'WINNER!! This is the secret code to unlock the money: C3421.'
#tst_msg = tst_msg.replace("\W", " ").lower().split()

In [None]:
import re
def classify(message):
  '''
  message: is a string
  '''
  message = message.replace('\W', ' ')
  message = message.lower().split()
  p_spam_given_message = p_spam
  p_ham_given_message = p_ham

  for word in message:
    if word in parameters_ham:
      p_ham_given_message *= parameters_ham[word]
    if word in parameters_spam:
      p_spam_given_message *= parameters_spam[word]
  if p_ham_given_message > p_spam_given_message:
    label = 'ham'
  elif p_ham_given_message < p_spam_given_message:
    label = 'spam'
  else:
      label = 'manual candiddate'
  return(label, p_ham_given_message, p_spam_given_message)

In [None]:
classify(tst_msg)[1]

1.8195638182330266e-19

In [None]:
test['prediction_result']= test['SMS'].apply(classify)

In [None]:
test['p_label']= test['SMS'].apply(lambda x: classify(x)[0])
test['p_ham']= test['SMS'].apply(lambda x: classify(x)[1])
test['p_spam']= test['SMS'].apply(lambda x: classify(x)[2])

In [None]:
acc =test['Label']== test['p_label']
False_pos = np.logical_and(test['Label']== 'ham', test['p_label']== 'spam')
ham = test['Label']== 'ham'
spam = test['p_label']== 'spam'
False_neg = np.logical_and(test['Label']== 'spam', test['p_label']== 'ham')
accuracy = acc.sum()/ len(test['Label'])
False_pos_rate = False_pos.sum()/ (False_pos.sum()+ ham.sum())
False_neg_rate = False_neg.sum()/ (False_neg.sum()+ spam.sum())
miss = test['Label']!= test['p_label']

In [None]:
print(accuracy)
print(False_pos_rate)
print(False_neg_rate)

0.9892280071813285
0.00513347022587269
0.04666666666666667


In [None]:
miss.sum()

12

In [None]:
test[miss]

Unnamed: 0,Label,SMS,prediction_result,p_label,p_ham,p_spam
56,spam,Money i have won wining number 946 wot do i do...,"(ham, 7.527669451157541e-37, 9.097839917029221...",ham,7.527669e-37,9.09784e-40
99,ham,Gettin rdy to ship comp,"(spam, 1.3662156424097608e-19, 2.0710112343792...",spam,1.366216e-19,2.0710109999999999e-19
142,ham,Have you laid your airtel line to rest?,"(spam, 1.5576950963877754e-21, 1.1926441515825...",spam,1.557695e-21,1.192644e-20
218,spam,"Hi babe its Chloe, how r u? I was smashed on s...","(ham, 1.9240005103235274e-65, 2.30484689434199...",ham,1.924001e-65,2.304847e-72
271,ham,I (Career Tel) have added u as a contact on IN...,"(spam, 3.322331576885379e-45, 1.21329729065954...",spam,3.322332e-45,1.213297e-43
296,spam,Cashbin.co.uk (Get lots of cash this weekend!)...,"(ham, 1.3128980459038297e-57, 1.56986076656587...",ham,1.312898e-57,1.569861e-58
404,ham,Nokia phone is lovly..,"(spam, 3.3416452792129453e-10, 3.2302276335356...",spam,3.341645e-10,3.230228e-09
491,spam,"Hi this is Amy, we will be sending you a free ...","(ham, 1.3270940387655318e-69, 4.16040674217410...",ham,1.3270940000000002e-69,4.160407e-73
579,spam,You won't believe it but it's true. It's Incre...,"(ham, 5.16693782827404e-64, 1.7469250352197834...",ham,5.1669379999999994e-64,1.746925e-64
588,ham,We have sent JD for Customer Service cum Accou...,"(spam, 6.011261440232347e-57, 1.69418236823749...",spam,6.011260999999999e-57,1.6941820000000002e-54


The accuracy is close to 98.74%, which is really good. Our spam filter looked at 1,114 messages that it hasn't seen in training, and classified 1,100 correctly

Next Steps
In this project, we managed to build a spam filter for SMS messages using the multinomial Naive Bayes algorithm. The filter had an accuracy of 98.74% on the test set we used, which is a pretty good result. Our initial goal was an accuracy of over 80%, and we managed to do way better than that.

# Next steps include:

- Analyze the 12 messages that were classified incorrectly and try to figure out why the algorithm classified them incorrectly
- Make the filtering process more robust:
   -  by making the algorithm sensitive to letter case
   -  n-grams
   -  word2vec
   -  tf/itdf