# Spam Filter using Naive Bayes Classifier

You are given a collection of SMS text messages in `sms.csv` as a tab separated CSV file. The first column of this file tells whether the message is a spam or not spam and the second column gives the message. Assume that this dataset is labelled correctly as spam or not spam. We  will use this dataset as the training data to build a spam filter. 

(a) Analyze the dataset and identify top ten spam words and top ten non-spam words  and their frequency counts. Make sure that you first remove articles ("a", "and", "the") and <=4 letter propositions ("for", "off", "in", "from" and so on).  

In [12]:
import csv
import re
from collections import Counter
file  = open('sms.csv')         #opening the file
csvreader = csv.reader(file, delimiter='\t')   #using the delimiter as tab
rows = []            #storing the info of each row in this list
spamID = []          #storing whether any message is spam or not
spamInd = []         #indexes of spam messages
nspamInd = []        #indexes of not spam messages
words = []           #storing all the words
unique = []          #storing all the unique words
spamwords = []       #storing all the spam words
spamunique = []      #storing all the unique words
nspamwords = []      #storing all the non spam words
nspamunique = []     #storing all the non spam unique words
stopwords = ['', 'â', "you", "a", "and", "the", 'u','ur','4','1','2','i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'you', 'your', 'yours', 'he', 'him', 'his', 'she', 'her', 'hers', 'it', 'its', 'they', 'them', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', 'couldn', 'didn', 'doesn', 'hadn', 'hasn', 'haven', 'isn', 'ma', 'mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'wouldn']
for row in csvreader:
    rows.append(row)
for row in rows:
    row[1] = row[1].lower()
    temp = re.split(r'\W+', row[1] )
    for word in temp:     #storing the words
        if word not in stopwords:
            words.append(word)
    for word in temp:     #storing all the unique words
        if word not in stopwords:
            if word not in unique:
                unique.append(word)
    if row[0] == "spam":           #0 indicates spam
        spamID.append(0)
        spamInd.append(rows.index(row))
        for word in temp:
            if word not in stopwords:
                spamwords.append(word)
        for word in temp:
            if word not in stopwords:
                if word not in spamunique:
                    spamunique.append(word)
    else:                          #1 indiactes not spam
        spamID.append(1)
        nspamInd.append(rows.index(row))
        for word in temp:
            if word not in stopwords:
                nspamwords.append(word)
        for word in temp:
            if word not in stopwords:
                if word not in nspamunique:
                    nspamunique.append(word)

#print(spamwords)
topspam = Counter(spamwords).most_common(10)     #getting the top 10 spamwords
topnspam = Counter(nspamwords).most_common(10)   #getting the top 10 non spam words
print("Top 10 spam words are")
print(topspam)
print("Top 10 non spam words are")
print(topnspam)


Top 10 spam words are
[('call', 348), ('free', 219), ('txt', 146), ('mobile', 123), ('claim', 113), ('stop', 109), ('text', 105), ('reply', 96), ('www', 96), ('prize', 90)]
Top 10 non spam words are
[('lt', 315), ('gt', 315), ('get', 298), ('ok', 281), ('go', 244), ('call', 235), ('got', 230), ('know', 230), ('good', 229), ('come', 229)]


(b) Let `W` be the random variable denoting a word and `T` be the random variable denoting a message's type (spam or non-spam). For each of the words `w` (spam or non-spam), estimate the likelihood probabilities (aka the conditional probabilities) `Pr(W = w | T=spam)` and `Pr(W=w | T=non-spam)` as two separate functions. Note  that in order to compute these likelihoods this, you need to compute how many times `w` appears in the corpus (spam or not spam) and the total number of words (including duplicates) in that corpus. If a word does not occur at all, then assign it a non-zero yet small probability fixed suitably. Note that the likelihoods `Pr(w | spam)` and `Pr(w | non-spam)` have to be estimated after suitably removing articles and propositions as done in (a). 

In [13]:
def Pspam(w):           #probability for spam for each word
    if w in spamwords:
        return len(spamwords)/len(unique)
    else:
        return 1/len(words)
def Pnotspam(w):           #probability for not spam for each word
    if w in nspamwords:
        return len(nspamwords)/len(unique)
    else:
        return 1/len(words)

(c) Let `M` be the random variable denoting a message (consisting of multiple words). Using the likelihood probabilities calculated in (b), implement a classifier that takes in a new SMS message `m=w1 w2 ... wi` and checks if it is spam or not using the naive Bayes' assumption. That is, compute `P(T=spam | M=m)` and `P(T=non-spam | M=m)` assuming that `P(m | spam) = P(w1 | spam) x P(w2 | spam) x ... x P(wi | spam)` and use this computation to decide if `m` is spam or not. 

In [14]:
def classifier(M):
    msgw = re.split(r'\W+', M)
    ps = 1
    pns = 1
    for i in range(len(msgw)):
        ps *= Pspam(msgw[i])
        pns *= Pnotspam(msgw[i])
    pSpam = len(spamInd)/len(spamID)       #based on number of spam messages
    pNotSpam = 1 - pSpam
    pSpGivenM = pSpam*(ps/((ps*pSpam)+(pns*pNotSpam)))          #bayes rule
    pNotSpGivenM = pNotSpam*(pns/((ps*pSpam)+(pns*pNotSpam)))
    if pSpGivenM > pNotSpGivenM:   #printing the result
        print("spam")
    else:
        print("not spam")
classifier("You are a winner U have been specially selected 2 receive Â£1000 or a 4* holiday (flights inc) speak to a live operator 2 claim 0871277810910p/min (18+) ")

spam


(d) Test your classifier against 4-5 SMS messages (spam as well as non-spam) that you have received in your mobile phone.

In [16]:
messages = []
messages.append("Dear customer, As an appreciation for being an HDFC Bank customer, you can take advantage of new Credit Card without documentation.Check here: hdfcbk.io/a/oBXL4GVw")
messages.append("Congrats ! Get annual savings worth Rs. 12000 on your pre-approved Bajaj Finserv RBL Bank Credit card. Avail now v.db1.in/a7XZ40 SRIBALAJI")
messages.append("Congratulation! You are eligible for IDFC Bank Credit Card LIFE TIME FREE With limit upto 5 Lakhs.T&C Applied. Apply Now v.db1.in/9XXMyP SRIBALJI")
messages.append("Dear User, You’ve earned RummyCircle 100% Welcome Bonus upto Rs 3000* on Specials. Click on https://t.jio/winnings-RUMCPS to claim your coupon! T&C Team JioCoupon")
messages.append("Register free for AAKASH ANTHE for upto 100% Scholarship & Win a trip to NASA* Eligibility: Class 7 - 12 Online Exam from Home! Apply@ bit.ly/3QHRhR2 MOTACH")
for i in range(5):
    classifier(messages[i])

not spam
not spam
not spam
spam
not spam
