# An Evaluation of Naive Bayesian Anti-Spam Filtering
**Authors**: Ion Androutsopoulos, John Koutsias, Konstantinos V. Chandrinos, George Paliouras and Constantine D. Spyropoulos

**Recreated by** : Amadora, Angelo and Choy, Seaver Mathew

---
# 1 Introduction

Unsolicited bulk e-mail which are sent to thousands  of recipients, is becoming common. Anyone with an e-mail will most likely have encountered or at least been sent these e-mails called "spam". Spam is annoying and sent by marketers in order to entice unsuspecting people into clicking links for many purposes unknown to them. Apart from wasting time, spam costs money to users
with dial-up connections, wastes bandwidth, and may expose under-aged recipients to unsuitable (e.g.
pornographic) content. 


There are many anti-spam filters available however, they rely mostly on manually constructed patternmatching. These systems need to be maintained in order to match a user's incoming message. This method requires both time and expertise. It would stand to reason that a system that would learn automatically to separate spam from legitimate massages would present significant changes.

# 2 Implementation
In order to classify the documents, the system must be able to understand what a "word" means. In the code segment below, we create a class "Word" which contains spamCount, hamCount, word, Mutual Information (MI) with its probability of being legitimate. We also create wordList class in order to keep track of the words in the document which we want to look for the probability that it is spam.

In [124]:
class Word:
    def __init__(self, word):
        self.spamCount = 0
        self.hamCount = 0
        self.word = word
        self.MI = 0
        self.probSpam = 0
        self.probLegit = 0
        
    def addSpamCount(self):
        self.spamCount+=1
        
    def addHamCount(self):
        self.hamCount+=1    
    
    def setMI(self, mi):
        self.MI = mi
    
class wordList:
    def __init__(self):
        self.wordList = []
        
    def updateWordList(self, word, doctype):
        for currWord in self.wordList:
            if word == currWord.word and doctype == "ham":
                currWord.addHamCount()
                return 0
            elif word == currWord.word and doctype == "spam":
                currWord.addSpamCount()
                return 0
                
        newWord = Word(word)
        if doctype == "ham":
            newWord.addHamCount()
        elif doctype == "spam":
            newWord.addSpamCount()
        self.wordList.append(newWord)
        return 1
            
    def printWordList(self):
        for word in self.wordList:
            cnt = 0
            print(word.word + " " + "spam" + " " + str(word.spamCount))
            print(word.word + " " + "ham" + " " + str(word.hamCount))
            cnt += word.spamCount + word.hamCount
            print(str(cnt))

# 2.1 Training

This portion talks about training the system using the stop-lemm corpus. The code segment below initializes the variables we would need by reading every text file from the corpus. Take note that the stop-lemm corpus is a set of documents that have been pre processed. This means that words that are commonly used have been stripped from the document and each word is simplified.

In [125]:
import glob
import errno

def readTrainDataset(begin, end, test):
    numSpam = 0
    numLegit = 0
    wList = wordList()

    for i in range (begin,end):
        if(i != test):
            path = "emails/emails/lemm_stop/part" + str(i) + "/*.txt"
            files = glob.glob(path)
            print(i)

            for name in files:
                checkedWordList = []
                if "spmsg" in name:
                    numSpam += 1
                else:
                    numLegit += 1
                with open(name, 'r') as f:
                        for line in f:
                            for word in line.split():
                                if "spmsg" in name and word not in checkedWordList:
                                    wList.updateWordList(word, "spam")
                                    checkedWordList.append(word)
                                elif word not in checkedWordList:
                                    wList.updateWordList(word, "ham")
                                    checkedWordList.append(word)
    return (numSpam, numLegit, wList)


#numSpam, numLegit, wordList = readTrainDataset(1,9,10)


In [127]:
import math

def computeMI(wList):
    for word in wList.wordList:
        try:    
            mInformation = (word.spamCount/numTotal * math.log(numTotal * word.spamCount/((word.spamCount + word.hamCount) * numSpam),2))
        except:
            mInformation = 0
        try:
            mInformation += ((numSpam - word.spamCount)/numTotal * math.log(numTotal * (numSpam - word.spamCount)/((numSpam - word.spamCount + word.hamCount) * numSpam),2))
        except:
            mInformation += 0
        try:    
            mInformation += (word.hamCount/numTotal * math.log(numTotal * word.hamCount/((word.spamCount + word.hamCount) * (word.hamCount + (numLegit))),2))
        except:
            mInformation += 0
        try:
            mInformation += ((numLegit - word.hamCount) / numTotal * math.log((numTotal * (numLegit - word.hamCount)) / ((word.hamCount + word.spamCount) * (numLegit)),2))
        except:
            mInformation += 0
        #add MI
        word.setMI(mInformation)

    wList.wordList.sort(key=lambda x: x.MI, reverse=True)
    return wList

#wordList = computeMI(wordList)

#print(len(wordList.wordList))

In [128]:
#Training
def trainClassifier(wList):
    for word in wList.wordList:
        print(word.word + " " + str(word.MI))
        word.probSpam = (numSpam + 1) / (word.spamCount + 2)
        word.probLegit = (numLegit + 1) / (word.hamCount + 2)
    return wList

#wordList = trainClassifier(wordList)

In [12]:
#classify
def classifyEmails(numSpam, numLegit, wList):
    classificationList = []
    path = "emails/emails/lemm_stop/part" + str(10) + "/*.txt"
    files = glob.glob(path)
    print(i)

    NLL = 0
    NSS = 0
    NSL = 0
    NLS = 0
    NL = 0
    NS = 0

    for name in files:
        checkedWordList = []
        if "spmsg" in name:
            NS += 1
        else:
            NL += 1
        with open(name, 'r') as f:
                for line in f:
                    for word in line.split():
                        if word not in checkedWordList:
                            checkedWordList.append(word)
                probSpam = math.log((numSpam/(numSpam+numLegit)),10) #the probability of getting a spam document
                probLegit = math.log(numLegit/(numSpam+numLegit),10) #the probability of getting a legit document
                accSpam = 1
                accLegit = 1
                for word in wList.wordList:
                    if word.word in checkedWordList:
                        accSpam *= math.log(word.probSpam)
                        accLegit *= math.log(word.probLegit)
                    #else:
                        #try:
                            #accSpam += math.log(1 - word.probSpam)
                            #accLegit += math.log(1 - word.probLegit)
                        #except:
                            #accLegit += 0
                total = (probSpam + accSpam) - (probLegit + accLegit)
                if total >= math.log(999):
                    print(name + " spam")
                    if "spmsg" in name:
                        NSS += 1
                    else:
                        NSL += 1
                else:
                    print(name + " legitimate")
                    if "spmsg" not in name:
                        NLL += 1
                    else:
                        NLS += 1

    WAccuracy = (999*(NLL + NSS))/(999*(NS + NL))
    SRecall = NSS/(NSS + NSL)
    SPrecision = NSS/(NSS + NLS)
    
    return WAccuracy, SRecall, SPrecision
    
#WAccuracy, SRecall, SPrecision = classifyEmails(numSpam, numLegit, wordList)
#print(WAccuracy)
#print(SRecall)
#print(SPrecision)
                            


In [None]:
#Run this
WAccuracyList = []
SRecallList = []
SPrecisionList = []

#10 fold cross validation
for i in range(1,11)
    numSpam, numLegit, wordList = readTrainDataset(1,11,i)
    wordList = computeMI(wordList)
    wordList = trainClassifier(wordList)

    WAccuracy, SRecall, SPrecision = classifyEmails(numSpam, numLegit, wordList)
    WAccuracyList.append(WAccuracy)
    SRecallList.append(SRecall)
    SPrecisionList.add(SPrecision)
    
    print(WAccuracy)
    print(SRecall)
    print(SPrecision)

1
2
3
4
5
6
7
8
9
owner-operate 10.936735205283506
compensaation 10.936735205283506
703-5390 10.936735205283506
b2998 10.936735205283506
pva 10.936735205283506
kenmore 10.936735205283506
90029 10.936735205283506
felton 10.936735205283506
cozy 10.936735205283506
muzzel 10.936735205283506
deer 10.936735205283506
archery 10.936735205283506
sesson 10.936735205283506
110734 10.936735205283506
2622 10.936735205283506
iq 10.936735205283506
zsazsa36 10.936735205283506
sparkle124 10.936735205283506
herehttp 10.936735205283506
bizman 10.936735205283506
amcap 10.936735205283506
htmlor 10.936735205283506
pay-no 10.936735205283506
stitution 10.936735205283506
tyranny 10.936735205283506
caplin 10.936735205283506
distraint 10.936735205283506
fined 10.936735205283506
bandits 10.936735205283506
withholding 10.936735205283506
beat-the 10.936735205283506
extortion 10.936735205283506
withold 10.936735205283506
complience 10.936735205283506
milus 10.936735205283506
tary 10.936735205283506
guam 10.936735205

In [9]:
import math

#function returns the prediction of the system whether spam 
def predict(numSpam, numLegit,spamWords,legitWords):
    #numSpam contains total number of spam
    #numLegit contains total number of legit
    #spamWords contains the probabilities of the words as spam
    #legitWords contains the probabilities of the words as legit
    
    total = probabilityOfSpam(numSpam,numLegit,spamWords) - probabilityOfLegit(numSpam,numLegit,legitWords)
    
    if total >= math.log(999,10):
        return "spam"
    else:
        return "legitimate"
    

#Probability of spam, words is the vector of SIGNIFICANT words in the list obtained in the MI phase
def probabilityOfSpam(numSpam, numLegit, probabilities):
    #numSpam contains total number of spam
    #numLegit contains total number of legit
    #*words contains all the values of the probability of each signifacant word
    
    probSpam = math.log((numSpam/(numSpam+numLegit)),10) #the probability of getting a spam document
    accumulatedProbabilityOfWords = 1
    for num in list(probabilities):
        accumulatedProbabilities *= num
    
    accumulatedProbabilities = math.log(accumulatedProbabilities)
    return probSpam + accumulatedProbabilities

#Probability that the document is legitimate
def probabilityOfLegit(numSpam, numLegit, probabilities):
    #numSpam contains total number of spam
    #numLegit contains total number of legit
    #*words contains all the values of the probability of each signifacant word
    
    probLegit = math.log(numLegit/(numSpam+numLegit),10) #the probability of getting a legit document
    accumulatedProbabilities = 1
    
    for num in list(probabilities):
        accumulatedProbabilities *= num
    
    accumulatedProbabilites = math.log(accumulatedProbabilities)
    return probLegit + accumulatedProbabilities


In [11]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline #for jupyter only... this will not work in pycharm
#Creating a Table

def table_row_generator(index,pd,label_type,l,a,avg_recall,avg_precision,avg_accuracy,avg_accuracy_base,avg_tcr):
    raw_data = {
        '#': index,
        'Filter Configuration': [label_type],
        'Lambda': [l], 
        'No. of attrib.': [a],
        'Spam Recall': [avg_recall],
        'Spam Precision': [avg_precision],
        'Weighted Accuracy': [avg_accuracy],
        'Baseline W. Acc': [avg_accuracy_base],
        'TCR': [avg_tcr]
    }
    return pd.DataFrame(raw_data, columns = ['#', 'Filter Configuration', 'Lambda', 'No. of attrib.', 'Spam Recall', 'Spam Precision', 'Weighted Accuracy', 'Baseline W. Acc', 'TCR'])
       
df_i = table_row_generator(1,pd,label_type[0],1,50,avg_recall,avg_precision,avg_accuracy,avg_accuracy_base,avg_tcr)
df_j = ...

df_table = pd.concat([df_i, df_j])
df_table.set_index('#', inplace=True) #para matangal yung index na paulit ulit
df_table = df_table.rename_axis(None) #para matangal si #
df_table

#Creating a Scatter Plot
#define colors if you want
colors = ['b', 'r', 'g', 'm']

#You can plot using 2 methods
#1.Direct 
plt.scatter(x_coord, y_coord, marker='o', color=colors[0])
#or
plt.scatter(x_coord_list, y_coord_list, marker='o', color=colors[0])
#2 Indirect with legend
scatter_1 = plt.scatter(x_coord, y_coord, marker='o', color=colors[0])
plt.legend((scatter_1,...),
           ('no lemmatizer, no stop-list', 'no lemmatizer, top-100 stop-list', 'with lemmatizer, no stop-list', 'with lemmatizer, top-100 stop-list'),
           scatterpoints=1,
           loc='lower right', #where the legend box will be located
           ncol=1, 
           fontsize=8 #font size of legend)

plt.xlabel('spam recall') #x field
plt.ylabel('spam precision') #y field

plt.grid() #if you want to show grid
plt.show() #display

#For Line Scatter Plot
#Use this to connect the lines of the same type
plt.plot(x_coord, y_coord, ls = "dotted", c=colors[3])
#or
plt.plot(x_coord_list, y_coord_list, ls = "dotted", c=colors[3])

#REMEMBER you can simply replace the x_coord or y_coord with a LIST, Python does the rest


SyntaxError: invalid syntax (<ipython-input-11-25665dccff0e>, line 46)