# Detecting Spam with Spark

This is the final submission for IN432 Big Data coursework 2018 mid-term coursework

## Load and prepare the data

We will use the lingspam dataset in this coursework (see [http://csmining.org/index.php/ling-spam-datasets.html](http://csmining.org/index.php/ling-spam-datasets.html) for more information).

In [4]:
# Version control
%cd ~/notebook/work/
!git clone https://github.com/tweyde/City-Data-Science.git
%cd ~/notebook/work/City-Data-Science/
!git pull
%cd ./datasets/ 

/gpfs/global_fs01/sym_shared/YPProdSpark/user/s8bb-4c006897e4ec94-b51c55267e90/notebook/work
fatal: destination path 'City-Data-Science' already exists and is not an empty directory.
/gpfs/global_fs01/sym_shared/YPProdSpark/user/s8bb-4c006897e4ec94-b51c55267e90/notebook/work/City-Data-Science
Already up-to-date.
/gpfs/global_fs01/sym_shared/YPProdSpark/user/s8bb-4c006897e4ec94-b51c55267e90/notebook/work/City-Data-Science/datasets


In [5]:
# Extracting the dataset
print(">>> Extracting the ling_spam dataset, this can take a moment.")
!tar -xf lingspam_public02.tar.gz
print(">>> Unzipping finished.")

# We now have a new dataset in directory 'bare'.
%cd lingspam_public/bare 
print(">>> pwd ")
!pwd
print(">>> ls ")
!ls
# the line before last of output should show "part1 part10 part2  part3  part4  part5  part6  part7 part8 part9"
%cd ..

>>> Extracting the ling_spam dataset, this can take a moment.
>>> Unzipping finished.
/gpfs/global_fs01/sym_shared/YPProdSpark/user/s8bb-4c006897e4ec94-b51c55267e90/notebook/work/City-Data-Science/datasets/lingspam_public/bare
>>> pwd 
/gpfs/global_fs01/sym_shared/YPProdSpark/user/s8bb-4c006897e4ec94-b51c55267e90/notebook/work/City-Data-Science/datasets/lingspam_public/bare
>>> ls 
part1  part10  part2  part3  part4  part5  part6  part7  part8	part9
/gpfs/global_fs01/sym_shared/YPProdSpark/user/s8bb-4c006897e4ec94-b51c55267e90/notebook/work/City-Data-Science/datasets/lingspam_public


### Troubleshooting

In [6]:
# Set print options
import numpy as np
np.set_printoptions(precision=3)

In [7]:
from pyspark import SparkContext
sc = spark.sparkContext

In [8]:
# Fix "error while instantiating when multiple notebooks are opened"
!rm -Rf /gpfs/global_fs01/sym_shared/YPProdSpark/user/s8bb-4c006897e4ec94-b51c55267e90/notebook/jupyter-rt/kernel-8d2bd250-95e0-4bc0-be9f-99d2d60b63f0-20180311_223518/metastore_db

## Task 1) Read the dataset and create RDDs 
a) Start by reading the directory with text files from the file system (`~/notebook/work/City-Data-Science/datasets/bare`). Load all text files per directory (part1,part2, ... ,part10).

b) Split data into training set and test set.

b) Remove the path and extension from the filename.

**Note**: If the filename starts with 'spmsg' it is spam, otherwise it is not

In [9]:
from pathlib import Path
import re

def makeTestTrainRDDs(pathString):
    
    p = Path(pathString) # gets a path object representing the current directory path.
    dirs = list(p.iterdir()) # get the directories part1 ... part10. 
    print(dirs) # Print to check that the directory is correct
    
    rddList = [] # create a list for the RDDs
    
    # Create an RDD for each 'part' directory and add them to rddList
    for d in dirs: # iterate through the directories
        rdd = sc.wholeTextFiles(str(d.absolute())) #>>> # read the files in the directory 
        rddList.append(rdd) #>>> append the RDD to the rddList
        
    print('len(rddList)',len(rddList))  # we should now have 10 RDDs in the list # just for testing
    
    # print(rddList[1].take(1)) # just for testing, comment out if worked.
    
    testRDD1 = rddList[9] # set the test set
    trainRDD1 = rddList[0] # start the training set from 0 and 
    
    # Create a union of the remaining RDDs
    for i in range(1,9):
        trainRDD1 = trainRDD1.union(rddList[i]) 
            
    # Remove the paths and extensions from the filename. 
    testRDD2 = testRDD1.map(lambda fn_txt: (re.split('[/\.]',fn_txt[0])[-2],fn_txt[1]))    
    trainRDD2 = trainRDD1.map(lambda fn_txt: (re.split('[/\.]', fn_txt[0])[-2], fn_txt[1]))
    
    return (trainRDD2,testRDD2)

# this makes sure we are in the right directory
%cd ~/notebook/work/City-Data-Science/datasets/lingspam_public/

# this should show "bare  lemm  lemm_stop  readme.txt  stop"
!ls

# Testing the function makeTestTrainRDDs
trainRDD_testRDD = makeTestTrainRDDs('bare') # read from the 'bare' directory - this takes a bit of time
(trainRDD,testRDD) = trainRDD_testRDD # unpack the returned tuple

print('created the RDDs') # notify the user, so that we can figure out where things went wrong if they do.
print('testRDD.count(): ',testRDD.count()) # should be ~291 

#print('trainRDD.count(): ',trainRDD.count()) # should be ~2602 - commented out to save time

print('testRDD.getNumPartitions()',testRDD.getNumPartitions()) # normally 2 on DSX
print('testRDD.getStorageLevel()',testRDD.getStorageLevel()) # Serialized 1x Replicated on DSX
print('testRDD.take(1): ',testRDD.take(1)) # should be (filename,[tokens]) 

rdd1 = testRDD # use this for developemnt in the next tasks 

/gpfs/global_fs01/sym_shared/YPProdSpark/user/s8bb-4c006897e4ec94-b51c55267e90/notebook/work/City-Data-Science/datasets/lingspam_public
bare  lemm  lemm_stop  readme.txt  stop
[PosixPath('bare/part5'), PosixPath('bare/part8'), PosixPath('bare/part2'), PosixPath('bare/part9'), PosixPath('bare/part3'), PosixPath('bare/part6'), PosixPath('bare/part1'), PosixPath('bare/part4'), PosixPath('bare/part7'), PosixPath('bare/part10')]
len(rddList) 10
created the RDDs
testRDD.count():  291
testRDD.getNumPartitions() 2
testRDD.getStorageLevel() Serialized 1x Replicated
testRDD.take(1):  [('9-66msg1', "Subject: xth conference of nordic and general ling .\n\nthe tenth conference of nordic and general linguistics will be held in reykjavik , iceland , from saturday june 6 , to monday june 8 , 1998 . it is organized by the institute of linguistics , university of iceland . the deadline for pre-registration at a reduced price is january 31 , 1998 . pre - registration forms and further information can be 

## Task 2) Tokenize and remove punctuation

We will use the Python [Natural Language Toolkit](http://www.nltk.org) *NLTK* to do the tokenization (rather than splitting ourselves, as these specialist tools usually do that we can ourselves). We use the NLTK function word_tokenize.

Then we will remove punctuation. There is no specific funtion for this, so we use a regular expression in a list comprehension.

We separate keys and values of the RDD, using the RDD functions `keys()` and `values()`, which yield each a new RDD. Then we process the values and *zip* them together with the keys again. We wrap the whole sequence into one function `prepareTokenRDD` for later use.

In [6]:
import nltk
import re
from nltk.corpus import stopwords

def tokenize(text):
    nltk.download('punkt') # this loads the standard NLTK tokenizer model 
    return nltk.word_tokenize(text) # use the nltk function word_tokenize
    
def removePunctuation(tokens):
    tokens2 =  [re.sub('\W','',s) for s in tokens] # use a list comprehension to remove punctuaton 
    return tokens2
    
def prepareTokenRDD(fn_txt_RDD):
    rdd_vals2 = fn_txt_RDD.values() # It's convenient to process only the values. 
    rdd_vals3 = rdd_vals2.map(tokenize) # Create a tokenised version of the values by mapping
    rdd_vals4 = rdd_vals3.map(removePunctuation) # remove punctuation from the values
    rdd4 = fn_txt_RDD.keys().zip(rdd_vals4) # Zip the two RDDs together 
    
    # remove any empty value strings (i.e. length 0) that we may have created by removing punctiation.
    rdd5 = rdd4.map(lambda x: (x[0], [y for y in x[1] if len(y)>0])) # remove empty strings/
    rdd6 =  rdd5.filter(lambda x: len(x[1]) > 0) # remove items without tokens.
    
    return rdd6 

rdd2 = prepareTokenRDD(rdd1) # Use the test set for now, because it is smaller
print(rdd2.take(1)) # For checking result of task 2. 

[('9-66msg1', ['Subject', 'xth', 'conference', 'of', 'nordic', 'and', 'general', 'ling', 'the', 'tenth', 'conference', 'of', 'nordic', 'and', 'general', 'linguistics', 'will', 'be', 'held', 'in', 'reykjavik', 'iceland', 'from', 'saturday', 'june', '6', 'to', 'monday', 'june', '8', '1998', 'it', 'is', 'organized', 'by', 'the', 'institute', 'of', 'linguistics', 'university', 'of', 'iceland', 'the', 'deadline', 'for', 'preregistration', 'at', 'a', 'reduced', 'price', 'is', 'january', '31', '1998', 'pre', 'registration', 'forms', 'and', 'further', 'information', 'can', 'be', 'found', 'on', 'our', 'web', 'site', 'http', 'www', 'rhi', 'hi', 'is', 'nordconf', 'and', 'can', 'also', 'be', 'mailed', 'or', 'emailed', 'upon', 'request', 'papers', 'on', 'any', 'linguistic', 'topic', 'are', 'invited', 'especially', 'papers', 'on', 'synchronic', 'and', 'diachronic', 'aspects', 'of', 'the', 'nordic', 'languages', 'invited', 'speakers', 'anders', 'holmberg', 'tromsoe', 'syntax', 'tomas', 'riad', 'stock

## <font color='green'>Question: why should this be filtering done after zipping the keys and values together?</font>
**Answer:** To avoid the mismatch between key and values. If this had been done before, empty values are discarded while their corresponding keys are not. This will cause a mismatch between key and values, and a difference between the number of keys and the number of values.

## Task 3) Creating normalised TF.IDF vectors of defined dimensionality, measure the effect of caching.

We use the hashing trick to create fixed size TF vectors directly from the word list now.

Then we'll use the IDF and Normalizer functions provided by Spark.

We want control of the dimensionality in the `normTFIDF` function, so we introduce an argument into our functions that enables us to vary dimensionalty later. Here is also an opportunity to benefit from caching.

In [7]:
# use the hashing trick to create a fixed-size vector from a word list

def hashing_vectorize(text,N): # arguments: the list and the size of the output vector
    v = [0] * N  # create vector of 0s
    for word in text: # iterate through the words 
        h = hash(word) # get the hash value 
        v [h % N] += 1 #  add 1 at the hashed address 
    return v # return hashed word vector

from pyspark.mllib.feature import IDF, Normalizer

def normTFIDF(fn_tokens_RDD, vecDim, caching=True):
    keysRDD = fn_tokens_RDD.keys()
    tokensRDD = fn_tokens_RDD.values()
    tfVecRDD = tokensRDD.map(lambda tokens: hashing_vectorize(tokens,vecDim)) #>>> passing the vecDim value.
    
    if caching:
        tfVecRDD.persist(StorageLevel.MEMORY_ONLY) # since we will read more than once, caching in Memory will make things quicker.
    idf = IDF() # create IDF object
    idfModel = idf.fit(tfVecRDD) # calculate IDF values
    tfIdfRDD = idfModel.transform(tfVecRDD)
    
    norm = Normalizer(float('inf')) # create a Normalizer object like in the example linked above
    normTfIdfRDD = norm.transform(tfIdfRDD) # and apply it to the tfIdfRDD 
    zippedRDD =  keysRDD.zip(normTfIdfRDD) # zip the keys and values together
    
    return zippedRDD

testDim = 10 # testing
rdd3 = normTFIDF(rdd2, testDim, True) # testing
print(rdd3.take(1)) # we should now have tuples with ('filename',[N-dim vector])

[('9-66msg1', DenseVector([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0]))]


### Task 3a) Caching experiment

The normTFIDF let's us switch caching on or off. Write a bit of code that measures the effect of caching by takes the time for both options.

Add a short comment on the result (why is there an effect, why of the size that it is?)

In [8]:
#run a small experiment with caching set to True or False, 3 times each

from time import time

resCaching = [] # for storing results
resNoCache = [] # for storing results
for i in range(3): # 3 samples
    
    startTime = time()  # start timer
    testRDD1 = normTFIDF(rdd2, testDim, True) # 
    testRDD1.count() # call an action on the RDD to force execution
    endTime = time()  # end timer
    resCaching.append( endTime - startTime ) # calculate the difference
    
    startTime = time()  # start timer
    testRDD2 = normTFIDF(rdd2, testDim, False) 
    testRDD2.count() # call an action to force execution   
    endTime = time()  # end timer
    resNoCache.append( endTime - startTime )
    
meanTimeCaching = np.mean(resCaching)# calculate average times
meanTimeNoCache = np.mean(resNoCache)# calculate average times

print('Creating TF.IDF vectors, 3 trials - mean time with caching: ', meanTimeCaching, ', mean time without caching: ', meanTimeNoCache)

Creating TF.IDF vectors, 3 trials - mean time with caching:  15.7701662381 , mean time without caching:  18.1433038712


## <font color='green'>Results </font>

mean time with caching:  15.7701662381 , mean time without caching:  18.1433038712. Processing time without caching is slower than with caching by approximately 20%.

## <font color='green'>Comments</font>

**Why is there an effect**: Caching keeps the data for later use, which means that the data don't need to be reloaded in each iteration. This helps speed up operations where specific RDDs need to be accessed multiple times. 

**Why of the size that it is:** The effect is noticeable as the count of each RDD is 291, with each contains a tuple with the header and a dense vector of 10 elements. Multiplying these two numbers indicates a considerable amount of data to be loaded.

## Task 4) Create LabeledPoints 

Determine whether the file is spam (i.e. the filename contains ’spmsg’) and replace the filename by a 1 (spam) or 0 (non-spam) accordingly.

In [9]:
from pyspark.mllib.regression import LabeledPoint

# creatate labelled points of vector size N out of an RDD with normalised (filename [(word,count), ...]) items
def makeLabeledPoints(fn_vec_RDD): # RDD and N needed 
    cls_vec_RDD = fn_vec_RDD.map(lambda x: (1 if x[0].startswith('spmsg') else 0, x[1])) # use a conditional expression to get the class label (True or False)
    
    # now we can create the LabeledPoint objects with (class,vector) arguments
    lp_RDD = cls_vec_RDD.map(lambda cls_vec: LabeledPoint(cls_vec[0],cls_vec[1]) ) 
    return lp_RDD 

# for testing
testLpRDD = makeLabeledPoints(rdd3) 
print(testLpRDD.take(1)) 

[LabeledPoint(0.0, [0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0])]


## Task 5) Complete the preprocessing 

Create a single function to do the preprocessing

In [10]:
# N is for controlling the vector size
def preprocess(rawRDD,N):
    """ take a (filename,text) RDD and transform into LabelledPoint objects 
        with class labels and a TF.IDF vector with N dimensions. """
    tokenRDD = prepareTokenRDD(rawRDD) # task 2
    tfIdfRDD = normTFIDF(tokenRDD,N) # task 3
    lpRDD = makeLabeledPoints(tfIdfRDD) # task 4
    return lpRDD # return RDD with LabeledPoints

# and with this we can start the whole process from a directory, N is again the vector size
def loadAndPreprocess(directory,N):
    """ load lingspam data from a directory and create a training and test set of preprocessed data """
    trainRDD_testRDD = makeTestTrainRDDs(directory) # read from the directory using the function created in task 1
    (trainRDD,testRDD) = trainRDD_testRDD # unpack the returned tuple
    
    return (preprocess(trainRDD,N),preprocess(testRDD,N)) # apply the preprocessing funcion defined above

trainLpRDD = preprocess(trainRDD,testDim) # prepare the training data
print(testLpRDD.take(1)) #testing

train_test_LpRDD = loadAndPreprocess('lemm',100) # Re-run with another vector size
(trainLpRDD,testLpRDD) = train_test_LpRDD 

print(testLpRDD.take(1))
print(trainLpRDD.take(1))

[LabeledPoint(0.0, [0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0])]
[PosixPath('lemm/part5'), PosixPath('lemm/part8'), PosixPath('lemm/part2'), PosixPath('lemm/part9'), PosixPath('lemm/part3'), PosixPath('lemm/part6'), PosixPath('lemm/part1'), PosixPath('lemm/part4'), PosixPath('lemm/part7'), PosixPath('lemm/part10')]
len(rddList) 10
[('file:/gpfs/global_fs01/sym_shared/YPProdSpark/user/s8bb-4c006897e4ec94-b51c55267e90/notebook/work/City-Data-Science/datasets/lingspam_public/lemm/part8/6-939msg2.txt', 'Subject: child language acquistion\n\ni be look for information on elicitation technique and grammaticality judgement for 2 - 4 year old child . i would be grateful for your help on thus s subject . cathy finlay . university of ulster .\n')]
[LabeledPoint(0.0, [0.0640664810574,0.156270149468,0.28272862321,0.158772994227,0.264182002173,0.575501960201,0.756375820848,0.360451806382,0.369625030955,0.575501960201,0.172630846172,0.405814869475,0.631754026472,0.0,0.379600845063,0.112351647251,0.1026

## Task 6) Train some classifiers 

Use the `LabeledPoint` objects to train a classifier, specifically the *LogisticRegression*, *Naive Bayes*, and *Support Vector Machine*. Calculate the accuracy of the model on the training set

In [13]:
from pyspark.mllib.classification import (NaiveBayes, LogisticRegressionWithLBFGS, SVMWithSGD) 
import numpy

def trainModel(lpRDD):
    """ Train 3 classifier models on the given RDD with LabeledPoint objects. A list of trained model is returned. """
    lpRDD.persist(StorageLevel.MEMORY_ONLY)
    
    # Train a classifier model.
    print('Starting to train the model') 
    model1 = LogisticRegressionWithLBFGS.train(lpRDD) # this is the best model
    print('Trained LR (model1)')
    #print('type(model1)')
    model2 = NaiveBayes.train(lpRDD) # doesn't work well
    print('Trained NB (model2)')
    #print(type(model2))
    model3 = SVMWithSGD.train(lpRDD) # or this ...
    print('Trained SVM (model3)')
    return [model1,model2,model3]

def testModel(model, lpRDD):
    """ Tests the classification accuracy of the given model on the given RDD with LabeledPoint objects. """
    lpRDD.persist(StorageLevel.MEMORY_ONLY)
    
    # Make prediction and evaluate training set accuracy.
    predictionAndLabel = lpRDD.map(lambda p: (model.predict(p.features), p.label)) # get the prediction and ground truth (label) for each item.
    correct = predictionAndLabel.filter(lambda xv: xv[0] == xv[1]).count() # count the correct predictions 
    
    accuracy = correct/(lpRDD.count()) # and calculate the accuracy 
    
    print('Accuracy {:.1%} (data items: {}, correct: {})'.format(accuracy,lpRDD.count(), correct)) # report to console
    return accuracy # and return the value  

models = trainModel(trainLpRDD) # just for testing
testModel(models[2], trainLpRDD) # just for testing

Starting to train the model
Trained LR (model1)
Trained NB (model2)
Trained SVM (model3)
Accuracy 84.9% (data items: 2602, correct: 2209)


0.8489623366641046

## Task 7) Automate training and testing

We automate now the whole process from reading the files, through preprocessing, and training up to evaluating the models. In the end we have a single function that takes all the parameters we are interested in and produces trained models and an evaluation.

In [14]:
def trainTestModel(trainRDD,testRDD):
    """ Trains 3 models and tests them on training and test data. Returns a matrix the training and testing (rows) accuracy values for all models (columns). """
    models = trainModel(trainRDD) # train models on the training set
    results = [[],[]] # matrix for 2 modes (training/test) vs n models (currently 3)
    
    for mdl in models:
        print('Training')
        results[0].append(testModel(mdl, trainRDD)) # test the model on the training set
        
        print('Testing')
        results[1].append(testModel(mdl, testRDD)) # test the model on the test set
        
    return results

def trainTestFolder(folder,N):
    """ Reads data from a folder, preproceses the data, and trains and evaluates models on it. """
    print('Start loading and preprocessing')
    train_test_LpRDD = loadAndPreprocess(folder,N) # create the RDDs
    
    print('Finished loading and preprocessing')
    (trainLpRDD,testLpRDD) = train_test_LpRDD # unpack the RDDs 
    
    return trainTestModel(trainLpRDD,testLpRDD) # train and test

trainTestFolder('lemm',1000) 

Start loading and preprocessing
[PosixPath('lemm/part5'), PosixPath('lemm/part8'), PosixPath('lemm/part2'), PosixPath('lemm/part9'), PosixPath('lemm/part3'), PosixPath('lemm/part6'), PosixPath('lemm/part1'), PosixPath('lemm/part4'), PosixPath('lemm/part7'), PosixPath('lemm/part10')]
len(rddList) 10
[('file:/gpfs/global_fs01/sym_shared/YPProdSpark/user/s8bb-4c006897e4ec94-b51c55267e90/notebook/work/City-Data-Science/datasets/lingspam_public/lemm/part8/6-939msg2.txt', 'Subject: child language acquistion\n\ni be look for information on elicitation technique and grammaticality judgement for 2 - 4 year old child . i would be grateful for your help on thus s subject . cathy finlay . university of ulster .\n')]
Finished loading and preprocessing
Starting to train the model
Trained LR (model1)
Trained NB (model2)
Trained SVM (model3)
Training
Accuracy 100.0% (data items: 2602, correct: 2602)
Testing
Accuracy 97.9% (data items: 291, correct: 285)
Training
Accuracy 96.5% (data items: 2602, corre

[[1.0, 0.9654112221368178, 0.8889315910837817],
 [0.979381443298969, 0.9484536082474226, 0.8900343642611683]]

## Task 8) Run experiments 

We have now a single function that allows us to vary the vector size easily. Test vector sizes 3, 30, 300, 3000, 30000 and examine the effect on the classification accuracy in Experiment 1.

Use the function from Task 7) to test different data types. The dataset has raw text in folder `bare`, lemmatised text in  `lemm` (similar to stemming, reduces to basic word forms), `stop` (with stopwords removed), and `lemm_stop` (lemmatised and stopwords removed). Test how the classification accuracy differs for these four data types in Experiment 2. Collect the results in a data structure that can be saved for later saving and analyis.

Comment on the results in a few sentences, considering the differences in performance between the different conditions as well as train an test values. 15%

In [15]:
from pyspark.sql import DataFrame

folder = 'bare'
N = numpy.array([3,30,300,3000,30000]) 
print('\nEXPERIMENT 1: Testing different vector sizes')
results = []
for n in N:
    print('N = {}'.format(n))
    results.append(trainTestFolder(folder,n))
    
n = 3000
typeFolders = ['bare','stop','lemm','lemm_stop']
print('EXPERIMENT 2: Testing different data types')
for folder in typeFolders:
    print('Path = {}'.format(folder))
    results.append(trainTestFolder(folder,n))


EXPERIMENT 1: Testing different vector sizes
N = 3
Start loading and preprocessing
[PosixPath('bare/part5'), PosixPath('bare/part8'), PosixPath('bare/part2'), PosixPath('bare/part9'), PosixPath('bare/part3'), PosixPath('bare/part6'), PosixPath('bare/part1'), PosixPath('bare/part4'), PosixPath('bare/part7'), PosixPath('bare/part10')]
len(rddList) 10
[('file:/gpfs/global_fs01/sym_shared/YPProdSpark/user/s8bb-4c006897e4ec94-b51c55267e90/notebook/work/City-Data-Science/datasets/lingspam_public/bare/part8/6-939msg2.txt', 'Subject: child language acquistion\n\ni am looking for information on elicitation techniques and grammaticality judgements for 2 - 4 year old children . i would be grateful for your help on thi s subject . cathy finlay . university of ulster .\n')]
Finished loading and preprocessing
Starting to train the model
Trained LR (model1)
Trained NB (model2)
Trained SVM (model3)
Training
Accuracy 83.4% (data items: 2602, correct: 2170)
Testing
Accuracy 83.2% (data items: 291, corr

In [66]:
from tabulate import tabulate
vectorsize = np.array(results)*100
header = ['\nSettings', 'Trainning accuracy\nModels\nLR      NB      SVM', 'Testing accuracy\nModels\nLR      NB      SVM']
row = ['N = 3', 'N = 30', 'N = 300', 'N = 3000', 'N = 30000', 'Folder: bare', 'Folder: stop', 'Folder: lemm', 'Folder: lemm_stop']
table1 = tabulate(vectorsize, header, tablefmt = 'grid', showindex = row, stralign = 'center')
print('============================ SUMMARY OF RESULTS ================================')
print(table1)

+-------------------+------------------------------+---------------------------+
|                   |      Trainning accuracy      |     Testing accuracy      |
|     Settings      |            Models            |          Models           |
|                   |     LR      NB      SVM      |    LR      NB      SVM    |
|       N = 3       |  [ 83.397  83.397  83.397]   | [ 83.162  83.162  83.162] |
+-------------------+------------------------------+---------------------------+
|      N = 30       |  [ 88.778  83.397  83.397]   | [ 86.254  83.162  83.162] |
+-------------------+------------------------------+---------------------------+
|      N = 300      | [ 100.      93.659   90.968] | [ 96.907  92.096  91.065] |
+-------------------+------------------------------+---------------------------+
|     N = 3000      | [ 100.      99.116   89.354] | [ 97.938  97.938  88.66 ] |
+-------------------+------------------------------+---------------------------+
|     N = 30000     | [ 100.

## <font color='green'>Comments</font>

**Regarding hash vector size:** It can be observed that the bigger the hash vector, the better prediction, since differents words are less likely to be assigned to the same position. However, there appears to be no improvement to the accuracy as the vector size exceeds 3000, which can be justified as the vocabulary in each email rarely exceeds 3000. It is worth noting that all classifiers achieved ~84% accuracy even with the hash vector size of 3. This is due to the class imbalance: approximately 84% of the emails are non-spam, which means that the classifier are indicating all emails as non-spam. This emphasised the neccesity of using precision/recall as the main evaluation criteria.

**Regarding lemmatisation and stop word removal:** In general, lemmatisation is expected to improve the models' accuracy as it groups words with _similar_ meanings together - this is consistent with the results obtained. Removing stop words does not significantly affect the result, as stop words are usually considered as containing very few information.

**Regarding classifiers performance:** Logistic regression edged out the remaining two in all settings. Naives Bayes performed best with the hash vector size of 3000. Interestingly, normalizing samples to unit L1 or L2 norm by setting the normalizer parameter to 1 or 2 limits SVM's accuracy to ~84%, while setting this parameter to "inf" boosted SVM's accuracy to 90%. This can be regarded as an alternative to tunning SVM's kernel function. The result obtained appears to contradict empirical results, where SVM are proven to be the best classifier, followed by Naive Bayes then Logistic Regression. This might be attributed to the difference in datasets and model tuning.