# Coursework Part 1: Detecting Spam with Spark

These are the tasks for IN432 Big Data coursework 2020, part 1.  

This coursework is about classification of e-mail messages as spam or non-spam in Spark. We go through the whole process from loading and preprocessing to training and testing classifiers in a distributed way in Spark. We use the techniques shown in the lextures and labs and a few additional elements will be introduced here, such as the Natural Language ToolKit (NLTK) and some of the preprocessing and machine learning functions that come with Spark.

## Load and prepare the data

We will use the lingspam dataset in this coursework (see [http://csmining.org/index.php/ling-spam-datasets.html](http://csmining.org/index.php/ling-spam-datasets.html) for more information).

The next cells only prepare the machine, as usual.

In [None]:
# Load the Drive helper and mount
from google.colab import drive

# This will prompt for authorization.
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [None]:
%cd
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
!tar -xvf spark-2.4.5-bin-hadoop2.7.tgz > /dev/null
!pip install -q findspark
import os 
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/root/spark-2.4.5-bin-hadoop2.7"
%cd /content
import findspark
findspark.init()

import pyspark
# get a spark context
sc = pyspark.SparkContext.getOrCreate()
print(sc)
# get the context
spark = pyspark.sql.SparkSession.builder.getOrCreate()
print(spark) 

/root
/content
<SparkContext master=local[*] appName=pyspark-shell>
<pyspark.sql.session.SparkSession object at 0x7f5504d2ca90>


In [None]:
# We have a new dataset in directory BigData2020/data/lingspam_public .
%cd /content/drive/My Drive/BigData2020/data/lingspam_public 
# the line above should output should show "bare  lemm  lemm_stop  readme.txt  stop"
!cat readme.txt
# the line above shows the content of the readme file, which explains the structure of the dataset
# Lemmatisation is a process similar to stemming

/content/drive/My Drive/BigData2020/data/lingspam_public
This directory contains the Ling-Spam corpus, as described in the 
paper:

I. Androutsopoulos, J. Koutsias, K.V. Chandrinos, George Paliouras, 
and C.D. Spyropoulos, "An Evaluation of Naive Bayesian Anti-Spam 
Filtering". In Potamias, G., Moustakis, V. and van Someren, M. (Eds.), 
Proceedings of the Workshop on Machine Learning in the New Information 
Age, 11th European Conference on Machine Learning (ECML 2000), 
Barcelona, Spain, pp. 9-17, 2000.

There are four subdirectories, corresponding to four versions of 
the corpus:

bare: Lemmatiser disabled, stop-list disabled.
lemm: Lemmatiser enabled, stop-list disabled.
lemm_stop: Lemmatiser enabled, stop-list enabled.
stop: Lemmatiser disabled, stop-list enabled.

Each one of these 4 directories contains 10 subdirectories (part1, 
..., part10). These correspond to the 10 partitions of the corpus 
that were used in the 10-fold experiments. In each repetition, one 
part was reserved 

## Task 1) Read the dataset and create RDDs 

In [None]:
from pathlib import Path
import re

def makeTestTrainRDDs(pathString):
    """ Takes one of the four subdirectories of the lingspam dataset and returns two RDDs one each for testing and training. """
    # We should see 10 parts that we can use for creating train and test sets.
    p = Path(pathString) # gets a path object representing the current directory path.
    dirs = list(p.iterdir()) # get the directories part1 ... part10. 
#    print(dirs) # Print to check that you have the right directory. You can comment this out when checked. 
    rddList = [] # create a list for the RDDs
    # now create an RDD for each 'part' directory and add them to rddList
    print('creating RDDs')
    for d in dirs: # iterate through the directories
        dir_path = str(d.resolve())
        print(dir_path)
        rdd = sc.wholeTextFiles(dir_path) #>>> # read the files in the directory 
        rddList.append(rdd) #>>> append the RDD to the rddList
    print('len(rddList)', len(rddList))  # we should now have 10 RDDs in the list # just for testing
#   print(rddList[1].take(1)) # just for testing, comment out when it works.

    testRDD1 = rddList[9] # set the test set
    trainRDD1 = rddList[0] # start the training set from 0 and 
    # now loop over the range from 1 to 9 (exclusive) to create a union of the remaining RDDs
    print('creating RDD union')
    for i in range(1, 9):
        trainRDD1 = trainRDD1.union(rddList[i]) #>>> create a union of the current and the next 
            # RDD in the list, so that in the end we have a union of all parts 0-8. (9 is used as test set)
    # both RDDs should remove the paths and extensions from the filename. 
        testRDD2 = testRDD1.map(lambda fn_txt: (re.split('[/\.]', fn_txt[0])[-2], fn_txt[1]))
        trainRDD2 = trainRDD1.map(lambda fn_txt: (re.split('[/\.]', fn_txt[0])[-2], fn_txt[1]))
    return (trainRDD2, testRDD2)

# this makes sure we are in the right directory
%cd /content/drive/My Drive/BigData2020/data/lingspam_public 
# this should show "bare  lemm  lemm_stop  readme.txt  stop"
!ls 
# the code below is for testing the function makeTestTrainRDDs
trainRDD_testRDD = makeTestTrainRDDs('bare') # read from the 'bare' directory - this takes a bit of time
(trainRDD, testRDD) = trainRDD_testRDD # unpack the returned tuple
print('created the RDDs') # notify the user, so that we can figure out where things went wrong if they do.
print('testRDD.count(): ', testRDD.count()) # should be ~289
#print('trainRDD.count(): ', trainRDD.count()) # should be ~2604 - commented out to save time as it takes some time to create RDD from all the files
print('testRDD.getNumPartitions():', testRDD.getNumPartitions()) # normally 2 on Colab (single machine)
print('testRDD.getStorageLevel():', testRDD.getStorageLevel()) # Serialized, 1x Replicated, expected to be (False, False, False, False, 1) 
print('testRDD.take(1): ', testRDD.take(1)) # should be (filename, [tokens]) 
rdd1 = testRDD # use this for development in the next tasks 

/content/drive/My Drive/BigData2020/data/lingspam_public
bare  lemm  lemm_stop  readme.txt  stop
creating RDDs
/content/drive/My Drive/BigData2020/data/lingspam_public/bare/part10
/content/drive/My Drive/BigData2020/data/lingspam_public/bare/part9
/content/drive/My Drive/BigData2020/data/lingspam_public/bare/part8
/content/drive/My Drive/BigData2020/data/lingspam_public/bare/part7
/content/drive/My Drive/BigData2020/data/lingspam_public/bare/part6
/content/drive/My Drive/BigData2020/data/lingspam_public/bare/part5
/content/drive/My Drive/BigData2020/data/lingspam_public/bare/part4
/content/drive/My Drive/BigData2020/data/lingspam_public/bare/part3
/content/drive/My Drive/BigData2020/data/lingspam_public/bare/part2
/content/drive/My Drive/BigData2020/data/lingspam_public/bare/part1
len(rddList) 10
creating RDD union
created the RDDs
testRDD.count():  289
testRDD.getNumPartitions(): 2
testRDD.getStorageLevel(): Serialized 1x Replicated
testRDD.take(1):  [('3-1msg1', 'Subject: re : 2 . 88

## Task 2) Tokenize and remove punctuation

Python [Natural Language Toolkit](http://www.nltk.org) (*NLTK*) to do the tokenization. We use the NLTK function word_tokenize, see here for a code example: [http://www.nltk.org/book/ch03.html](http://www.nltk.org/book/ch03.html). 


In [None]:
import nltk
import re
from nltk.corpus import stopwords

def tokenize(text):
    """ Apply the nltk.word_tokenize() method to our text, return the token list. """
    nltk.download('punkt') #  loads the standard NLTK tokenizer model 
    # it is important that this is done here in the function, as it needs to be done on every worker.
    # If we do the download outside a this function, it would only be executed on the driver     
    return nltk.word_tokenize(text)
    
def removePunctuation(tokens):
    """ Remove punctuation characters from all tokens in a provided list. """
    # this will remove all punctiation from string s: re.sub('[()\[\],.?!";_]','', s)
    tokens2 = [re.sub('[()\[\],.?!";_]','', s)for s in tokens]
    return tokens2
    
def prepareTokenRDD(fn_txt_RDD):
    """ Take an RDD with (filename, text) elements and transform it into a (filename, [token ...]) RDD without punctuation characters. """
    rdd_vals2 = fn_txt_RDD.values() # It's convenient to process only the values. 
    rdd_vals3 = rdd_vals2.map(tokenize) # Create a tokenised version of the values by mapping
    rdd_vals4 = rdd_vals3.map(removePunctuation) # remove punctuation from the values
    rdd_kv = fn_txt_RDD.keys().zip(rdd_vals4) # we zip the two RDDs together 
    # i.e. produce tuples with one item from each RDD.
    # This works because we have only applied mappings to the values, 
    # therefore the items in both RDDs are still aligned.
    # >>> now remove any empty strings (i.e. length 0) that we may have 
    # created by removing punctuation, and resulting entries without words left.
    rdd_kvr = rdd_kv.map(lambda x: (x[0], [token for token in x[1] if len(token)>0]))
    rdd_kvrf = rdd_kvr.filter(lambda x: len(x[1])>0)
    
    return rdd_kvrf 

rdd2 = prepareTokenRDD(rdd1) # Use a small RDD for testing.
print(rdd2.take(1)) # For checking result of task 2. 

[('3-1msg1', ['Subject', ':', 're', ':', '2', '882', 's', '-', '>', 'np', 'np', '>', 'date', ':', 'sun', '15', 'dec', '91', '02', ':', '25', ':', '02', 'est', '>', 'from', ':', 'michael', '<', 'mmorse', '@', 'vm1', 'yorku', 'ca', '>', '>', 'subject', ':', 're', ':', '2', '864', 'queries', '>', '>', 'wlodek', 'zadrozny', 'asks', 'if', 'there', 'is', '``', 'anything', 'interesting', '``', 'to', 'be', 'said', '>', 'about', 'the', 'construction', '``', 's', '>', 'np', 'np', '``', 'second', '>', 'and', 'very', 'much', 'related', ':', 'might', 'we', 'consider', 'the', 'construction', 'to', 'be', 'a', 'form', '>', 'of', 'what', 'has', 'been', 'discussed', 'on', 'this', 'list', 'of', 'late', 'as', 'reduplication', 'the', '>', 'logical', 'sense', 'of', '``', 'john', 'mcnamara', 'the', 'name', '``', 'is', 'tautologous', 'and', 'thus', 'at', '>', 'that', 'level', 'indistinguishable', 'from', '``', 'well', 'well', 'now', 'what', 'have', 'we', 'here', '``', 'to', 'say', 'that', "'", 'john', 'mcnama

**Question:** why should this be filtering done after zipping the keys and values together?

**Answer:** Because it is more efficient to filter after zipping the keys and values together. When the RDD is split, two RDD's are created (KeysRDD and ValuesRDD). When filtering, it may remove items from ValuesRDD and if it is done before zipping then you will have to manually find the KeysRDD (of the items that were removed from ValuesRDD) and remove it and reassign the rest which will be time consuming.

## Task 3) Creating normalised TF.IDF vectors of defined dimensionality.

In [None]:
# use the hashing trick to create a fixed-size vector from a word list
def hashing_vectorize(text, N): # arguments: the list and the size of the output vector
    v = [0] * N  # create vector of 0s
    for word in text: # iterate through the words 
      hashValue = hash(word)
      v[hashValue%N] += 1
    return v # return hashed word vector

from pyspark.mllib.feature import IDF, Normalizer

def normTFIDF(fn_tokens_RDD, vecDim):
    keysRDD = fn_tokens_RDD.keys()
    tokensRDD = fn_tokens_RDD.values()
    tfVecRDD = tokensRDD.map(lambda tokens: hashing_vectorize(tokens, vecDim)) 
    idf = IDF() # create IDF object
    idfModel = idf.fit(tfVecRDD) # calculate IDF values
    tfIdfRDD = idfModel.transform(tfVecRDD) # 2nd pass needed (see lecture slides), transforms RDD
    norm = Normalizer()
    normTfIdfRDD = norm.transform(tfIdfRDD)
    zippedRDD = keysRDD.zip(normTfIdfRDD)
    return zippedRDD

testDim = 10 # too small for good accuracy, but OK for testing
rdd3 = normTFIDF(rdd2, testDim) # test our normTFIDF function
print(rdd3.take(1)) # we should now have tuples with ('filename', [N-dim vector])
# e.g. [('3-1msg1', DenseVector([0.0, 0.1629, 0.6826, 0.0, 0.0, 0.0, 0.4017, 0.3258, 0.3133, 0.3766]))]

[('3-1msg1', DenseVector([0.0, 0.1629, 0.6826, 0.0, 0.0, 0.0, 0.4017, 0.3258, 0.3133, 0.3766]))]


## Task 4) Create LabeledPoints 

Determine whether the file is spam (i.e. the filename contains ’spmsg’) and replace the filename by a 1 (spam) or 0 (non-spam) accordingly. Use `RDD.map()` to create an RDD of LabeledPoint objects. See here [http://spark.apache.org/docs/2.4.5/mllib-linear-methods.html#logistic-regression](http://spark.apache.org/docs/2.4.5/mllib-linear-methods.html#logistic-regression) 

In [None]:
from pyspark.mllib.regression import LabeledPoint

# create labelled points of vector size N out of an RDD with normalised (filename, td.idf-vector) items
def makeLabeledPoints(fn_vec_RDD): # RDD and N needed 
    # we determine the true class as encoded in the filename and represent as 1 (spam) or 0 (good) 
    cls_vec_RDD = fn_vec_RDD.map(lambda x: (1, x[1]) if x[0].startswith('spmsg') else (0, x[1]))
    # now we can create the LabeledPoint objects with (class, vector) arguments
    lp_RDD = cls_vec_RDD.map(lambda cls_vec: LabeledPoint(cls_vec[0], cls_vec[1]) ) 
    return lp_RDD 

# for testing
testLpRDD = makeLabeledPoints(rdd3)
print(testLpRDD.take(1))
# should look similar to this: [LabeledPoint(0.0, [0.0,0.16290896085571283,0.6826175329317583,0.0,0.0,0.0,0.40170165983309447,0.32581792171142565,0.3132864631840631,0.3765953060935261])]

[LabeledPoint(0.0, [0.0,0.16290896085571283,0.6826175329317583,0.0,0.0,0.0,0.40170165983309447,0.32581792171142565,0.3132864631840631,0.3765953060935261])]


## Task 5) Complete the preprocessing 

In [None]:
# now we can apply the preprocessing chain to the data loaded in task 1 
# N is for controlling the vector size
def preprocess(rawRDD, N):
    """ take a (filename,text) RDD and transform into LabelledPoint objects 
        with class labels and a TF.IDF vector with N dimensions. 
    """
# tasks 2, 3 and 4 
    tokenRDD = prepareTokenRDD(rawRDD) 
    tfIdfRDD = normTFIDF(tokenRDD,N) 
    lpRDD = makeLabeledPoints(tfIdfRDD) 
    return lpRDD # return RDD with LabeledPoints

# and with this we can start the whole process from a directory, N is again the vector size
def loadAndPreprocess(directory, N):
    """ load lingspam data from a directory and create a training and test set of preprocessed data """
    # read from the directory using the function created in task 1
    # unpack the returned tuple
    trainRDD_testRDD = makeTestTrainRDDs(directory)
    trainRDD, testRDD = trainRDD_testRDD
    return (preprocess(trainRDD, N), preprocess(testRDD, N)) # apply the preprocessing function defined above

trainLpRDD = preprocess(trainRDD, testDim) # prepare the training data
print(trainLpRDD.take(1)) # should look similar to previous cell's output

train_test_LpRDD = loadAndPreprocess('lemm', 100) # let's re-run with another vector size
(trainLpRDD, testLpRDD) = train_test_LpRDD
print(testLpRDD.take(1))
print(trainLpRDD.take(1))

[LabeledPoint(0.0, [0.5431144491961283,0.39172111622608674,0.35540711046827655,0.0,0.0,0.19045658048349204,0.3625647437051716,0.3592848102947122,0.2847076534848611,0.2177065059127546])]
creating RDDs
/content/drive/My Drive/BigData2020/data/lingspam_public/lemm/part10
/content/drive/My Drive/BigData2020/data/lingspam_public/lemm/part9
/content/drive/My Drive/BigData2020/data/lingspam_public/lemm/part8
/content/drive/My Drive/BigData2020/data/lingspam_public/lemm/part7
/content/drive/My Drive/BigData2020/data/lingspam_public/lemm/part6
/content/drive/My Drive/BigData2020/data/lingspam_public/lemm/part5
/content/drive/My Drive/BigData2020/data/lingspam_public/lemm/part4
/content/drive/My Drive/BigData2020/data/lingspam_public/lemm/part3
/content/drive/My Drive/BigData2020/data/lingspam_public/lemm/part2
/content/drive/My Drive/BigData2020/data/lingspam_public/lemm/part1
len(rddList) 10
creating RDD union
[LabeledPoint(0.0, [0.04138897380669189,0.06905413571056576,0.09383464193692988,0.05

## Task 6) Train some classifiers 

Use the `LabeledPoint` objects to train a classifier, specifically *Logistic Regression*, *Naive Bayes*, and *Support Vector Machine*. Calculate the accuracy of the model on the training set (by dividing the number of correctly recognised messages by the total number of messages, again, following the example [http://spark.apache.org/docs/2.4.5/ml-classification-regression.html#logistic-regression](http://spark.apache.org/docs/2.4.5/ml-classification-regression.html#logistic-regression) and the documentation for the classifiers [LogisticRegressionWithLBFGS](http://spark.apache.org/docs/2.4.5/api/python/pyspark.mllib.html#pyspark.mllib.classification.LogisticRegressionWithLBFGS), [NaiveBayes](http://spark.apache.org/docs/2.4.5/api/python/pyspark.mllib.html#pyspark.mllib.classification.NaiveBayes), [SVMWithSGD](http://spark.apache.org/docs/2.4.5/api/python/pyspark.mllib.html#pyspark.mllib.classification.SVMWithSGD).

In [None]:
from pyspark.mllib.classification import NaiveBayes, LogisticRegressionWithLBFGS, SVMWithSGD
from pyspark import StorageLevel

# train the model with a LabeledPoint RDD.
def trainModel(lpRDD):
    """ Train 3 classifier models on the given RDD with LabeledPoint objects. A list of trained model is returned. """
    # Train a classifier model.
    print('Starting to train the model') # give some immediate feedback
    model1 = LogisticRegressionWithLBFGS.train(lpRDD) # this is the best model
    print('Trained LR (model1)')
    #print('type(model1)')
    model2 = NaiveBayes.train(lpRDD) # doesn't work well
    print('Trained NB (model2)')
    #print(type(model2))
    model3 = SVMWithSGD.train(lpRDD) # or this ...
    print('Trained SVM (model3)')
    return [model1, model2, model3]

def testModel(model, lpRDD):
    """ Tests the classification accuracy of the given model on the given RDD with LabeledPoint objects. """
    lpRDD.persist(StorageLevel.MEMORY_ONLY)
    # Make prediction and evaluate training set accuracy
    # Get the prediction and the ground truth label
    predictionAndLabel = lpRDD.map(lambda p: (model.predict(p.features), p.label)) # get the prediction and ground truth (label) for each item
    correct = predictionAndLabel.filter(lambda xv: xv[0] == xv[1]).count() # count the correct predictions 
    #calculate the accuracy 
    accuracy = correct/predictionAndLabel.count()
    print('Accuracy {:.1%} (data items: {}, correct: {})'.format(accuracy, lpRDD.count(), correct)) # report to console
    return accuracy # and return the value  

models = trainModel(trainLpRDD) # just for testing
testModel(models[2], trainLpRDD) # just for testing

Starting to train the model
Trained LR (model1)
Trained NB (model2)
Trained SVM (model3)
Accuracy 83.4% (data items: 2604, correct: 2171)


0.8337173579109063

## Task 7) Automate training and testing

Automate the whole process from reading the files, through preprocessing, and training up to evaluating the models. 

In [None]:
# this function combines the previous two functions
# this method should take RDDs with LabeledPoints
def trainTestModel(trainRDD, testRDD):
    """ Trains 3 models and tests them on training and test data. Returns a matrix with the training and testing (rows) accuracy values for all models (columns). """
    # train models on the training set
    models = trainModel(trainRDD)
    results = [[], []] # matrix for 2 modes (training/test) vs n models (currently 3)
    for mdl in models:
        print('Training')
        # test the model on the training set
        results[0].append(testModel(mdl, trainRDD))
        print('Testing')
        # test the model on the test set
        results[1].append(testModel(mdl, testRDD))
    return results

def trainTestFolder(folder,N):
    """ Reads data from a folder, preproceses the data, and trains and evaluates models on it. """
    print('Start loading and preprocessing') 
    train_test_LpRDD = loadAndPreprocess(folder,N) # create the RDDs
    print('Finished loading and preprocessing')
    (trainLpRDD, testLpRDD) = train_test_LpRDD # unpack the RDDs 
    return trainTestModel(trainLpRDD,testLpRDD) # train and test

trainTestFolder('lemm', 1000) 

Start loading and preprocessing
creating RDDs
/content/drive/My Drive/BigData2020/data/lingspam_public/lemm/part10
/content/drive/My Drive/BigData2020/data/lingspam_public/lemm/part9
/content/drive/My Drive/BigData2020/data/lingspam_public/lemm/part8
/content/drive/My Drive/BigData2020/data/lingspam_public/lemm/part7
/content/drive/My Drive/BigData2020/data/lingspam_public/lemm/part6
/content/drive/My Drive/BigData2020/data/lingspam_public/lemm/part5
/content/drive/My Drive/BigData2020/data/lingspam_public/lemm/part4
/content/drive/My Drive/BigData2020/data/lingspam_public/lemm/part3
/content/drive/My Drive/BigData2020/data/lingspam_public/lemm/part2
/content/drive/My Drive/BigData2020/data/lingspam_public/lemm/part1
len(rddList) 10
creating RDD union
Finished loading and preprocessing
Starting to train the model
Trained LR (model1)
Trained NB (model2)
Trained SVM (model3)
Training
Accuracy 100.0% (data items: 2604, correct: 2604)
Testing
Accuracy 97.2% (data items: 289, correct: 281)


[[1.0, 0.9443164362519201, 0.8337173579109063],
 [0.972318339100346, 0.9273356401384083, 0.8339100346020761]]

## Task 8) Run experiments 

We have now a single function that allows us to vary the vector size easily. Test vector sizes 3, 30, 300, 3000, 30000 and examine the effect on the classification accuracy in Experiment 1.

Use the function from Task 7) to test different data types. The dataset has raw text in folder `bare`, lemmatised text in  `lemm` (similar to stemming, reduces to basic word forms), `stop` (with stopwords removed), and `lemm_stop` (lemmatised and stopwords removed). Test how the classification accuracy differs for these four data types in Experiment 2. Collected the results in a data structure for later analyis.



In [None]:
folder = 'bare'
N = [3, 30, 300, 3000, 30000]
print('\nEXPERIMENT 1: Testing different vector sizes')
results_vectorsizes = []
for n in N:
    print('N = {}'.format(n))
    result = {'n': n, 't': folder}
    result['acc'] = trainTestFolder(folder, n)
    results_vectorsizes.append(result)


EXPERIMENT 1: Testing different vector sizes
N = 3
Start loading and preprocessing
creating RDDs
/content/drive/My Drive/BigData2020/data/lingspam_public/bare/part10
/content/drive/My Drive/BigData2020/data/lingspam_public/bare/part9
/content/drive/My Drive/BigData2020/data/lingspam_public/bare/part8
/content/drive/My Drive/BigData2020/data/lingspam_public/bare/part7
/content/drive/My Drive/BigData2020/data/lingspam_public/bare/part6
/content/drive/My Drive/BigData2020/data/lingspam_public/bare/part5
/content/drive/My Drive/BigData2020/data/lingspam_public/bare/part4
/content/drive/My Drive/BigData2020/data/lingspam_public/bare/part3
/content/drive/My Drive/BigData2020/data/lingspam_public/bare/part2
/content/drive/My Drive/BigData2020/data/lingspam_public/bare/part1
len(rddList) 10
creating RDD union
Finished loading and preprocessing
Starting to train the model
Trained LR (model1)
Trained NB (model2)
Trained SVM (model3)
Training
Accuracy 83.4% (data items: 2604, correct: 2171)
Test

In [None]:
n = 3000
typeFolders = ['bare', 'stop', 'lemm', 'lemm_stop']
print('EXPERIMENT 2: Testing different data types')
results_preprocessing = []
for folder in typeFolders:
    print('Path = {}'.format(folder))
    result = {'n': n, 't': folder}
    result['acc'] = trainTestFolder(folder, n)
    results_preprocessing.append(result)

# Add comments on the performance in a cell below. 

EXPERIMENT 2: Testing different data types
Path = bare
Start loading and preprocessing
creating RDDs
/content/drive/My Drive/BigData2020/data/lingspam_public/bare/part10
/content/drive/My Drive/BigData2020/data/lingspam_public/bare/part9
/content/drive/My Drive/BigData2020/data/lingspam_public/bare/part8
/content/drive/My Drive/BigData2020/data/lingspam_public/bare/part7
/content/drive/My Drive/BigData2020/data/lingspam_public/bare/part6
/content/drive/My Drive/BigData2020/data/lingspam_public/bare/part5
/content/drive/My Drive/BigData2020/data/lingspam_public/bare/part4
/content/drive/My Drive/BigData2020/data/lingspam_public/bare/part3
/content/drive/My Drive/BigData2020/data/lingspam_public/bare/part2
/content/drive/My Drive/BigData2020/data/lingspam_public/bare/part1
len(rddList) 10
creating RDD union
Finished loading and preprocessing
Starting to train the model
Trained LR (model1)
Trained NB (model2)
Trained SVM (model3)
Training
Accuracy 100.0% (data items: 2604, correct: 2604)


**Experiment** **1**


---



Initial observation shows that there is a difference between the training and testing accuracy which is expected. However, when vector size is 3, training and testing return the accuracy of 83.4 for all the models so it is safe to assume that the models are predicting all as no spam(major class) and due to the high imbalance, they're managing an accuracy of 83.4. When vector size is 30(low again), there is not much of a difference between the training and testing accuracy. But as vector size increases, there is changes in the performance. 

*Logistic Regression*
- Performance imporves as vector size increases as it went from 83.4 (VectorSize = 3) testing accurancy to 97.2(VectorSize =30000) testing accuracy. 

*Naive Bayes*
- Naive Bayes performance increases as vector size is increased from 3 to 3000, similar to Linear Regression. However, when the vector size is 30000, accuracy drops 83.4 form 96.2 which could suggest that after a certain point we are adding noise rather adding informative features.

*SVM*
- SVM is very ineffective as the changes to vector size didnt cotribute towards the performance.

**Experiment** **2**

---



*Logistic Regression*
- Performed excellently on the training with a 100% accuracy for all the data types which may suggest it is overfitting but the lowest testing accuracy is 96.9 and highest is 97.6 which shows that is not the case and performes really well. 
- StopWords removal and Lemmatization improved the accuracy but doing both reduced the accuracy which may be due to lemmatization resulting in more stopwords which the ends up being removed.

*Naive Bayes*
- Naive Bayes performs really  on all data types with best performance accuracy on "bare" which suggests pre-processing doesnt improve the accuracy of the model.

*SVM*
- No change in performance even after pre-processing. 

Logistic Regression is the best performing model. 