## CS431/631 Data Intensive Distributed Computing
---

In [None]:
!apt-get update -qq > /dev/null
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-2.4.7/spark-2.4.7-bin-hadoop2.7.tgz
!tar xf spark-2.4.7-bin-hadoop2.7.tgz
!pip install -q findspark

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.7-bin-hadoop2.7"

import findspark
findspark.init()

from pyspark import SparkContext
sc = SparkContext(appName="IrTest", master="local[*]")

I will be using Python and Spark to perform spam detection.   I will need to perform two tasks.   The first is to build spam prediction models, using training data sets and stochastic gradient descent (SGD).   The second is to use these models to predict whether the documents in a test data set are spam.
The stochastic gradient descent technique that I will be using is based on [a paper](http://arxiv.org/abs/1004.5168) by Cormack, Smucker and Clarke.

#### Training a Spam Classification Models
To build a spam classification model, I will start with a training data set.   Each instance in the training set represents a single document, and is labeled to indicate whether that document should be considered to be spam or ham.
An instance looks like this:
```
clueweb09-en0094-20-13546 spam 387908 697162 426572 161118 688171 ...
```
The first field, `clueweb09-en0094-20-13546`, is the (unique) document name.   The second field is the label, indicating whether the document should be considered spam (as in this example) or ham.   The remaining fields are integers representing *features* present in the document.   In this case, the features are hashed byte 4-grams, represented as integers.   Each training data set is stored as a text file, with one instance per line.   The training files  are:
* `spam.train.group_x.txt`   (25 MB)
* `spam.train.group_y.txt`   (20 MB)
* `spam.train.britney.txt`   (766 MB)

Now let's download the spamminess module and the training traces I will use in this assignment. This will take a few minutes. The ls command at the end shows the files I have in this directory.

In [None]:
!wget -q https://student.cs.uwaterloo.ca/~cs451/W20/content/cs431/spamminess.py
!wget -q https://www.student.cs.uwaterloo.ca/~cs451/spam/spam.train.group_x.txt.bz2
!wget -q https://www.student.cs.uwaterloo.ca/~cs451/spam/spam.train.group_y.txt.bz2
!wget -q https://www.student.cs.uwaterloo.ca/~cs451/spam/spam.train.britney.txt.bz2

!bunzip2 spam.train.group_x.txt.bz2
!bunzip2 spam.train.group_y.txt.bz2
!bunzip2 spam.train.britney.txt.bz2
!ls

sample_data		spam.train.group_y.txt
spamminess.py		spark-2.4.7-bin-hadoop2.7
spam.train.britney.txt	spark-2.4.7-bin-hadoop2.7.tgz
spam.train.group_x.txt


In [None]:
from spamminess import spamminess
from math import exp

def sequential_SGD(model, training_dataset='spam.train.group_x.txt', delta = 0.002):
    # open one of the training files - defaults to group_x
    with open(training_dataset) as f:
      for line in f:

        def parse_line(line):
          els = line.split(' ')
          doc = els[0]
          t = els[1]
          F = els[2:]
          return doc, t, F

        doc, t, F = parse_line(line)

        # find the spamminess of the current document using the current model:
        score = spamminess(F, model)

        # squash the logit (log odds) i.e the score on a probability scale using sigmoid function. Then, update the model.
        prob = 1.0/(1+exp(-score))

        for feature in F:

          if t == 'spam':
            if feature in model:
              #print("feature", feature, "in model. Update += ", (1.0-prob)*delta)
              model[feature] += (1.0-prob)*delta # looks like I are adding the error. true is 1 and error is 1-prob. but why add this error in prob.
            else:
              #print("feature", feature, "not in model. Set = ", (1.0-prob)*delta)
              model[feature] = (1.0-prob)*delta

          elif t == 'ham':
            if feature in model:
              #print("feature", feature, "in model. Update -= ", (prob)*delta)
              model[feature] -= (prob)*delta
            else:
              #print("feature", feature, "not in model. Set = ", (-prob)*delta)
              model[feature] = (-prob)*delta

    return model

In [None]:
model = sequential_SGD({}) # Providing an empty model
print(max(model.items(), key=lambda x:x[1]))
print(min(model.items(), key=lambda x:x[1]))

('288281', 0.022990809445890017)
('358032', -0.024400483573413394)


I will now be implementing a Spark version of the SGD model trainer.   The Spark implementation should read a training file, train the model, and then output the model to the `models` folder.  The model output file that I generate should list the weight associated with each feature, with one feature per line, like this:
```
(802123, 0.0009858585991850937)
(438450, 4.267897922108138e-05)
(271525, 0.0013133437007968654)
(92853, 0.0004300009932503611)
```

In [None]:
from spamminess import spamminess
from math import exp
import shutil, os

def spark_SGD(training_dataset='spam.train.group_x.txt', output_model='models/group_x_model', delta = 0.002):

    if os.path.isdir(output_model):
        shutil.rmtree(output_model) # Remove the previous model to create a new one
    training_data = sc.textFile(training_dataset)

    def parse_line(line):
      els = line.split(' ')
      doc = els[0]
      t = els[1]
      F = els[2:]
      print("done with parsing")
      return doc, t, F

    def trainer(x):

      model = {}

      i = 0
      for datapoint in x:
        i+=1

        doc = datapoint[0]
        t = datapoint[1]
        F = datapoint[2]

        # calculate spamminess score
        score = spamminess(F, model)

        # squash the logit (log odds) i.e the score on a probability scale using sigmoid function. Then, update the model.
        prob = 1.0/(1+exp(-score))

        for feature in F:

          if t == 'spam':
            if feature in model:
              model[feature] += (1.0-prob)*delta # looks like I are adding the error. true is 1 and error is 1-prob. but why add this error in prob.
            else:
              model[feature] = (1.0-prob)*delta

          elif t == 'ham':
            if feature in model:
              model[feature] -= (prob)*delta
            else:
              model[feature] = (-prob)*delta

      for feature, weight in model.items():
        yield (int(feature), weight)

    training_data = training_data.map(lambda x: parse_line(x)).coalesce(1)
    weights = training_data.mapPartitions(lambda x: trainer(x))

    # save file
    weights.saveAsTextFile(output_model)

    return weights.collectAsMap()

In [None]:
# Ir tests here
model_x = spark_SGD()
model_y = spark_SGD(training_dataset='spam.train.group_y.txt', output_model='models/group_y_model')
model_britney = spark_SGD(training_dataset='spam.train.britney.txt', output_model='models/britney_model')

print(max(model_x.items(), key=lambda x:x[1]))
print(min(model_x.items(), key=lambda x:x[1]))

(288281, 0.022996007768337472)
(358032, -0.0243996483857188)


When I train a model using SGD, the model I get depends on the order in which I handle the training instances.  To see this in action, try using the Spark SGD trainer I implemented earlier to train a model from the group_x training set, but with the instances processed in a different order. 

In [None]:
from spamminess import spamminess
from math import exp
import shutil, os, random
import numpy as np

def spark_shuffled_SGD(training_dataset='spam.train.group_x.txt', output_model='models/group_x_model', delta = 0.002):
    if os.path.isdir(output_model):
        shutil.rmtree(output_model) # Remove the previous model to create a new one
    training_data = sc.textFile(training_dataset)

    def parse_line(line):
      els = line.split(' ')
      doc = els[0]
      t = els[1]
      F = els[2:]
      return doc, t, F

    def add_random_sort_key():
      # Although this does not assign a unique random key
      # to each row, it does meet the requirement for shuffling purposes
      return np.random.randint(1, 785)

    def trainer(x):

      model = {}

      i = 0
      for datapoint in x:
        i+=1

        doc = datapoint[0]
        t = datapoint[1]
        F = datapoint[2]

        # calculate spamminess score
        score = spamminess(F, model)

        # squash the logit (log odds) i.e the score on a probability scale using sigmoid function. Then, update the model.
        prob = 1.0/(1+exp(-score))

        for feature in F:

          if t == 'spam':
            if feature in model:
              model[feature] += (1.0-prob)*delta # looks like I are adding the error. true is 1 and error is 1-prob. but why add this error in prob.
            else:
              model[feature] = (1.0-prob)*delta

          elif t == 'ham':
            if feature in model:
              model[feature] -= (prob)*delta
            else:
              model[feature] = (-prob)*delta

      for feature, weight in model.items():
        yield (int(feature), weight)

    shuffled_data = training_data.map(lambda x: parse_line(x)).map(lambda x: (add_random_sort_key(), x)).sortByKey().map(lambda x: (x[1][0], x[1][1], x[1][2])).coalesce(1)
    weights = shuffled_data.mapPartitions(lambda x: trainer(x))

    # save file
    weights.saveAsTextFile(output_model)

    return weights.collectAsMap()

In [None]:
# Ir tests here
shuffled_model_x = spark_shuffled_SGD()
shuffled_model_y = spark_shuffled_SGD(training_dataset='spam.train.group_y.txt', output_model='models/group_y_model')
shuffled_model_britney = spark_shuffled_SGD(training_dataset='spam.train.britney.txt', output_model='models/britney_model')

print(max(shuffled_model_x.items(), key=lambda x:x[1]))
print(min(shuffled_model_x.items(), key=lambda x:x[1]))

print(max(model_x.items(), key=lambda x:x[1]))
print(min(model_x.items(), key=lambda x:x[1]))

(658098, 0.02533142398599327)
(365972, -0.02753083626464579)
(288281, 0.022996007768337472)
(358032, -0.0243996483857188)


In [None]:
!wget -q https://www.student.cs.uwaterloo.ca/~cs451/spam/spam.test.qrels.txt.bz2
!bunzip2 spam.test.qrels.txt.bz2
!ls

models	       spam.test.qrels.txt     spark-2.4.7-bin-hadoop2.7
__pycache__    spam.train.britney.txt  spark-2.4.7-bin-hadoop2.7.tgz
sample_data    spam.train.group_x.txt
spamminess.py  spam.train.group_y.txt


In [None]:
from spamminess import spamminess
import shutil, os

def spark_classify(input_model='models/group_x_model', test_dataset='spam.test.qrels.txt', results_path='results/test_qrels'):
    if os.path.isdir(results_path):
        shutil.rmtree(results_path) # Remove the previous results
    test_data = sc.textFile(test_dataset)

    def parse_line(line):
      els = line.split(' ')
      doc = els[0]
      t = els[1]
      F = els[2:]
      return doc, t, F

    # read model
    def model_parser(str):
        import re
        array = re.split(',',str)
        return int(array[0].replace('(',"")), float(array[1].replace(')', ''))

    # reading and sending model as a broadcast to executors
    model = sc.textFile('models/group_x_model').map(lambda x: model_parser(x)).collectAsMap()
    modelBroadcast = sc.broadcast(model)

    def get_prediction(F, model):
      # calculate spamminess score
      F = list(map(int, F))
      score = spamminess(F, model) 

      # assuming score = 0 => ham  
      if score > 0:
        return score, 'spam'
      else:
        return score, 'ham'

    # parsing test data and getting predictions
    test_data = test_data.map(lambda x: parse_line(x)).map(lambda x: (x[0], x[1], get_prediction(x[2], modelBroadcast.value))).map(lambda x: (x[0], x[1], x[2][0], x[2][1]))
    
    # save file
    test_data.saveAsTextFile(results_path)

    return test_data

In [18]:
# Ir tests here
spark_classify().collect()
spark_classify('models/group_y_model', results_path='results/test_qrels_y').collect()
spark_classify('models/britney_model', results_path='results/test_qrels_britney').collect()

[('clueweb09-en0000-00-00142', 'spam', 2.601624279252943, 'spam'),
 ('clueweb09-en0000-00-01005', 'ham', 2.5654162439491004, 'spam'),
 ('clueweb09-en0000-00-01382', 'ham', 2.5893946346394188, 'spam'),
 ('clueweb09-en0000-00-01383', 'ham', 2.6190102258752614, 'spam'),
 ('clueweb09-en0000-00-03449', 'ham', 1.500142758578532, 'spam'),
 ('clueweb09-en0000-00-04105', 'ham', -0.3808710772971723, 'ham'),
 ('clueweb09-en0000-00-04111', 'ham', -0.3808250857026689, 'ham'),
 ('clueweb09-en0000-00-04550', 'ham', -0.45869917748396055, 'ham'),
 ('clueweb09-en0000-00-05874', 'ham', 0.4873796742511931, 'spam'),
 ('clueweb09-en0000-00-06261', 'ham', -0.09350898143646483, 'ham'),
 ('clueweb09-en0000-00-08228', 'ham', 0.7424945553355444, 'spam'),
 ('clueweb09-en0000-00-09582', 'ham', 0.2940940731798853, 'spam'),
 ('clueweb09-en0000-00-09583', 'ham', 0.15021141962008225, 'spam'),
 ('clueweb09-en0000-00-09585', 'ham', -0.3410974553358884, 'ham'),
 ('clueweb09-en0000-00-09605', 'ham', 0.21829007452821064, '

In [None]:
!wget -q https://student.cs.uwaterloo.ca/~cs451/content/cs431/compute_spam_metrics.c
!wget -q https://student.cs.uwaterloo.ca/~cs451/content/cs431/spam_eval.sh

Now compile this program.

In [None]:
!gcc -w -O2 -o compute_spam_metrics compute_spam_metrics.c -lm

In [19]:
!bash spam_eval.sh 'results/test_qrels'
!bash spam_eval.sh 'results/test_qrels_y'
!bash spam_eval.sh 'results/test_qrels_britney'

1-ROCA%: 17.26
1-ROCA%: 17.26
1-ROCA%: 17.26
