# Lab Sheet 5a: N-Gram Classification and more Google Cloud

These tasks are for working in the lab session and during the week.

We'll revisit some of the tasks of the previous weeks, and add some classification with **n-grams**. Specifically, we'll have a look at using n-grams to represent the novels of lab 3 and the newsgroups of lab 4. In both cases, we then calculate hashed feature vectors from the n-grams, similar to what we've been doing before. These vectors are then stored in a DataFrame, which we use to train a classifier with like in lab 4. In essence, only the preprocessing changes, but the resulting document representations will take the word context as well.

We'll also continue to use Google **Cloud**.

First we mount drive and install local Spark as usual.

In [None]:
# Load the Drive helper and mount
from google.colab import drive

# This will prompt for authorization.
drive.mount('/content/drive')

# install spark
%cd
!apt-get update -qq
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!tar -xzf "/content/drive/My Drive/Big_Data/data/spark/spark-3.5.0-bin-hadoop3.tgz"
!pip install -q findspark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/root/spark-3.5.0-bin-hadoop3"
import findspark
findspark.init()
%cd /content


Now lets run spark and get a context.

In [None]:
import pyspark
# get a spark context
sc = pyspark.SparkContext.getOrCreate()
print(sc)
# and a spark session
spark = pyspark.sql.SparkSession.builder.getOrCreate()
print(spark)

## 1) Creating n-grams from novels

Remember that we tried to cluster novels by author in lab 3, based on their usage of stopwords. Now we will read the entire content of the novels and look at the words in the context of their preceding words. After that we again convert the n-grams to hashed feature vectors, now using the authors' word n-gram frequency instead of stopword frequency.

Here is **code from lab 3** that we run first, and then extend.

In [None]:
import re
from operator import add

def stripFinalS( word ):
    word = word.lower() # lower case
    if len(word) >0 and word[-1] == 's': # check for final letter
        return word[:-1]
    else:
        return word

def splitFileWords(filenameContent): # your splitting function
    f,c = filenameContent # split the input tuple
    fwLst = [] # the new list for (filename,word) tuples
    wLst = re.split('\W+',c) # <<< now create a word list wLst
    for w in wLst: # iterate through the list
        fwLst.append((f,stripFinalS(w))) # <<< and append (f,w) to the
    return fwLst # return a list of (f,w) tuples

def hashing_vectorizer(word_count_list, N):
     v = [0] * N  # create fixed size vector of 0s
     for word_count in word_count_list:
        word, count = word_count 	# unpack tuple
        h = hash(word) # get hash value
        v[h % N] = v[h % N] + count # add count
     return v # return hashed word vector

def reGrpLst(fw_c): # we get a nested tuple
    fw,c = fw_c
    f,w = fw
    return (f,[(w,c)]) # return (f,[(w,c)]) structure.

In [None]:
dirPath = '/content/drive/My Drive/Big_Data/data/library/'
>>>ft_RDD = ... #<<< add code to create an RDD with wholeTextFiles

### a) Create n-grams and n-gram frequency vectors

Now modify the `splitFileWords` function to **extract n-grams** instead of words. More specifically, it extracts *unigrams, bigrams, ... , n-grams*, i.e. *k*-grams for each *k* ranging from 1 to the provided *n*. For this you need to have a variable for the n-gram's start and one for its end point. **The logic is given** in the first version with nested list comprehensions. Please **translate it into normal for-loops** for the second version, as an exercise in understanding Pythonic data processing.

Then we need a **function to manage the file names** to go with the n-grams.

Finally we can **use this in RDD transformations**. A nice trick is the so-called **currying** or **partial parametrisation** of the function with the Python functools. The code is provided, the documentation to understand what is going on can be found at [https://docs.python.org/3/library/functools.html#functools.partial](https://docs.python.org/3/library/functools.html#functools.partial)

In [None]:
# Using a nested list comprehension to create 1..n-grams given a string.
def split1NGrams(text, n): # function for splitting a word list and creating n-grams
    nGramLst = [] # the new list for (filename, word) tuples
    wLst = re.split('\W+', text) # now create a word list wLst
    wLst = list(map(stripFinalS, wLst)) # remove final s from the word list (this is a local map, don't confuse with RDD or DF map)
    wNum = len(wLst) # get total length to avoid overrunning at the end.
    nGramLst = [' '.join(wLst[i:j]) for i in range(wNum) for j in range(i+1, min(wNum, i+n+1))]
    return nGramLst # return a list of (f, w) tuples

# Alternative version with separate function for converting a word-list to n-grams
def lst21ngram(wLst, n):
    wNum = len(wLst) # get total length to avoid overrunning at the end.
    nGramLst = [] # output list
    # <<< reprogram the nested list comprehension above with regular for loops.
    ...
    return nGramLst # done

# a wrapper around the separate function
def split1NGrams2(text, n): # your splitting function
    wLst = re.split('\W+', text) #  split into words
    nGramLst = lst21ngram(wLst,n) # create the n-grams
    return nGramLst # done

# This function manages the filenames around the 1..n-gram extraction
def splitFile1NGrams(filenameContent, n=2): # your splitting function
    f, c = filenameContent # split the input tuple
    ngLst = split1NGrams2(c,n) # split the file content into n-grams
    fngLst = [] # the new list for (filename, n-gram) tuples
    for ng in ngLst: # iterate through the list
        fngLst.append((f,ng)) # and append (f, ng) to the new list
    return fngLst # return a list of (f,ng) tuples

# just for testing
print(split1NGrams('a b c d e f g', 3)) # test the splitting function with a string (easier than with an RDD or DF)
print(split1NGrams2('a b c d e f g', 3))# test the 2nd version, should look like the first
print(splitFile1NGrams(('file','a b c d e f g'), 3)) # should add the file tag before the n-grams

Now let's **use the new functions** to create RDDs with n-gram vectors:


In [None]:
from functools import partial
# use a partial to define the max len of the n-grams
fng_RDD = ft_RDD.flatMap(partial(splitFile1NGrams, n=2)) # <<< read the documentation (link above) to figure out what happens here
fng_RDD.take(5)
>>>fng_1_RDD = fng_RDD.map(...)  # <<< like in lab 3, as an exercise for the reader ;-) change (f, ng) to ((f, ng), 1)
>>>fng_c_RDD = fng_1_RDD.reduceByKey(...) # <<< like in lab 3, as an exercise for the reader ;-) add the ones
f_ngcL_RDD = fng_c_RDD.map(reGrpLst) # regroup to (f, [(ng, c)])
f_ngcL2_RDD = f_ngcL_RDD.reduceByKey(add) # concatenate ngram counts into one list per file
f_ngVec_RDD = f_ngcL2_RDD.map(lambda f_wc: (f_wc[0], hashing_vectorizer(f_wc[1], 10))) # we can apply the vectorizer as normal
print(f_ngVec_RDD.take(3))

### b) Convert n-gram RDD into DataFrame

The next task is to **create a DataFrame from the RDD**. This is similar to what was shown in lab 4 and also to the documentation: [http://spark.apache.org/docs/latest/sql-programming-guide.html#interoperating-with-rdds](http://spark.apache.org/docs/latest/sql-programming-guide.html#interoperating-with-rdds)  


In [None]:
austen_novels = ['senseandsensibility.txt', 'mansfield_park.txt', 'emma.txt', 'persuasion.txt', 'northanger_abbey.txt', 'lady_susan.txt', 'prideandpredjudice.txt']
austen = ['file:/content/drive/MyDrive/Big_Data/data/library/'+s for s in austen_novels]
print(austen)

av_RDD = f_ngVec_RDD.map(lambda f_wVec: ('Austen' if (f_wVec[0] in austen) else 'Shakespeare', f_wVec[1]))

from pyspark.sql import Row

>>>row_RDD = av_RDD.map(...Row(author ..., vector ...)) # <<< create a Row objects (similar to LabelledPoints)
# Create a dataFrame from the RDD
>>>library_DF = spark. .... # create the data frame, like in lab 4, but without giving an explicit schema
library_DF.createOrReplaceTempView("library")
>>> ... # print the schema
>>> ... # show the first 5 elements

Remember from lab 4 that SQL can be used over DataFrames that have been registered as a table.

In [None]:
SQL1 = "SELECT author,vector FROM library WHERE author=='Austen'"
austen_vectors = spark.sql(SQL1)
print(SQL1)
austen_vectors.show()

# create an SQL query that gives you only the authors that are not called Austen
>>>SQL2 = ...
other_vectors = spark.sql(SQL2)
print(SQL2)
other_vectors.show()

## 2) Running PySpark on Google's Cloud Platform

We have started using Google Cloud in the last lab. Now let's port the solution for task 1a) above to the cloud.

Open the notebook 'Running Spark in the Google' cloud from week 4 as a reference. You can copy and past code from there.

### a) cloud setup
The first step is to authenticate, the exact method varies per platform. For Colab, we use the 'google.colab.auth' package, as in the last lab.  

In [None]:
import sys
if 'google.colab' in sys.modules:
    from google.colab import auth
    auth.authenticate_user()

Then we create the project and region variables and set the values for this cloud session.

In [None]:
### this project NEEDS TO BE SET UP IN GOOGLE CLOUD FIRST
PROJECT = 'big-data-cw22' ### Append -xxxx, where xxxx is your City login to make project names unique ###
### it seems that the project name here has the be in lower case.
!gcloud config set project $PROJECT
REGION = 'us-central1' # this has worked most reliably with the free tier
!gcloud config set compute/region $REGION
!gcloud config set dataproc/region $REGION

Then we create a bucket, using the code from last time.

In [None]:
BUCKET = 'gs://{}-storage'.format(PROJECT)
!gsutil mb $BUCKET

### b) copy the data
With gsutil, we can use commands similar to the unix shell to copy data to the bucket. The 'cp' command copies data and we can use the glob pattern '*' to match all files in a directory.

One difference between cloud buckets and local file systems is that directories (like 'library') are not objects in a bucket but instead are treated like parts of the filename. Therefore we cannot create the target directory in the bucket, but specify it as part of the target path, even when it doesn't exist, yet.    

In [None]:
!ls '/content/drive/My Drive/Big_Data/data/library/'

!gsutil -m cp '/content/drive/My Drive/Big_Data/data/library/*' $BUCKET/library

### c) create the cluster

Like in the last lab, we create the cluster with a unique name using the `gcloud cluster create` command. See [here](https://cloud.google.com/sdk/gcloud/reference/dataproc/clusters/create) for documentation

This time, let's create a cluster with a master and 3 worker machines. For that, copy the `gcloud cluster create  ...` code from the last lab.

Remove the `--single-node` flag and use `--num-workers` instead to request 3 worker machines. Specifying the boot disk type and size fort the is analogous to the master, i.e. just copy the flags and replace 'master' with 'worker'

In [None]:
CLUSTER = '{}-cluster'.format(PROJECT)

#!gcloud dataproc clusters create $CLUSTER \
#    --image-version 1.5-ubuntu18 --single-node \
#    --master-machine-type n1-standard-2 \
#    --master-boot-disk-type pd-ssd --master-boot-disk-size 100 \
#    --max-idle 3600s
#>>> adapt the code above to create a cluster with 3 workers

Check the [console Dataproc page](https://console.cloud.google.com/dataproc/clusters/) to see that the cluster is running.


### d) create the script

For this, you need to combine the code cells under 1a) and the ones above (except for the code mounting drive and installing spark). All code needs to go into one code cell and you need to write the content into a file using the `%%writefile <filename>` magic at the beginning of the code cell.


In [None]:
%%writefile script.py
#>>> copy all the from 1a) and before here (except drive mounting and spark installation).

Once we have the script file, we can submit it to our Spark cluster using `gcloud submit`.

You will get a lot of output, but among all the log messages, you should find the same output as before.

In [None]:
#>>> use gcloud submit to run your script in the cloud

Check the output and compare with 1a).

Have a look at the [dataproc page on the cloud console](https://console.cloud.google.com/dataproc/clusters) to see how the machines are used.

# Extra tasks (optional)


## 3) Run more tasks in the cloud

Try running other tasks from this and previous labs in the cloud.

## 4) Creating n-grams from the newsgroups dataset

To use the newsgroups dataset from lab 4, we need to parse the messages like we did before.

In [None]:
import re
import os.path

p = '/content/drive/MyDrive/Big_Data/data/20_newsgroups/'

#here we are setting the path to select 2 topics
dirPath1 = os.path.join(p, 'alt.atheism')
dirPath2 = os.path.join(p, 'comp.graphics')

# remove the headers, get the sender and the main text
def parseMessage(ft):
    fn, text = ft # unpack the filename and text content
    # now use a regular expression to match the text
    # When you check the data, you can see that the first line that starts with 'Lines:' normally ends the header.
    # Only the very first file is different, but we can tolerate one wrong sample for now.
    # (How could we be more thorough?)
    matchObj = re.search(r'.+^(Lines:|NNTP-Posting-Host:) (.*)', text, re.DOTALL|re.MULTILINE|re.IGNORECASE)
    if(matchObj): # only if the pattern has matched
        text = matchObj.group(2) # can we replace the text
    else:
        text = "" # otherwise we return an empty string, in order to avoid giving header information to the model, which would give away the class.
    return (fn, text)

# for testing the parseMessages function
#ft_RDD = sc.wholeTextFiles(dirPath1) # create an RDD with wholeTextFiles
#txts = ft_RDD.take(3) # take into a local list
#txts2 = list(map(parseMessage, txts))# and apply removeHeader (NOTE: this is different from an RDD map!)
#print(txts2)

We then add the function `splitFile1NGrams` created above to the preprocessing pipeline.

In [None]:
# we need to create our feature vectors using the pyspark.ml.linalg.DenseVector class,
# in order to use the CrossValidation later
from pyspark.ml.linalg import DenseVector

# Make a DataFrame with labels and N-gram vectors
def make_dataFrame(dirPath, argLabel, N, NG):
    print("make_dataFrame started")
    ft_RDD = sc.wholeTextFiles(dirPath) # create an RDD with wholeTextFiles
    ft2_RDD = ft_RDD.map(parseMessage) # parse the messages
    # print("ft2_RDD.take(2)", ft2_RDD.take(2))
    >>>fng_RDD = ft2_RDD.flatMap(...) # split the file with a 'partial' like in the task 1, fixing the n-gram parameter to "NG".
    # print("fng_RDD.take(2)", fng_RDD.take(2))
    print("fng_RDD.count()", fng_RDD.count())
    fng2_RDD = fng_RDD.filter(lambda x: x is not None) # filter files we couldn't parse
    print("fng2_RDD.count()", fng2_RDD.count())
    fng_1_RDD = fng2_RDD.map(lambda x: (x, 1))  # change (fs, ng) to ((fs, ng), 1) - we can ignore that (fs, ng) actually is a tuple here
    fng_c_RDD = fng_1_RDD.reduceByKey(add) # as above
    f_ngcL_RDD = fng_c_RDD.map(reGrpLst) # as above
    f_ngcL2_RDD = f_ngcL_RDD.reduceByKey(add) # create [(ng, c), ..., (ng, c)] lists per file
    f_ngVec_RDD = f_ngcL2_RDD.map(lambda f_wc: (f_wc[0], hashing_vectorizer(f_wc[1], N)))
    # <<< below create a Row with dense vectors and the argLabel called 'features' and 'label'
    # <<< convert your list of hashed features to a DenseVector in order to make them compatible with the pyspark.ml library
    # <<< you can just call DenseVector(list) to achieve this
    >>>rows_RDD = f_ngVec_RDD.map(... Row(label= ..., features= ...))
    rows_DF = spark.createDataFrame(rows_RDD)
    return rows_DF

N  = 10 # vector size
NG =  3 # max n-gram size

rows1_DF = make_dataFrame(dirPath1, 0, N, NG)
rows2_DF = make_dataFrame(dirPath2, 1, N, NG)
rows_DF = rows1_DF.union(rows2_DF)
rows_DF.createOrReplaceTempView("newsgroups")
rows_DF.cache()
print(rows_DF.count())
rows_DF.printSchema()
rows_DF.show(5)

## 5) Use the spark.ml cross-validator

Now we can use the **CrossValidator**, which comes with the `pyspark.ml` module on the newsgroup DataFrames. We only need to set up the parameters. Have a look at the extra task of lab 4 for hints.

In [None]:
# We need to import the classifiers from the ML package now.
from pyspark.ml.classification import NaiveBayes

# The CrossValidator and ParamGridBuilder enable the automatic tuning
from pyspark.ml.tuning import CrossValidator
from pyspark.ml.tuning import ParamGridBuilder

# The evaluator to test the model
from pyspark.ml.evaluation import BinaryClassificationEvaluator

nb = NaiveBayes()
# <<< build a parameter grid for the nb.smoothing value
evaluator = BinaryClassificationEvaluator()
print("starting cross-validation")
>>>cv = CrossValidator(estimator= ... , estimatorParamMaps= ... , evaluator= ... ) # <<< fill in the correct values
cvModel = cv.fit(rows_DF)
print("finished cross-validation")
# <<< add evaluation and parameter values