# Lab Sheet 4a:  Text Classification with DataFrames and getting started with Google Cloud

These tasks are for working in the lab session and during the week.

We'll build on last week's code and add  
* Classification
* Execution Timing
* DataFrames
* Spark ML
* ML Pipeline
* Evaluation and HP tutning

We are going to use a **simple text classification problem** to address all these points.

In addition, we'll start using Google Cloud.

## 1) Preparation
As usual, we start by mounting Google Drive, then installing and starting Spark.

In [None]:
# Load the Drive helper and mount
from google.colab import drive

# This will prompt for authorization.
drive.mount('/content/drive')

%cd
!tar -xzf "/content/drive/My Drive/Big_Data/data/spark/spark-3.5.0-bin-hadoop3.tgz" # unpacking
!apt-get install openjdk-8-jdk-headless -qq > /dev/null # installing java
import os # Python package for interaction with the operating system
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64" # tell the system where Java lives
os.environ["SPARK_HOME"] = "/root/spark-3.5.0-bin-hadoop3" # and where spark lives
!pip install -q findspark # install helper package
import findspark # use the helper package
findspark.init() # to set up spark
%cd "/content/drive/My Drive/Big_Data"

import pyspark
# get a spark context for RDDs
sc = pyspark.SparkContext.getOrCreate()
print(sc)
# and a spark session for DataFrames
spark = pyspark.sql.SparkSession.builder.getOrCreate()
print(spark)

## 2) Dataset: Newsgroups

Today, we use another, **larger dataset**, which consists of "Usenet" discussions from the early days of the Internet. This dataset contains messages from 20 different newsgroups on different topics with ~1000 messages each. More information and the data can be found here:
[https://archive.ics.uci.edu/dataset/113/twenty+newsgroups](https://archive.ics.uci.edu/dataset/113/twenty+newsgroups)

With the larger dataset you should get more meaningful time measurements. However, we also need to wait longer. Try several runs and try executing things in different order (here we are only training a Logistic Regression classifier, however you are encouraged to try other models).

To create a meaningful dataset for classification, you need to read in at least 2 topics and then use `RDD.randomSplit()` to create a train and test set. For this lab, we will use alt.atheism and comp.graphics. Try adding more topics to the dataset, there are 20 differet directories (i.e. topics).

In [None]:
# show directory content
%ls "/content/drive/MyDrive/Big_Data/data/20_newsgroups"

We can now read the text files into RDDs and create our small trial dataset.

## 3) Preprocessing: Extract directory names and remove headers from files with a RegEx

We can now read the **text files into RDDs** and create our smaller trial dataset. We use the [`wholeTextFiles()`](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.SparkContext.wholeTextFiles.html)  method, which produces tuples in the RDD consisting of the filename and the text contained in the file.  Use the [`use_unicode=False`](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.SparkContext.wholeTextFiles.html) option in `wholeTextFiles` to speed up reading.


This cell takes a bit of time to execute.

In [None]:
%%time
import os.path

# the dataset directory
p = '/content/drive/MyDrive/Big_Data/data/20_newsgroups/'

# here we are setting the path to select 2 topics
dirPath1 = os.path.join(p, 'alt.atheism')
dirPath2 = os.path.join(p, 'comp.graphics')

# Use wholeTextFiles to read both the files and make sure you set the
# keyword argument 'use_unicode' to False as it speeds up reading dramatically.
# Check the link in the cell above for a bit more information.
>>> alt_rdd = ...
print(alt_rdd.take(1))
>>> comp_rdd = ...
print(comp_rdd.take(1))

# Create a union of the 2 RDD's so we hava a set with both classes
>>> newsGroup_RDD_b = alt_rdd ... comp_rdd

# The byte strings need to be 'decoded' into standard Python strings.
# (Strangely doing it after reading is much faster than reading unicode, not sure why that is ...)
# Apply .decode() to *each* item of the tuple containing (filepath, text) separately
>>> newsGroup_RDD = newsGroup_RDD_b.map( ... )
# print the total number of documents here:
print('Number of documents read is:', newsGroup_RDD.count())
print(newsGroup_RDD.take(3)) # print out 3 examples.
# The 2nd part of the tuple (at address 1) will contain the whole message.
# the print command is necessary here, because the %%time magic
# prevents the last element from being printed automatically

You'll see low CPU time (in the millisecond range), but longer wall time (around a minute). What could be the reason?

Next, use a Regular Expression with `re.split()` on the file path. We want only the last directory name (e.g. "sci.space", these are our class labels), but not the filename and the rest of the path. The approach is to split at the directory separator (`/` on Linux and Mac), and then use the element before the last.
See here: https://docs.python.org/3/library/re.html#re.split

This can be written in a lambda expression, that keeps the text (the file content), but processes the filepath at position 0 of the tuple.

In [None]:
import re

# Remove the file name and path before the last directory name (i.e. the newsgroup name)
>>> fnt_RDD = newsGroup_RDD.map(lambda ft: ...) # extract the newsgroup name from the filepath, keep the text

# check the output against the directoy names listed before section 2.
fnt_RDD.take(3)

At closer inspection, we can see that the messages have headers, and one of them starts with `Newsgroups:` and actually lists the topic. This is clearly an **unreasonable** shortcut for the classifier, as we are interested in **predicting topics from the text**.

Thus, the dataset needs preprocessing to **remove these headers**. We can use a **regular expression** to retain only the actual content of the messages.

When you check the data, you can see that the first line that starts with `Lines:` normally ends the headers. Only the very first file is different, but we can tolerate one wrong sample for now. (How could we be more thorough?)

It's best to use `re.search()` here. Since we want to match **multiple lines of text**, we need to use `re.DOTALL` and `re.MULTILINE`. We alse need to create **groups** in the expression with brackets `()`, that are then available from the **match object** that gets returned.

See here:  \
https://docs.python.org/3/library/re.html#re.search  \
https://docs.python.org/3/library/re.html#match-objects  \
https://docs.python.org/3/library/re.html#re.MULTILINE  \
https://docs.python.org/3/library/re.html#re.DOTALL  

In [None]:
import re

# new function to remove the headers using regular expressions
def removeHeader(ft):
    fn, text = ft # unpack the filename and text content
    # now use a regular expression to match the text
>>>    matchObj = re.search(...)  # fill in the expression here and don't forget to use DOTALL and MULTILINE
    if(matchObj): # only if the pattern has matched
>>>        text = matchObj.group(...) # can we replace the text. Which element of the matchObj are we looking for?
    # otherwise we keep the old for now (what could be a better solution?)
    return (fn, text)

fnt_RDD2 = fnt_RDD.map(removeHeader)
fnt_RDD2.take(3)

## 4) DataFrames

In this section we will introduce **DataFrames**. To read more about DataFrames look here: \
https://spark.apache.org/docs/latest/sql-programming-guide.html

DataFrames can be created by reading directly **from files**, too, such as csv, parquet, or json, but our data had to be preprocessed first.  

In Spark we can also create DataFrames **from RDDs** and that is what we will implement in the next section.
A DataFrame represents a table structure. We define a schema that contains the names and types of the coumns in the table.

From the official documentation:

A DataFrame can be created programatically in 3 steps:

- Create an RDD of Rows from the original RDD;
- Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1.
- Apply the schema to the RDD of Rows via createDataFrame method provided by SparkSession.

More on the **pyspark.sql API** can be found here:

https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/index.html


In [None]:
from pyspark.sql.types import * # import the type specifications for the schema

# The schema is encoded in a string.
# Here we are only interested in the topic and text
schemaString = "topic text"

# A StructField object comprises three fields, name (a string), dataType (a DataType)
# and a flag indicating whether its value can be 'null' (a bool).
# We create 2 fields of strings with names according to our schemaString
fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
# these together define our schema
schema = StructType(fields)

# Apply the schema in createDataFrame, to create a DataFrame 'df' from the RDD
df = spark.createDataFrame(fnt_RDD2, schema)
df.show(3)

### Using SparkSQL

**SparkSQL** is a limited version of SQL that can be applied to Spark DataFrames. It will be **executed** in a **distributed way**, just like RDD operations.

We need to create a (temporary) **`view`** of the DataFrame, so that we can **use SparkSQL**.


In [None]:
df.createOrReplaceTempView("newsgroups")

# SQL can now be run on the DataFrame.
# Let's start by selecting only the topics elements of each row
results = spark.sql("SELECT DISTINCT topic FROM newsgroups")
results.show()

SparkSQL provides a number of **functions**, a **list** can be found here:
http://spark.apache.org/docs/latest/api/sql/index.html \
An **overview** of SparkSQL and DataFrames is provided here:
http://spark.apache.org/docs/latest/sql-programming-guide.html

In [None]:
# We can make more sophisticated queries in SQL, e.g. using topic names as a distinct feature and simply count number of files:
results_topic = spark.sql("SELECT DISTINCT topic, count(*) as count_ FROM newsgroups GROUP BY topic ORDER BY count_ DESC")
results_topic.show()

### Preparing the numeric class labels for classification
We need **numeric labels** for the **classifier**. For now, we go for binary labels.

In [None]:
# df.withColumn returns a new DataFrame by adding a column with a name and value for each row.
# The value is a 'column expression', where we compares with the string 'comp', to find out whether the topic starts with the string "comp".
# .cast("int") will convert the resulting Boolean value into a number, 1 for 'comp', 0 for other strings.
news_Groups = df.withColumn("label", df.topic.like("comp%").cast("int"))

# you can use a syntax similar to an array to select some examples of either class
alt_topic_df = news_Groups[df.topic.like("alt%")]
alt_topic_df.show(3)
comp_topic_df = news_Groups[df.topic.like("comp%")]
comp_topic_df.show(3)

In [None]:
# Split the DataFrame into training and test set.
# RandomSplit - divides the DataFrame into train/test using the weights given as arguments.
# You can try other combinations of weights.
train_set, test_set = news_Groups.randomSplit([0.8, 0.2], 123)
#>>> now get the sizes of the sets:
>>> print("Total document count:", ...)
>>> print("Training-set count:", ...)
>>> print("Test-set count:", ...)

## 5) Using ML to classify messages

### Make an ML Pipeline

**ML** is the Spark **machine learning library** for **DataFrames**. We want to build an ML pipeline to predict the Binary label.

A Spark ML **Pipeline** is a sequence of stages, and each stage is either a **Transformer** or an **Estimator**. These stages are run in a sequence, and the input DataFrame is transformed as it passes through each stage.

Some of the **functions** we implemented ourselves are now used in **ready-made** versions, such as the Hashing Vectorizer or the stopword remover.

A practical ML pipeline might consist of many stages like feature extraction, feature transformation, and model fitting. We create a pipeline that consists of the **following stages**:

1. **RegexTokenizer** - which tokenizes each article into a sequence of words with a regex pattern,
2. **Stop word remover**
3. **HashingTF**, maps the word sequences produced by RegexTokenizer to fixed size feature vectors using the hashing trick,
4. **IDF**, normalises by the IDF value,
5. **LogisticRegression**, which fits the feature vectors and the labels from the training data to a logistic regression model.
    
To read more on this: https://spark.apache.org/docs/latest/ml-features.html

In [None]:
from pyspark.ml import Pipeline, PipelineModel
from pyspark.ml.classification import LogisticRegression, NaiveBayes
from pyspark.ml.feature import HashingTF,StopWordsRemover,IDF,Tokenizer

# Constructing a pipeline
# We split each sentence into words using Tokenizer.
# Tokenizer only splits by white spaces
tokenizer = Tokenizer().setInputCol("text").setOutputCol("words") # create a nuem

# Remove stopwords
remover= StopWordsRemover().setInputCol("words").setOutputCol("filtered").setCaseSensitive(False)

# For each sentence (bag of words),use HashingTF to hash the sentence into a feature vector.
hashingTF = HashingTF().setNumFeatures(100).setInputCol("filtered").setOutputCol("rawFreqs")

# We use IDF to rescale the feature vectors; this generally improves performance when using text as features.
idf = IDF().setInputCol("rawFreqs").setOutputCol("features").setMinDocFreq(10)
# The Spark ML algorithms expect their input in a column "featurs" by default.

# Our feature vectors can then be passed to a learning algorithm.
lr = LogisticRegression()
# nb = NaiveBayes() # you can try this as an alternative to LR (can plot the AUC)

# Then we connect all the steps above to create one pipeline
pipeline=Pipeline(stages=[tokenizer, remover, hashingTF, idf, lr])

In [None]:
# We can get an information for each parameter  using the .explainParams()
print("Tokenizer:", tokenizer.explainParams(),"\n")
print("Remover:", remover.explainParams(),"\n")
print("HashingTF:", hashingTF.explainParams(),"\n")
print("IDF:", idf.explainParams(),"\n")
print("Pipeline:", pipeline.explainParams(),"\n")

In [None]:
# Use the pipeline option to fit the training set and create a model

# After we construct this ML pipeline,we can fit it to the training data
# and obtain a trained pipeline model that can be used for prediction.
%time model = pipeline.fit(train_set)
# %time is a simpler way to take the time than we used in the coursework
# Training takes about a minute normally.

## 6) Evaluate prediction results

In [None]:
# After we obtain a fitted pipeline model, we want to know how well it performs.
# Let us start with some manual checks by displaying the predicted labels.

# You can simply use the .transform() on the test set to make predictions on the test set
test_predictions = model.transform(test_set)
train_predictions = model.transform(train_set)

# Show the predicted labels along with true labels and raw texts.
test_predictions.select("topic","probability","prediction","label").show(5)
# and show some of the other class ...
test_predictions.select("topic","probability","prediction","label").filter(test_predictions.topic.like("comp%")).show(5)

### Training set evaluation

In [None]:
# The predicted labels look accurate.
# Let's evaluate the model quantitatively.
from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluatorROC = BinaryClassificationEvaluator().setMetricName("areaUnderROC")
>>> evaluatorPR # create an evaluator like above for the 'areaUnderPR', i.e. PrecisionRecall curve
train_ROCAUC = evaluatorROC.evaluate(train_predictions)
>>> train_PRAUC = # get the PR value
print("Area under ROC curve - training:", train_ROCAUC)
print("Area under PR curve - training:", train_PRAUC)

### Test set evaluation


In [None]:
test_ROCAUC = evaluatorROC.evaluate(test_predictions)
test_PRAUC = evaluatorPR.evaluate(test_predictions)
print("Area under ROC curve - test:", test_ROCAUC)
print("Area under PR curve - test:", test_PRAUC)

We can now plot a **ROC curve** with Matplotlib (or another package of your choice). We read the values from the summary that the **evaluator** provides in the **last stage** object.

In [None]:
import matplotlib.pyplot as plt

plt.plot([0, 1], [0, 1], 'k--')
plt.plot(model.stages[-1].summary.roc.select('FPR').collect(),
         model.stages[-1].summary.roc.select('TPR').collect(),
        label='ROC curve (area = {:0.2f})'.format(test_ROCAUC)
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.title('ROC Curve')
plt.legend(loc="lower right")
plt.show()

In [None]:
plt.plot([0, 1], [1, 0], 'k--')
plt.plot(model.stages[-1].summary.pr.select('recall').collect(),
         model.stages[-1].summary.pr.select('precision').collect(),
        label='PR curve (area = {:0.2f})'.format(test_PRAUC))
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.show()

The **training** result is already **very good**, the **test result** is **not much worse**. So, this task is easy for LogisticRegression, even with small vector size. With a grid search we can tune the hyper-parameters to find the optimal settings.

With **20 classes**, the **task** gets **harder**, however, which you can explore with a multi-classification.


## Extra Tasks

These tasks are a bit more involved in terms of programming.  So, you can try them at home.

### Tuning the Hyper-Parameters

The Spark ML package includes a class for Hyper-Parameter tuning, such as **CrossValidator** (for smaller datasets) and **TrainTestValidator** (for larger datasets).

Here, we define a grid and the CrossValidator tests over all points in the grid.


In [None]:
# We use a ParamGridBuilder to construct a grid of parameters to search over.

# With 1 value for hashingTF.numFeatures and 1 value for idf,
# the grid below has only 1 parameter setting for CrossValidator to choose from.
#>>> Extend the grid once you are sure that it runs.

from pyspark.ml.tuning import ParamGridBuilder

paramGrid = ParamGridBuilder()\
    .addGrid(hashingTF.numFeatures, [100])\
    .addGrid(idf.minDocFreq, [10])\
    .build()

Training the model (`cvModel.fit`) taks several minutes. Make sure that everything so far has worked correctly,  to avoid spending a long time waiting for an invalid result.

In [None]:
# A CrossValidator requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.
from pyspark.ml.tuning import CrossValidator

cv = CrossValidator().setEstimator(pipeline).setEvaluator(evaluatorROC).setEstimatorParamMaps(paramGrid).setNumFolds(2)
# Note: This takes a long time to run with proper sized grid
# Do this step only when you've done everything else first!
%time cvModel = cv.fit(train_set)

In [None]:
# Task:
# After you have run the cell above , print both results (with and without cross validation)
# Observe the results

print ("Area under ROC curve for non-tuned model:", evaluatorROC.evaluate(test_predictions))
print ("Area under PR curve for non-tuned model:", evaluatorPR.evaluate(test_predictions))
>>> print("Area under the ROC curve for best fitted model =", evaluatorROC.evaluate(...)) # apply the cvModel to the test_set (like done in 4) for the pipeline model to create 'test_predictions')
>>> print("Area under the PR curve for best fitted model =", evaluatorPR.evaluate(...)) # as above


### Multi-class classification

Use **all the 20 topics** in the dataset as class labels. The reading of the data is straightforward. You will need to use a different mapping from newsgroup names to class labels, though. Then you will need to **reconsider evaluation**, as the ROC AUC is only defined for the binary case.

The **perfomance** will be **lower**, so that it is worth to try and **tune** the **hyper-parameters**.

# Running PySpark on Google's Cloud Platform

**Up to now**, we have been running **PySpark** on Google **Colab**. This means our code is executed on a virtual Linux machine hosted by Google. Our code can already run in parallel, but only using the multiple cores on one machine (see CPU info below).

In [None]:
!lscpu | grep "socket\|Socket\|core\|CPU"

Next we'll move towards running **PySpark code on multiple machines**, which means that synchronisation between jobs will need to happen over the network instead of in the shared working memory. We'll be using [Cloud Dataproc](https://cloud.google.com/dataproc), a **managed Spark installation** hosted by Google (so it's a **platform-as-a-service**).

Open the _Google Cloud Introduction_ pdf file on Moodle (in section 4) and follow the instructions to get started with creating a free trial account and running some example code.

