# COM6012 Scalable Machine Learning 2020 - Haiping Lu
# Lab 2 - RDD, DataFrame, ML pipeline, & parallelization

## Objectives

* Task 1: To finish in the lab session. **Essential**
* Task 2: To finish in the lab session. **Essential**
* Task 3: To finish in the lab session. **Essential**
* Task 4: To explore by yourself before the next session. **Optional but recommended**

**Suggested reading**: 
* Chapters 5 and 6, and especially **Section 9.1** (of Chapter 9)  of [PySpark tutorial](https://runawayhorse001.github.io/LearningApacheSpark/pyspark.pdf)
* [RDD Programming Guide](https://spark.apache.org/docs/2.3.2/rdd-programming-guide.html): Most are useful to know in this module.
* [Spark SQL, DataFrames and Datasets Guide](https://spark.apache.org/docs/2.3.2/sql-programming-guide.html): `Overview` and `Getting Started` recommended (skipping those without Python example).
* [Machine Learning Library (MLlib) Guide](https://spark.apache.org/docs/2.3.2/ml-guide.html)
* [ML Pipelines](https://spark.apache.org/docs/2.3.2/ml-pipeline.html)
* [Apache Spark Examples](https://spark.apache.org/examples.html)
* [Basic Statistics - DataFrame API](https://spark.apache.org/docs/2.3.2/ml-statistics.html)
* [Basic Statistics - RDD API](https://spark.apache.org/docs/2.3.2/mllib-statistics.html): much **richer**

**Something compact?**
* [Cheat sheet PySpark Python](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PySpark_Cheat_Sheet_Python.pdf): **Highly recommended. Very handy.** 
* [Cheat sheet PySpark SQL Python](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PySpark_SQL_Cheat_Sheet_Python.pdf): **Highly recommended. Very handy.** 
* [Cheat sheet for PySpark (2 page version)](https://github.com/runawayhorse001/CheatSheet/blob/master/cheatSheet_pyspark.pdf): **Highly recommended. Very handy.** 
* If you find any good stuff, please share with me and all. Thanks.

**Tips**
* Try to use as much DataFrame APIs as possible by referring to the [Pyspark API documentation](https://spark.apache.org/docs/2.3.2/api/python/index.html). When you try to program something, try to search whether there is a function in the API already.
* Tired of typing those module load command? There are [convenient ways to set up your environment for different projects](https://docs.hpc.shef.ac.uk/en/latest/hpc/modules.html#convenient-ways-to-set-up-your-environment-for-different-projects) (thanks to Chengkun for sharing).

If running this notebook on HPC via [Jupyter Hub](https://jupyter-sharc.shef.ac.uk/), we need to run the following cell. If we are running this notebook on our local machine, skip the following cell.

In [None]:
import os
import subprocess
def module(*args):        
    if isinstance(args[0], list):        
        args = args[0]        
    else:        
        args = list(args)        
    (output, error) = subprocess.Popen(['/usr/bin/modulecmd', 'python'] + args, stdout=subprocess.PIPE).communicate()
    exec(output)    
module('load', 'apps/java/jdk1.8.0_102/binary')    
os.environ['PYSPARK_PYTHON'] = os.environ['HOME'] + '/.conda/envs/jupyter-spark/bin/python'

Unless you are running in a pyspark shell, you need to first create `spark` and `sc`.

In [1]:
#import findspark
#findspark.init()
import pyspark

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master("local[2]") \
    .appName("COM6012 Lab 2: RDD, DataFrame, ML pipeline, parallelization") \
    .getOrCreate()

sc = spark.sparkContext

You can also copy-paste into a shell or download the code as a `.py` file to run as standalone program.

## 1. RDD and Shared Variables


Spark allows for parallel operations in a program to be executed on a cluster.
* **Main abstraction**: **resilient distributed dataset (RDD)** is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel.
* **Second abstraction**: **shared variables** that can be shared across tasks, or between tasks and the driver program. Two types:
    * *Broadcast variables*, which can be used to cache a value in memory on all nodes
    * *Accumulators*, which are variables that are only “added” to, such as counters and sums.

#### Parallelized collections

Parallelized collections are created by calling `SparkContext`’s `parallelize` method on an existing iterable or collection in your driver program. The elements of the collection are copied to form a distributed dataset that can be operated on in parallel. For example, here is how to create a parallelized collection holding the numbers 1 to 5:



In [2]:
data = [1, 2, 3, 4, 5]
rddData = sc.parallelize(data)

In [3]:
rddData.collect()

[1, 2, 3, 4, 5]

While the number of *partitions* can be set manually, by passing parallelize a second argument to the sparkContext

In [4]:
sc.parallelize(data, 10)

ParallelCollectionRDD[1] at parallelize at PythonRDD.scala:194

Spark tries to set the number of partitions automatically based on the cluster, the rule being 2-4 partitions for every CPU in the cluster.

#### $\pi$ Estimation
Spark can also be used for compute-intensive tasks. This code estimates $\pi$ by "throwing darts" at a circle. We pick random points in the unit square ((0, 0) to (1,1)) and see how many fall in the unit circle. The fraction should be $\pi / 4$, so we use this to get our estimate.

In [5]:
import random
NUM_SAMPLES = 10000000
def inside(p):
    x, y = random.random(), random.random()
    return x*x + y*y < 1
count = sc.parallelize(range(0, NUM_SAMPLES)).filter(inside).count()
pi = 4 * count / NUM_SAMPLES
print("Pi is roughly", pi)

Pi is roughly 3.1409452


You can change `NUM_SAMPLES` to see the difference in precision and time cost.

### Understanding closures

There is a difference between running in local and cluster mode; the main difference is variable values. Note that you may get unexpected behaviour due to assumptions about variable values being updated / not being updated!

#### Broadcast variables

To avoid creating a copy of a variable for each task, an accessible (read-only!) variable can be kept on each machine - this is useful for particularly large datasets which may be needed for multiple tasks. The data broadcasted this way is cached in serialized form and deserialized before running each task.

Broadcast variables are created from a variable $v$ by calling SparkContext.broadcast(v). The broadcast variable is a wrapper around $v$, and its value can be accessed by calling the *value* method.

In [6]:
broadcastVar = sc.broadcast([1, 2, 3])
broadcastVar.value

[1, 2, 3]

#### Accumulators

[Accumulators](https://spark.apache.org/docs/2.3.2/api/java/org/apache/spark/Accumulator.html) are variables that are only “added” to through an associative and commutative operation and can therefore be efficiently supported in parallel. They can be used to implement counters (as in <tt>MapReduce</tt>) or sums. 

You can create a numeric accumulator by calling *SparkContext.longAccumulator()* or *SparkContext.doubleAccumulator()* to accumulate either Long or Double values (it is possible for users to create their own accumulators of different type), and the accumulator is created with an initial value. Cluster tasks can then add to it using <tt>add</tt>. However, they cannot read its value - that can only be read using <tt>value</tt> by the driver program.

In [7]:
accum = sc.accumulator(0)
accum

Accumulator<id=0, value=0>

In [8]:
sc.parallelize([1, 2, 3, 4]).foreach(lambda x: accum.add(x))
accum.value

10

Note that accumulators are only guaranteed to update the value of a variable once for updates within *actions*, within lazy transforms (such as *map*) accumulator updates are not guaranteed to be executed:

In [9]:

accum = sc.accumulator(5)
def g(x):
    accum.add(x)

rddData.map(g)
# Here, accum is still 0 because no actions have caused the `map` to be computed.

PythonRDD[6] at RDD at PythonRDD.scala:52

In [10]:
g(5)

In [11]:
accum

Accumulator<id=1, value=10>

In [12]:
g(10)

In [13]:
accum

Accumulator<id=1, value=20>

## 2. Data Frame

Along with the introduction of <tt>SparkSession</tt>, the <tt>resilient distributed dataset</tt> (RDD) was replaced by [dataset](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset). Again, these are objects which can be worked on in parallel. The available operations are:

- **transformations**: produce new datasets
- **actions**: computations which return results

We will start with creating dataframes and datasets, showing how we can print their contents. We create a dataframe in the cell below and print out some info (we can also modify the output before printing):

From RDD to DataFrame

In [14]:
rdd = sc.parallelize([(1,2,3),(4,5,6),(7,8,9)])
df = rdd.toDF(["a","b","c"])

In [15]:
rdd

ParallelCollectionRDD[7] at parallelize at PythonRDD.scala:194

In [16]:
df

DataFrame[a: bigint, b: bigint, c: bigint]

In [17]:
df.show()

+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  2|  3|
|  4|  5|  6|
|  7|  8|  9|
+---+---+---+



In [18]:
df.printSchema()

root
 |-- a: long (nullable = true)
 |-- b: long (nullable = true)
 |-- c: long (nullable = true)



Get RDD from DataFrame

In [19]:
rdd2=df.rdd
rdd2

MapPartitionsRDD[19] at javaToPython at <unknown>:0

In [20]:
rdd2.collect()

[Row(a=1, b=2, c=3), Row(a=4, b=5, c=6), Row(a=7, b=8, c=9)]

#### Load data from CSV file. This data is from a [classic book on statistical learning](http://www-bcf.usc.edu/~gareth/ISL/data.html). 

In [21]:
df = spark.read.load("Data/Advertising.csv",
                     format="csv", inferSchema="true", header="true")

In [22]:
df.show(5)

+---+-----+-----+---------+-----+
|_c0|   TV|radio|newspaper|sales|
+---+-----+-----+---------+-----+
|  1|230.1| 37.8|     69.2| 22.1|
|  2| 44.5| 39.3|     45.1| 10.4|
|  3| 17.2| 45.9|     69.3|  9.3|
|  4|151.5| 41.3|     58.5| 18.5|
|  5|180.8| 10.8|     58.4| 12.9|
+---+-----+-----+---------+-----+
only showing top 5 rows



In [23]:
df.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- TV: double (nullable = true)
 |-- radio: double (nullable = true)
 |-- newspaper: double (nullable = true)
 |-- sales: double (nullable = true)



Let us remove the first column

In [24]:
df2=df.drop('_c0')

In [25]:
df2.printSchema()

root
 |-- TV: double (nullable = true)
 |-- radio: double (nullable = true)
 |-- newspaper: double (nullable = true)
 |-- sales: double (nullable = true)



We can get **summary statistics** for numerical columns using **`.describe().show()`**, very handy to inspect your (big) data for understanding/debugging.



In [26]:
df2.describe().show()

+-------+-----------------+------------------+------------------+------------------+
|summary|               TV|             radio|         newspaper|             sales|
+-------+-----------------+------------------+------------------+------------------+
|  count|              200|               200|               200|               200|
|   mean|         147.0425|23.264000000000024|30.553999999999995|14.022500000000003|
| stddev|85.85423631490805|14.846809176168728| 21.77862083852283| 5.217456565710477|
|    min|              0.7|               0.0|               0.3|               1.6|
|    max|            296.4|              49.6|             114.0|              27.0|
+-------+-----------------+------------------+------------------+------------------+



## 3. Machine Learning Libary and Pipelines

[MLlib](https://spark.apache.org/docs/2.3.2/ml-guide.html) is Spark’s machine learning (ML) library. It provides:

- *ML Algorithms*: common learning algorithms such as classification, regression, clustering, and collaborative filtering
- *Featurization*: feature extraction, transformation, dimensionality reduction, and selection
- *Pipelines*: tools for constructing, evaluating, and tuning ML Pipelines
- *Persistence*: saving and load algorithms, models, and Pipelines
- *Utilities*: linear algebra, statistics, data handling, etc.

<tt>MLlib</tt> allows easy combination of numerous algorithms into a single pipeline using standardized APIs for machine learning algorithms. The key concepts are:

- **Dataframe**. Dataframes can hold a variety of data types.
- **Transformer**. Transforms one dataframe into another.
- **Estimator**. Algorithm which can be fit on a DataFrame to produce a Transformer.
- **Pipeline**. A Pipeline chains multiple Transformers and Estimators together to specify an ML workflow.
- **Parameter**. Transformers and Estimators share a common API for specifying parameters.

A list of some of the available ML features is available [here](http://spark.apache.org/docs/2.3.2/ml-features.html).

**Clarification on whether Estimator is a transformer**. See [Estimators](https://spark.apache.org/docs/2.3.2/ml-pipeline.html#estimators)
> An Estimator abstracts the concept of a learning algorithm or any algorithm that fits or trains on data. Technically, an Estimator implements a method fit(), which accepts a DataFrame and produces a Model, which is a Transformer. For example, a learning algorithm such as LogisticRegression is an Estimator, and calling fit() trains a LogisticRegressionModel, which is a Model and hence a **Transformer**.


### Linear Regression Example
The example below is based on **Section 8.1** of [PySpark tutorial](https://runawayhorse001.github.io/LearningApacheSpark/pyspark.pdf). 

#### Convert the data to dense vector (features and label)

Let us convert the above data in CSV format to a typical (feature, label) pair for supervised learning

In [27]:
from pyspark.sql import Row
from pyspark.ml.linalg import Vectors

In [28]:
def transData(data):
    return data.rdd.map(lambda r: [Vectors.dense(r[:-1]),r[-1]]).toDF(['features','label'])

In [29]:
transformed= transData(df2)
transformed.show(5)

+-----------------+-----+
|         features|label|
+-----------------+-----+
|[230.1,37.8,69.2]| 22.1|
| [44.5,39.3,45.1]| 10.4|
| [17.2,45.9,69.3]|  9.3|
|[151.5,41.3,58.5]| 18.5|
|[180.8,10.8,58.4]| 12.9|
+-----------------+-----+
only showing top 5 rows



The labels here are real numbers and this is a **regression** problem. For **classification** problem, you may need to transform labels (e.g., *disease*,*healthy*) to indices with a featureIndexer in Step 5, **Section 8.1** of [PySpark tutorial]

#### Split the data into training and test sets (40% held out for testing)

In [30]:
(trainingData, testData) = transformed.randomSplit([0.6, 0.4])

Check your train and test data as follows. I agree with the author Wenqiang that **it is always to good to keep tracking your data during prototype phase**.

In [31]:
trainingData.show(5)

+---------------+-----+
|       features|label|
+---------------+-----+
| [0.7,39.6,8.7]|  1.6|
| [4.1,11.6,5.7]|  3.2|
| [5.4,29.9,9.4]|  5.3|
|[7.3,28.1,41.4]|  5.5|
|[7.8,38.9,50.6]|  6.6|
+---------------+-----+
only showing top 5 rows



In [32]:
testData.show(5)

+----------------+-----+
|        features|label|
+----------------+-----+
|   [8.6,2.1,1.0]|  4.8|
| [8.7,48.9,75.0]|  7.2|
|[13.2,15.9,49.6]|  5.6|
|[16.9,43.7,89.4]|  8.7|
|[17.2,45.9,69.3]|  9.3|
+----------------+-----+
only showing top 5 rows



####  Fit a Linear Regression Model and Perform prediction
More details on parameters can be found in the [Python API documentation](https://spark.apache.org/docs/2.3.2/api/python/pyspark.ml.html#pyspark.ml.regression.LinearRegression).

In [33]:
from pyspark.ml.regression import LinearRegression

lr = LinearRegression()

In [34]:
lrModel = lr.fit(trainingData)

In [35]:
predictions = lrModel.transform(testData)

In [36]:
predictions.show(5)

+----------------+-----+------------------+
|        features|label|        prediction|
+----------------+-----+------------------+
|   [8.6,2.1,1.0]|  4.8|3.4620887403519953|
| [8.7,48.9,75.0]|  7.2|12.962200605141033|
|[13.2,15.9,49.6]|  5.6|6.6332137956904855|
|[16.9,43.7,89.4]|  8.7| 12.41824981687026|
|[17.2,45.9,69.3]|  9.3|12.737043108987537|
+----------------+-----+------------------+
only showing top 5 rows



#### Evaluation

In [37]:
from pyspark.ml.evaluation import RegressionEvaluator

evaluator = RegressionEvaluator(labelCol="label",predictionCol="prediction",metricName="rmse")

In [38]:
rmse = evaluator.evaluate(predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)

Root Mean Squared Error (RMSE) on test data = 1.5899


### Machine Learning Pipeline Example

This example is from the [ML Pipeline API](https://spark.apache.org/docs/2.3.2/ml-pipeline.html), with additional explanations. 

In [39]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer

Directly create DataFrame (for illustration)

In [40]:
training = spark.createDataFrame([
    (0, "a b c d e spark", 1.0),
    (1, "b d", 0.0),
    (2, "spark f g h", 1.0),
    (3, "hadoop mapreduce", 0.0)
], ["id", "text", "label"])


In [41]:
training.printSchema()

root
 |-- id: long (nullable = true)
 |-- text: string (nullable = true)
 |-- label: double (nullable = true)



In [42]:
training.show()

+---+----------------+-----+
| id|            text|label|
+---+----------------+-----+
|  0| a b c d e spark|  1.0|
|  1|             b d|  0.0|
|  2|     spark f g h|  1.0|
|  3|hadoop mapreduce|  0.0|
+---+----------------+-----+



Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.

In [43]:
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.001)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

Model fitting 

In [44]:
model = pipeline.fit(training)

Construct test documents (data), which are unlabeled (id, text) tuples

In [45]:
test = spark.createDataFrame([
    (4, "spark i j k"),
    (5, "l m n"),
    (6, "spark hadoop spark"),
    (7, "apache hadoop")
], ["id", "text"])

In [46]:
test.show()

+---+------------------+
| id|              text|
+---+------------------+
|  4|       spark i j k|
|  5|             l m n|
|  6|spark hadoop spark|
|  7|     apache hadoop|
+---+------------------+



Make predictions on test documents and print columns of interest.


In [47]:
prediction = model.transform(test)
prediction.show()

+---+------------------+--------------------+--------------------+--------------------+--------------------+----------+
| id|              text|               words|            features|       rawPrediction|         probability|prediction|
+---+------------------+--------------------+--------------------+--------------------+--------------------+----------+
|  4|       spark i j k|    [spark, i, j, k]|(262144,[20197,24...|[-1.6609033227472...|[0.15964077387874...|       1.0|
|  5|             l m n|           [l, m, n]|(262144,[18910,10...|[1.64218895265644...|[0.83783256854767...|       0.0|
|  6|spark hadoop spark|[spark, hadoop, s...|(262144,[155117,2...|[-2.5980142174393...|[0.06926633132976...|       1.0|
|  7|     apache hadoop|    [apache, hadoop]|(262144,[66695,15...|[4.00817033336812...|[0.98215753334442...|       0.0|
+---+------------------+--------------------+--------------------+--------------------+--------------------+----------+



In [48]:
selected = prediction.select("id", "text", "probability", "prediction")
selected.show()

+---+------------------+--------------------+----------+
| id|              text|         probability|prediction|
+---+------------------+--------------------+----------+
|  4|       spark i j k|[0.15964077387874...|       1.0|
|  5|             l m n|[0.83783256854767...|       0.0|
|  6|spark hadoop spark|[0.06926633132976...|       1.0|
|  7|     apache hadoop|[0.98215753334442...|       0.0|
+---+------------------+--------------------+----------+



In [49]:
for row in selected.collect():
    rid, text, prob, prediction = row
    print("(%d, %s) --> prob=%s, prediction=%f" % (rid, text, str(prob), prediction))

(4, spark i j k) --> prob=[0.1596407738787475,0.8403592261212525], prediction=1.000000
(5, l m n) --> prob=[0.8378325685476744,0.16216743145232562], prediction=0.000000
(6, spark hadoop spark) --> prob=[0.06926633132976037,0.9307336686702395], prediction=1.000000
(7, apache hadoop) --> prob=[0.9821575333444218,0.01784246665557808], prediction=0.000000


## 4. Exercise (completing three or more questions is considered as completion of this exercise)


**Note**: Jupyther notebook is primarily for interactive learning that is more suitable on small scale data. For large-scale (big) data, stand-alone programs or HPC batch jobs are encouraged. 
* Load the NASA access log Aug95 data in Session 1 and create a DataFrame with FIVE columns by **specifying** the schema 
* Perform some more challenging mining tasks in Session 1 using *as many DataFrame functions as possible*
   * How many **unique** hosts on a particular day (e.g., 15th August)?
   * How many **unique** hosts in total (i.e., in August 1995)?
   * Which host is the most frequent visitor?
   * How many different types of return codes?
   * How many requests per day on average?
   * How many requests per post on average?
   * Any other question that you are interested in.
* Explore more CSV data of your interest via Google or at [Sample CSV data](https://support.spatialkey.com/spatialkey-sample-csv-data/), including insurance, real estate, and sales transactions.