## Spark: Cloud Programming with Memory

Lecture derived from: Zaharia et al. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. USENIX NSDI, 2012.

* Map/Reduce style programming
  * Data-parallel, batch, restrictive model, functional
  * Abstractions to leverage distributed memory
* New interfaces to in-memory computations
  * Fault-tolerant
  * Lazy materialization (pipelined evaluation)
* Good support for iterative computations on in-memory data sets leads to good performance
  * 20x over Map/Reduce
  * No writing data to file system, loading data from file system, during iteration
  
### RDD: Resilient Distributed Dataset

This is the central abstraction of Spark.
* Read-only partitioned collection of records
* Created from:
  * Data in stable storage
  * Transformations on other RDDs
* Unit of parallelism in a data decomposition:
  * Automatic parallelization of transformation, such as map, reduce, filter, etc.
* RDDs are not data:
  * Not materialized.  They are an abstraction.
  * Defined by lineage. Set of transformations on a original data set.
  
A first example:

```scala
lines = spark.textFile("hdfs://...")
errors = lines.filter(_.startsWith("ERROR"))
errors.persist()

// Return the time fields of errors mentioning 
// HDFS as an array (assuming time is field #3 
// in a tab separated format
errors.filter(_contains("HDFS"))
      .map(_.split('\t')(3))
      .collect()
```

* RDDS in this computation
  * `lines` is RDD backed by HDFS.
  * `errors` is derived from filter.
  * next two are implicit (not named variables)
* `persist` indicates to store something in memory for reuse
* `collect` materializes computation to HDFS

Associate with each Spark computation is a lineago of RDDS.

<img src="./images/spark_lineage.png" width=384 />




### A PySpark Example

Monte Carlo approximation of $\Pi$ based on determining whether points are inside or outside a radius 1 circle of area $\Pi r^2$ inside a square of area $4r^2$.


In [1]:
from pyspark import SparkContext
sc = SparkContext("local", "App Name",)

import random
num_samples = 1000000
def inside(p):     
  x, y = random.random(), random.random()
  return x*x + y*y < 1
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples
print(pi)
sc.stop()

3.141332


### Logistical Regression in Spark

```scala
val points = spark.textFile(...)
                  .map(parsePoint).persist()
var = w  // random initial vector
for (i <- 1 to ITERATIONS) {
    val gradient = points.map{ p => 
          p.x * (1/(1+exp(-p.y*(w dot p.x)-1)*p.y
    }.reduce((a+b) => a+b) 
    w -= gradient
}
```

* Features:
  * Scala closures, functions with free variables
  * `points` is a read-only RDD reused in each iteration
  * Only w (a scalar) gets updated



#### Managing Memory in Spark

* `persist()` indicates desire to reuse an RDD, encourages Spark to keep it in memory
* Spark breaks data up into: 
  * RDD: the representation of a logical data set
  * sequence: a physical, materialized data set
* In Spark-land, RDDs and sequences are differentiated by the concepts of 
  * Transformations: RDD->RDD
  * Actions: RDD->sequence/data
* RDDs define a pipeline of computations from data set (HDFS) to sequence/data
* RDDs evaluated lazily as needed to build a sequence
  * A sequence computation pulls data through the Spark pipeline
* Parallelized constructs in Spark
  * Transformations are lazy whereas actions launch computation
  
<img src="./images/spark_ops.png" width=768 />

### Map/Reduce in Spark

The following steps compute a Map/Reduce
```
spark.textfile(...).flatMap(...).reduceByKey(...).save()
```
* Doesn’t use RDD pipelining.  
* `flatMap` produces a sequence.
* Doesn’t use memory abstraction

#### Many Maps

* `map()` is one-to-one consistent w/ scala semantics
* `flatMap` is many-to-one like Map in M/R
* `mapValues` does not transform key (important for partitioning)

#### M/R equivalent in PySpark


In [2]:
(from pyspark import SparkContext
sc = SparkContext("local", "App Name",)

text_file = sc.text(File("/mnt/labbook/input/wordcount/*")
counts = text_file.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("/mnt/labbook/output/untracked/sparkoutput")

sc.stop()

### Lineage and Recovery

* Spark in the presence of failures:
  * Identify partitions of data (in an RDD) that have failed
  * Recompute the failed partitions using lineage
  * Parallelize recomputation (using Spark)
  * Easy because all RDDs are immutable
* Does not require checkpoint/restart or rollback
  * Checkpoint = save required memory (application state) to persistent storage sufficient to restart the computation
  * Restart = restart computation from a checkpoint
  * Rollback = Undo changes made to memory associated with computations that have failed or did not complete
  * All concepts related to managing a writeable memory related to computational progress in the presence of failures.
  
#### Spark Checkpointing

* Spark supports checkpoints as performance optimization
  * make an RDD persistent to limit recovery work
* It is desirable to checkpoint when:
  * Lineages become long (in dependencies)
  * Dependencies become wide
* Spark’s default is to use the initial data load as the only checkpoint and restart from that checkpoint

**PageRank** Checkpoint Example. Scala code and the resulting lineage.
```scala
val links = spark.textFile(...).map(...).persist()
var ranks = // RDD of (URL, rank) pairs
for (i <- 1 to ITERATIONS) {
    // Build an RDD of (TargetURL, float) pairs
    // with contributions sent by each page
    val contribs = links.join(ranks).flatMap {
        (url, (links, rank)) => 
            links.map(dest => dest, rank/link.size))
    }
    // Sum contributions by URL and get new ranks
    ranks = contribs.reduceByKey((x,y) => x+y)
        .mapValues(sum => a/N + (1-a)*sum)
}
```
<img src="images/sparkpr.png" width=384 />

Recovering the computation from lineage repeats the entire comptutation history.  One would prefer to checkpoint `ranks` intermittently so that the computation can restart from a recent point.

#### Dependencies

RDDs exist in partitions in which each partition is (potentially) on a different computer.  So when a node fails, that parition fails.

<img src="images/sparkdep.png" width=512 />

Different operations have different dependency patterns.

* If partition fails in an RDD with a wide-dependency, the entire previous computation needs to be repeated
  * Partition after wide dependencies.



  
  


### Conclusions

Everything M/R can do.  

* With memory abstractions.
  * Caching
  * Persistence
  * Checkpoints
* And support for iterative algorithms
* And an interactive programming interface
* And multi-language support