## Spark: Cloud Programming with Memory

Lecture derived from: Zaharia et al. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. USENIX NSDI, 2012.

* Map/Reduce style programming
  * Data-parallel, batch, restrictive model, functional
  * Abstractions to leverage distributed memory
* New interfaces to in-memory computations
  * Fault-tolerant
  * Lazy materialization (pipelined evaluation)
* Good support for iterative computations on in-memory data sets leads to good performance
  * 20x over Map/Reduce
  * No writing data to file system, loading data from file system, during iteration
  
### RDD: Resilient Distributed Dataset

This is the central abstraction of Spark.
* Read-only partitioned collection of records
* Created from:
  * Data in stable storage
  * Transformations on other RDDs
* Unit of parallelism in a data decomposition:
  * Automatic parallelization of transformation, such as map, reduce, filter, etc.
* RDDs are not data:
  * Not materialized.  They are an abstraction.
  * Defined by lineage. Set of transformations on a original data set.
  
A first example:

```scala
lines = spark.textFile("hdfs://...")
errors = lines.filter(_.startsWith("ERROR"))
errors.persist()

// Return the time fields of errors mentioning 
// HDFS as an array (assuming time is field #3 
// in a tab separated format
errors.filter(_contains("HDFS"))
      .map(_.split('\t')(3))
      .collect()
```

* RDDS in this computation
  * `lines` is RDD backed by HDFS.
  * `errors` is derived from filter.
  * next two are implicit (not named variables)
* `persist` indicates to store something in memory for reuse
* `collect` materializes computation to HDFS

Associate with each Spark computation is a lineago of RDDS.

<img src="./images/spark_lineage.png" width=384 />


### Logistical Regression in Spark

```scala
val points = spark.textFile(...)
                  .map(parsePoint).persist()
var = w  // random initial vector
for (i <- 1 to ITERATIONS) {
    val gradient = points.map{ p => 
          p.x * (1/(1+exp(-p.y*(w dot p.x)-1)*p.y
    }.reduce((a+b) => a+b) 
    w -= gradient
}
```

* Features:
  * Scala closures, functions with free variables
  * `points` is a read-only RDD reused in each iteration
  * Only w (a scalar) gets updated



#### Managing Memory in Spark

* `persist()` indicates desire to reuse an RDD, encourages Spark to keep it in memory
* Spark breaks data up into: 
  * RDD: the representation of a logical data set
  * sequence: a physical, materialized data set
* In Spark-land, RDDs and sequences are differentiated by the concepts of 
  * Transformations: RDD->RDD
  * Actions: RDD->sequence/data
* RDDs define a pipeline of computations from data set (HDFS) to sequence/data
* RDDs evaluated lazily as needed to build a sequence
  * A sequence computation pulls data through the Spark pipeline
  
* Parallelized constructs in Spark
  * Transformations are lazy whereas actions launch computation
  
<img src="./images/spark_ops.png" width=512 />