# Spark Intro
Like _Hadoop_, __Spark__ is a low-level system for distributed computation on a cluster.  It has two major advantages over MapReduce:
 
 - It can do in-memory caching between stages, while Hadoop must write everything to disk.  This improves performance.  It also makes it suitable for classes of algorithms that would otherwise be too slow (e.g. iterative algorithms, like the training step in many machine learning algorithms).
 
 - It has a more flexible execution model (i.e. not just MapReduce).
 
There are also minor advantages: The default API is much nicer than writing raw MR code. The favored interface is via a language called Scala, but there is also good support for using Python.

## How does Spark relate to Hadoop?  

__Answer 1__: It's a replacement for it.  You can manage and run your own Spark cluster, independent of Hadoop.  You'll have to figure out the filesystem layer from scratch though.   (In this context, _Spark SQL_ is the replacement for _Hive_, i.e. for SQL-like operation.)

__Answer 2__: It's a complement to it.  You can run Spark on top of a Hadoop cluster, and still leverage HDFS and YARN -- then Spark is just replacing MapReduce.  (In this context, _Spark SQL_ can be used as a drop-in replacement for _Hive_.)

## The Spark API
### Nouns
The two main abstractions in Spark are:
1. Resilient distributed datasets (RDDs):
  This is like a (smartly) distributed list, which is partitioned across the nodes of the cluster and can be operated on in parallel.  The operations take the form of the usual functional list operations as shown above.  There are two types of operations: Transformations and Actions.
1. Shared variables used in parallel operations: When Spark runs a function in parallel across nodes, it ships a copy of each variable needed by that function to each node.  There are two types of shared variables:
 - Broadcast variables: These are used to store a value in memory across all nodes (i.e. their values are "broadcast" from the driver node)
 - Accumulator variables: These are variables which are only added to across nodes, for implementing counters and sums.
 
### Verbs
- **Transformations:** This creates a new RDD from an old one: for instance `map` or `flatMap`.  Transformations in Spark are __lazy__, which means that they do not compute the answer right away -- only when it is needed by something else.  Instead they represent the "idea" of the transformation applied to the base data set (e.g. for each chaining).  In particular, this means that intermediate results to computations do not necessarily get created and stored -- if the output of your map is going straight into a reduce, Spark should be clever enough to realize this and not actually store the output of the map.  Another consequence of this design is that the results of transformations are, by default, recomputed each time they are needed -- to store them one must explicitly call `cache`.
    
    [Transformations](https://spark.apache.org/docs/latest/programming-guide.html#transformations) typically *transform* the data and return another RDD.  Common examples include `map`, `filter`, `flatMap`, and also `join`, `cogroup`, `groupByKey`, `reduceByKey`, `countByKey`, and `sample`.

- **Actions:** These actually return a value as a result of a computation.  For instance, `reduce` is an action that combines all elements of an RDD using some function and then returns the final result.  (Note that `reduceByKey` actually returns an RDD.)
    
    [Actions](https://spark.apache.org/docs/latest/programming-guide.html#actions) typically either generate *small* outputs (e.g. `reduce`, `count`, `first`, `take`, `takeSample`, `foreach`, `collect`) or persist to disk (e.g. `saveAsTextFile`, `saveAsSequenceFile`, etc.).


**Questions:**
1. What's the difference between `map` and `foreach` in (non-Spark) Scala?  Why is `map` a transformation but `foreach` an action in Spark?
1. Why is `reduceByKey` a transformation but `reduce` an action?

In [1]:
def time[R](block: => R): R = {
    val start: Long = System.nanoTime()
    val result = block
    val end: Long = System.nanoTime()
    val duration: Double = (end - start) / 1000000000.0  // what happens if you leave off the decimal point?
    println("Elapsed time: " + duration + "s")
    result
}





## Creating a Spark Context

In [2]:
import org.apache.spark.SparkContext
classOf[SparkContext].getPackage.getImplementationVersion





1.3.0

## Spark examples

Now that you have some basic familiarity with Scala, let's try working with Spark.

In [3]:
// Basic wordcount. `val spark` is already introduced in the namespace.
time {
    val lines = spark.textFile("small_data/gutenberg/")
    val totalLines = lines.count()
    println(s"total lines: $totalLines")

    lines.flatMap(line => line.split(" "))
        .map(word => (word.toLowerCase(), 1))
        .reduceByKey(_ + _)
        .sortByKey()
        .saveAsTextFile("/tmp/output_gutenberg" + (System.currentTimeMillis / 1000).toString)
}

total lines: 77931
Elapsed time: 3.619031605s




**Exercise:** There are a lot of special symbols in the output.  How would you do wordcount only on pure characters?  Hint: in Scala and Java, you can test for regular expressions this way:
```scala
"abc".matches("^\\w+$")
```

### Example: a simple simulation for $\pi$

In [11]:
val numSamples = 100000

val count = spark.parallelize(1 to numSamples).map{_ =>
  val x = Math.random()
  val y = Math.random()
  if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / numSamples)

Pi is roughly 3.1558




## ETL in Spark

#### Premise
The first steps of any data science project usually involve ETL'ing and performing exploratory analytics on a dataset.
Here we'll acquaint ourselves both with the Spark shell and some Spark functions that we can use to perform such tasks.

#### The dataset
From https://archive.ics.uci.edu/ml/datasets/EEG+Database:  
> "The first four lines are header information. Line 1 contains the subject identifier and indicates if the subject was an alcoholic (a) or control (c) subject by the fourth letter. Line 4 identifies the matching conditions: a single object shown (S1 obj), object 2 shown in a matching condition (S2 match), and object 2 shown in a non matching condition (S2 nomatch). Line 5 identifies the start of the data from sensor FP1. The four columns of data are: the trial number, sensor position, sample number (0-255), and sensor value (in micro volts)."

There are 16,452 rows in this file including the header. Here's a preview of the first 10 lines and last 10 lines:
```
# co2a0000364.rd
# 120 trials, 64 chans, 416 samples 368 post_stim samples
# 3.906000 msecs uV
# S1 obj , trial 0
# FP1 chan 0
0 FP1 0 -8.921
0 FP1 1 -8.433
0 FP1 2 -2.574
0 FP1 3 5.239
0 FP1 4 11.587
    ...
0 Y 246 24.150
0 Y 247 20.243
0 Y 248 11.454
0 Y 249 4.618
0 Y 250 3.153
0 Y 251 6.571
0 Y 252 12.431
0 Y 253 15.849
0 Y 254 16.337
0 Y 255 14.872
```


#### Summary stats on the dataset

1. There is an action for `RDD[Double]` called `stats` that will provide us with summary statistics about the values in the RDD. How can we get out summary stats of the `reading` column across the entire dataset?

1. How can we do the same but for only the `posn` "FP1"?

1. Let's make sure there are 256 samples per `posn` in this dataset. How can we do this? 

In [12]:
def isHeader(line: String) : Boolean = {
    line.contains("# ")
}

case class Record(
    trial: Int,
    posn: String,
    sample: Int,
    reading: Double
)

def parse(line: String) = {
    val tokens = line.split("\\s+")
    val trial = tokens(0).toInt
    val posn = tokens(1)
    val sample = tokens(2).toInt
    val reading = tokens(3).toDouble
    Record(trial, posn, sample, reading)
}

val data = spark.textFile("small_data/eeg/*")
                .filter(x => !isHeader(x))
                .map(parse)





MapPartitionsRDD[14] at map at <console>:31

In [13]:
println(data.map(_.reading).stats())

(count: 491520, mean: -0.079346, stdev: 6.198320, max: 48.472000, min: -49.072000)




In [14]:
println(data.filter(_.posn == "FP1").map(_.reading).stats())

(count: 7680, mean: 0.817634, stdev: 8.418456, max: 27.649000, min: -25.513000)




In [15]:
val samples = data.map(x => ((x.trial, x.posn), 1)).reduceByKey(_ + _)
val posns = samples.map{case (x, y) => (x._2, y)}.distinct()

assert(posns.take(1)(0)._2 == 256) // _2 second element of tuple 
assert(posns.count() == posns.filter{case (x, y) => y == 256}.count())





In [16]:
samples.take(5)

Array(((65,FC4),256), ((3,FC1),256), ((1,FPZ),256), ((29,CP5),256), ((29,CPZ),256))





In [19]:
posns.take(5)





Array((FT8,256), (Y,256), (F3,256), (PO8,256), (FT7,256))

In [21]:
posns.take(5)(0)._1



FT8



## Joins in Spark

Like in MapReduce, a join is accomplished by emitting key-value pairs `(k, v)` and joining on similar keys.  It is accomplished by roughly the same mechanism (sending similar keys to the same *node*), although in Spark those key-value pairs may not (necessarily) be persisted to disk.

In [22]:
case class Transaction(
    transactionId: String,
    productId: String,
    userId: String,
    amount: Int
)

case class User(
    userId: String,
    email: String,
    language: String,
    country: String
)

val users = (spark.textFile("small_data/employee/users.csv")
    .map({line => 
        val data = line.split(",")
        User(data(1), data(2), data(3), data(4))
    }))

val transactions = (spark.textFile("small_data/employee/sales.csv")
    .flatMap({ _.split(",") match {
        case Array(
            "sales",
            transactionId: String,
            productId: String,
            user: String,
            amount: String
        ) => Some(new Transaction(transactionId, productId, user, amount.toInt))
        case _ => None
    }}))

val totalSales = transactions.map(_.amount).sum
val usersByCountryCount = users.map(u => (u.country, u)).countByKey

println(s"total sales $totalSales")
println("Users by country")
usersByCountryCount foreach println

// This is (userId, (transaction, user))
val salesByCountry = (transactions.map(t => (t.userId, t.amount))
                .join(users.map(u => (u.userId, u.country)))
                .map({ 
                  case (userId: String, (amount: Int, country: String)) 
                      => (country, amount)
                }) 
                .reduceByKey(_ + _)
                .collect)

println("Transactions by country")
salesByCountry foreach println

total sales 700.0
Users by country
(FR,2)
(US,1)
(GB,1)
Transactions by country
(FR,150)
(US,400)
(GB,150)




### A warning about traversables

The transformations `mapPartitions`, `groupByKey`, `mapPartitionsWithIndex`, and `cogroup` return iterators.  Much like Python iterators, you can only traverse them once.  Here's a common anti-pattern:

In [None]:
// Anti-example

def mean(iter: TraversableOnce[Int]) = {
    val size = iter.size
    val sum = iter.sum.toDouble
    sum / size
}

val iterMean = mean(Iterator(1, 2, 3, 4, 5))
val listMean = mean(List(1, 2, 3, 4, 5))

println(s"iterator mean: $iterMean\nlist mean: $listMean")

## Persisting RDDs
When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it). This allows future actions to be much faster.

The argument passed to the persist method determines how the RDD will be stored. In general, you may choose
* Deserialized or serialized formats (serialized Java objects are more space-efficient but more CPU-intensive to read)
* In-memory only or allow spilling to disk
* *Fast* fault tolerance via redundancy
* [Full documentation](http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence)


*Serialization*: the process of translating data structures or object states into a format (i.e. series of bits) that can be stored and reconstructed later.


1. The cache method is a shortcut for the default behavior: in memory and deserialized:
```scala
myRdd.cache()
```
2. Otherwise, specify the desired behavior:
```scala
myRdd.persist(MEMORY_ONLY_SER)
```
3. Cleaning up cached data is important for memory management. During shuffles, some intermediate data is automatically persisted, so it can be worthwhile to manually unpersist RDDs instead of waiting for garbage collection:
```scala
myRDD.unpersist()
```

**Question:** Caching is a key tool for iterative algorithms and fast interactive use.  Why might you not always do it?

In [None]:
def randomStream = spark.parallelize(1 to 1000000).map{_ => Math.random()}

val uncachedLines = randomStream
val cachedLines = randomStream.cache()

time{uncachedLines.count()}
time{uncachedLines.count()}
time{cachedLines.count()}
time{cachedLines.count()} // this runs faster the 2nd time

## Accumulators

We'll want to implement counters (like we had in `mrjob` and Hadoop in general).  As usual, here is a first anti-example:

In [24]:
// Anti-example

var counter: Int = 0

spark.parallelize((1 to 10))
    .foreach(x => counter += 1)

println(counter)  // what's happening here?

0




Here's the correct way to implement this using [Accumulators](https://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.Accumulator).  We are just using them for integer addition but you can use them for any [monoid](https://en.wikipedia.org/wiki/Monoid).  You'll hear about these a lot in Scala.

In [25]:
// Example

var accumCounter = spark.accumulator(0)

spark.parallelize((1 to 10))
    .foreach(x => accumCounter += 1)

println(accumCounter)

10




In [26]:
// Lazy example

var accumCounter = spark.accumulator(0)

val myRdd = spark.parallelize((1 to 10))
    .map{x => accumCounter += 1
         x}

println("counter: " + accumCounter)

println(myRdd.reduce(_ + _))

println("counter: " + accumCounter)

myRdd.take(3).foreach(println)

println("counter: " + accumCounter)

counter: 0
55
counter: 10
1
2
3
counter: 15




## Broadcast variables

Broad variables are the Spark equivalent of Hadoop's distributed cache: they deposit read-only data cached on each machine rather than shipping a copy of it with tasks.

In [27]:
// Basic usage:

val iBV = spark.broadcast(10)
assert(iBV.value == 10)





Here's a more realistic example to show how you might implement this:

In [28]:
val states = Map(
    0 -> "AL",
    1 -> "KY",
    2 -> "VT" // etc.
)

case class Sale(state: Int, sales: Double)

val data = spark.parallelize(Array(
    Sale(0, 30.),
    Sale(1, 40.),
    Sale(2, 20.),
    Sale(2, 10.)
))

// This works but it will send `states` to each node again if run twice
time{val results1 = data.map(s => (states(s.state), s.sales)).collect()}

// // broadcast to each node once, and can be used freely inside map
val statesBV = spark.broadcast(states)
time{val results2 = data.map(s => (statesBV.value(s.state), s.sales)).collect()}

Elapsed time: 0.087469965s
Elapsed time: 0.048489677s




In [29]:
val results1 = data.map(s => (states(s.state), s.sales)).collect()
val results2 = data.map(s => (statesBV.value(s.state), s.sales)).collect()
assert(results1.toList == results2.toList) // why do we need .toList?





**Some caveats:**
1. After the broadcast variable is created, it should be used instead of the value so that it is not shipped to the nodes more than once.
2. The broadcast variable should not be modified after it is broadcast in order to ensure that all nodes get the same value of the broadcast variable.
3. How does the memory footprint of broadcasting scale with the number of tasks? With the number of executors?

**Question:** The above example is called a map-side join.  
1. Why is it called this?
1. What are the strengths and weaknesses of this versus a full join (i.e. emitting key-value pairs to common keys)?

#### Exit Tickets
1. What are the main distinguishing factors of Spark versus Hadoop?
1. Go over the pros and cons of caching/broadcasting information for MapReduce jobs.
1. What are the inputs and outputs for actions and transformations, respectively?
1. `union` is a funciton that "merges" two RDDs "vertically".  How do you have a union between an `RDD` of type `S` and one of type `T` in a statically typed language?

*Copyright &copy; 2015 The Data Incubator.  All rights reserved.*