# Spark, Datasets, and Structured Streaming

# Spark Overview

[Spark](http://spark.apache.org/) is a platform for distributed computation that has several great features:

1. Transparently processes data on multiple nodes in the cloud while giving the programmer an API that's little more complex than Scala's `Seq` API.
1. Resiliently handles failures by restarting processess.
1. Spills data to disk as necessary but prefers to intelligently cache in ram for faster processing.
1. Java, Scala, Python, R, and SQL APIs (although we will primarily be using the Scala one).
1. The same Spark code can run in standalone (single-node cluster for development), Hadoop, mesos, the cloud.

# Spark Wordcount Using RDD Example

In order to speak about Spark Structured Streaming, we need to first introduce the basics of batch jobs in Spark and highlight the main features of the batch API.  In addition, this will be useful when we look at Spark discretized streams later on.

The key feature of the batch API is an resilient distributed dataset (RDD).  These share a similar API to Scala `Seq` and transparently handles large datasets data by conveniently abstracting the distributing computation load.

Below, we give a simple example of a Spark Program that counts the words in Shakespeare's Othello.  Notice how much more concise the code is compared to the typical [wordcount in Mapreduce](https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Example:_WordCount_v2.0) that's 133 lines of code.

In [34]:
// Load data
val lines = sc.textFile("data/othello.txt")

// Count words
val counts = (lines.flatMap(line => line.split("\\s+"))
    .map(word => (word.toLowerCase, 1))
    .reduceByKey(_ + _))
    
// Sort and print top 20
(counts.sortBy(_._2, ascending=false)
    .take(20)
    .foreach(println))

(i,803)
(and,769)
(the,756)
(to,572)
(of,470)
(a,441)
(my,427)
(that,337)
(you,335)
(in,319)
(iago,299)
(othello,292)
(not,278)
(is,276)
(it,244)
(with,221)
(for,219)
(be,208)
(your,207)
(he,205)


# Two Environments for Running Spark

There are two contexts for running Spark:

1. We will primarily be demonstrating the "REPL" method.  You can access the repl method by running any of the code cells in these notebooks or by running `spark-shell` from the bash command line.  Notebooks are great for dedactic, exploratory, and presentation purposes.  The shell is also great  for exploration.

2. You can also write Spark jobs as a program to be packaged up and run as a standalone jar application.  This requires using build tools (e.g. SBT) and a reasonable amount of overhead.  We'll explore how to do this in the final notebook.  This is great for production code.

For (2), you'll need to create your own `SparkContext` and `SparkSession`.  In (1), these are provided as global variables named `sc` and `spark`.  You'll access spark functionality through these two objects.

# Spark Wordcount Using Scala Example

Rather than using the complex mapreduce paradigm, Spark's API resembles the the Scala `Seq` API.  For example, the code below uses native Scala methods to implement a (much less powerful) word count.  The amazing thing about Spark is that it allows us distributed computation with the same level of API complexity.

In [18]:
import scala.io.Source

// Load data
val lines = (Source.fromFile("data/othello.txt")
    .getLines
 
// Count words
val counts = (lines.flatMap(line => line.split("\\s+"))
    .toSeq
    .groupBy(_.toLowerCase)
    .mapValues(_.length))
       
// Sort and print top 20
(counts.toSeq
    .sortBy(_._2)
    .reverse
    .take(20)
    .foreach(println))

(i,803)
(and,769)
(the,756)
(to,572)
(of,470)
(a,441)
(my,427)
(that,337)
(you,335)
(in,319)
(iago,299)
(othello,292)
(not,278)
(is,276)
(it,244)
(with,221)
(for,219)
(be,208)
(your,207)
(he,205)


**Exercise**: Use the `filter` method on RDDs to only count the number of instances of lower case words using Spark.  Verify your answer with the native Scala implementation using `Seq`.

# Spark and Datasets

While RDDs are great, they lacked several things.
- RDDs are slow because of Java serialization (this can be sped up by using Kryo).
- Dataframes provide named columns and identical rows, like [Dataframes in R](http://www.r-tutor.com/r-introduction/data-frame) or Dataframes in [Python Pandas](http://pandas.pydata.org/).  It benefits from many advanced optimizations.  Unfortunately, we sacrifice typesafety.
- Datasets provides an API which has many of those options as before.  They have many of the same performance benefits without needing to sacrifice typesafety.

# Spark Wordcount Using Datasets Example

Datasets make heavy use of implicits, which we will have to import.  The API also uses column types which are strings prefixed by `$` or called from the method `.col`.

In [2]:
// Normally, run `import spark.implicits._`
// There's a namespace collision in a Spark Notebook that requires this

val sparkDummy = spark
import sparkDummy.implicits._

In [3]:
// Load Data
val text = (spark.read
    .text("data/othello/part*")
    .as[String])

// Count words
val counts = (text.flatMap(line => line.split("\\s+"))
    .groupByKey(_.toLowerCase)
    .count)

// Display most common
counts.orderBy($"count(1)" desc).show

+-------+--------+
|  value|count(1)|
+-------+--------+
|      i|     803|
|    and|     769|
|    the|     756|
|     to|     572|
|     of|     470|
|      a|     441|
|     my|     427|
|   that|     337|
|    you|     335|
|     in|     319|
|   iago|     299|
|othello|     292|
|    not|     278|
|     is|     276|
|     it|     244|
|   with|     221|
|    for|     219|
|     be|     208|
|   your|     207|
|     he|     205|
+-------+--------+
only showing top 20 rows



# Joining Data Using Spark Datasets

So far, we have covered `map`, `flatMap`, and `filter`, represent three major operations on RDDs (which are also used in streaming).  The final operation we will cover is loading and joining structured data.

**Loading Data:** Datasets supports many nice helper functions for loading data.
1. In the code below, we infer the data schema, read from headers, and cast to a case class.
1. It also supports a SQL-like join syntax via the special `===` operator

In [61]:
import sys.process._

"more data/users.csv" !

id,name,email,country
0,Amy,amy@example.com,EN
1,Bob,bob@example.com,EN
2,Charlie,charlie@example.com,FR
3,Denise,denise@example.com,FR
4,Edward,edward@example.com,DE


In [60]:
case class User(id: Int, name: String, email: String, country: String)
case class Transaction(userid: Int, product: String, cost: Double)

val users = (spark.read
    .option("inferSchema", "true")
    .option("header", "true")
    .csv("data/users.csv")
    .as[User])
    
val transactions = (spark.read
    .option("inferSchema", "true")
    .option("header", "true")
    .csv("data/transactions.csv")
    .as[Transaction])

(users.join(transactions, users.col("id") === transactions.col("userid"))
    .groupBy($"name")
    .sum("cost")
    .show)

+-------+---------+
|   name|sum(cost)|
+-------+---------+
| Denise|       10|
|    Amy|       70|
|Charlie|       40|
|    Bob|       20|
+-------+---------+



**Exercise:** Notice that there's a user ("Edward") who did not make a purchase.  He disappeared from `transactionsByUsers` because we were doing an (inner) `join`.  Instead, do a left outer join so we have a record that he spent nothing.

# Structured Streaming Overview

So far, we have only talked about batch operations in the form of RDDs and Datasets.  We will now introduce Spark Structured Streaming.  The easiest way to think about Spark Structured Streaming as a dataset with an infintie number of rows.  As more data is added, more rows are appended to the dataset.

![](images/structured-streaming-stream-as-a-table.png)

This interface has a number of benefits, primary among them is that the API for streaming is almost exactly the same as the API for datasets.

With Spark Structured Streaming, we define a trigger which occurs regularly (for example, every second).  When the trigger fires, more input (table rows) are appended to the "dataset".  We can then write queries (for examlpe, wordcount) against the entire (ever growing) "dataset".  Of course, the queries are computed incrementally so that we only have to process the new data at each time point.

![](images/structured-streaming-model.png)

Finally, it's worth keeping in mind that with streaming, spark is still distributing the workload across multiple nodes.  Your data is essentailly distributed along two dimensions: time (as the data streams in) and nodes.

<!--- Images are from https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html --->

# Spark Structured Streaming Wordcount Example

Below, we have chunked Othello into multiple parts.  Structured Streaming provies a useful option for reading in each part on a separate trigger, allowing us to fake a stream from a file.

Note how similar the Structured Streaming API is to the dataset API.

In [6]:
// Load data
val text = (spark.readStream
    .option("maxFilesPerTrigger", 1)  // fake streaming read one file per trigger
    .text("data/othello/part*")
    .as[String])
    
// Count words
val counts = (text.flatMap(line => line.split("\\s+"))
    .groupByKey(_.toLowerCase)
    .count)
    
// Print counts
val query = (counts
    .orderBy($"count(1)" desc)
    .writeStream
    .outputMode("complete")
    .format("console")
    .start)

-------------------------------------------
Batch: 0
-------------------------------------------
+-----+--------+
|value|count(1)|
+-----+--------+
|  the|     206|
|  and|     191|
|   of|     166|
|    i|     163|
|   to|     147|
|   my|     109|
|    a|     105|
|   in|      75|
| with|      70|
|   is|      68|
| that|      59|
| your|      57|
|  you|      56|
|  for|      53|
|   it|      52|
| have|      47|
|  not|      46|
|   be|      45|
|   do|      42|
|  but|      40|
+-----+--------+
only showing top 20 rows

-------------------------------------------
Batch: 1
-------------------------------------------
+-----+--------+
|value|count(1)|
+-----+--------+
|  the|     407|
|  and|     376|
|    i|     293|
|   to|     286|
|   of|     268|
|    a|     203|
|   my|     158|
|   in|     147|
| that|     134|
|   is|     129|
|  you|     112|
| with|     110|
|  for|     109|
| iago|      98|
| your|      97|
|  not|      94|
|   it|      90|
| this|      86|
| have|      83

<img src="images/logo-text.jpg" width="20%"/>