# Spark, Datasets, and Structured Streaming

- RDD
- Resilient
- Distributed
- In RAM
- etc ...

# Spark in Action

Below, we give a simple example of a Spark Program that counts the words in Shakespeare's Othello.  Notice how much more concise the code is compared to 

In [34]:
// Load Data
val lines = sc.textFile("data/othello.txt")

// Count words
val counts = (lines.flatMap(line => line.split("\\s+"))
    .map(word => (word.toLowerCase, 1))
    .reduceByKey(_ + _))
    
(counts.sortBy(_._2, ascending=false)
    .take(20)
    .foreach(println))

(i,803)
(and,769)
(the,756)
(to,572)
(of,470)
(a,441)
(my,427)
(that,337)
(you,335)
(in,319)
(iago,299)
(othello,292)
(not,278)
(is,276)
(it,244)
(with,221)
(for,219)
(be,208)
(your,207)
(he,205)


**Exercise**:

Spark interface allows you to do everything SQL might do.

# A More Complex Spark Example

One common use case is joins

In [6]:
case class User(id: Int, name: String, email: String, country: String)
case class Transaction(userid: Int, product: String, cost: Double)

val users = (sc.textFile("data/users.csv")
    .map{ t =>
        val p = t.split(",")
        User(p(0).toInt, p(1), p(2), p(3))
    })
        
val transactions = (sc.textFile("data/transactions.csv")
    .map{ t =>
        val p = t.split(",")
        Transaction(p(0).toInt, p(1), p(2).toDouble)
    })

val userTransactions = (users.map(u => (u.id, u.name))
    .join(transactions.map(t => (t.userid, t.cost)))
).values

val transactionsByUsers = userTransactions.reduceByKey(_ + _).collect

transactionsByUsers.foreach(println)

(Denise,10.0)
(Charlie,40.0)
(Amy,70.0)
(Bob,20.0)


**Exercise:** Notice that there's a user ("Edward") who did not make a purchase.  He disappeared from `transactionsByUsers` because we were doing an (inner) `join`.  Instead, do a `leftOuterJoin` so we have a record that he spent nothing.  Notice that this returns an `Option`.  You may find `flatMap` useful for this exercise.

# Spark and Datasets

- RDDs are slow because of using Java serialization (can use Kryo)
- Dataframes use Catalyst but don't provide type safety.

http://www.agildata.com/apache-spark-rdd-vs-dataframe-vs-dataset/

# An example with Datasets

In [3]:
// Normally, run `import spark.implicits._`
// There's a namespace collision in a Spark Notebook that requires this

val sparkDummy = spark
import sparkDummy.implicits._

In [4]:
// Load Data
val text = (spark.read
    .text("data/othello.txt").as[String])

// Count words
val counts = (text.flatMap(line => line.split("\\s+"))
    .groupByKey(_.toLowerCase)
    .count)

// Display most common
counts.orderBy($"count(1)" desc).show

+-------+--------+
|  value|count(1)|
+-------+--------+
|      i|     803|
|    and|     769|
|    the|     756|
|     to|     572|
|     of|     470|
|      a|     441|
|     my|     427|
|   that|     337|
|    you|     335|
|     in|     319|
|   iago|     299|
|othello|     292|
|    not|     278|
|     is|     276|
|     it|     244|
|   with|     221|
|    for|     219|
|     be|     208|
|   your|     207|
|     he|     205|
+-------+--------+
only showing top 20 rows



# Towards Structured Streaming

Structured Streaming is based on datasets.

- word count 

In [5]:
// Load Data
val text = (spark.readStream
    .text("data/othello.txt")
    .as[String])
    
// Count words
val counts = (text.flatMap(line => line.split("\\s+"))
    .groupByKey(_.toLowerCase)
    .count)
    
val query = (counts
    .orderBy($"count(1)" desc)
    .writeStream
    .outputMode("complete")
    .format("console")
    .start)

-------------------------------------------
Batch: 0
-------------------------------------------
+-------+--------+
|  value|count(1)|
+-------+--------+
|      i|     803|
|    and|     769|
|    the|     756|
|     to|     572|
|     of|     470|
|      a|     441|
|     my|     427|
|   that|     337|
|    you|     335|
|     in|     319|
|   iago|     299|
|othello|     292|
|    not|     278|
|     is|     276|
|     it|     244|
|   with|     221|
|    for|     219|
|     be|     208|
|   your|     207|
|     he|     205|
+-------+--------+
only showing top 20 rows

