# Spark Structured Streaming

- structured streaming
- How does structured streaming differ from batch?  Unbounded

http://hortonworks.com/hadoop-tutorial/introduction-spark-streaming/
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#programming-model

In [2]:
val sparkDummy = spark
import sparkDummy.implicits._

## A Network Streaming Example

### Streaming Shakespeare

In [4]:
def createStream(port: Int, duration: Int) {
    val lines = (spark.readStream
        .format("socket")
        .option("host", "localhost")
        .option("port", port)
        .load())

    val words = (lines
        .as[String]
        .flatMap(_.split("\\s+")))

    val wordCounts = words.groupByKey(_.toLowerCase).count().orderBy($"count(1)" desc)

    val query = (wordCounts.writeStream
        .outputMode("complete")
        .format("console")
        .start
        .awaitTermination(duration))
}

In [19]:
import sys.process._

"more notebooks/data/summer.txt" !

Shall I compare thee to a summer’s day?
Thou art more lovely and more temperate.
Rough winds do shake the darling buds of May,
And summer’s lease hath all too short a date.
Sometime too hot the eye of heaven shines,
And often is his gold complexion dimmed;
And every fair from fair sometime declines,
By chance, or nature’s changing course, untrimmed;
But thy eternal summer shall not fade,
Nor lose possession of that fair thou ow’st,
Nor shall death brag thou wand’rest in his shade,
When in eternal lines to Time thou grow’st.
So long as men can breathe, or eyes can see,
So long lives this, and this gives life to thee.

In [20]:
val port = 9001

(new Thread {
    override def run {
        s"scala notebooks/Broadcast.scala ${port} notebooks/data/summer.txt" !
    }
}).start

In [21]:
createStream(port, 12000)

-------------------------------------------
Batch: 0
-------------------------------------------
+--------+--------+
|   value|count(1)|
+--------+--------+
|    thee|       1|
|summer’s|       1|
|       i|       1|
|    day?|       1|
|   shall|       1|
|       a|       1|
|      to|       1|
| compare|       1|
+--------+--------+

-------------------------------------------
Batch: 1
-------------------------------------------
+--------+--------+
|   value|count(1)|
+--------+--------+
|    more|       2|
|     too|       2|
|summer’s|       2|
|     the|       2|
|     and|       2|
|      of|       2|
|       a|       2|
|     art|       1|
|    thou|       1|
|  lovely|       1|
|     hot|       1|
|   winds|       1|
|    thee|       1|
|    buds|       1|
| shines,|       1|
|     eye|       1|
|   date.|       1|
|   lease|       1|
| darling|       1|
|    hath|       1|
+--------+--------+
only showing top 20 rows

-------------------------------------------
Batch: 2
------

### Streaming Netcast

In [6]:
// run `nc -lk 9000` in bash and start typing!

createStream(9000, 10000)

-------------------------------------------
Batch: 0
-------------------------------------------
+-----+--------+
|value|count(1)|
+-----+--------+
|   hi|       1|
+-----+--------+

-------------------------------------------
Batch: 1
-------------------------------------------
+-----+--------+
|value|count(1)|
+-----+--------+
|   hi|       2|
|there|       1|
+-----+--------+

-------------------------------------------
Batch: 2
-------------------------------------------
+-----+--------+
|value|count(1)|
+-----+--------+
|there|       2|
|   hi|       2|
+-----+--------+



# Prasing Data using Schemas

In [19]:
import org.apache.spark.sql.types._
import org.apache.spark.sql.catalyst.ScalaReflection

In [37]:
case class Person(name: String, city: String, country: String, age: Int)

val explicitSchema = StructType(
  StructField("name", StringType, false) ::
  StructField("city", StringType, true) ::
  StructField("country", StringType, true) ::
  StructField("age", IntegerType, true) :: Nil
)

val caseSchema = ScalaReflection.schemaFor[Person].dataType.asInstanceOf[StructType]

val in = (spark.readStream
  .schema(caseSchema)
  .option("header", true)
  .option("maxFilesPerTrigger", 1)
  .json("data/people.json")
  .as[Person])
  
(in.writeStream
    .outputMode("append")
    .format("console")
    .start)

ACTIVE]

-------------------------------------------
Batch: 0
-------------------------------------------
+-------+----+-------+----+
|   name|city|country| age|
+-------+----+-------+----+
|Michael|null|   null|null|
|   Andy|null|   null|  30|
| Justin|null|   null|  19|
| Justin|null|   null|  26|
| Justin|null|   null|  15|
+-------+----+-------+----+



In [13]:


spark.readStream.load("data/people.csv").as[Person]

Name: Unknown
Message: Unable to retrieve error!
StackTrace: 

In [4]:
val df = (spark.read
  .option("header", true)
  .option("inferSchema", true)
  .json("data/people.json"))
  
df.groupBy("age").count()

[age: bigint, count: bigint]

# Custom Reducer

# Joining with datasets

- Broadcast variables