# Introduction to Spark and Spark Structured Streaming

## Introduction to Spark

In [3]:
// Normally, run `import spark.implicits._`
// There's a namespace collision in a Spark Notebook that requires this

val sparkDummy = spark
import sparkDummy.implicits._

## Wordcount in Spark

In [2]:
// Load Data
val text = (spark.read
    .text("notebooks/data/othello.txt").as[String])

// Count words
val counts = (text.flatMap(line => line.split("\\s+"))
    .groupByKey(_.toLowerCase)
    .count)

// Display most common
counts.orderBy($"count(1)" desc).show

+-------+--------+
|  value|count(1)|
+-------+--------+
|      i|     803|
|    and|     769|
|    the|     756|
|     to|     572|
|     of|     470|
|      a|     441|
|     my|     427|
|   that|     337|
|    you|     335|
|     in|     319|
|   iago|     299|
|othello|     292|
|    not|     278|
|     is|     276|
|     it|     244|
|   with|     221|
|    for|     219|
|     be|     208|
|   your|     207|
|     he|     205|
+-------+--------+
only showing top 20 rows



## Introduction to Spark Structured Streaming

## Streaming Word Count

In [11]:
// Load Data
val text = (spark.readStream
    .text("notebooks/data/othello.txt")
    .as[String])
    
// Count words
val counts = (text.flatMap(line => line.split("\\s+"))
    .groupByKey(_.toLowerCase)
    .count)
    
val query = (counts
    .orderBy($"count(1)" desc)
    .writeStream
    .outputMode("complete")
    .format("console")
    .start)

-------------------------------------------
Batch: 0
-------------------------------------------
+-------+--------+
|  value|count(1)|
+-------+--------+
|      i|     803|
|    and|     769|
|    the|     756|
|     to|     572|
|     of|     470|
|      a|     441|
|     my|     427|
|   that|     337|
|    you|     335|
|     in|     319|
|   iago|     299|
|othello|     292|
|    not|     278|
|     is|     276|
|     it|     244|
|   with|     221|
|    for|     219|
|     be|     208|
|   your|     207|
|     he|     205|
+-------+--------+
only showing top 20 rows



## A Network Streaming Example

### Streaming Shakespeare

In [4]:
def createStream(port: Int, duration: Int) {
    val lines = (spark.readStream
        .format("socket")
        .option("host", "localhost")
        .option("port", port)
        .load())

    val words = (lines
        .as[String]
        .flatMap(_.split("\\s+")))

    val wordCounts = words.groupByKey(_.toLowerCase).count().orderBy($"count(1)" desc)

    val query = (wordCounts.writeStream
        .outputMode("complete")
        .format("console")
        .start
        .awaitTermination(duration))
}

In [19]:
import sys.process._

"more notebooks/data/summer.txt" !

Shall I compare thee to a summer’s day?
Thou art more lovely and more temperate.
Rough winds do shake the darling buds of May,
And summer’s lease hath all too short a date.
Sometime too hot the eye of heaven shines,
And often is his gold complexion dimmed;
And every fair from fair sometime declines,
By chance, or nature’s changing course, untrimmed;
But thy eternal summer shall not fade,
Nor lose possession of that fair thou ow’st,
Nor shall death brag thou wand’rest in his shade,
When in eternal lines to Time thou grow’st.
So long as men can breathe, or eyes can see,
So long lives this, and this gives life to thee.

In [20]:
val port = 9001

(new Thread {
    override def run {
        s"scala notebooks/Broadcast.scala ${port} notebooks/data/summer.txt" !
    }
}).start

In [21]:
createStream(port, 12000)

-------------------------------------------
Batch: 0
-------------------------------------------
+--------+--------+
|   value|count(1)|
+--------+--------+
|    thee|       1|
|summer’s|       1|
|       i|       1|
|    day?|       1|
|   shall|       1|
|       a|       1|
|      to|       1|
| compare|       1|
+--------+--------+

-------------------------------------------
Batch: 1
-------------------------------------------
+--------+--------+
|   value|count(1)|
+--------+--------+
|    more|       2|
|     too|       2|
|summer’s|       2|
|     the|       2|
|     and|       2|
|      of|       2|
|       a|       2|
|     art|       1|
|    thou|       1|
|  lovely|       1|
|     hot|       1|
|   winds|       1|
|    thee|       1|
|    buds|       1|
| shines,|       1|
|     eye|       1|
|   date.|       1|
|   lease|       1|
| darling|       1|
|    hath|       1|
+--------+--------+
only showing top 20 rows

-------------------------------------------
Batch: 2
------

### Streaming Netcast

In [6]:
// run `nc -lk 9000` in bash and start typing!

createStream(9000, 10000)

-------------------------------------------
Batch: 0
-------------------------------------------
+-----+--------+
|value|count(1)|
+-----+--------+
|   hi|       1|
+-----+--------+

-------------------------------------------
Batch: 1
-------------------------------------------
+-----+--------+
|value|count(1)|
+-----+--------+
|   hi|       2|
|there|       1|
+-----+--------+

-------------------------------------------
Batch: 2
-------------------------------------------
+-----+--------+
|value|count(1)|
+-----+--------+
|there|       2|
|   hi|       2|
+-----+--------+

