### Spark Streaming Demo

#### University of Virginia
#### DS 7200: Distributed Computing
#### Last updated: November 1, 2023

---

### Sources

- [netcat](https://netcat.sourceforge.net/)
- Spark structured streaming [documentation](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html)
- Spark streaming source code [example](https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/network_wordcount.py)
- Spark **structured streaming** source code [example](https://github.com/apache/spark/blob/master/examples/src/main/python/sql/streaming/structured_network_wordcount.py)

---

### About  

This is a quick demo of word count programs of text data received from a data server listening on a TCP socket.  
It uses the *netcat* utility as a backend.  
On the client side, Spark Streaming is used for analytics such as word count.  

We review two examples which are different. One task is to note the differences.

1. Streaming word count (following the SparkStreaming notebook)
2. Structured streaming, which uses a Spark DataFrame.  
  The two important methods are `readStream()` and `writeStream()`  

---

### Instructions

Review the results below and then clear out the results to start over.

Try setting up the demos and running them. For each example, you will:

1) Enter text in a separate terminal running *netcat*  
2) Run the Spark code

Batches of results should print below.  
Entering more text in *netcat* will trigger additional output below.

When you are finished with each demo, stop the kernel.

---

### Background for Setup

#### Backend Server

For the backend server, open a terminal and run at command line:  

$nc -lk 9999

Next, enter lines of text, pressing ENTER to complete each line

#### Frontend Client

For the frontend client, you will run the Spark Streaming code below

Each time you enter more text in *netcat*, this will trigger a new batch of results through Spark.

To quit an experiment and run another, you can restart the kernel.

### I. Word Count

In [None]:
# import modules

import sys
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.streaming import StreamingContext

In [None]:
# socket parameters
host = 'localhost'
port = 9999

# set up context
spark = SparkSession.builder\
        .master("local[2]")\
        .appName("NetworkWordCount")\
        .getOrCreate()

sc = spark.sparkContext
#sc = SparkContext(appName="PythonStreamingNetworkWordCount")
ssc = StreamingContext(sc, 1)

In [None]:
# create text stream object
lines = ssc.socketTextStream(host, port)

# calculate word counts
counts = lines.flatMap(lambda line: line.split(" "))\
              .map(lambda word: (word, 1))\
              .reduceByKey(lambda a, b: a + b)

counts.pprint()

ssc.start()
ssc.awaitTermination()

When finished, you can shut down the kernel

### II. Streaming Word Count

In [None]:
# import modules
import sys

from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split

In [None]:
# socket parameters
host = 'localhost'
port = 9999

spark = SparkSession\
    .builder\
    .appName("StructuredNetworkWordCount")\
    .getOrCreate()

# Create DataFrame representing the stream of input lines from connection to host:port
lines = spark\
    .readStream\
    .format('socket')\
    .option('host', host)\
    .option('port', port)\
    .load()

In [None]:
# Split the lines into words. `explode()` turns each array item into separate row.
words = lines.select(
    explode(
        split(lines.value, ' ')
    ).alias('word')
)

# Generate running word count
wordCounts = words.groupBy('word').count()

# Start running the query that prints the running counts to the console
query = wordCounts\
    .writeStream\
    .outputMode('complete')\
    .format('console')\
    .start()

query.awaitTermination()

When finished, you can shut down the kernel

**Question**: What differences do you notice between these examples?

### III. Streaming Word Count with Filter

Provide a screening list of words to match against.  
Include a filter to calculate a running count of the words provided in the list.

### IV. Create your own streaming demo.

Try a different streaming job below.  
For example, you might aggregate the data differently in combination with a filter.