<img src="./img/uva_seal.png">  

## Spark Streaming

### University of Virginia
### DS 7200: Distributed Computing
### Last Updated: November 13, 2024

---  


### SOURCES

https://spark.apache.org/docs/2.2.0/streaming-programming-guide.html

https://spark.apache.org/docs/3.5.1/structured-streaming-programming-guide.html

Learning Spark, Chapter 10: Spark Streaming

### OBJECTIVES
- Provide an overview of `Spark Streaming`
- Discuss how `DStreams` work
- Discuss the concept of window computations
- Provide an introduction to `Structured Streaming`


### CONCEPTS

- StreamingContext
- DStream
- Window Operations
- Receivers

---

### Overview

*Spark Streaming* is a module for acting on data as soon as it arrives.   

Enables scalable, high-throughput, fault-tolerant stream processing of live data streams.

Data can be ingested from many sources like `Kafka`, `Flume`, `Kinesis`, or `TCP sockets`, and can be processed using complex algorithms expressed with high-level functions like `map`, `reduce`, `join` and `window.`  

Finally, processed data can be pushed out to filesystems, databases, and live dashboards.  



<img src="./img/streaming-arch.png">  

Provides high-level abstraction called `discretized stream` or `DStream`, which represents continuous stream of data.  

Internally, a `DStream` is a sequence of `RDDs`.  The time interval covered by each `RDD` is the same, and is called the `batch interval`.  

`DStreams` are created from input data streams like Kafka, or by applying operations on other `DStreams`

**Example: Word Count of Data Received From Server Listening on TCP Socket**

---

**NOTE: CODE IS FOR ILLUSTRATION PURPOSES ONLY. WE WILL RUN IN THE DEMO.**  

```
from pyspark import SparkContext
from pyspark.streaming import StreamingContext

# Create a local StreamingContext with two working threads
spark = SparkSession.builder\
        .master("local[2]")\
        .appName("NetworkWordCount")\
        .getOrCreate()
sc = spark.sparkContext

# set up StreamingContext with batch interval of 1 second
ssc = StreamingContext(sc, 1)

# Create a DStream that will connect to socket = hostname:port, like localhost:8888
# on Windows machine, can find port from CMD by issuing: netstat -a -o
lines = ssc.socketTextStream("localhost", 8888)

type(lines)
# <class 'pyspark.streaming.dstream.DStream'>

# Split each line into words
# Each line is split into words, and this stream of words goes to DStream
words = lines.flatMap(lambda line: line.split(" "))

# MapReduce to count each word in each batch
pairs = words.map(lambda word: (word, 1))
wordCounts = pairs.reduceByKey(lambda x, y: x + y)

# Print first ten elements of each RDD generated in DStream to console
wordCounts.pprint()

# Start the computation
# depending where you are listening, different results will follow
ssc.start()

# Wait for the computation to terminate
ssc.awaitTermination()  
```


<span style='color:red'> NOTE: </span> In practice, when running on a cluster, `master` won't be hardcoded in program.  

Instead, launch application with `spark-submit` and receive it there. For local testing and unit tests, `local[*]` makes sense. 

Note that no computation takes place until `start()` is called on `StreamingContext`

**Launching the Application**

To run the example, you would first install a utility called `netcat`, add it to your `PATH` variable, and issue the following at terminal.  Below the `nc` command, you can type text which should print in the Spark job.

**TERMINAL 1: Running Netcat on port 9999**
```
$ nc -lk 9999

hello world
```

In a second terminal, you can run the program (provided as a Spark code example).  It should print out once per second.

**TERMINAL 2: Running the Spark job**
```
$ ./bin/spark-submit examples/src/main/python/streaming/network_wordcount.py localhost 9999
```

The results should look something like this:

```
-------------------------------------------
Time: 2014-10-14 15:25:21
-------------------------------------------
(hello,1)
(world,1)
```

### DStreams

The `DStream` is the basic abstraction for Spark Streaming.  Each `RDD` in a `DStream` covers a time interval.  Any operation on a `DStream` maps onto each underlying RDD.

The `RDD` transformations are computed by the Spark engine.  

**DStream as a Sequence of RDDs**

<img src='./img/streaming-dstream.png'>

### Receivers

For each input source, `Spark Streaming` launches *receivers*, which are tasks running on each executor that collect data from the input source and saves to `RDDs`. 

The receivers replicate the data to another executor for fault tolerance.  It is stored in executor memory, as in cached `RDDs`.

<img src='./img/exec_spark_streaming.jpg'>

### Transformations on DStreams

`DStreams` support many of the transformations available to `RDDs`.   Refer to documentation for details. https://spark.apache.org/docs/2.2.0/streaming-programming-guide.html

Transformations on `DStreams` can be characterized as *stateless* or *stateful*.

- In stateless transformations, batches can be treated independently.  Previous batch data is not required.  Examples include `map` and `filter`.  


- In stateful transformations, data from previous batches is needed to compute results on the current batch.  Examples include window operations and `updateStateByKey()`, which can build up an object representing each user session.

### Window Operations

`Windowed computations` allow for application of transformations over a sliding window of data.  In the example below, `window length=3` and `sliding interval=2`.  This means the operation is applied over the previous 3 `RDDs`, and it slides by 2 time units.

<span style='color:red'>NOTE: </span> the parameters must be multiples of the batch interval of the `DStream`.

**Windowed Computation Schematic**

<img src='./img/streaming-dstream-window.png'>

If we wish to compute word counts for last 30 seconds of data, every 10 seconds, we can use `reduceByKey` on the `DStream` of `(word, 1)` pairs from the `Map` stage.

**Compute Word Count over Last 30sec Data, every 10sec**

```
windowedWordCounts = pairs.reduceByKeyAndWindow(lambda x, y: x + y, lambda x, y: x - y, 30, 10)
```

There are several window operations.  For details, refer to: https://spark.apache.org/docs/2.2.0/streaming-programming-guide.html

- `window()`
- `countByWindow()`
- `reduceByWindow()`
- `reduceByKeyAndWindow()`
- `countByValueAndWindow()`

### MLLib Operations

`SparkStreaming` is compatable with `MLlib`

`MLlib` includes streaming ML algorithms which can learn from streaming data while doing scoring.

Can alternatively learn a model offline and score the streaming data.


### Caching

Calling `persist()` on `DStream` will persist all underlying `RDDs`.  
This is useful when computing on the same data multiple times.  
For window operations, this always occurs, thus the `DStreams` are automatically persisted in memory.

For input streams that receive data over the network (e.g., Kafka), persisting the data will replicate on two nodes (by default) for fault-tolerance.

### Checkpointing

Streaming applications must be resilient to failures external to the application (e.g., power outage, system maintenance).

Information is `checkpointed` to a fault-tolerant storage system so it can easily recover from failures (e.g., `HDFS`, `S3`).  

Checkpointing will save regular backups, such that the maximum information loss is equal to the information generated since the last checkpoint.

Two types of information are checkpointed:
- metadata    
  - this backs up data on the driver node, including configs
- data        
  - generated RDDs   
  - especially important if stateful transformations are used.
  
  setting up checkpointing is a one-liner:

```
ssc.checkpoint(checkpointDirectory)
```

### Output Operations

Output operations bring the stream data to rest (e.g., store in a table in a database, print to screen).  They do the computation on each batch.

Like actions on `RDDs`, `DStreams` are lazily evaluated.  
When an output operation is called on a `DStream`, the computation takes place.

### Streaming UI

The Spark UI (typically located at *http://[driver]:4040*) contains a Streaming tab.  Here is a snapshot:

<img src='./img/webui-streaming1.png'>

---

## Structured Streaming

Newer model that uses DataFrames for streaming

Treats live data stream as an unbounded table that is being continuously appended 




<img src='./img/streaming_table.png' width=900>

---

Consider Word Count program using Structured Streaming

Full code found [here](https://spark.apache.org/docs/3.5.1/structured-streaming-programming-guide.html). 

Important elements are `SparkSession.readStream` and `outputMode("complete")`

In *Complete Mode*, entire updated Result Table will be written to external storage.

This diagram provides an illustration:

<img src='./img/streaming_example.png' width=800>

Spark is checking each second for new data from socket connection 

If there is new data, Spark combines previous running counts with new data to compute updated counts.

**Source data is discarded**

---

### Handling Event Times

The model can handle data by event time, rather than processing time

Event time can be a column in each data row

Aggregations can be done on event-time column, which handles late data

---

### Operations on Streaming DataFrames

Most common operations on DataFrames, including Spark SQL, are supported for streaming

---