# Streaming Data Concepts

## What is streaming data?
- Traditional **batch-oriented** data processing: is **one-off** and **bounded**
<img src="./images/01_Incremental_Processing_With_Spark_Streaming/Batch_Processing.png" alt="Batch_Processing" style="border: 2px solid black; border-radius: 10px;">

- **Streaming** processing: is **continuous** and **unbounded**
<img src="./images/01_Incremental_Processing_With_Spark_Streaming/Streaming_Processing.png" alt="Streaming_Processing" style="border: 2px solid black; border-radius: 10px;">

## Bounded vs. Unbounded Dataset
**1. Bounded Data:**
- Has a **finite** and **unchanging** structure at the time of processing
- The order is **static**
- Analogy: Vehicles in a parking lot

<img src="./images/01_Incremental_Processing_With_Spark_Streaming/Anology_Bounded_Data.png" alt="Anology_Bounded_Data" style="border: 2px solid black; border-radius: 10px;">

**2. Unbounded Data:**
- Has an **infinite** and **continuousely changing** structure at the time of processing
- The order **not always sequential**
- Analogy: Vehicles on a highway

<img src="./images/01_Incremental_Processing_With_Spark_Streaming/Anology_Unbounded_Data.png" alt="Anology_Unbounded_Data" style="border: 2px solid black; border-radius: 10px;">

# Batch vs. Streaming Processing

### Batch Processing
- Refers to processing & analysis of **bounded** datasets
  - e.g. size is well known, we can count the number of elements
- **Loose** data latency requirement
  - e.g. day old, week old
- **Traditional ETL** from transactional systems into analytical systems

<img src="./images/01_Incremental_Processing_With_Spark_Streaming/Batch_Example.png" alt="Batch_Example" style="border: 2px solid black; border-radius: 10px;">

### Streaming Processing
- Datasets are **continuous** and **unbounded**
  - Data is constantly arriving and must be processed as long as there is new data --> **Micro-batch (or 1-by-1)**
- **Low-latency** use cases
  - e.g. real-tiome or new real-time
- Provide fast, actionable insights
  - e.g. Quality-of-Service, Device Monitoring, Recommendations)

<img src="./images/01_Incremental_Processing_With_Spark_Streaming/Streaming_Example.png" alt="Streaming_Example" style="border: 2px solid black; border-radius: 10px;">  

### Similarities
- Both have data transformation
- Output of streaming job is often queried in batch jobs
- Stream processing oftern inlcude batch processing (micro-batch)

### Differences:
|How to process in one run | Batch | Streaming |
|-|-|-|
|**Bounded dataset** | Big batch | Row by row / mini-batch|
|**Unbounded dataset** | N/A (multi runs) | Row by row / mini-batch|
|**Query computation** | Only once | Multiple |

# Instroduction to Structured Streaming

## What is Structured Streaming
- A scalable, fault-tolerant **stream processing framework** built on Spark SQL engine
- Uses **existing structured APIs** (DataFrames, SQL Engine) and provides similar API as batch processing API
- Includes **stream specific features**, end-to-end, exactl-once-processing, fault-tolerance, et.c

## How Structured Streaming Works

### Incremental Updates - Data stream as an unbounded table
- Streaming data is usually coming in very fast
- The magic behind Spark Structure Streaming: Processing infinite data as an incremental table updates

<img src="./images/01_Incremental_Processing_With_Spark_Streaming/Incremental_Updates.png" alt="Incremental_Updates" style="border: 2px solid black; border-radius: 10px;">

### Continuous Execution vs. Micro-Batch
- **Continuous Execution (EXPERIMNENTAL):** continuously **_listen_** for new data and process them **individually** --> lower throughput + overhead resource
- **Micro-batch Execution:** **_accumulate_** **small batches** of data and process each batch in **parallel** --> higer throughput + fewer nodes

<img src="./images/01_Incremental_Processing_With_Spark_Streaming/Micro_Batch_Processing.png" alt="Micro_Batch_Processing" style="border: 2px solid black; border-radius: 10px;">

### Execution mode
1. An **`input table`** is defined by configuring a **streaming read** against **source**
2. A **`query`** is defined against the **`input table`**
3. This logical query on the input table generates the **`results table`**
4. The output of a streaming pipeline will persist updates to the **`results table`** by writing to an external **sink**
5. New rows are appended to the **`input table`** for each trigger interval

<img src="./images/01_Incremental_Processing_With_Spark_Streaming/Programming_Model_For_Structured_Streaming.png" alt="Programming_Model_For_Structured_Streaming" style="border: 2px solid black; border-radius: 10px;">

## Anatomy of a Streaming Query
- Core concepts:
  - Input sources
  - Sinks
  - Transformations & actions
  - Triggers

- Example:
  - Read JSON data from Kafka
  - Parse nested JSON
  - Store in structured Delta Lake table

<img src="./images/01_Incremental_Processing_With_Spark_Streaming/JSON_Example.png" alt="JSON_Example" style="border: 2px solid black; border-radius: 10px;">

### Source
**Source:**
- Specify where to read data from
- OS Spark supports Kafka and file sources
- Databricks runtimes include connector libraries supporting Delta, Event Hubs, and Kinesis

<img src="./images/01_Incremental_Processing_With_Spark_Streaming/Core_Concept_Source.png" alt="Core_Concept_Source" style="border: 2px solid black; border-radius: 10px;">

1. Instantiate Spark object -> read data as a stream with a format specification (in this example = kafka)
2. Add option to run from kafka.bootstrap.servers
3. Add option to subcribe to a topic.
    - This can include multiple sources of different types
4. Call load() function to return a Spark DataFrame

In [0]:
spark.readStream.format("kafka")
 .option("kafka.bootstrap.servers",...)
 .option("subscribe", "topic")
 .load()

### Transformation
**Transformation:** add transformation of data on the fly

<img src="./images/01_Incremental_Processing_With_Spark_Streaming/Core_Concept_Transformation.png" alt="Core_Concept_Transformation" style="border: 2px solid black; border-radius: 10px;">

1. Cast bytes from Kafka records to a string, parse it as  JSON and generate nested columns
    - There are 100s of built-in optimized SQL functions from JSON, or UDF like lambdas function, map or flat map

In [0]:
spark.readStream.format("kafka")
 .option("kafka.bootstrap.servers",...)
 .option("subscribe", "topic")
 .load()
 .selectExpr("cast (value as string) as json")
 .select(from_json("json", schema).as("data"))

### Sink
**Sink:** add storage to store data

Built-in Spark supports:
- Files and Kafka for production
- Console and memory for development and debugging
- `foreachBatch` to execute arbitrary code with the output data

<img src="./images/01_Incremental_Processing_With_Spark_Streaming/Core_Concept_Sink.png" alt="Core_Concept_Sink" style="border: 2px solid black; border-radius: 10px;">

1. Write transformed output to external storage systems

In [0]:
spark.readStream.format("kafka")
 .option("kafka.bootstrap.servers",...)
 .option("subscribe", "topic")
 .load()
 .selectExpr("cast (value as string) as json")
 .select(from_json("json", schema).as("data"))
 .writeStream
 .format("delta")
 .option("path", "/deltaTable/")

### Checkpoint
**Checkpoint location:** For tracking the progress of the query

<img src="./images/01_Incremental_Processing_With_Spark_Streaming/Core_Concept_Checkpoint.png" alt="Core_Concept_Checkpoint" style="border: 2px solid black; border-radius: 10px;">

In [0]:
spark.readStream.format("kafka")
 .option("kafka.bootstrap.servers",...)
 .option("subscribe", "topic")
 .load()
 .selectExpr("cast (value as string) as json")
 .select(from_json("json", schema).as("data"))
 .writeStream
 .format("delta")
 .option("path", "/deltaTable/")
 .trigger("1 minute")
 .option("checkpointLocation", "…")
 .start()

### Trigger
**Trigger:** Defines how frequently the input table is checked for new data
- Each time a trigger fires, Sparks check for new data and updates the results

| Type | Script | Description |
|-|-|-|
| Fixed interval micro batches | `.trigger(processingTime = "2 minutes")` | Micro-batch processing kicked off at the **user-specified interval** |
| Triggered One-time micro batch | `.trigger(once=True)` | Process all of the available data as a **single `micro-batch`** and then automatically stop the query |
| Triggered One-time micro batches | `.trigger(availableNow=True)` | Process all of the available data as **multiple `micro-batches`** and then automatically stop the query |
| Continuous Processing | `.trigger(continuous= "2 seconds")` | Long-running tasks that **continuously** read, process, and write data as soon events are available, with checkpoints at the specified frequency |
| Default  |  | Databricks: 500ms fixed interval OS Apache Spark: Process each microbatch as soon as the previous has been processed |

### Output Mode

**Output Mode:**
- Defines how the data is written to the sink
- Equivalent to **_"save"_** mode on static DataFrames

| Type | Description |
|-|-|
| Complete  | The entire updated Result Table is written to the sink. The individual sink implementation decides how to handle writing the entire table |
| Append  | Only the new rows appended to the Result Table since the last trigger are written to the sink |
| Update  | Only new rows and the rows in the Result Table that were updated since the last trigger will be outputted to the sink |

<img src="https://files.training.databricks.com/images/icon_note_32.png" alt="Note"> The output modes supported depends on the type of transformations and sinks used by the streaming query. Refer to the the [Structured Streaming Programming Guide](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html) for details.

<img src="./images/01_Incremental_Processing_With_Spark_Streaming/Core_Concept_Output.png" alt="Core_Concept_Output" style="border: 2px solid black; border-radius: 10px;">

In [0]:
spark.readStream.format("kafka")
.option("kafka.bootstrap.servers",...)
.option("subscribe", "topic")
.load()
.selectExpr("cast (value as string) as json")
.select(from_json("json", schema).as("data"))
.writeStream
.format("delta")
.option("path", "/deltaTable/")
.trigger("1 minute")
.option("checkpointLocation", "…")
.outputMode("complete")
.start()

## Benefits of Structured Streaming

### Unification
- **Same API** is used for batch and stream processing.
- Supports Python, SQL or Spark’s other supported languages.
- Spark’s built-in libraries can be called in a streaming context, including ML libraries.

<img src="https://files.training.databricks.com/images/icon_note_32.png" alt="Note"> Most operations on a streaming DataFrame are identical to a static DataFrame. There are some exceptions to this, for example, **_sorting_** is not supported with streaming data.

### Fault Tolerance
- Structured Streaming ensures end-to-end exactly-once fault-tolerance guarantees through **`checkpointing`**.
- In case of failures; the streaming engine attempts to **_restart_** and/or **_reprocess_** the data.
- This approach requires;
  - **Replayable streaming** source such as cloud-based object storage and pub/sub services.
  - **Idempotent sinks** - multiple writes of the same data (as identified by the offset) **_do not result in duplicates_** being written to the sink.

### Handle Out-of-Order Data
- Supports **event-time-window-based** aggregation queries
- Supports **watermarking** which allows users to the threshold of late data

## Structured Streaming with Delta Lake

### Delta Lake Benefits

<img src="./images/01_Incremental_Processing_With_Spark_Streaming/Detal_Lake_Benefits.png" alt="Detal_Lake_Benefits" style="border: 2px solid black; border-radius: 10px;">

### Streaming from Delta Lake
- Each committed version represents **new data** to stream. Delta Lake **transactions logs** identify the version’s new data files
- Structured Streaming assumes **append-only** sources. Any non-append changes to a Delta table causes queries streaming from that table to **throw exceptions**.
  - Set `delta.appendOnly = true` to prevent non-append modifications to a table.
  - Use Delta Lake [change data feed](https://docs.databricks.com/en/delta/delta-change-data-feed.html) to propagate arbitrary change events to downstream consumers
- You can [limit the input rate](https://docs.databricks.com/en/structured-streaming/delta-lake.html#limit-input-rate) for micro-batches by setting `DataStreamReader` options:
  - `maxFilesPerTrigger`: Maximum files read per micro-batch (default 1,000)
  - `maxBytesPerTrigger`: Soft limit to amount of data read per micro-batch (no default)
  - **Note:** Delta Live Tables pipelines auto-tune options for rate limiting, so you **_`should avoid`_** setting these options explicitly for your pipelines.
- Each micro-batch written to the Delta table is committed as a **new version**.
- Delta Lake supports both **`append`** and **`complete`** output modes.
  - Append is most common.
  - Complete replaces the entire table with each micro-batch. It can be used for streaming queries that perform arbitrary aggregations on streaming data.

# Aggregations, Time Windows, Watermarks

## Types of Streaming Processing
|Stateless| Stateful |
|-|-|
| Typically **trivial transformations**. The way records are handled **_do not depend_** on previously seen records. | Previously seen records **_can influence_** new records |
| Example: Data Ingest (map-only), simple dimensional joins | Example: Aggregations over time, Fraud/Anomaly Detection |


## Stream Aggregations
- Continuous applications often require **near real-time** decisions on **real-time**, aggregated statistics
  - Examples: Aggregating errors from IoT devices, behavior analysis on instant messages via hashtags
- In the case of streams, you generally don't want to run aggregations **_over the entire dataset_**. Why;
  - There conceptually is **no end** to the flow of data, data is **continuous**
  - The size of the dataset grows in **perpetuity**; will eventually **_run out of resources_**
- **Solution:** Instead of aggregating over the entire dataset, we can aggregate over data **_grouped by_ `windows`** of time (say, every 5 minutes). This is referred to as **`windowing`**

## Time Based Windows
| Tumbling Window | Sliding Window |
|-|-|
|**No window overlap** | **Windows overlap** |
| Any given event gets aggregated into **only one** window group (e.g. 1:00–2:00 am, 2:00–3:00 am, 3:00-4:00 am, …) |Any given event gets aggregated into **multiple window** groups (e.g. 1:00-2:00 am, 1:30–2:30 am, 2:00–3:00 am, …) |

**Sliding Window Example:**

<img src="./images/01_Incremental_Processing_With_Spark_Streaming/Sliding_Window_Example.png" alt="Sliding_Window_Example" style="border: 2px solid black; border-radius: 10px;">

## Reasoning About Time

### Event Time vs. Processing Time

- **Event Time:** time at which the **event** (record in the data) actually occurred.
- **Processing time:** time at which a **record** is actually processed.
- Important in every use case processing unbounded data in whatever order (otherwise no guarantee on correctness)

### Time Doamin Skew
**When batch processing:**
- **`Processing time`** per definition much later (e.g. an hour or day) than **_`event time`_**
- Data assumed to be complete (or settle for incompleteness)

**When stream processing:**
- **`Processing time`** >= **_`event time`_** but often close (e.g. seconds, minutes)
- Challenge when **`processing time`** >>> **_`event time`_** (`**late data**`): not able to conclude anything easily, how long to wait for the data to be complete?

<img src="./images/01_Incremental_Processing_With_Spark_Streaming/Time_Domain_Skew.png" alt="Time_Domain_Skew" style="border: 2px solid black; border-radius: 10px;">

### Watermarking
**Watermark:** **_Handle_** **late** data and **_limit_** how long to **`remember`** old data
- Analogy: Highway **minimum speed** limit
<img src="./images/01_Incremental_Processing_With_Spark_Streaming/Watermarking.png" alt="Watermarking" style="border: 2px solid black; border-radius: 10px;">