# Streaming Data Concepts

## What is streaming data?
- Traditional **batch-oriented** data processing: is **one-off** and **bounded**
<img src="./images/01_Incremental_Processing_With_Spark_Streaming/Batch_Processing.png" alt="Batch_Processing" style="border: 2px solid black; border-radius: 10px;">

- **Streaming** processing: is **continuous** and **unbounded**
<img src="./images/01_Incremental_Processing_With_Spark_Streaming/Streaming_Processing.png" alt="Streaming_Processing" style="border: 2px solid black; border-radius: 10px;">

## Bounded vs. Unbounded Dataset
**1. Bounded Data:**
- Has a **finite** and **unchanging** structure at the time of processing
- The order is **static**
- Analogy: Vehicles in a parking lot

<img src="./images/01_Incremental_Processing_With_Spark_Streaming/Anology_Bounded_Data.png" alt="Anology_Bounded_Data" style="border: 2px solid black; border-radius: 10px;">

**2. Unbounded Data:**
- Has an **infinite** and **continuousely changing** structure at the time of processing
- The order **not always sequential**
- Analogy: Vehicles on a highway

<img src="./images/01_Incremental_Processing_With_Spark_Streaming/Anology_Unbounded_Data.png" alt="Anology_Unbounded_Data" style="border: 2px solid black; border-radius: 10px;">

# Batch vs. Streaming Processing

### Batch Processing
- Refers to processing & analysis of **bounded** datasets
  - e.g. size is well known, we can count the number of elements
- **Loose** data latency requirement
  - e.g. day old, week old
- **Traditional ETL** from transactional systems into analytical systems

<img src="./images/01_Incremental_Processing_With_Spark_Streaming/Batch_Example.png" alt="Batch_Example" style="border: 2px solid black; border-radius: 10px;">

### Streaming Processing
- Datasets are **continuous** and **unbounded**
  - Data is constantly arriving and must be processed as long as there is new data --> **Micro-batch (or 1-by-1)**
- **Low-latency** use cases
  - e.g. real-tiome or new real-time
- Provide fast, actionable insights
  - e.g. Quality-of-Service, Device Monitoring, Recommendations)

<img src="./images/01_Incremental_Processing_With_Spark_Streaming/Streaming_Example.png" alt="Streaming_Example" style="border: 2px solid black; border-radius: 10px;">  

### Similarities
- Both have data transformation
- Output of streaming job is often queried in batch jobs
- Stream processing oftern inlcude batch processing (micro-batch)

### Differences:
|How to process in one run | Batch | Streaming |
|-|-|-|
|**Bounded dataset** | Big batch | Row by row / mini-batch|
|**Unbounded dataset** | N/A (multi runs) | Row by row / mini-batch|
|**Query computation** | Only once | Multiple |

# Instroduction to Structured Streaming

## What is Structured Streaming
- A scalable, fault-tolerant **stream processing framework** built on Spark SQL engine
- Uses **existing structured APIs** (DataFrames, SQL Engine) and provides similar API as batch processing API
- Includes **stream specific features**, end-to-end, exactl-once-processing, fault-tolerance, et.c

## How Structured Streaming Works

### Incremental Updates - Data stream as an unbounded table
- Streaming data is usually coming in very fast
- The magic behind Spark Structure Streaming: Processing infinite data as an incremental table updates

<img src="./images/01_Incremental_Processing_With_Spark_Streaming/Incremental_Updates.png" alt="Incremental_Updates" style="border: 2px solid black; border-radius: 10px;">

### Micro-Batch Processing
- **Micro-batch Execution:** **_accumulate_** **small batches** of data and process each batch in **parallel**
- Continuous Execution (EXPERIMNENTAL): continuously **_listen_** for new data and process them **individually**

<img src="./images/01_Incremental_Processing_With_Spark_Streaming/Micro_Batch_Processing.png" alt="Micro_Batch_Processing" style="border: 2px solid black; border-radius: 10px;">

### Execution mode
1. An **`input table`** is defined by configuring a **streaming read** against **source**
2. A **`query`** is defined against the **`input table`**
3. This logical query on the input table generates the **`results table`**
4. The output of a streaming pipeline will persist updates to the **`results table`** by writing to an external **sink**
5. New rows are appended to the **`input table`** for each trigger interval

<img src="./images/01_Incremental_Processing_With_Spark_Streaming/Programming_Model_For_Structured_Streaming.png" alt="Programming_Model_For_Structured_Streaming" style="border: 2px solid black; border-radius: 10px;">

## Anatomy of a Streaming Query
- Core concepts:
  - Input sources
  - Sinks
  - Transformations & actions
  - Triggers

- Example:
  - Read JSON data from Kafka
  - Parse nested JSON
  - Store in structured Delta Lake table

<img src="./images/01_Incremental_Processing_With_Spark_Streaming/JSON_Example.png" alt="JSON_Example" style="border: 2px solid black; border-radius: 10px;">

### Source
<img src="./images/01_Incremental_Processing_With_Spark_Streaming/Core_Concept_Source.png" alt="Core_Concept_Source" style="border: 2px solid black; border-radius: 10px;">

```
spark.readStream.format("kafka")
 .option("kafka.bootstrap.servers",...)
 .option("subscribe", "topic")
 .load()
```
- Specify where to read data from
- OS Spark supports Kafka and file sources
- Databricks runtimes include connector libraries supporting Delta, Event Hubs, and Kinesis

### Transformation
```
spark.readStream.format("kafka")
 .option("kafka.bootstrap.servers",...)
 .option("subscribe", "topic")
 .load()
 .selectExpr("cast (value as string) as json")
 .select(from_json("json", schema).as("data"))
```
- 100s of built-in, optimized SQL functions like from_json
- In this example, cast bytes from Kafka records to a string, parse it as JSON, and generate nested columns

# Aggregations, Time Windows, Watermarks