# 5. Structured Streaming

## Streaming Query

### Sources:
- Kafka, Files, Event Hubs, Kinesis
- DataFrame
  - ```
    df = (spark.readStream
              .option("maxFilesPerTrigger", 1)
              .format("delta")
              .load(DA.paths.events)
            )
    df.isStreaming
    ```
- SQL Views & Tables
  - ```
    df.createOrReplaceTempView("v_event")
    spark.readStream.format("delta").table("v_event")
    ```

### Sinks
- **Where** to write data: Kafka, Files, Event Hubs/EventGrid, Foreach(Batch) for custom logic to store data
- **What** data to write (Output Modes)
  - APPEND: **add new** records only
  - UPDATE **update changed** records in place
    - Only rows updated since last trigger written
    - Different from **Complete** mode since **Update** mode outputs only changed rows since last trigger
    - If query does not contain aggregations, **Update** same as **Append** mode
  - COMPLETE: **rewrite** full output
  - Example:
  
    ![Output Modes](./images/Output_Modes.png)

### Trigger Types
- Default: Process each micro-batch as soon as previous one has been processed (or 500ms). *No coding required for this*. **Not recommended**
- Fixed interval: Micro-batch processing kicked off at the **user-specified interval** 
  - `.trigger(processingTime="1 second")` = every 1 second, bring new data
- One-time: Process **all** available data as **a single micro-batch** and then automatically stop the query 
  - `.trigger(once=True)` for manually trigger
- AvailableNow: Like Trigger One, available data processed before query stops, but in **multiple batches** instead of one 
  - `.trigger(availableNow=True)` **--> Recommended**
- ContinuousProcessing: Long-running tasks that **continuously read, process and write** data as soon events are available 
  - `.trigger(continuous="1 second")`

### End-to-end fault tolerance
Guaranteed in Structure Streaming by:
1. Checkpointing: Directed Acyclic Graph (DAG) of all DStream transformation stored in reliable storage (along with optional State)
    - `dbultils.fs.ls(checkpoitPath)`
2. Write-ahead logs: To commit offsets
    - Before it reads data into RAM, it writes data into disk system. Then it commits offsets. Hence, it knows where it left off in case there are interuption, then when the stream turn on, it switch back to where is left.
3. Idempoten sinks: Writes given row only once, even if sent multiple
4. Replayable data sources: Join allowed to poll data again

![Checkpointing](./images/Checkpointing.png)


### Example: Complete Stream Query
![Complete Streaming Query](./images/Complete_Streaming_Query.png)

**Step 1:** Read Stream (lazy)
  - `.option("maxFilesPerTrigger", 1)` - how much data to read, in this case 1 file at a time
    - It helps not overrun resource
    - We can also use `.option("maxBytesPerTrigger", 1000)`
  - All of them will be held in RAM

**Step 2:** Transformation

**Step 3:** Write Stream (lazy)
  - `emailTrafficDF.writeStream` write the DF in step 2 to disk
  - `querryName("email_traffic)` the name of this query, It is good to provide name for query.
  - `option("checkpointLocation", checkpointPath)` it is **mandatory**
  - `start(outputPath)` write into directory. `start` is **Action**

### Streaming Best Practices
1. Select trigger interval over nothing to unintended cost
2. Use ADLS Gen2 > Blob storage for Azure
3. Name the Streaming job so it's easily identifiable
4. Don't run multitple Stream on the same Driver. Multiplexing on same cluster is generally not recommended
5. Alter `maxFilesPerTrigger` or `maxBytesPerTrigger` to achieve Partition sized around **128MB - 200MB** (for best latency and throughput)
6. Can convert `SortMergeJoin` to `BroadcastHashJoin`. May need to increase auto-broadcast hash join threshold to larger size
7. If have Shuffle, consider setting Shuffle Partition number manually (since AQE is disabled in Streaming) to match number of Cores or 2x number of Cores
8. Turn off Stats collection on initial Stream to decrease latency
9. If possible, Auto-Optimize initial Stream to coalesce tiny files
10. Use compute-optimized workers and RocksDB state store

## Streaming Aggregates

abc