# Data Ingestion Patterns
**Limitations at Data Ingestion Stage:**
- Streaming sources like Kinesis, Kafka and EventHubs only retain data for a **limited** amount of time
- **Need for retention** - full history of data
  - Reprocessing raw data
  - Perform GDPR and compliance tasks
  - Recover data
- Need for a simple, **maintainable and scalable** architecture
- Keeping full history in the streaming source is **expensive**

## Pattern 1: Use Delta for **Infinite Retention**
**Delta:**
- Provides **cheap**, **elastic** and **governable** storage for transient sources
- It allows to **_stream_**, **_store_** and **_track_** data in datalake upfront

<img src="./images/02_Streamline_ETL_With_DLT/Pattern_1.png" alt="Pattern_1" style="border: 2px solid black; border-radius: 10px;">

**1. Input:** we can connect any of these input types into streaming system
- cloud_file: confige Auto Loader to detect files coming in object storage

**2. Transformation:**
- Need to have **minimized transformation** to have **raw/original** form

**3. Store Data:**
- DLT allows to **_recompute_**, **_debug_** codes, **_delete_** and **_propagate_** changes in pipeline
- In Bronze layer, `pipeline.reset.allowed=false` ensures downstream computations can be fully refreshed without losing data
- Best pratice: 
    - **`streaming tables`** for **Bronze**
    - **_live tables_** for **Silver** and **Gold**

## Pattern 2: Up-to-date **Replica with CDC**
Maintain an **up-to-date replica** of a table stored elsewhere
- Use **Change Data Capture (CDC)** from RDMS and create replica as Delta
- A variety of 3rd party tools can provide a streaming change feed
- DLT simplifies with `APPLY CHANGES INTO`

<img src="./images/02_Streamline_ETL_With_DLT/Pattern_2.png" alt="Pattern_2" style="border: 2px solid black; border-radius: 10px;">

## Pattern 3: Multiplex Ingestion
**1. Simplex vs. Multiplex:**

<img src="./images/02_Streamline_ETL_With_DLT/Simplex_Vs_Multiplex.png" alt="Simplex_Vs_Multiplex" style="border: 2px solid black; border-radius: 10px;">

- **Simplex:** **every single** source is being streamed into **its own** Delta Lake table
- **Multiplex:** 
  - **all** of sources are streamed into the **same** table and 
  - each source can be differentitated by **_adding_ a column** specifying source of the dataset

**2. Anti-Pattern:** **_`Don't use`_** **Kafka** directly to **Bronze Table**:
- Data retention **limited** by Kafka; **expensive** to keep full history
- All processing happens on ingest
- If stream gets too far behind, data is **lost**
- Cannot recover data (no history to replay)

<img src="./images/02_Streamline_ETL_With_DLT/Anti_Pattern_3.png" alt="Anti_Pattern_3" style="border: 2px solid black; border-radius: 10px;">

**3. Bronze + Silver approach:**

<img src="./images/02_Streamline_ETL_With_DLT/Pattern_3.png" alt="Pattern_3" style="border: 2px solid black; border-radius: 10px;">

1. Ingest data into **Bronze table**
2. **_Read_** Bronze table -> **_separate_** data by category -> **_apply_** a schema and some transformations -> **_write_** to **Silver Delta table** = multiplex ingestion

# Data Quality Enforcement Patterns

## Silver Layer for Quality Enforcement
**Silver Layer Objectives:**
- **_Validate_** data quality and schema
- **_Enrich_** and **_transform_** data
- **_Optimize_** data layout and storage for downstream queries
- **_Provide_** single source of truth for analytics

## Schema Enforcement & Evolution
**1. Enforcement:** **_prevents_ bad records** from entering table
- E.g. Mismatch in type or field name

**2. Evolution:** **_allows_ new fields** to be added
- Useful when schema changes in production/new fields added to nested data
- **_`Cannot use`_** evolution to **_remove fields_**
- All previous records will show newly added field as **_`Null`_**
  - For previously written records, the underlying file isn’t modified.
  - The additional field is simply defined in the metadata and dynamically read as null

## Alternative Quality Check Approaches
- Add a **"validation" field** that captures any **validation errors** and a **null** value means validation passed.
  - E.g. outlier or unexpected value is catured in validation field = passed for later reviewing
- Quarantine data by filtering **non-compliant data** to alternate location
  - E.g. Add quarantine table where instead dropping the record and omitting it from the final production table, we can pass it into separate table that can be processed later
- Warn without failing by **_writing_ additional fields** with constraint check results to Delta tables

## Pattern: Quarantine Invalid Records
What if we want to save the records that violate data quality constraints for analysis?

### Method 1: Create Inverse Expectation Rules

<img src="./images/02_Streamline_ETL_With_DLT/Method_1.png" alt="Method_1" style="border: 2px solid black; border-radius: 10px;">

**Limitations:**
- Processes the data twice

### Method 2: Add a validation status column and use for partitioning

<img src="./images/02_Streamline_ETL_With_DLT/Method_2.png" alt="Method_2" style="border: 2px solid black; border-radius: 10px;">

**Limitations:**
- **_Doesn't use_ expectations**; **data quality metrics** are **_`not available`_** in the event logs or the pipelines UI.

## Pattern: Verify Data with Row Comparison
Validate **row counts** across tables to verify that data was processed successfully without dropping rows.
**Solution:**
- Add an **additional table** to your pipeline that defines an expectation to perform the comparison.
- The results of this expectation appear in the **event log** and the **Delta Live Tables UI**.

<img src="./images/02_Streamline_ETL_With_DLT/Pattern_Verify_Data_With_Row_Comparison.png" alt="Pattern_Verify_Data_With_Row_Comparison" style="border: 2px solid black; border-radius: 10px;">

## Pattern: Define Tables for Adv. Validation
Perform advanced data validation with DLT expectations
- Complex data quality checks examples
  - A derived table contains **`all records`** from the source table
  - Guaranteeing the **equality** of a **`numeric column`** across tables
- Solution:
  - Define DLT using **aggregate** and **join queries** and use the results of those queries as part of your expectation checking.

<img src="./images/02_Streamline_ETL_With_DLT/Pattern_Define_Tables_For_Adv_Validation.png" alt="Pattern_Define_Tables_For_Adv_Validation" style="border: 2px solid black; border-radius: 10px;">

# Data Modeling

## Dimensional Modeling
**1. Fact Tables vs. Dimension Tables:**
- **Fact Tables:** Often contain a **granular record** of activities
- **Dimension Tables:** Often contain data may be **_updated_** or **_modified_** over time.

<img src="./images/02_Streamline_ETL_With_DLT/Combined_Table.png" alt="Combined_Table" style="border: 2px solid black; border-radius: 10px;">

**2. Modeling Guidelines:**
- **_Denormalize_** dimension and fact tables
- Use **conformed dimensions**
- Use **slowly changing dimensions** as necessary

**3. Fact Tables as Incremental Data:**
- Often is a **time series**
- No intermediate aggregations
- No overwrite/update/delete operations
- Often **`append-only`** operations

## Slowly Changing Dimensions (SCD)

**Type 0:**
- **`No changes`** allowed
- Tables are either **static** or **append** only
- E.g. static lookup tables, append-only fact tables

**Type 1:**
- **`Overwrite`** but **`no history`** is maintained
- May contain recording of when record was entered, but **`not` previous values**
- E.g. valid customer mailing address

**Type 2:**
- Add a **`new row`**; mark old row as **obsolete**
- **`Strong history`** is maintained
- E.g. tracking product price changes over time

<img src="./images/02_Streamline_ETL_With_DLT/SCD_123_Example.png" alt="SCD_123_Example" style="border: 2px solid black; border-radius: 10px;">

## SCD Type 2 with DLT
**Keep a record** of how values changed over time
```
APPLY CHANGES INTO LIVE.cities
FROM STREAM(LIVE.city_updates)
KEYS (id)
SEQUENCE BY ts
STORED AS SCD TYPE 2
```
- `__starts_at` and `__ends_at` will take on the type of the `SEQUENCE BY` field `ts`.

<img src="./images/02_Streamline_ETL_With_DLT/SCD2_With_DLT.png" alt="SCD2_With_DLT" style="border: 2px solid black; border-radius: 10px;">

In [0]:
%sql
APPLY CHANGES INTO LIVE.cities
FROM STREAM(LIVE.city_updates)
KEYS (id)
SEQUENCE BY ts
STORED AS SCD TYPE 2

## Applying SCD Principles to Facts
- Fact table usually **`append-only`** (Type 0)
- Can leverage event and processing times for append-only history

<img src="./images/02_Streamline_ETL_With_DLT/SCD_Principles_To_Facts.png" alt="SCD_Principles_To_Facts" style="border: 2px solid black; border-radius: 10px;">

# Streaming Joins and Statefulness

## The Components of a Stateful Stream

<img src="./images/02_Streamline_ETL_With_DLT/Components_Stateful_Stream.png" alt="Components_Stateful_Stream" style="border: 2px solid black; border-radius: 10px;">

## Statefulness vs. Query Progress
- Some operations are specifically **`stateful`** in that the results of processing **earlier records** from the stream **_affect_** the processing of **later records**.
  - Examples include deduplication, aggregation, and stream-stream joins
- Other transformations just need to **_store incremental_** query progress and are **`not stateful`**.
  - Examples include simple transformations and stream-static joins
- **Progress** and **state** are **_stored_ in `checkpoints`** and managed by the driver during query processing.

## Stream-Static Joins
Using **Dimension Tables** in Incremental Updates
- Delta Lake enables **dynamic stream-static joins**
- Each **`micro-batch`** captures the **most recent state** of the Delta table that is the static side of the join
  - This does **`not occur`** if the static side of the join is another format such as **Parquet**
- Allows modification of dimension while maintaining downstream
composability

<img src="https://files.training.databricks.com/images/icon_note_32.png" alt="Note"> Because Delta Lake does not enforce foreign key constraints, it is possible that joined data will go unmatched.

## Streaming Queries are Not Stateful
**1. Each input row is processed only once**
- **A change** to a **`streaming live`** table's definition **_does not recompute_** results by default:

<img src="./images/02_Streamline_ETL_With_DLT/Not_Stateful_1.png" alt="Not_Stateful_1" style="border: 2px solid black; border-radius: 10px;">

**2. Enrich data by joining with an up-to date-snapshot stored in Delta**
- **A change** to **`joined table snapshot`** **_does not recompute_** results by default:

<img src="./images/02_Streamline_ETL_With_DLT/Not_Stateful_2.png" alt="Not_Stateful_2" style="border: 2px solid black; border-radius: 10px;">

## Clear State in DLT
Perform **`backfills`** after critical changes using **full refresh**
- **Full-refresh** clears the table’s data and the queries state, **_reprocessing_ `all the data`**.

<img src="./images/02_Streamline_ETL_With_DLT/Clear_State_DLT.png" alt="Clear_State_DLT" style="border: 2px solid black; border-radius: 10px;">

## Stream-Static Join & Merge
Join driven by streaming data
- Join triggers shuffle
- Join itself is stateless
- Control state information with predicate
- Goal is to broadcast static table to streaming data
- Broadcasting puts all data on each node

In [0]:
# 1. Main input stream
salesSDF = (
 spark
 .readStream
 .format("delta")
 .table("sales")
)

In [0]:
# 2. Join item category lookup
itemSalesSDF = (
 salesSDF
 .join(
 spark.table("items")
 .filter("category='Food'"), #Predicate
 on=["item_id"]
 )
)