
# Full demo: Change Data Capture on multiple tables
## Use-case: Synchronize all your ELT tables with your Lakehouse

Real use-case typically includes multiple tables that we need to ingest and synch.

These tables are stored on different folder having the following layout:

<img width="1000px" src="https://github.com/databricks-demos/dbdemos-resources/raw/main/images/product/Delta-Lake-CDC-CDF/cdc-full.png">

In [0]:
%run ./_resources/00-setup $reset_all_data=false

## Running the streams in parallel

Each table will be save as a distinct table, using a distinct Spark Structured Streaming stream.

To implement an efficient pipeline, we should process multiple streams at the same time. To do that, we'll use a ThreadPoolExecutor and start multiple thread, each of them processing and waiting for a stream.

We're using Trigger Once to refresh all the tables once and then shutdown the cluster, typically every hour. For lower latencies we can keep the streams running (depending of the number of tables & cluster size), or keep the Trigger Once but loop forever.

## Schema evolution

By organizing the raw incoming cdc files with 1 folder by table, we can easily iterate over the folders and pickup any new tables without modification.

Schema evolution will be handled my the Autoloader and Delta `mergeSchema` option at the bronze layer. Schema evolution for MERGE (Silver Layer) are supported using `spark.databricks.delta.schema.autoMerge.enabled`

*Note: that autoloader will trigger an error in a stream if a schema change happens, and will automatically recover during the next run. See Autoloader demo for a complete example.*

*Note: another common pattern is to redirect all the CDC events to a single message queue (the table name being a message attribute), and then dispatch the message in different Silver Tables.*

In [0]:
# Explore our raw cdc data:
base_folder = f"{raw_data_location}/cdc"
display(dbutils.fs.ls(f"{raw_data_location}/cdc"))

In [0]:
# Reset all checkpoints:
dbutils.fs.rm(f"{raw_data_location}/cdc_full", True)

## Bronze ingestion with autoloader:

In [0]:
def update_bronze_layer(path, bronze_table):
    print(f"Ingesting RAW cdc data for {bronze_table} and building bronze layer...")

    (spark.readStream
            .format("cloudFiles")
            .option("cloudFiles.format", "csv")
            .option("cloudFiles.schemaLocation", f"{raw_data_location}/cdc_full/schemas/{bronze_table}")
            .option("cloudFiles.schemaHints", "id BIGINT, operation_date TIMESTAMP")
            .option("cloudFiles.inferColumnTypes", "true")
            .load(path)
          .withColumn("file_name", F.col("_metadata.file_path"))
          .writeStream
            .option("checkpointLocation", f"{raw_data_location}/cdc_full/checkpoints/{bronze_table}")
            .option("mergeSchema", "true")
            .trigger(availableNow=True)
            .table(bronze_table).awaitTermination())

## Silver materializing tables with MERGE based on CDC events:

In [0]:
def update_silver_layer(bronze_table, silver_table):
  print(f"Ingesting {bronze_table} update and materializing silver layer using a MERGE statement...")

  # First, create the silver table if it doesn't exists:
  if not spark.catalog.tableExists(silver_table):
    print(f"Table {silver_table} doesn't exist, creating it using the same schema as the bronze one...")
    (spark.read.table(bronze_table)
            .drop("operation", "operation_date", "_rescued_data", "file_name")
          .write.saveAsTable(silver_table))
  
  # For each batch/incremental update from the raw cdc table, we'll run a MERGE on the silver table:
  def merge_stream(updates, i):
    # Deduplicate based on the id and take the most recent update:
    windowSpec = Window.partitionBy("id").orderBy(F.col("operation_date").desc())

    updates_deduplicated = (updates.withColumn("rnk", F.row_number().over(windowSpec))
                                   .where("rnk = 1")
                                   .drop("rnk", "operation_date", "_rescued_data", "file_name"))
    
    # Remove the "operation" field from the column to update in the silver table:
    columns_silver = {c: f"s.{c}" for c in spark.read.table(silver_table).columns if c != "operation"}

    # Run the merge in the silver table directly:
    (DeltaTable.forName(spark, silver_table).alias("t")
        .merge(updates_deduplicated.alias("s"), "s.id = t.id")
        .whenMatchedDelete("s.operation = 'DELETE'")
        .whenMatchedUpdate("s.operation != 'DELETE'", set=columns_silver)
        .whenNotMatchedInsert("s.operation != 'DELETE'", values=columns_silver)
        .execute())
  
  (spark.readStream
          .table(bronze_table)
        .writeStream
          .foreachBatch(merge_stream)
          .option("checkpointLocation", f"{raw_data_location}/cdc_full/checkpoints/{silver_table}")
          .trigger(availableNow=True)
          .start().awaitTermination())

## Starting all the streams

We can now iterate over the folders to start the bronze & silver streams for each table.

In [0]:
from concurrent.futures import ThreadPoolExecutor
from collections import deque

def refresh_cdc_table(table):
    try:
        # Update the bronze table:
        bronze_table = f"bronze_{table}"
        update_bronze_layer(f"{base_folder}/{table}", bronze_table)

        # Update the silver table:
        silver_table = f"silver_{table}"
        update_silver_layer(bronze_table, silver_table)
    except Exception as e:
        print(f"Couldn't properly process {bronze_table}")
        raise e

# Enable Schema evolution during merges (to capture new columns):
spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", "true")

# Iterate over all the tables folders:
tables = [table_path.name[:-1] for table_path in dbutils.fs.ls(base_folder)]

# Start 3 CDC flow at the same time in 3 different thread to speed up ingestion:
with ThreadPoolExecutor(max_workers=3) as executor:
    deque(executor.map(refresh_cdc_table, tables))
    print(f"Database refreshed!")

In [0]:
%sql
SELECT * FROM bronze_users;

In [0]:
%sql
SELECT * FROM silver_users;

In [0]:
%sql
SELECT * FROM bronze_transactions;

In [0]:
%sql
SELECT * FROM silver_transactions;

In [0]:
DBDemos.stop_all_streams()