# 6 - Change Data Capture with AUTO CDC with Slowing Changing Dimensions (SCD) TYPE 1

##### NOTE: The AUTO CDC APIs replace the APPLY CHANGES APIs, and have the same syntax. The APPLY CHANGES APIs are still available, but Databricks recommends using the AUTO CDC APIs in their place.

In this demonstration, we will continue to build our pipeline by ingesting **customer** data into our pipeline. The customer data includes new customers, customers who have deleted their accounts, and customers who have updated their information (such as address, email, etc.). We will need to build our customer pipeline by implementing change data capture (CDC) for customer data using SCD Type 1 (Type 2 is outside the scope of this course).

The customer pipeline flow will:

- The bronze table uses **Auto Loader** to ingest JSON data from cloud object storage with SQL (`FROM STREAM`).
- A table is defined to enforce constraints before passing records to the silver layer.
- `AUTO CDC` is used to automatically process CDC data into the silver layer as a Type 1.
- A gold table is defined to create a materialized view of the current customers with updated information (dropped customers, new customers and updated customer information).



### Learning Objectives

By the end of this lesson, students should feel comfortable:
- Apply the `AUTO CDC` operation in Lakeflow Spark Declarative Pipelines to process change data capture (CDC) by integrating and updating incoming data from a source stream into an existing Delta table, ensuring data accuracy and consistency.
- Analyze Slowly Changing Dimensions (SCD Type 1) tables within Lakeflow Spark Declarative Pipelines to effectively update, insert and drop customers in dimensional data, managing the state of records over time using appropriate keys, versioning, and timestamps.


## A. Environment Configuration

Set up the environment variables for this course.


In [0]:
# Define catalog, schema, and volume paths
CATALOG_NAME = 'ldp_demo'
SCHEMA_NAME = 'ldp_schema'
VOLUME_NAME = 'raw'
WORKING_DIR = f'/Volumes/{CATALOG_NAME}/{SCHEMA_NAME}/{VOLUME_NAME}'

print(f'Catalog: {CATALOG_NAME}')
print(f'Schema: {SCHEMA_NAME}')
print(f'Volume Path: {WORKING_DIR}')


## B. Explore the Customer Data Source Files

1. Run the cell below to programmatically view the files in your `/Volumes/ldp_demo/ldp_schema/raw/customers` volume. Confirm you only see one **00.json** file for customers.


In [0]:
spark.sql(f'LIST "{WORKING_DIR}/customers"').display()


2. Run the query below to explore the customers **00.json** file. Note the following:

   a. The file contains customer information.

   b. It includes general customer information such as **email**, **name**, and **address**.

   c. The **timestamp** column specifies the logical order of customer events in the source data.

   d. The **operation** column indicates whether the entry is for a new customer, a deletion, or an update.


In [0]:
%sql
SELECT *
FROM read_files(
  '/Volumes/ldp_demo/ldp_schema/raw/customers/00.json',
  format => "JSON"
)
ORDER BY operation;


## C. Change Data Capture with AUTO CDC APIs in Lakeflow Spark Declarative Pipelines

1. Create your starter Spark Declarative Pipeline for this demonstration. The pipeline should be configured with:
    - Your default catalog: `ldp_demo`
    - Your default schema: `ldp_schema`
    - Your configuration parameter: `source` = `/Volumes/ldp_demo/ldp_schema/raw`
    - Source folders: `orders`, `status`, `customers`

2. Complete the following steps to open the starter Spark Declarative Pipeline project:

   a. In the main navigation bar right-click on **Jobs & Pipelines** and select **Open in Link in New Tab**.

   b. In **Jobs & Pipelines** select your pipeline (or create a new one).

   c. **REQUIRED:** At the top near your pipeline name, turn on **New pipeline monitoring**.

   d. In the **Pipeline details** pane on the far right, select **Open in Editor** to open the pipeline in the **Lakeflow Pipeline Editor**.

   e. In the new tab you should see folders: **orders**, **status**, and **customers**.

3. Explore the code in the `customers/customers_pipeline.sql` file. Follow the instructional comments in the file to proceed.


## D. Land New Data to Your Data Source Volume

1. Run the cell below to land a new JSON file to each volume (**customers**, **status** and **orders**) to simulate new files being added to your cloud storage locations.


In [0]:
def copy_files(copy_from: str, copy_to: str, n: int, sleep=2):
    import os
    import time

    print(f"\n----------------Loading files to volume: '{copy_to}'----------------")

    if os.path.exists(copy_from):
        list_of_files_to_copy = sorted(os.listdir(copy_from))
        total_files = len(list_of_files_to_copy)
    else:
        print(f'Source directory {copy_from} does not exist.')
        return

    if os.path.exists(copy_to):
        list_of_files_in_dest = os.listdir(copy_to)
    else:
        list_of_files_in_dest = []

    assert total_files >= n, f"Source location contains only {total_files} files, but you specified {n} files to copy."

    counter = 1
    for file in list_of_files_to_copy:
        if file in list_of_files_in_dest:
            print(f'File number {counter} - {file} already exists. Skipping.')
        else:
            file_to_copy = f'{copy_from}/{file}'
            copy_file_to = f'{copy_to}/{file}'
            print(f'File number {counter} - Copying {file_to_copy} --> {copy_file_to}')
            dbutils.fs.cp(file_to_copy, copy_file_to, recurse=True)
            time.sleep(sleep)

        if counter == n:
            break
        counter += 1

# Copy files for multiple sources
try:
    copy_files(
        copy_from='/Volumes/dbacademy_retail/v01/retail-pipeline/customers/stream_json',
        copy_to=f'{WORKING_DIR}/customers',
        n=2,
        sleep=1
    )
    copy_files(
        copy_from='/Volumes/dbacademy_retail/v01/retail-pipeline/orders/stream_json',
        copy_to=f'{WORKING_DIR}/orders',
        n=2,
        sleep=1
    )
    copy_files(
        copy_from='/Volumes/dbacademy_retail/v01/retail-pipeline/status/stream_json',
        copy_to=f'{WORKING_DIR}/status',
        n=2,
        sleep=1
    )
except Exception as e:
    print(f'Note: Could not copy from dbacademy_retail. Error: {e}')
    print(f'Please manually add JSON files to the volumes.')


2. Run the cell below to programmatically view the files in your `/Volumes/ldp_demo/ldp_schema/raw/customers` volume. Confirm your volume now contains the original **00.json** file and the new **01.json** file.


In [0]:
spark.sql(f'LIST "{WORKING_DIR}/customers"').display()


3. Go back to your pipeline and click **Run pipeline** button to ingest the new JSON file (**01.json**) incrementally and perform CDC SCD Type 1 on the **scd_type_1_customers_silver_demo6** table.


## E. Explore the Customers Pipeline Tables

1. Run the query below to view the **scd_type_1_customers_silver_demo6** streaming table (the table with SCD Type 1 updates, inserts and deletes).


In [0]:
%sql
SELECT *
FROM ldp_demo.ldp_schema.scd_type_1_customers_silver_demo6;
