
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img
    src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png"
    alt="Databricks Learning"
  >
</div>

# 2.4 DEMO: Implementing Change Data Feed (CDF) with Delta Sharing \[Recipient]

## Overview
In this demo, you will process incremental changes from a shared table using Change Data Feed (CDF). You'll learn how to efficiently track and apply only the changes (inserts, updates, deletes) using two different Slowly Changing Dimension (SCD) patterns.

**Your Role:** Regional Division (West Division) of Acme Corp

**Scenario:**
You work in the West Division of Acme Corp, operating in a different region with a separate Databricks workspace and Unity Catalog metastore. The central corporate division (East Division) is sharing customer and sales data with you via Delta Sharing. Instead of receiving full table snapshots, you'll use Change Data Feed to incrementally process only the data that has changed.

You'll implement two different patterns:
1. **Type 1 SCD**: Maintain current state only (overwrite changes)
2. **Type 2 SCD**: Maintain full history with effective dates

**Learning Objectives:**
By the end of this demo, you will:
1. Mount a D2D share containing CDF-enabled tables
2. Read and understand CDF metadata columns
3. Implement Type 1 SCD using MERGE INTO (current state)
4. Implement Type 2 SCD with historical tracking
5. Track processed versions for incremental updates
6. Understand use cases for each SCD pattern

## Background: SCD Patterns with Change Data Feed

### What You'll Learn

Change Data Feed (CDF) enables you to:
- **Track row-level changes** between table versions
- **Process only changes**, not entire tables
- **Maintain synchronized copies** efficiently
- **Build incremental ETL pipelines** within your organization

### CDF Metadata Columns

When reading CDF, you get these additional columns:

| Column | Type | Description |
|--------|------|-------------|
| `_change_type` | String | `insert`, `update_preimage`, `update_postimage`, `delete` |
| `_commit_version` | Long | Table version where change occurred |
| `_commit_timestamp` | Timestamp | When the change was committed |

### SCD Type 1 vs Type 2

**Type 1 SCD (Current State Only)**
- Maintains only the latest version of each record
- Updates overwrite previous values
- No historical tracking
- Use case: When you only need current state (e.g., current customer segment, current contact info)
- Simple, efficient, smaller storage footprint

**Type 2 SCD (Historical Tracking)**
- Maintains full history of changes
- New row for each change with effective dates
- Historical analysis capability
- Use case: When you need to track changes over time (e.g., customer segment changes for analysis, price history)
- More complex, larger storage, enables time-travel analytics

In [0]:
%run ./Includes/Demo-Setup-2

## Prerequisites

- Central division (East) has created the share `cdf_retail` and granted you access
- Change Data Feed is enabled on the source tables
- You have your metastore ID ready (for D2D sharing)
- You have permissions to create catalogs and tables in your workspace

## Step 1: Mount the Share (D2D)

First, let's mount the CDF-enabled share from the central division. Since this is Databricks-to-Databricks sharing within the same organization, the provider (East Division) will have used your metastore ID to configure the recipient.

In [0]:
-- View your metastore ID (this was shared with East Division)
SELECT global_metastore_id FROM system.information_schema.metastores;

In [0]:
-- View available providers (should see acme_corp from East Division)
SHOW PROVIDERS;

In [0]:
-- View available shares from the East Division provider
SHOW SHARES IN PROVIDER acme_corp;

In [0]:
-- Mount the share to a catalog in West Division
CREATE CATALOG IF NOT EXISTS east_division_shared
USING SHARE acme_corp.cdf_retail
COMMENT 'Data shared from East Division via Delta Sharing';

In [0]:
-- Verify the mounted catalog (should show type DELTASHARING)
DESCRIBE CATALOG east_division_shared;

In [0]:
-- List schemas and tables in the shared catalog
SHOW SCHEMAS IN east_division_shared;

In [0]:
-- List tables in the retail schema
SHOW TABLES IN east_division_shared.retail;

## Step 2: Explore the Shared Data

Let's examine the shared tables and their change data feed.

In [0]:
-- View current customer data (current snapshot from East Division)
SELECT * FROM east_division_shared.retail.customers 
ORDER BY customer_id
LIMIT 10;

In [0]:
-- Query the change data feed to see all changes
-- This shows inserts, updates (pre and post), and deletes
SELECT 
  _change_type,
  _commit_version,
  _commit_timestamp,
  customer_id,
  customer_name,
  customer_segment
FROM table_changes('east_division_shared.retail.customers', 0)
ORDER BY _commit_version, customer_id, _change_type
LIMIT 20;

## Step 3: Create Target Catalog and Schema (West Division)

Create a local catalog and schema where we'll maintain our copies of the data.

In [0]:
-- Create a catalog for West Division's replicated data
CREATE CATALOG IF NOT EXISTS west_division
COMMENT 'West Division catalog for incrementally replicated data from East';

In [0]:
-- Create a schema
CREATE SCHEMA IF NOT EXISTS west_division.retail_data
COMMENT 'Schema for retail data synchronized via CDF from East Division';

USE west_division.retail_data;

## Part 1: Type 1 SCD - Current State Only

In this section, we'll implement a **Type 1 Slowly Changing Dimension** pattern. This maintains only the current state of each record, with updates overwriting previous values.

### Type 1 SCD Characteristics:
- ✅ Simple to implement and understand
- ✅ Efficient - one row per customer
- ✅ Fast queries - no date filtering needed
- ❌ No historical tracking
- ❌ Can't answer "what was the value on date X?"

### Use Cases:
- Current customer contact information
- Latest account status
- Current product prices (when history not needed)
- Reference data that changes infrequently

## Step 4: Create Type 1 SCD Target Table

In [0]:
-- Drop table if it exists (for demo purposes)
DROP TABLE IF EXISTS west_division.retail_data.customers_type1;

In [0]:
-- Create Type 1 SCD table (current state only)
CREATE TABLE west_division.retail_data.customers_type1 (
  customer_id INT,
  customer_name STRING,
  email STRING,
  country STRING,
  signup_date DATE,
  customer_segment STRING
)
USING DELTA
COMMENT 'Type 1 SCD: Current state only - updates overwrite previous values';

## Step 5: Create Watermark Table for Type 1

In [0]:
-- Drop watermark table if exists (for demo)
DROP TABLE IF EXISTS west_division.retail_data.cdf_watermark;

In [0]:
-- Create watermark table to track processed versions
CREATE TABLE west_division.retail_data.cdf_watermark (
  table_name STRING,
  last_processed_version BIGINT,
  last_processed_timestamp TIMESTAMP,
  updated_at TIMESTAMP
)
USING DELTA;

In [0]:
-- Initialize watermark with version -1 (to process from version 0)
INSERT INTO west_division.retail_data.cdf_watermark VALUES
  ('customers_type1', -1, NULL, current_timestamp());

## Step 6: Process Changes for Type 1 SCD

Now let's process all changes from version 0 using MERGE INTO. For Type 1, we only care about the final state of each record.

In [0]:
-- Create a temporary view with the changes to apply for Type 1
-- For Type 1, we only need: inserts, update_postimage (final values), and deletes
CREATE OR REPLACE TEMP VIEW type1_changes AS
SELECT 
  _change_type,
  _commit_version,
  _commit_timestamp,
  customer_id,
  customer_name,
  email,
  country,
  signup_date,
  customer_segment
FROM table_changes('east_division_shared.retail.customers', 0)
WHERE _change_type IN ('insert', 'update_postimage', 'delete');

In [0]:
-- Preview the changes
SELECT * FROM type1_changes 
ORDER BY _commit_version, customer_id, _change_type;

In [0]:
-- Apply changes using MERGE INTO (Type 1 SCD pattern)
MERGE INTO west_division.retail_data.customers_type1 AS target
USING type1_changes AS source
ON target.customer_id = source.customer_id
WHEN MATCHED AND source._change_type = 'delete' THEN
  DELETE
WHEN MATCHED AND source._change_type != 'delete' THEN
  UPDATE SET
    target.customer_name = source.customer_name,
    target.email = source.email,
    target.country = source.country,
    target.signup_date = source.signup_date,
    target.customer_segment = source.customer_segment
WHEN NOT MATCHED AND source._change_type != 'delete' THEN
  INSERT (
    customer_id,
    customer_name,
    email,
    country,
    signup_date,
    customer_segment
  )
  VALUES (
    source.customer_id,
    source.customer_name,
    source.email,
    source.country,
    source.signup_date,
    source.customer_segment
  );

In [0]:
-- Verify Type 1 data (current state only)
SELECT * FROM west_division.retail_data.customers_type1 
ORDER BY customer_id;

In [0]:
-- Update watermark for Type 1
MERGE INTO west_division.retail_data.cdf_watermark AS target
USING (
  SELECT 
    MAX(_commit_version) as max_version,
    MAX(_commit_timestamp) as max_timestamp
  FROM type1_changes
) AS source
ON target.table_name = 'customers_type1'
WHEN MATCHED THEN
  UPDATE SET
    target.last_processed_version = source.max_version,
    target.last_processed_timestamp = source.max_timestamp,
    target.updated_at = current_timestamp();

## Part 2: Type 2 SCD - Historical Tracking

Now let's implement a **Type 2 Slowly Changing Dimension** pattern. This maintains full history by creating new rows for each change with effective dates.

### Type 2 SCD Characteristics:
- ✅ Full historical tracking
- ✅ Can answer "what was the value on date X?"
- ✅ Time-travel analytics enabled
- ✅ Audit trail of all changes
- ❌ More complex logic
- ❌ Larger storage footprint
- ❌ Queries need date filtering

### Use Cases:
- Customer segment changes over time (analyze customer lifecycle)
- Product price history (trend analysis)
- Employee role/department changes
- Compliance and audit requirements
- Historical reporting ("as of" queries)

### Type 2 Table Design:
We'll add these columns to track history:
- `effective_from`: When this version became active
- `effective_to`: When this version was superseded (NULL = current)
- `is_current`: Boolean flag for active record

## Step 7: Create Type 2 SCD Target Table

In [0]:
-- Drop table if exists (for demo)
DROP TABLE IF EXISTS west_division.retail_data.customers_type2;

In [0]:
-- Create Type 2 SCD table with history tracking columns
CREATE TABLE west_division.retail_data.customers_type2 (
  customer_id INT,
  customer_name STRING,
  email STRING,
  country STRING,
  signup_date DATE,
  customer_segment STRING,
  -- Type 2 SCD columns
  effective_from TIMESTAMP,
  effective_to TIMESTAMP,
  is_current BOOLEAN
)
USING DELTA
COMMENT 'Type 2 SCD: Full history with effective dates';

In [0]:
-- Initialize watermark for Type 2
INSERT INTO west_division.retail_data.cdf_watermark VALUES
  ('customers_type2', -1, NULL, current_timestamp());

## Step 8: Process Changes for Type 2 SCD

Type 2 SCD processing is more complex because we need to:
1. **Close out current records** (set effective_to, is_current = false) for updates and deletes
2. **Insert new versions** for inserts and updates
3. **NOT insert for deletes** (just close the current record)

We'll use a two-step MERGE approach.

In [0]:
-- Create view of changes for Type 2 processing
CREATE OR REPLACE TEMP VIEW type2_changes AS
SELECT 
  _change_type,
  _commit_version,
  _commit_timestamp,
  customer_id,
  customer_name,
  email,
  country,
  signup_date,
  customer_segment
FROM table_changes('east_division_shared.retail.customers', 0)
WHERE _change_type IN ('insert', 'update_postimage', 'delete');

In [0]:
-- Step 1: Close out current records for updates and deletes
MERGE INTO west_division.retail_data.customers_type2 AS target
USING (
  SELECT DISTINCT
    customer_id,
    _commit_timestamp,
    _change_type
  FROM type2_changes
  WHERE _change_type IN ('update_postimage', 'delete')
) AS source
ON target.customer_id = source.customer_id AND target.is_current = true
WHEN MATCHED THEN
  UPDATE SET
    target.effective_to = source._commit_timestamp,
    target.is_current = false;

In [0]:
-- Step 2: Insert new versions for inserts and updates (NOT for deletes)
INSERT INTO west_division.retail_data.customers_type2
SELECT 
  customer_id,
  customer_name,
  email,
  country,
  signup_date,
  customer_segment,
  _commit_timestamp as effective_from,
  NULL as effective_to,
  true as is_current
FROM type2_changes
WHERE _change_type IN ('insert', 'update_postimage');

In [0]:
-- Verify Type 2 data (should see multiple rows per customer if updated)
SELECT 
  customer_id,
  customer_name,
  customer_segment,
  effective_from,
  effective_to,
  is_current
FROM west_division.retail_data.customers_type2 
ORDER BY customer_id, effective_from;

In [0]:
-- Update watermark for Type 2
MERGE INTO west_division.retail_data.cdf_watermark AS target
USING (
  SELECT 
    MAX(_commit_version) as max_version,
    MAX(_commit_timestamp) as max_timestamp
  FROM type2_changes
) AS source
ON target.table_name = 'customers_type2'
WHEN MATCHED THEN
  UPDATE SET
    target.last_processed_version = source.max_version,
    target.last_processed_timestamp = source.max_timestamp,
    target.updated_at = current_timestamp();

## Step 9: Query Type 2 SCD Data

Let's see how to query Type 2 data for different use cases.

In [0]:
-- Query 1: Get current state only (equivalent to Type 1)
SELECT 
  customer_id,
  customer_name,
  customer_segment,
  effective_from as current_since
FROM west_division.retail_data.customers_type2
WHERE is_current = true
ORDER BY customer_id;

In [0]:
-- Query 2: Show full history for a specific customer (e.g., customer_id = 1 or 2)
SELECT 
  customer_id,
  customer_name,
  email,
  customer_segment,
  effective_from,
  effective_to,
  is_current,
  CASE 
    WHEN is_current THEN 'Current'
    ELSE 'Historical'
  END as record_status
FROM west_division.retail_data.customers_type2
WHERE customer_id IN (1, 2, 6)
ORDER BY customer_id, effective_from;

In [0]:
-- Query 3: Count customers by segment over time (track segment changes)
SELECT 
  customer_segment,
  COUNT(*) as customer_count,
  MIN(effective_from) as first_seen,
  MAX(effective_from) as last_changed
FROM west_division.retail_data.customers_type2
WHERE is_current = true
GROUP BY customer_segment
ORDER BY customer_count DESC;

In [0]:
-- Query 4: Find customers whose segment changed (segment migration analysis)
WITH customer_history AS (
  SELECT 
    customer_id,
    customer_name,
    COUNT(*) as version_count,
    COUNT(DISTINCT customer_segment) as segment_count
  FROM west_division.retail_data.customers_type2
  GROUP BY customer_id, customer_name
)
SELECT 
  ch.customer_id,
  ch.customer_name,
  ch.version_count,
  ch.segment_count
FROM customer_history ch
WHERE ch.segment_count > 1
ORDER BY ch.version_count DESC;

## Step 10: Compare Type 1 vs Type 2

Let's compare the two approaches side by side.

In [0]:
-- Compare row counts
SELECT 'Type 1 SCD' as table_type, COUNT(*) as row_count 
FROM west_division.retail_data.customers_type1
UNION ALL
SELECT 'Type 2 SCD (all)' as table_type, COUNT(*) as row_count 
FROM west_division.retail_data.customers_type2
UNION ALL
SELECT 'Type 2 SCD (current only)' as table_type, COUNT(*) as row_count 
FROM west_division.retail_data.customers_type2
WHERE is_current = true;

In [0]:
-- Compare a specific customer in both tables
SELECT 
  'Type 1' as scd_type,
  customer_id,
  customer_name,
  customer_segment,
  NULL as effective_from,
  NULL as effective_to,
  true as is_current
FROM west_division.retail_data.customers_type1
WHERE customer_id = 2
UNION ALL
SELECT 
  'Type 2' as scd_type,
  customer_id,
  customer_name,
  customer_segment,
  effective_from,
  effective_to,
  is_current
FROM west_division.retail_data.customers_type2
WHERE customer_id = 2
ORDER BY scd_type, effective_from;

## Summary

Congratulations! You've successfully implemented both Type 1 and Type 2 SCD patterns using Change Data Feed with Delta Sharing in a D2D scenario.

✅ **What we accomplished:**

1. Mounted a D2D share with CDF-enabled tables from East Division
2. **Type 1 SCD (Current State)**:
   - Created simple target table
   - Applied changes with MERGE INTO
   - Maintained only current values
3. **Type 2 SCD (Historical Tracking)**:
   - Created table with effective dates
   - Closed out old records
   - Inserted new versions
   - Maintained full audit trail
4. Compared both approaches with example queries

### Decision Guide: Type 1 vs Type 2

**Use Type 1 when:**
- ✅ Current state is sufficient
- ✅ Storage efficiency is critical
- ✅ Query simplicity is preferred
- ✅ Historical analysis not needed
- Examples: current contact info, latest prices, reference data

**Use Type 2 when:**
- ✅ Historical analysis required
- ✅ Compliance/audit needs
- ✅ "Point-in-time" queries needed
- ✅ Trend analysis valuable
- Examples: customer segments, product categories, organizational changes

**Hybrid Approach (Both!):**
- Type 1 for operational dashboards (current state)
- Type 2 for analytical/historical reporting
- Both fed from same CDF source

### Key Benefits Demonstrated:

- **Efficient Processing**: Only process changed data within organization
- **Flexible Patterns**: Choose SCD type based on use case
- **Version Tracking**: CDF provides reliable change tracking
- **Watermark Pattern**: Track progress across divisions
- **D2D Simplicity**: Seamless sharing between Databricks workspaces
- **Enterprise Ready**: Scales for large organizations

### Next Steps:
- Implement similar patterns for sales_transactions table
- Schedule as recurring Databricks job
- Add monitoring and alerting
- Consider Delta Live Tables (DLT) for production pipelines
- Explore streaming with Structured Streaming for real-time processing

---
&copy; 2025 Databricks, Inc. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg logo are trademarks of the <a href="https://www.apache.org/" target="_blank">Apache Software Foundation</a>.<br/><br/><a href="https://databricks.com/privacy-policy" target="_blank">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use" target="_blank">Terms of Use</a> | <a href="https://help.databricks.com/" target="_blank">Support</a>