
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img
    src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png"
    alt="Databricks Learning"
  >
</div>


# 2.4 DEMO: Implementing Change Data Feed (CDF) with Delta Sharing \[Provider]

## Overview
In this demo, we will implement **Change Data Feed (CDF)** with Delta Sharing to enable incremental processing of data changes. This allows recipients to efficiently track and process only the rows that have been inserted, updated, or deleted in shared tables.

**Provider Notebook (This Notebook):** We will enable Change Data Feed on existing tables, create a new share with CDF-enabled tables, configure a D2D recipient, and generate various data changes (inserts, updates, deletes) for the recipient to process.

**Recipient Notebook:** The receiving organization will mount the share, read the change data feed, and incrementally process changes using MERGE INTO operations to maintain a synchronized target table.

This demo showcases how Change Data Feed enables efficient incremental data processing patterns with Delta Sharing.

### Learning Objectives
By the end of this demo, you will be able to:
1. Enable Change Data Feed on Delta tables
2. Share CDF-enabled tables via Delta Sharing
3. Generate and track data changes (inserts, updates, deletes)
4. Understand CDF metadata columns and change types
5. Query historical changes using table versions
6. Understand best practices for incremental data processing

## Background

**Scenario:**
You are a data provider company (*"Acme Corp"*) that wants to share customer and sales data with a partner organization. Instead of sharing full snapshots, you want to enable incremental processing so the recipient can efficiently track and process only the changes.

### What is Change Data Feed?

Change Data Feed (CDF) allows Delta Lake to track row-level changes between versions of a Delta table. When enabled, the runtime records change events for all data written to the table, including:
- The row data
- Metadata indicating whether the row was **inserted**, **deleted**, or **updated**
- The table version where the change occurred
- The timestamp of the change

### CDF Metadata Columns

When reading from CDF, additional metadata columns are included:

| Column | Type | Values | Description |
|--------|------|--------|-------------|
| `_change_type` | String | `insert`, `update_preimage`, `update_postimage`, `delete` | Type of change |
| `_commit_version` | Long | Version number | The Delta log version containing the change |
| `_commit_timestamp` | Timestamp | Timestamp | When the commit was created |

**Note:** For updates, you get both the `update_preimage` (before) and `update_postimage` (after) values.

In [0]:
%run ./Includes/Demo-Setup-2

## Step 1: Enable Change Data Feed on an Existing Table

First, we need to enable CDF on our `customers` table. This is done by setting the table property `delta.enableChangeDataFeed = true`.

**Important:** Only changes made **after** enabling CDF are recorded. Past changes are not captured.

In [0]:
-- Verify current context
SELECT current_catalog(), current_schema();

In [0]:
-- Enable Change Data Feed on customers table
ALTER TABLE acme_corp.retail.customers 
SET TBLPROPERTIES (delta.enableChangeDataFeed = true);

In [0]:
-- Verify CDF is enabled by checking table properties
SHOW TBLPROPERTIES acme_corp.retail.customers;

## Step 2: Check Current Table Versions

Let's check the current version of our tables. This will be our baseline before we start making changes.

In [0]:
-- Check table history for customers
DESCRIBE HISTORY acme_corp.retail.customers LIMIT 5;

In [0]:
-- View current customers data
SELECT * FROM acme_corp.retail.customers ORDER BY customer_id;

## Step 3: Generate Data Changes

Now let's generate various types of changes to demonstrate CDF functionality:
1. **Inserts** - Add new customers
2. **Updates** - Modify existing customer data
3. **Deletes** - Remove customers

### Change 1: Insert New Customers

In [0]:
-- Insert 3 new customers
INSERT INTO acme_corp.retail.customers VALUES
  (41, 'FutureTech Systems', 'sales@futuretech.com', 'Japan', '2024-09-15', 'Mid-Market'),
  (42, 'Quantum Analytics', 'info@quantum.com', 'Netherlands', '2024-09-20', 'Enterprise'),
  (43, 'Smart Solutions Co', 'contact@smartsol.com', 'Brazil', '2024-09-25', 'SMB');

### Change 2: Update Existing Customers

In [0]:
-- Update customer segments (upgrade SMB customers to Mid-Market)
UPDATE acme_corp.retail.customers 
SET customer_segment = 'Mid-Market'
WHERE customer_segment = 'SMB';

In [0]:
-- Update email for a specific customer
UPDATE acme_corp.retail.customers 
SET email = 'newemail@acme.com'
WHERE customer_id = 1;

### Change 3: Delete Customers

In [0]:
-- Delete a customer (e.g., customer closed account)
DELETE FROM acme_corp.retail.customers WHERE customer_id = 11;

## Step 4: View Table History

Let's examine the table history to see all the changes we've made.

In [0]:
-- View detailed table history
DESCRIBE HISTORY acme_corp.retail.customers;

## Step 5: Query Change Data Feed (Batch Mode)

Let's query the CDF to see all the changes we've made. We'll use the `table_changes()` function to read changes in batch mode.

In [0]:
-- Query all changes from the version where we enabled CDF (version after enabling CDF)
-- Replace 'X' with the version number where CDF was enabled (check DESCRIBE HISTORY output)
-- Typically this will be the version after the 'SET TBLPROPERTIES' operation
SELECT 
  _change_type,
  _commit_version,
  _commit_timestamp,
  customer_id,
  customer_name,
  email,
  customer_segment
FROM table_changes('acme_corp.retail.customers', 3)
ORDER BY _commit_version, customer_id, _change_type;

### Understanding the Output

In the results above, you should see:

1. **Insert Operations**: `_change_type = 'insert'` for new customers (IDs 9, 10, 11)
2. **Update Operations**: For each updated row, you'll see:
   - `update_preimage`: The values **before** the update
   - `update_postimage`: The values **after** the update
3. **Delete Operations**: `_change_type = 'delete'` for removed customer (ID 11)

Each change includes the `_commit_version` and `_commit_timestamp` for tracking when the change occurred.

In [0]:
-- Query changes for a specific version range
-- This shows changes between version X and version Y
SELECT 
  _change_type,
  _commit_version,
  customer_id,
  customer_name,
  customer_segment
FROM table_changes('acme_corp.retail.customers', 3, 4)
ORDER BY _commit_version, customer_id;

In [0]:
-- Summarize changes by type
SELECT 
  CASE 
    WHEN _change_type LIKE 'update%' THEN 'update'
    ELSE _change_type 
  END as change_type,
  COUNT(*) as change_count
FROM table_changes('acme_corp.retail.customers', 3)
GROUP BY change_type
ORDER BY change_type;

/* generate a bar chart visualization, 

## Step 6: Create a Share with CDF-Enabled Table

Now let's create a new share and add our CDF-enabled table to it.

In [0]:
-- Create a share for CDF demonstration
CREATE SHARE IF NOT EXISTS cdf_retail
COMMENT 'Share with Change Data Feed enabled for incremental processing';

In [0]:
-- Add CDF-enabled table to the share
ALTER SHARE cdf_retail 
ADD TABLE acme_corp.retail.customers
COMMENT 'Customer data with Change Data Feed enabled';

In [0]:
-- Verify share contents
SHOW ALL IN SHARE cdf_retail;

## Step 7: Create Recipient and Grant Access

Create a D2D recipient and grant them access to the CDF-enabled share.

**Note:** Get the recipient's metastore ID before running this step.

In [0]:
-- Create recipient (replace with actual metastore ID from recipient workspace)
CREATE RECIPIENT IF NOT EXISTS cdf_recipient
USING ID 'aws:us-west-2:adc48801-0d35-4e94-9cbc-0ff7f29ce611'
COMMENT 'Recipient for Change Data Feed demonstration';

In [0]:
-- Grant access to the share
GRANT SELECT ON SHARE cdf_retail TO RECIPIENT cdf_recipient;

In [0]:
-- Verify grants
SHOW GRANTS ON SHARE cdf_retail;

## Next Steps
The recipient will now mount this share and use the Change Data Feed to incrementally process changes using `MERGE INTO` operations!

## Best Practices and Considerations

### CDF Retention and History

- **Not a permanent record**: CDF is not designed to store all historical changes forever
- **Retention policy**: Change data files follow the table's retention policy
- **`VACUUM` impact**: Running `VACUUM` removes old change data files
- **Checkpoint policy**: Changes reconstructed from transaction log follow checkpoint retention

### When to Use CDF

✅ **Good use cases:**
- Incremental ETL/ELT pipelines
- Downstream system synchronization
- Audit and compliance tracking
- Real-time data warehousing
- Change data capture (CDC) patterns

❌ **Not ideal for:**
- Complete historical replay (use table snapshots instead)
- Long-term audit archives (copy CDF to separate archive table)
- Tables with very frequent small updates

### Streaming vs Batch Reads

**Streaming (Recommended):**
- Automatically tracks processed versions
- Checkpoint-based progress tracking
- Ideal for continuous incremental processing
- Default behavior includes initial snapshot

**Batch:**
- Must specify starting version or timestamp
- Useful for ad-hoc analysis
- Good for backfilling specific version ranges
- Requires manual version tracking

### Performance Tips

- Enable CDF before making changes (can't capture historical changes)
- Use appropriate starting versions to avoid reprocessing
- Consider rate limiting for streaming reads
- Monitor storage costs (CDF may add additional overhead)
- Archive old changes if long-term history needed

## Summary

✅ **What we accomplished:**

1. Enabled Change Data Feed on Delta tables
2. Generated various data changes (inserts, updates, deletes)
3. Queried change data using `table_changes()` function
4. Created a share with CDF-enabled tables
5. Configured a recipient for incremental processing
6. Understood CDF metadata and change types

**Key Takeaways:**

- **CDF tracks row-level changes** with metadata indicating operation type
- **Enables efficient incremental processing** for Delta Sharing recipients
- **Supports both streaming and batch** processing patterns
- **Provides audit trail** with version and timestamp information
- **Updates include both pre and post images** for complete change tracking

---
&copy; 2025 Databricks, Inc. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg logo are trademarks of the <a href="https://www.apache.org/" target="_blank">Apache Software Foundation</a>.<br/><br/><a href="https://databricks.com/privacy-policy" target="_blank">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use" target="_blank">Terms of Use</a> | <a href="https://help.databricks.com/" target="_blank">Support</a>