# 2.4 DEMO: Consuming Change Data Feed (CDF) with Delta Sharing

## Recipient Workspace - Tracking Changes

**Learning Objectives:**
- Access CDF-enabled shared tables
- Query change data using `table_changes()`
- Build incremental data pipelines
- Track INSERT, UPDATE, and DELETE operations
- Understand versioning and timestamps

**Scenario:**
You are a data consumer who receives shared customer data. Instead of processing the entire dataset every time, you want to track only the changes (deltas) to efficiently update your systems.

**Benefits of Using CDF:**
- Process only changed records, not full datasets
- Reduce processing time and costs
- Enable real-time or near-real-time synchronization
- Build event-driven architectures
- Maintain complete audit trails

In [None]:
%run ../../setup/00-recipient-setup

## Step 1: Activate the Share

**Instructions:**
1. Obtain the activation link from the provider
2. Open the link in a web browser to activate access
3. Return here to create the catalog

In [None]:
-- Create catalog from the CDF share
CREATE CATALOG IF NOT EXISTS cdf_recipient_catalog
USING SHARE `<provider_name>`.${c.cdf_share_name}
COMMENT 'Catalog for CDF-enabled shared data';

-- View the shared schema and tables
SHOW SCHEMAS IN cdf_recipient_catalog;
SHOW TABLES IN cdf_recipient_catalog.${c.schema_name};

## Step 2: View the Current Shared Data

Let's query the current state of the shared table.

In [None]:
-- Query the shared customer accounts table
SELECT * 
FROM cdf_recipient_catalog.${c.schema_name}.customer_accounts
ORDER BY customer_id;

## Step 3: Query Change Data Feed

**Key Concept:** The `table_changes()` function allows you to see what changed in the table between versions or timestamps.

Let's query all changes since the beginning (version 0).

In [None]:
-- Query all changes from version 0 to current
SELECT 
  _change_type,
  _commit_version,
  _commit_timestamp,
  customer_id,
  account_number,
  customer_name,
  email,
  account_status,
  account_balance,
  last_activity_date
FROM table_changes('cdf_recipient_catalog.${c.schema_name}.customer_accounts', 0)
ORDER BY _commit_version, customer_id;

### Understanding Change Types

The `_change_type` column shows:
- **insert** - New rows added to the table
- **update_preimage** - Old values before an update
- **update_postimage** - New values after an update  
- **delete** - Rows that were removed

For incremental processing, we typically care about:
- **insert** and **update_postimage** - These are the "current" values
- **delete** - Records to remove from our local copy

## Step 4: Create a Local Target Table

Now let's create a local table in our own catalog where we'll maintain a synchronized copy of the shared data.

In [None]:
-- Create our local catalog and schema
CREATE CATALOG IF NOT EXISTS ${c.recipient_catalog};
USE CATALOG ${c.recipient_catalog};

CREATE SCHEMA IF NOT EXISTS analytics;
USE SCHEMA analytics;

-- Create a local table to store synchronized customer data
CREATE OR REPLACE TABLE ${c.recipient_catalog}.analytics.customer_accounts_local (
  customer_id INT,
  account_number STRING,
  customer_name STRING,
  email STRING,
  account_status STRING,
  account_balance DECIMAL(10,2),
  last_activity_date DATE,
  created_date DATE,
  synced_at TIMESTAMP,
  source_version BIGINT
)
COMMENT 'Local copy of shared customer accounts, synchronized using CDF';

-- Verify empty table
SELECT * FROM ${c.recipient_catalog}.analytics.customer_accounts_local;

## Step 5: Initial Load Using MERGE INTO

**Key Pattern:** Use `MERGE INTO` with CDF to perform incremental updates. This is more efficient than full reloads.

For the initial load, we'll process all changes from version 0.

In [None]:
-- MERGE changes from the shared table into our local table
MERGE INTO ${c.recipient_catalog}.analytics.customer_accounts_local AS target
USING (
  SELECT 
    customer_id,
    account_number,
    customer_name,
    email,
    account_status,
    account_balance,
    last_activity_date,
    created_date,
    _change_type,
    _commit_version
  FROM table_changes('cdf_recipient_catalog.${c.schema_name}.customer_accounts', 0)
  WHERE _change_type IN ('insert', 'update_postimage')  -- Only get current values
) AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN 
  UPDATE SET
    target.account_number = source.account_number,
    target.customer_name = source.customer_name,
    target.email = source.email,
    target.account_status = source.account_status,
    target.account_balance = source.account_balance,
    target.last_activity_date = source.last_activity_date,
    target.created_date = source.created_date,
    target.synced_at = current_timestamp(),
    target.source_version = source._commit_version
WHEN NOT MATCHED THEN
  INSERT (
    customer_id,
    account_number,
    customer_name,
    email,
    account_status,
    account_balance,
    last_activity_date,
    created_date,
    synced_at,
    source_version
  )
  VALUES (
    source.customer_id,
    source.account_number,
    source.customer_name,
    source.email,
    source.account_status,
    source.account_balance,
    source.last_activity_date,
    source.created_date,
    current_timestamp(),
    source._commit_version
  );

In [None]:
-- Verify the data was loaded
SELECT * FROM ${c.recipient_catalog}.analytics.customer_accounts_local
ORDER BY customer_id;

## Step 6: Handle Deletes with MERGE INTO

Now let's create a more comprehensive MERGE that also handles deletions.

In [None]:
-- First, let's process deletes separately
MERGE INTO ${c.recipient_catalog}.analytics.customer_accounts_local AS target
USING (
  SELECT 
    customer_id,
    _change_type
  FROM table_changes('cdf_recipient_catalog.${c.schema_name}.customer_accounts', 0)
  WHERE _change_type = 'delete'
) AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN DELETE;

-- Verify the delete was processed (customer_id 1004 should be gone)
SELECT * FROM ${c.recipient_catalog}.analytics.customer_accounts_local
ORDER BY customer_id;

## Step 7: Incremental Sync - Track Last Processed Version

**Best Practice:** Store the last processed version so you can query only new changes on subsequent syncs.

Let's create a tracking table and then process only changes since our last sync.

In [None]:
-- Create a sync tracking table
CREATE TABLE IF NOT EXISTS ${c.recipient_catalog}.analytics.sync_metadata (
  source_table STRING,
  last_synced_version BIGINT,
  last_synced_timestamp TIMESTAMP,
  PRIMARY KEY(source_table)
);

-- Record our current sync position (get max version from local table)
MERGE INTO ${c.recipient_catalog}.analytics.sync_metadata AS target
USING (
  SELECT 
    'customer_accounts' as source_table,
    MAX(source_version) as last_version,
    current_timestamp() as sync_time
  FROM ${c.recipient_catalog}.analytics.customer_accounts_local
) AS source
ON target.source_table = source.source_table
WHEN MATCHED THEN UPDATE SET
  target.last_synced_version = source.last_version,
  target.last_synced_timestamp = source.sync_time
WHEN NOT MATCHED THEN INSERT (
  source_table,
  last_synced_version,
  last_synced_timestamp
) VALUES (
  source.source_table,
  source.last_version,
  source.sync_time
);

-- View tracking table
SELECT * FROM ${c.recipient_catalog}.analytics.sync_metadata;

## Step 8: Process Only New Changes (Incremental Pattern)

**The Magic:** Now we query only changes AFTER our last synced version. This is the power of CDF!

```sql
-- Get last synced version
DECLARE last_version BIGINT;
SET last_version = (SELECT last_synced_version FROM sync_metadata WHERE source_table = 'customer_accounts');

-- Query only NEW changes
FROM table_changes('shared_table', last_version + 1)
```

Let's simulate this pattern:

In [None]:
-- Check what NEW changes exist since our last sync
-- (Provider made additional changes in Step 8 of their notebook)
SELECT 
  _change_type,
  _commit_version,
  _commit_timestamp,
  customer_id,
  customer_name,
  account_status,
  account_balance
FROM table_changes('cdf_recipient_catalog.${c.schema_name}.customer_accounts', 
  (SELECT last_synced_version + 1 FROM ${c.recipient_catalog}.analytics.sync_metadata 
   WHERE source_table = 'customer_accounts'))
ORDER BY _commit_version, customer_id;

## Step 9: Complete Incremental Sync with MERGE INTO

Now let's apply these new changes to our local table using the complete pattern.

In [None]:
-- Step 9a: Apply inserts and updates
MERGE INTO ${c.recipient_catalog}.analytics.customer_accounts_local AS target
USING (
  SELECT 
    customer_id,
    account_number,
    customer_name,
    email,
    account_status,
    account_balance,
    last_activity_date,
    created_date,
    _change_type,
    _commit_version
  FROM table_changes('cdf_recipient_catalog.${c.schema_name}.customer_accounts',
    (SELECT last_synced_version + 1 FROM ${c.recipient_catalog}.analytics.sync_metadata 
     WHERE source_table = 'customer_accounts'))
  WHERE _change_type IN ('insert', 'update_postimage')
) AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN 
  UPDATE SET
    target.account_number = source.account_number,
    target.customer_name = source.customer_name,
    target.email = source.email,
    target.account_status = source.account_status,
    target.account_balance = source.account_balance,
    target.last_activity_date = source.last_activity_date,
    target.created_date = source.created_date,
    target.synced_at = current_timestamp(),
    target.source_version = source._commit_version
WHEN NOT MATCHED THEN
  INSERT (
    customer_id,
    account_number,
    customer_name,
    email,
    account_status,
    account_balance,
    last_activity_date,
    created_date,
    synced_at,
    source_version
  )
  VALUES (
    source.customer_id,
    source.account_number,
    source.customer_name,
    source.email,
    source.account_status,
    source.account_balance,
    source.last_activity_date,
    source.created_date,
    current_timestamp(),
    source._commit_version
  );

In [None]:
-- Step 9b: Process deletes
MERGE INTO ${c.recipient_catalog}.analytics.customer_accounts_local AS target
USING (
  SELECT customer_id
  FROM table_changes('cdf_recipient_catalog.${c.schema_name}.customer_accounts',
    (SELECT last_synced_version + 1 FROM ${c.recipient_catalog}.analytics.sync_metadata 
     WHERE source_table = 'customer_accounts'))
  WHERE _change_type = 'delete'
) AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN DELETE;

In [None]:
-- Step 9c: Update sync metadata
MERGE INTO ${c.recipient_catalog}.analytics.sync_metadata AS target
USING (
  SELECT 
    'customer_accounts' as source_table,
    MAX(source_version) as last_version,
    current_timestamp() as sync_time
  FROM ${c.recipient_catalog}.analytics.customer_accounts_local
) AS source
ON target.source_table = source.source_table
WHEN MATCHED THEN UPDATE SET
  target.last_synced_version = source.last_version,
  target.last_synced_timestamp = source.sync_time;

-- View updated metadata
SELECT * FROM ${c.recipient_catalog}.analytics.sync_metadata;

In [None]:
-- View final synchronized data
SELECT * FROM ${c.recipient_catalog}.analytics.customer_accounts_local
ORDER BY customer_id;

## Step 10: Compare Source vs Local

Let's verify our local copy matches the source.

In [None]:
-- Compare shared data vs local copy
SELECT 
  'Shared Source' as source,
  COUNT(*) as record_count,
  SUM(account_balance) as total_balance
FROM cdf_recipient_catalog.${c.schema_name}.customer_accounts

UNION ALL

SELECT 
  'Local Copy' as source,
  COUNT(*) as record_count,
  SUM(account_balance) as total_balance
FROM ${c.recipient_catalog}.analytics.customer_accounts_local;

## Summary: CDF with MERGE INTO Pattern

### What We Accomplished:
✅ Activated and accessed CDF-enabled shared data  
✅ Queried change data using `table_changes()`  
✅ Created local table for synchronized copy  
✅ Performed initial load with MERGE INTO  
✅ Handled INSERT, UPDATE, and DELETE operations  
✅ Implemented incremental sync tracking  
✅ Processed only new changes efficiently  

### The Complete Incremental Pattern:

```sql
-- 1. Get last synced version
DECLARE last_version = (SELECT last_synced_version FROM sync_metadata);

-- 2. Apply inserts/updates
MERGE INTO local_table USING (
  SELECT * FROM table_changes('shared_table', last_version + 1)
  WHERE _change_type IN ('insert', 'update_postimage')
) ON local_table.id = source.id
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...;

-- 3. Apply deletes
MERGE INTO local_table USING (
  SELECT * FROM table_changes('shared_table', last_version + 1)
  WHERE _change_type = 'delete'
) ON local_table.id = source.id
WHEN MATCHED THEN DELETE;

-- 4. Update tracking metadata
UPDATE sync_metadata SET last_synced_version = current_version;
```

### Key Benefits:
- **Efficiency** - Process only changed data, not full table
- **Performance** - Reduce data transfer by 90%+ after initial load
- **Real-time** - Sync as frequently as needed
- **Cost Savings** - Pay only for data that changed
- **Flexibility** - Control sync frequency and scheduling

### Production Considerations:
1. **Error Handling** - Wrap in try/catch, retry logic
2. **Transaction Management** - Use transactions for consistency
3. **Monitoring** - Alert on sync failures or delays
4. **Scheduling** - Use Delta Live Tables or scheduled jobs
5. **Performance** - Add indexes on join keys
6. **Data Quality** - Validate data before applying changes

### Use Cases:
- **Data Warehousing** - Efficient ETL from operational systems
- **Data Lakes** - Keep multiple lakes synchronized
- **Analytics** - Real-time dashboards with fresh data
- **Replication** - Disaster recovery and geo-distribution
- **Compliance** - Audit trail of all changes