
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img
    src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png"
    alt="Databricks Learning"
  >
</div>


# 3.1 DEMO: Cross Cloud Replication with Cloudflare R2 \[Provider]

## Overview
This demo showcases a hybrid data replication pattern using Change Data Feed (CDF) and Cloudflare R2 for cross-cloud/cross-region collaboration. This pattern is particularly useful when:

- Recipients need low-latency access to data in their region
- Organizations want to minimize egress costs
- Data needs to be available even when the provider's workspace is offline
- Recipients prefer direct storage access over Delta Sharing protocol

**Provider Notebook (This Notebook):** Set up source data with CDF enabled, create an external location using Cloudflare R2, and configure automated replication using MERGE INTO operations based on change data.

**Recipient Notebook:** Access the replicated data directly from R2 storage, demonstrate query performance, and compare with traditional Delta Sharing approaches.

### Learning Objectives
By the end of this demo, you will be able to:
1. Enable and use Change Data Feed (CDF) to track table changes
2. Configure Cloudflare R2 as an external storage location in Databricks
3. Implement incremental replication using CDF and MERGE INTO
4. Schedule automated replication jobs for continuous sync
5. Monitor replication status and troubleshoot issues
6. Understand trade-offs between Delta Sharing and storage-based replication

## Background

**Scenario:**
You are a data provider ("Global Analytics Corp") with customer transaction data in AWS US-East. You need to share this data with regional partners in Europe and Asia-Pacific, but they require:
- Low-latency access in their regions
- Ability to access data even when your Databricks workspace is down
- Cost-effective replication without high egress fees

**Solution:**
Instead of direct Delta Sharing, you'll use Cloudflare R2 (S3-compatible, zero-egress storage) as a replication target. Your source tables use Change Data Feed to track all modifications, and a scheduled job incrementally replicates changes to R2 storage where recipients can access it.

In [0]:
%run ./Includes/Demo-Setup-3.1

## Step 1: Create Source Table with Change Data Feed

Change Data Feed (CDF) tracks all changes (inserts, updates, deletes) to a Delta table. We'll create a transactions table with CDF enabled to capture all modifications for replication.

In [0]:
-- Create source transactions table with CDF enabled
CREATE TABLE IF NOT EXISTS global_analytics.transactions (
  transaction_id STRING,
  customer_id STRING,
  product_id STRING,
  amount DECIMAL(10,2),
  transaction_date TIMESTAMP,
  region STRING,
  status STRING
)
USING DELTA
TBLPROPERTIES (
  'delta.enableChangeDataFeed' = 'true'
)
COMMENT 'Source transactions table with CDF for replication';

In [0]:
-- Insert initial transaction data
INSERT INTO global_analytics.transactions VALUES
  ('TXN001', 'CUST001', 'PROD100', 150.00, '2025-01-15 10:30:00', 'US', 'completed'),
  ('TXN002', 'CUST002', 'PROD101', 275.50, '2025-01-15 11:15:00', 'EU', 'completed'),
  ('TXN003', 'CUST003', 'PROD102', 89.99, '2025-01-15 12:00:00', 'APAC', 'completed'),
  ('TXN004', 'CUST004', 'PROD100', 150.00, '2025-01-15 13:45:00', 'US', 'pending'),
  ('TXN005', 'CUST005', 'PROD103', 420.00, '2025-01-15 14:20:00', 'EU', 'completed');

SELECT * FROM global_analytics.transactions ORDER BY transaction_date;

In [0]:
-- Verify CDF is enabled
DESCRIBE DETAIL global_analytics.transactions;

## Step 2: Configure Cloudflare R2 External Location

Cloudflare R2 provides S3-compatible object storage with zero egress fees. We'll configure it as an external location in Unity Catalog. You need:
- R2 bucket name
- R2 access key ID
- R2 secret access key
- R2 endpoint URL (account-specific)

**Note:** In production, credentials should be stored in Databricks secrets.

In [0]:
-- Create storage credential for R2 (Metastore admin required)
-- Replace with your actual R2 credentials
CREATE STORAGE CREDENTIAL IF NOT EXISTS r2_storage_credential
WITH (AWS_IAM_ROLE=<your-r2-credentials>)
COMMENT 'Cloudflare R2 storage credentials for cross-cloud replication';

-- Create external location pointing to R2
-- Replace <account-id> with your Cloudflare account ID
CREATE EXTERNAL LOCATION IF NOT EXISTS r2_replication_location
URL 's3://global-analytics-replication/<account-id>.r2.cloudflarestorage.com/'
WITH (STORAGE CREDENTIAL r2_storage_credential)
COMMENT 'Cloudflare R2 location for replicated transaction data';

**Alternative Setup Using Python (with Secrets):**

In [0]:
# Configure R2 credentials from secrets
r2_access_key = dbutils.secrets.get(scope="r2-secrets", key="access-key-id")
r2_secret_key = dbutils.secrets.get(scope="r2-secrets", key="secret-access-key")
r2_endpoint = dbutils.secrets.get(scope="r2-secrets", key="endpoint-url")

# Configure Spark to use R2
spark.conf.set("fs.s3a.access.key", r2_access_key)
spark.conf.set("fs.s3a.secret.key", r2_secret_key)
spark.conf.set("fs.s3a.endpoint", r2_endpoint)
spark.conf.set("fs.s3a.path.style.access", "true")

# Test connection
r2_path = "s3a://global-analytics-replication/transactions/"
print(f"R2 storage configured at: {r2_path}")

## Step 3: Create Target External Table on R2

The target table will be an external Delta table stored in Cloudflare R2. This allows recipients to access it directly from their cloud/region.

In [0]:
-- Create external table in R2 using external location
CREATE TABLE IF NOT EXISTS global_analytics.transactions_r2_replica
(
  transaction_id STRING,
  customer_id STRING,
  product_id STRING,
  amount DECIMAL(10,2),
  transaction_date TIMESTAMP,
  region STRING,
  status STRING,
  _replicated_at TIMESTAMP
)
USING DELTA
LOCATION 's3://global-analytics-replication/transactions/'
COMMENT 'Replicated transactions in Cloudflare R2 for cross-cloud access';

## Step 4: Initial Data Load to R2

Perform the initial full copy of data to R2 storage.

In [0]:
-- Initial data load with replication timestamp
INSERT INTO global_analytics.transactions_r2_replica
SELECT 
  transaction_id,
  customer_id,
  product_id,
  amount,
  transaction_date,
  region,
  status,
  current_timestamp() as _replicated_at
FROM global_analytics.transactions;

-- Verify initial load
SELECT COUNT(*) as replica_count FROM global_analytics.transactions_r2_replica;

## Step 5: Implement Incremental Replication Using CDF

Now we'll create a replication process that:
1. Reads changes from the source table using CDF
2. Applies those changes to the R2 replica using MERGE INTO
3. Tracks the last replicated version for incremental processing

In [0]:
# Create a metadata table to track replication progress
spark.sql("""
CREATE TABLE IF NOT EXISTS global_analytics.replication_metadata (
  source_table STRING,
  target_table STRING,
  last_replicated_version LONG,
  last_replicated_timestamp TIMESTAMP,
  replication_status STRING
)
USING DELTA
""")

# Initialize metadata
spark.sql("""
MERGE INTO global_analytics.replication_metadata AS target
USING (
  SELECT 
    'global_analytics.transactions' as source_table,
    'global_analytics.transactions_r2_replica' as target_table,
    0 as last_replicated_version,
    current_timestamp() as last_replicated_timestamp,
    'initialized' as replication_status
) AS source
ON target.source_table = source.source_table
WHEN NOT MATCHED THEN INSERT *
""")

In [0]:
def replicate_changes():
    """
    Incremental replication function using Change Data Feed.
    Reads changes since last replication and applies to R2 replica.
    """
    from pyspark.sql.functions import col, current_timestamp, lit
    
    # Get last replicated version
    last_version_df = spark.sql("""
        SELECT last_replicated_version 
        FROM global_analytics.replication_metadata
        WHERE source_table = 'global_analytics.transactions'
    """)
    
    last_version = last_version_df.collect()[0][0]
    
    # Get current version of source table
    current_version = spark.sql(
        "DESCRIBE HISTORY global_analytics.transactions LIMIT 1"
    ).select("version").collect()[0][0]
    
    print(f"Replicating changes from version {last_version + 1} to {current_version}")
    
    if current_version <= last_version:
        print("No new changes to replicate")
        return
    
    # Read changes using CDF
    changes_df = spark.read \
        .format("delta") \
        .option("readChangeFeed", "true") \
        .option("startingVersion", last_version + 1) \
        .option("endingVersion", current_version) \
        .table("global_analytics.transactions")
    
    changes_count = changes_df.count()
    print(f"Found {changes_count} change records to process")
    
    if changes_count == 0:
        return
    
    # Create temp view for merge
    changes_df.createOrReplaceTempView("changes_temp")
    
    # Apply changes to R2 replica using MERGE
    spark.sql("""
        MERGE INTO global_analytics.transactions_r2_replica AS target
        USING (
            SELECT 
                transaction_id,
                customer_id,
                product_id,
                amount,
                transaction_date,
                region,
                status,
                _change_type,
                _commit_version
            FROM (
                SELECT *,
                    ROW_NUMBER() OVER (
                        PARTITION BY transaction_id 
                        ORDER BY _commit_version DESC, _commit_timestamp DESC
                    ) as rn
                FROM changes_temp
            )
            WHERE rn = 1
        ) AS source
        ON target.transaction_id = source.transaction_id
        WHEN MATCHED AND source._change_type = 'update_postimage' THEN
            UPDATE SET
                customer_id = source.customer_id,
                product_id = source.product_id,
                amount = source.amount,
                transaction_date = source.transaction_date,
                region = source.region,
                status = source.status,
                _replicated_at = current_timestamp()
        WHEN MATCHED AND source._change_type = 'delete' THEN DELETE
        WHEN NOT MATCHED AND source._change_type IN ('insert', 'update_postimage') THEN
            INSERT (
                transaction_id, customer_id, product_id, amount,
                transaction_date, region, status, _replicated_at
            )
            VALUES (
                source.transaction_id, source.customer_id, source.product_id,
                source.amount, source.transaction_date, source.region,
                source.status, current_timestamp()
            )
    """)
    
    # Update metadata
    spark.sql(f"""
        UPDATE global_analytics.replication_metadata
        SET 
            last_replicated_version = {current_version},
            last_replicated_timestamp = current_timestamp(),
            replication_status = 'success'
        WHERE source_table = 'global_analytics.transactions'
    """)
    
    print(f"Successfully replicated {changes_count} changes to R2")

# Test the replication function
replicate_changes()

## Step 6: Test Replication with Data Changes

Let's make some changes to the source table and verify they replicate correctly to R2.

In [0]:
-- Insert new transactions
INSERT INTO global_analytics.transactions VALUES
  ('TXN006', 'CUST006', 'PROD104', 325.00, '2025-01-16 09:00:00', 'APAC', 'completed'),
  ('TXN007', 'CUST007', 'PROD100', 150.00, '2025-01-16 10:30:00', 'US', 'completed');

-- Update existing transaction
UPDATE global_analytics.transactions
SET status = 'completed'
WHERE transaction_id = 'TXN004';

-- Delete a transaction
DELETE FROM global_analytics.transactions
WHERE transaction_id = 'TXN001';

SELECT * FROM global_analytics.transactions ORDER BY transaction_date;

In [0]:
# Run replication to apply changes
replicate_changes()

In [0]:
-- Verify changes in R2 replica
SELECT 'Source' as table_name, COUNT(*) as record_count 
FROM global_analytics.transactions
UNION ALL
SELECT 'R2 Replica' as table_name, COUNT(*) as record_count 
FROM global_analytics.transactions_r2_replica;

-- Check specific records
SELECT * FROM global_analytics.transactions_r2_replica 
ORDER BY transaction_date;

## Step 7: Schedule Automated Replication Job

In production, you'll want to schedule the replication to run automatically. Here's how to create a Databricks job for this:

In [0]:
# This code creates a job definition that can be deployed
# In practice, you would use Databricks CLI or REST API to create the job

job_config = {
    "name": "R2 Replication - Transactions",
    "tasks": [
        {
            "task_key": "replicate_transactions",
            "notebook_task": {
                "notebook_path": "/path/to/this/notebook",
                "source": "WORKSPACE"
            },
            "job_cluster_key": "replication_cluster",
            "timeout_seconds": 3600,
            "max_retries": 2
        }
    ],
    "job_clusters": [
        {
            "job_cluster_key": "replication_cluster",
            "new_cluster": {
                "spark_version": "13.3.x-scala2.12",
                "node_type_id": "i3.xlarge",
                "num_workers": 2
            }
        }
    ],
    "schedule": {
        "quartz_cron_expression": "0 */15 * * * ?",  # Every 15 minutes
        "timezone_id": "UTC"
    }
}

print("Job configuration for R2 replication:")
print(job_config)

## Step 8: Monitor Replication Status

Create monitoring queries to track replication health and performance.

In [0]:
-- Check replication metadata
SELECT 
    source_table,
    target_table,
    last_replicated_version,
    last_replicated_timestamp,
    replication_status,
    TIMESTAMPDIFF(MINUTE, last_replicated_timestamp, current_timestamp()) as minutes_since_last_replication
FROM global_analytics.replication_metadata;

In [0]:
-- Compare record counts between source and replica
WITH source_stats AS (
    SELECT 
        COUNT(*) as record_count,
        MAX(transaction_date) as latest_transaction,
        SUM(amount) as total_amount
    FROM global_analytics.transactions
),
replica_stats AS (
    SELECT 
        COUNT(*) as record_count,
        MAX(transaction_date) as latest_transaction,
        SUM(amount) as total_amount,
        MAX(_replicated_at) as last_replication
    FROM global_analytics.transactions_r2_replica
)
SELECT 
    'Source' as table_type,
    source_stats.*,
    NULL as last_replication
FROM source_stats
UNION ALL
SELECT 
    'R2 Replica' as table_type,
    replica_stats.*
FROM replica_stats;

In [0]:
-- Check for replication lag
SELECT 
    t.transaction_id,
    t.transaction_date as source_date,
    r._replicated_at as replica_date,
    CASE 
        WHEN r.transaction_id IS NULL THEN 'Missing in replica'
        ELSE 'Replicated'
    END as replication_status
FROM global_analytics.transactions t
LEFT JOIN global_analytics.transactions_r2_replica r
    ON t.transaction_id = r.transaction_id
ORDER BY t.transaction_date DESC
LIMIT 20;

## Step 9: Performance Optimization Tips

Here are some optimizations for production deployments:

In [0]:
-- Optimize R2 replica table for query performance
OPTIMIZE global_analytics.transactions_r2_replica
ZORDER BY (region, transaction_date);

-- Vacuum old versions (keep 7 days for time travel)
VACUUM global_analytics.transactions_r2_replica RETAIN 168 HOURS;

In [0]:
# Configure partition strategy for large tables
# This example partitions by date for efficient time-based queries

spark.sql("""
CREATE TABLE IF NOT EXISTS global_analytics.transactions_r2_replica_partitioned
(
  transaction_id STRING,
  customer_id STRING,
  product_id STRING,
  amount DECIMAL(10,2),
  transaction_date TIMESTAMP,
  region STRING,
  status STRING,
  _replicated_at TIMESTAMP
)
USING DELTA
PARTITIONED BY (DATE(transaction_date))
LOCATION 's3://global-analytics-replication/transactions_partitioned/'
COMMENT 'Partitioned replica for better query performance'
""")

## Step 10: Sharing R2 Location with Recipients

To enable recipients to access the R2 data, share the following information:

In [0]:
# Generate recipient access instructions
recipient_instructions = f"""
=== R2 Data Access Information ===

Storage Location: s3://global-analytics-replication/transactions/
Endpoint: <your-account-id>.r2.cloudflarestorage.com
Region: auto (Cloudflare R2 automatically routes to nearest location)

Required Credentials:
- Access Key ID: [Provide via secure channel]
- Secret Access Key: [Provide via secure channel]

Data Format: Delta Lake
Update Frequency: Every 15 minutes
Replication Lag: Typically < 5 minutes

Table Schema:
  - transaction_id: STRING (Primary Key)
  - customer_id: STRING
  - product_id: STRING
  - amount: DECIMAL(10,2)
  - transaction_date: TIMESTAMP
  - region: STRING
  - status: STRING
  - _replicated_at: TIMESTAMP (Replication metadata)

Usage Notes:
- Data is read-only for recipients
- Changes are replicated incrementally
- Time travel available for 7 days
- No egress charges from Cloudflare R2
"""

print(recipient_instructions)

## Summary

✅ **What we accomplished:**

1. Created source table with Change Data Feed enabled
2. Configured Cloudflare R2 as external storage location
3. Implemented incremental replication using CDF and MERGE INTO
4. Built monitoring and metadata tracking
5. Optimized replica for query performance
6. Prepared recipient access instructions

**Key Benefits of This Pattern:**
- 🌍 **Global Access**: Recipients in any region get low-latency access
- 💰 **Cost Effective**: Zero egress fees with Cloudflare R2
- ⚡ **Performance**: Direct storage access without protocol overhead
- 🔄 **Incremental**: CDF enables efficient change tracking
- 🛡️ **Resilient**: Recipients can access data even if provider workspace is offline
- 📊 **Flexible**: Recipients can use any Delta Lake compatible tool

**Trade-offs vs. Direct Delta Sharing:**
- ➕ Lower latency for recipients
- ➕ Lower costs (no egress fees)
- ➕ Works across any cloud/region
- ➖ Replication lag (not real-time)
- ➖ Additional storage costs
- ➖ More complex setup and maintenance

**Next Steps:**
Continue to the recipient notebook to see how partners access and query this replicated data.

&copy; 2025 Databricks, Inc. All rights reserved.