
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img
    src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png"
    alt="Databricks Learning"
  >
</div>

# 3.1 DEMO: Cross Cloud Replication with Cloudflare R2 [Provider]

## Overview
This demo showcases how to set up cross-cloud replication using Cloudflare R2 as an intermediary storage layer. The provider creates a managed table with Change Data Feed (CDF) enabled, then replicates changes to an external table on R2 for global, cost-effective data sharing.

## Learning Objectives
By the end of this demo, you will understand:
1. How to enable Change Data Feed on managed tables
2. How to configure Cloudflare R2 external storage
3. How to replicate table changes to R2 using CDF
4. How to automate replication using Databricks Jobs

## Architecture
We will use a Cloudflare R2 bucket as the storage location for an External Table which is used between the provider and recipients to store changes from a Managed source table, the changes are then replication asynchronously to another Managed table in the recipients region/provider.
<br />
<br />
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img
    src="https://github.com/stackql/databricks-data-sharing-and-collaboration/blob/main/images/cloudflare-r2-replication.png?raw=true"
    alt="Cloudflare R2 Replication"
  >
</div>
<br />
<br />

**Benefits:**
- Zero egress costs with Cloudflare R2
- Global data distribution without provider dependencies
- Automated change propagation using CDF
- Cost-effective sharing with unlimited recipients

## Setup

Run the common setup and demo configuration scripts.

In [0]:
%run ./Includes/_common

In [0]:
%run ./Includes/Demo-Setup-3.1

## Step 1: Create Storage Credential for Cloudflare R2

Create a storage credential to access Cloudflare R2. Replace the access key and secret key with your actual R2 credentials.

In [0]:
CREATE STORAGE CREDENTIAL r2_credential
WITH (
  AWS_ACCESS_KEY_ID 'your-r2-access-key',
  AWS_SECRET_ACCESS_KEY 'your-r2-secret-key'
)
COMMENT 'Cloudflare R2 credentials for cross-cloud replication'

## Step 2: Create External Location for R2 Bucket

Create an external location pointing to your Cloudflare R2 bucket.

In [0]:
CREATE EXTERNAL LOCATION r2_location
URL 's3://databricks-demo/'
WITH (
  STORAGE_CREDENTIAL r2_credential
)
COMMENT 'Cloudflare R2 bucket for cross-cloud data replication'

## Step 3: Create Source Managed Table with CDF Enabled

Create a managed table with Change Data Feed enabled to serve as our data source.

In [0]:
CREATE OR REPLACE TABLE IDENTIFIER(CONCAT(DA.catalog, '.', DA.schema, '.', DA.source_table)) (
  transaction_id STRING,
  customer_id STRING,
  product_category STRING,
  amount DECIMAL(10,2),
  transaction_date DATE,
  region STRING,
  created_at TIMESTAMP
)
PARTITIONED BY (transaction_date)
TBLPROPERTIES (
  'delta.enableChangeDataFeed' = 'true',
  'delta.autoOptimize.optimizeWrite' = 'true'
)

## Step 4: Insert Sample Data

Insert some sample transaction data into our source table.

In [0]:
INSERT INTO IDENTIFIER(CONCAT(DA.catalog, '.', DA.schema, '.', DA.source_table))
VALUES 
  ('txn_001', 'cust_001', 'Electronics', 299.99, '2024-10-26', 'North America', current_timestamp()),
  ('txn_002', 'cust_002', 'Clothing', 79.50, '2024-10-26', 'Europe', current_timestamp()),
  ('txn_003', 'cust_003', 'Books', 24.99, '2024-10-26', 'Asia Pacific', current_timestamp()),
  ('txn_004', 'cust_004', 'Home & Garden', 149.00, '2024-10-26', 'Latin America', current_timestamp()),
  ('txn_005', 'cust_005', 'Sports', 89.99, '2024-10-26', 'North America', current_timestamp())

## Step 5: Verify CDF is Working

Check that Change Data Feed is properly enabled and working.

In [0]:
-- Check table properties
DESCRIBE DETAIL IDENTIFIER(CONCAT(DA.catalog, '.', DA.schema, '.', DA.source_table))

In [0]:
-- View table_changes
SELECT * FROM table_changes(IDENTIFIER(CONCAT(DA.catalog, '.', DA.schema, '.', DA.source_table)), 0)

In [0]:
-- View current data
SELECT * FROM IDENTIFIER(CONCAT(DA.catalog, '.', DA.schema, '.', DA.source_table))

## Step 6: Create External Table on Cloudflare R2

Create an external table that uses Cloudflare R2 storage for replication.

In [0]:
CREATE OR REPLACE TABLE IDENTIFIER(CONCAT(DA.catalog, '.', DA.schema, '.', DA.replica_table)) (
  transaction_id STRING,
  customer_id STRING,
  product_category STRING,
  amount DECIMAL(10,2),
  transaction_date DATE,
  region STRING,
  created_at TIMESTAMP
)
LOCATION CONCAT('s3://databricks-demo/', DA.r2_path)
PARTITIONED BY (transaction_date)
TBLPROPERTIES (
  'delta.autoOptimize.optimizeWrite' = 'true'
)

## Step 7: Initial Data Replication

Perform initial full copy of data to the R2 external table.

In [0]:
INSERT OVERWRITE IDENTIFIER(CONCAT(DA.catalog, '.', DA.schema, '.', DA.replica_table))
SELECT * FROM IDENTIFIER(CONCAT(DA.catalog, '.', DA.schema, '.', DA.source_table))

In [0]:
-- Verify replication
SELECT 'Source' as table_type, COUNT(*) as record_count FROM IDENTIFIER(CONCAT(DA.catalog, '.', DA.schema, '.', DA.source_table))
UNION ALL
SELECT 'R2 Replica' as table_type, COUNT(*) as record_count FROM IDENTIFIER(CONCAT(DA.catalog, '.', DA.schema, '.', DA.replica_table))

## Step 8: Add More Data to Test CDF

Add more data to the source table to generate changes.

In [0]:
INSERT INTO IDENTIFIER(CONCAT(DA.catalog, '.', DA.schema, '.', DA.source_table))
VALUES 
  ('txn_006', 'cust_006', 'Electronics', 499.99, '2024-10-27', 'Europe', current_timestamp()),
  ('txn_007', 'cust_007', 'Clothing', 129.50, '2024-10-27', 'Asia Pacific', current_timestamp()),
  ('txn_008', 'cust_008', 'Books', 34.99, '2024-10-27', 'North America', current_timestamp())

## Step 9: Read Changes Using CDF

Read the changes from the source table using Change Data Feed.

In [0]:
-- Read changes from version 1 onwards (after initial load)
SELECT 
  transaction_id,
  customer_id,
  product_category,
  amount,
  transaction_date,
  region,
  created_at,
  _change_type,
  _commit_version
FROM table_changes(CONCAT(DA.catalog, '.', DA.schema, '.', DA.source_table), 1)
WHERE _change_type = 'insert'
ORDER BY _commit_version

## Step 10: Replicate Changes to R2

Apply the detected changes to our R2 external table.

**Note**: In production, you would store the last processed version as state/checkpoint to avoid reprocessing changes.

In [0]:
-- Insert only the new changes (excluding CDF metadata columns)
INSERT INTO IDENTIFIER(CONCAT(DA.catalog, '.', DA.schema, '.', DA.replica_table))
SELECT 
  transaction_id,
  customer_id,
  product_category,
  amount,
  transaction_date,
  region,
  created_at
FROM table_changes(CONCAT(DA.catalog, '.', DA.schema, '.', DA.source_table), 1)
WHERE _change_type = 'insert'

## Step 11: Verify Final Replication Status

Confirm that both tables are now in sync.

In [0]:
-- Final verification
SELECT 
  'Source Table' as table_name,
  COUNT(*) as total_records,
  SUM(amount) as total_amount,
  MAX(transaction_date) as latest_date
FROM IDENTIFIER(CONCAT(DA.catalog, '.', DA.schema, '.', DA.source_table))

UNION ALL

SELECT 
  'R2 Replica' as table_name,
  COUNT(*) as total_records,
  SUM(amount) as total_amount,
  MAX(transaction_date) as latest_date
FROM IDENTIFIER(CONCAT(DA.catalog, '.', DA.schema, '.', DA.replica_table))

In [0]:
-- Show sample data from R2 replica
SELECT * FROM IDENTIFIER(CONCAT(DA.catalog, '.', DA.schema, '.', DA.replica_table))
ORDER BY created_at DESC

## Summary

### What We Accomplished:

✅ **Storage Setup**: Created storage credential and external location for Cloudflare R2  
✅ **Source Table**: Created managed table with Change Data Feed enabled  
✅ **External Table**: Created Delta table on R2 for cross-cloud sharing  
✅ **Initial Load**: Performed full data replication to R2  
✅ **Change Detection**: Used CDF to capture incremental changes  
✅ **Incremental Sync**: Replicated new changes to R2 external table  

### Key Benefits:

**Global Distribution**: Data available worldwide via Cloudflare's network  
**Zero Egress Costs**: Eliminate expensive data transfer fees  
**High Performance**: Fast access through global CDN  
**Change Data Feed**: Efficient incremental replication  

### Next Steps:

1. **Production Setup**: Use Databricks Secrets for R2 credentials
2. **Automation**: Schedule this as a Databricks Job for continuous replication
3. **State Management**: Store last processed version for efficient incremental processing
4. **Monitoring**: Add alerts and monitoring for replication health

The data is now available on R2 for recipients to access with zero egress costs!

---
&copy; 2025 Databricks, Inc. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg logo are trademarks of the <a href="https://www.apache.org/" target="_blank">Apache Software Foundation</a>.<br/><br/><a href="https://databricks.com/privacy-policy" target="_blank">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use" target="_blank">Terms of Use</a> | <a href="https://help.databricks.com/" target="_blank">Support</a>