
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img
    src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png"
    alt="Databricks Learning"
  >
</div>

# 3.1 DEMO: Cross Cloud Replication with Cloudflare R2 [Provider]

## Overview
This demo showcases how to set up cross-cloud replication using Cloudflare R2 as an intermediary storage layer. The provider creates a managed table, then replicates data to an external table on R2 for global, cost-effective data sharing.

## Learning Objectives
By the end of this demo, you will understand:
1. How to configure Cloudflare R2 external storage
2. How to replicate table data to R2 using insert-only MERGE operations
3. How to use R2 as a bridge for cross-cloud data sharing
4. How to automate replication using Databricks Jobs

## Architecture
We will use a Cloudflare R2 bucket as the storage location for an External Table which serves as an intermediary between the provider and recipients. Changes from a Managed source table are replicated to R2, and recipients can then pull the data asynchronously into their own Managed table in their region/provider.

> **Note:** For more sophisticated use cases, you could leverage Change Data Feed (CDF) to capture and replicate only incremental changes. In this demo, we'll use a simpler insert-only MERGE approach for clarity.

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img
    src="https://github.com/stackql/databricks-data-sharing-and-collaboration/blob/main/images/cloudflare-r2-replication-arch.png?raw=true"
    alt="Cloudflare R2 replication"
  >
</div>

**Benefits:**
- Zero egress costs with Cloudflare R2
- Global data distribution without provider dependencies
- Simple replication strategy using standard SQL
- Cost-effective sharing with unlimited recipients

## Prerequisites

Before starting this demo, you'll need to set up Cloudflare R2 and gather the required connection details. From your Cloudflare dashboard, create an R2 bucket and generate API credentials with read/write permissions. The key information you'll need includes your R2 Access Key ID and Secret Access Key, which authenticate your Databricks workspace to the R2 storage service. You'll also need the S3-compatible API endpoint URL for your R2 bucket and the specific bucket name where the replicated data will be stored.

These credentials will be configured in Databricks as a storage credential and external location, enabling secure access to your R2 bucket. The receiving party will also need access to these same R2 credentials to read the replicated data from their own Databricks environment.

<br />
<br />
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img
    src="https://github.com/stackql/databricks-data-sharing-and-collaboration/blob/main/images/cloudflare-r2-setup.png?raw=true"
    alt="Cloudflare R2 Setup"
  >
</div>
<br />
<br />

## Setup

Run the common setup and demo configuration scripts.

In [0]:
%run ./Includes/Demo-Setup-3.1

## Step 1: Create Storage Credential for Cloudflare R2

A storage credential in Databricks provides the authentication mechanism to access external cloud storage. For Cloudflare R2, we'll create a credential using the R2 API tokens (Access Key ID and Secret Access Key).

### Creating the Credential in the Databricks UI

1. Navigate to **Catalog** in the left sidebar, then select **External Data** → **Credentials**

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img
    src="https://github.com/stackql/databricks-data-sharing-and-collaboration/blob/main/images/create-cloudflare-r2-storage-cred-1.png?raw=true"
    alt="Navigate to Credentials"
  >
</div>

2. Click the **Create credential** button in the top right corner

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img
    src="https://github.com/stackql/databricks-data-sharing-and-collaboration/blob/main/images/create-cloudflare-r2-storage-cred-2.png?raw=true"
    alt="Navigate to Credentials"
  >
</div>

3. In the **Create a new credential** dialog, configure the following:
   - **Storage Credential**: Select this radio button (not Service Credential)
   - **Credential Type**: Select **Cloudflare API Token** from the dropdown
   - **Credential name**: Enter `r2_credential` (or your preferred name)
   - **Account ID**: Enter your Cloudflare Account ID (from R2 dashboard)
   - **Access key ID**: Enter your R2 Access Key ID
   - **Secret access key**: Enter your R2 Secret Access Key
   - **Comment**: Enter a description like "Cloudflare R2 credentials for replicated data"

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img
    src="https://github.com/stackql/databricks-data-sharing-and-collaboration/blob/main/images/create-cloudflare-r2-storage-cred-3.png?raw=true"
    alt="Navigate to Credentials"
  >
</div>
<br/>
<br/>
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img
    src="https://github.com/stackql/databricks-data-sharing-and-collaboration/blob/main/images/create-cloudflare-r2-storage-cred-4.png?raw=true"
    alt="Navigate to Credentials"
  >
</div>

4. Click **Create** to save the credential

5. Once created, you'll see your credential in the list with its configuration details. The credential is now ready to use for accessing your R2 bucket.

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img
    src="https://github.com/stackql/databricks-data-sharing-and-collaboration/blob/main/images/create-cloudflare-r2-storage-cred-5.png?raw=true"
    alt="Navigate to Credentials"
  >
</div>

### Verify the Credential

You can verify your storage credential was created successfully by running the following SQL command:

In [0]:
-- show available storage credenials
SHOW STORAGE CREDENTIALS;

In [0]:
-- see details of a storage credential
DESCRIBE STORAGE CREDENTIAL r2_credential;

## Step 2: Create External Location for R2 Bucket

Create an external location pointing to your Cloudflare R2 bucket.  The URL is in the form of:  

`r2://{r2_bucket_name}@{cloudflare_account_id}.r2.cloudflarestorage.com`

In [0]:
-- create an external location for your cloudflare r2 bucket
CREATE EXTERNAL LOCATION r2_location
URL 'r2://databricks-demo@4132d7d5587ee99b9d482ecfc2c1853c.r2.cloudflarestorage.com'
WITH (
  STORAGE CREDENTIAL r2_credential
  )
COMMENT 'Cloudflare R2 bucket for cross-cloud data replication'

## Step 3: Create Source Managed Table

Create a managed table to serve as our data source.

In [0]:
DROP TABLE IF EXISTS global_analytics.transactions.transactions;

CREATE OR REPLACE TABLE global_analytics.transactions.transactions (
  transaction_id STRING,
  customer_id STRING,
  product_category STRING,
  amount DECIMAL(10,2),
  transaction_date DATE,
  region STRING,
  created_at TIMESTAMP
)
PARTITIONED BY (transaction_date)
TBLPROPERTIES (
  'delta.autoOptimize.optimizeWrite' = 'true'
);

## Step 4: Insert Sample Data

Insert some sample transaction data into our source table.

In [0]:
INSERT INTO global_analytics.transactions.transactions
VALUES 
  ('txn_001', 'cust_001', 'Electronics', 299.99, '2024-10-26', 'North America', current_timestamp()),
  ('txn_002', 'cust_002', 'Clothing', 79.50, '2024-10-26', 'Europe', current_timestamp()),
  ('txn_003', 'cust_003', 'Books', 24.99, '2024-10-26', 'Asia Pacific', current_timestamp()),
  ('txn_004', 'cust_004', 'Home & Garden', 149.00, '2024-10-26', 'Latin America', current_timestamp()),
  ('txn_005', 'cust_005', 'Sports', 89.99, '2024-10-26', 'North America', current_timestamp());

## Step 5: Verify Source Table

Verify the source table was created successfully and contains data.

In [0]:
-- Check table properties
DESCRIBE DETAIL global_analytics.transactions.transactions;

In [0]:
-- View current data
SELECT * FROM global_analytics.transactions.transactions
ORDER BY transaction_date, transaction_id;

## Step 6: Create External Table on Cloudflare R2

Create an external table that uses Cloudflare R2 storage for replication.

In [0]:
DROP TABLE IF EXISTS global_analytics.transactions.transactions_r2_replica;

CREATE TABLE global_analytics.transactions.transactions_r2_replica (
  transaction_id STRING,
  customer_id STRING,
  product_category STRING,
  amount DECIMAL(10,2),
  transaction_date DATE,
  region STRING,
  created_at TIMESTAMP
)
LOCATION 'r2://databricks-demo@4132d7d5587ee99b9d482ecfc2c1853c.r2.cloudflarestorage.com/transactions_r2_replica'
PARTITIONED BY (transaction_date)
TBLPROPERTIES (
  'delta.autoOptimize.optimizeWrite' = 'true'
)

## Step 7: Initial Data Replication

Perform initial full copy of data to the R2 external table.

In [0]:
INSERT OVERWRITE global_analytics.transactions.transactions_r2_replica
SELECT * FROM global_analytics.transactions.transactions;

In [0]:
-- Verify replication
SELECT 'Source' as table_type, COUNT(*) as record_count FROM global_analytics.transactions.transactions
UNION ALL
SELECT 'R2 Replica' as table_type, COUNT(*) as record_count FROM global_analytics.transactions.transactions_r2_replica;

## Step 8: Add More Data to Test CDF

Add more data to the source table to generate changes.

In [0]:
INSERT INTO global_analytics.transactions.transactions
VALUES 
  ('txn_006', 'cust_006', 'Electronics', 499.99, '2024-10-27', 'Europe', current_timestamp()),
  ('txn_007', 'cust_007', 'Clothing', 129.50, '2024-10-27', 'Asia Pacific', current_timestamp()),
  ('txn_008', 'cust_008', 'Books', 34.99, '2024-10-27', 'North America', current_timestamp());

## Step 9: Replicate Changes to R2 Using MERGE

Apply changes to the R2 external table using an insert-only MERGE operation. This approach handles new records efficiently by only inserting rows that don't already exist in the replica table.

> **Note:** For more advanced scenarios with updates and deletes, consider using Change Data Feed (CDF) to track all types of changes.

In [0]:
-- Use MERGE to insert only new records
MERGE INTO global_analytics.transactions.transactions_r2_replica AS target
USING global_analytics.transactions.transactions AS source
ON target.transaction_id = source.transaction_id
WHEN NOT MATCHED THEN
  INSERT (
    transaction_id,
    customer_id,
    product_category,
    amount,
    transaction_date,
    region,
    created_at
  )
  VALUES (
    source.transaction_id,
    source.customer_id,
    source.product_category,
    source.amount,
    source.transaction_date,
    source.region,
    source.created_at
  );

## Step 10: Verify Final Replication Status

Confirm that both tables are now in sync.

In [0]:
-- Final verification
SELECT 
  'Source Table' as table_name,
  COUNT(*) as total_records,
  SUM(amount) as total_amount,
  MAX(transaction_date) as latest_date
FROM global_analytics.transactions.transactions

UNION ALL

SELECT 
  'R2 Replica' as table_name,
  COUNT(*) as total_records,
  SUM(amount) as total_amount,
  MAX(transaction_date) as latest_date
FROM global_analytics.transactions.transactions_r2_replica;

In [0]:
-- Show sample data from R2 replica
SELECT * FROM global_analytics.transactions.transactions_r2_replica
ORDER BY created_at DESC;

## Summary

### What We Accomplished:

✅ **Storage Setup**: Created storage credential and external location for Cloudflare R2  
✅ **Source Table**: Created managed table with sample transaction data  
✅ **External Table**: Created Delta table on R2 for cross-cloud sharing  
✅ **Initial Load**: Performed full data replication to R2  
✅ **Incremental Sync**: Used insert-only MERGE to replicate new data to R2  

### Key Benefits:

**Global Distribution**: Data available worldwide via Cloudflare's network  
**Zero Egress Costs**: Eliminate expensive data transfer fees  
**High Performance**: Fast access through global CDN  
**Simple Replication**: Straightforward MERGE-based approach for insert-only scenarios  

### Replication Strategy:

This demo uses an **insert-only MERGE** approach where:
- New records from the source table are identified by `transaction_id`
- Only non-existent records are inserted into the replica table
- Updates and deletes are not tracked in this simplified approach

**For more sophisticated scenarios**, consider enabling **Change Data Feed (CDF)** to capture inserts, updates, and deletes, enabling full bi-directional synchronization and point-in-time recovery.

### Next Steps:

1. **Production Setup**: Use Databricks Secrets for R2 credentials
2. **Automation**: Schedule this as a Databricks Job for continuous replication
3. **Enhanced Change Tracking**: Consider implementing CDF for comprehensive change management
4. **Monitoring**: Add alerts and monitoring for replication health

The data is now available on R2 for recipients to access with zero egress costs!

---
&copy; 2025 Databricks, Inc. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg logo are trademarks of the <a href="https://www.apache.org/" target="_blank">Apache Software Foundation</a>.<br/><br/><a href="https://databricks.com/privacy-policy" target="_blank">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use" target="_blank">Terms of Use</a> | <a href="https://help.databricks.com/" target="_blank">Support</a>