
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img
    src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png"
    alt="Databricks Learning"
  >
</div>

# 3.1 DEMO: Cross Cloud Replication with Cloudflare R2 [Recipient]

## Overview
This demo showcases how recipients can access replicated data from Cloudflare R2 and maintain synchronized local copies using `MERGE` operations. Recipients read changes from the R2-hosted delta table location (using a view) and apply them to their local tables.

## Learning Objectives
By the end of this demo, you will understand:
1. How to access data hosted on Cloudflare R2
3. How to implement replication from a provider's external table hosted on Cloudflare R2 to a local managed table

We will use a Cloudflare R2 bucket as the storage location for a view which is used between the provider and recipients to store changes from a Managed source table, the changes are then replication asynchronously to another Managed table in the recipients region/provider.

**Benefits:**
- Zero egress costs with Cloudflare R2
- Global data distribution without provider dependencies
- Automated change propagation (e.g. using CDF)
- Cost-effective sharing with unlimited recipients

## Prerequisites

The provider should have completed the setup described in the corresponding provider notebook and provided you with:
- The Cloudflare R2 bucket path (S3-compatible endpoint)
- Cloudflare R2 Access Key ID
- Cloudflare R2 Secret Access Key

**For instructions on setting up a storage credential for Cloudflare R2**, see the corresponding demo notebook for the provider.  

> **Note:** The storage credential and external location setup is identical for both provider and recipient. You'll need read access to the R2 bucket where the provider stores replicated data.

## Setup

Run the common setup and demo configuration scripts.

In [0]:
%run ./Includes/Demo-Setup-3.1

## Step 1: Create Storage Credential for R2 Access

Create a storage credential to access the Cloudflare R2 bucket (read-only access for recipients).

**For detailed step-by-step instructions with screenshots**, see **Step 1** in the provider notebook. The process is identical for recipients, except you only need read access to the R2 bucket.

In [0]:
-- show details of r2 storage credential
DESCRIBE STORAGE CREDENTIAL r2_credential;

## Step 2: Create External Location for R2

Create an external location pointing to the R2 bucket where the provider stores data.

In [0]:
CREATE EXTERNAL LOCATION IF NOT EXISTS r2_location
URL 'r2://databricks-demo@4132d7d5587ee99b9d482ecfc2c1853c.r2.cloudflarestorage.com'
WITH (
  STORAGE CREDENTIAL r2_credential
)
COMMENT 'Cloudflare R2 bucket for accessing replicated data'

## Step 3: Create a View Referencing R2 Replica Location

Create a view that points to the provider's data on R2.

⚠️ **Do not create an external table on the recipient side as this could corrupt the tables metadata**


In [0]:
CREATE OR REPLACE VIEW recipient_analytics.transactions.vw_r2_replicated_data AS
SELECT * FROM delta.`r2://databricks-demo@4132d7d5587ee99b9d482ecfc2c1853c.r2.cloudflarestorage.com/transactions_r2_replica`;

## Step 4: Verify R2 Data Access

Check that we can successfully access the data from R2.

In [0]:
-- test access to the view
SELECT * FROM recipient_analytics.transactions.vw_r2_replicated_data LIMIT 5;

## Step 5: Create Local Managed Table

Create a local managed table to store synchronized data.

In [0]:
DROP TABLE IF EXISTS recipient_analytics.transactions.local_transactions;

CREATE TABLE recipient_analytics.transactions.local_transactions (
  transaction_id STRING,
  customer_id STRING,
  product_category STRING,
  amount DECIMAL(10,2),
  transaction_date DATE,
  region STRING,
  created_at TIMESTAMP
)
PARTITIONED BY (transaction_date)
TBLPROPERTIES (
  'delta.autoOptimize.optimizeWrite' = 'true'
)
COMMENT 'Local synchronized table with Type 1 SCD metadata'

## Step 6: Synchronize Data Using `MERGE`

Synchronize data from R2 to local table using `MERGE` operation.  For simplicity, we are doing a simple insert only `MERGE`.

> You could optionally maintain temporality on the target table (e.g. Type1 or Type2 SCDs)

In [0]:
-- Use MERGE to insert only new records
MERGE INTO recipient_analytics.transactions.local_transactions AS target
USING recipient_analytics.transactions.vw_r2_replicated_data AS source
ON target.transaction_id = source.transaction_id
WHEN NOT MATCHED THEN
  INSERT (
    transaction_id,
    customer_id,
    product_category,
    amount,
    transaction_date,
    region,
    created_at
  )
  VALUES (
    source.transaction_id,
    source.customer_id,
    source.product_category,
    source.amount,
    source.transaction_date,
    source.region,
    source.created_at
  );

## Step 7: Verify Synchronization

Check that the sync was successful.

In [0]:
-- Compare record counts
SELECT 
  'R2 Source' as source,
  COUNT(*) as record_count,
  SUM(amount) as total_amount
FROM recipient_analytics.transactions.vw_r2_replicated_data

UNION ALL

SELECT 
  'Local Sync' as source,
  COUNT(*) as record_count,
  SUM(amount) as total_amount
FROM recipient_analytics.transactions.local_transactions

## Summary

### What We Accomplished:

✅ **R2 Access**: Connected to Cloudflare R2 external table with zero egress costs  
✅ **Synchonized Changes**: Identified new records from R2 source and synchronized a local managed table  

### Key Benefits:

**Global Access**: Fast data access from Cloudflare's global network  
**Cost Efficient**: Zero egress fees for reading from R2  
**Local Performance**: Optimized local queries on synchronized data  
**Automatic Updates**: `MERGE`-based synchronization maintains current data  

### Next Steps for Production:

1. **Automation**: Schedule the `MERGE` operation as a Databricks Job (hourly/daily)
2. **Secrets**: Use Databricks Secrets for R2 credentials
3. **Monitoring**: Set up alerts for sync failures or data quality issues
4. **Optimization**: Add partition pruning and table optimization
5. **Advanced SCD**: Consider Type 2 SCD if historical tracking is needed

The recipient now has a robust, cost-effective way to stay synchronized with provider data!

---
&copy; 2025 Databricks, Inc. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg logo are trademarks of the <a href="https://www.apache.org/" target="_blank">Apache Software Foundation</a>.<br/><br/><a href="https://databricks.com/privacy-policy" target="_blank">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use" target="_blank">Terms of Use</a> | <a href="https://help.databricks.com/" target="_blank">Support</a>