
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img
    src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png"
    alt="Databricks Learning"
  >
</div>

# 3.1 DEMO: Cross Cloud Replication with Cloudflare R2 \[Recipient]

## Overview
This demo showcases how recipients access replicated data from Cloudflare R2 storage, providing low-latency, cost-effective access without dependency on the provider's workspace availability.

## Learning Objectives
By the end of this demo, you will understand:
1. How cross-cloud replication works with Cloudflare R2
2. Benefits of using R2 for Delta Sharing
3. How to access replicated data as a recipient
4. Performance characteristics of R2-based shares

## What is Cloudflare R2?
Cloudflare R2 is an object storage service that provides:
- **Zero egress fees**: No charges for data transfer out
- **S3-compatible API**: Works seamlessly with Delta Sharing
- **Global distribution**: Low-latency access worldwide
- **Cost-effective**: Reduced storage and bandwidth costs

## Architecture Overview

In this cross-cloud replication scenario:

```
Provider (AWS/Azure/GCP)
    ↓
    ↓ Replicates Delta tables
    ↓
Cloudflare R2 Storage
    ↓
    ↓ Delta Sharing Protocol
    ↓
Recipients (Any Cloud/On-Premises)
```

**Key Benefits:**
- Recipients don't depend on provider's workspace
- Lower egress costs with R2's zero-egress pricing
- Improved performance with global CDN
- Provider maintains full control over shared data

## Setup

Run the setup script to configure the demo environment.

In [None]:
%run ./Includes/Setup-R2-Demo

## Step 1: Verify Provider Configuration

First, let's verify that the provider has been configured with R2 storage replication.

In [None]:
-- List all available providers
SHOW PROVIDERS

In [None]:
-- Describe the R2-enabled provider
-- Replace 'r2_provider' with your actual provider name
DESCRIBE PROVIDER r2_provider

**Expected Output:**
You should see information about the provider including:
- Provider name
- Storage configuration (R2)
- Region information
- Authentication method

## Step 2: Discover Available Shares

Let's see what shares are available from the R2-backed provider.

In [None]:
-- List shares from the R2 provider
SHOW SHARES IN PROVIDER r2_provider

## Step 3: Mount the R2-Backed Share

Now we'll mount the share to access the replicated data from Cloudflare R2.

In [None]:
-- Create a catalog from the R2-backed share
CREATE CATALOG IF NOT EXISTS r2_replicated_data
USING SHARE r2_provider.global_sales_share

In [None]:
-- Verify the catalog was created
DESCRIBE CATALOG r2_replicated_data

## Step 4: Explore the Replicated Data

Let's explore the structure of the data replicated to R2.

In [None]:
-- List schemas in the replicated catalog
SHOW SCHEMAS IN r2_replicated_data

In [None]:
-- List tables in the sales schema
SHOW TABLES IN r2_replicated_data.sales

In [None]:
-- Examine the structure of the transactions table
DESCRIBE TABLE r2_replicated_data.sales.transactions

## Step 5: Query Replicated Data

Now let's query the data served from Cloudflare R2. Notice that the query experience is identical to querying data from any other Delta Sharing source.

In [None]:
-- Query sample data from R2
SELECT *
FROM r2_replicated_data.sales.transactions
LIMIT 10

In [None]:
-- Perform aggregation on replicated data
SELECT 
  region,
  COUNT(*) AS transaction_count,
  ROUND(SUM(amount), 2) AS total_sales,
  ROUND(AVG(amount), 2) AS avg_transaction_value
FROM r2_replicated_data.sales.transactions
GROUP BY region
ORDER BY total_sales DESC

In [None]:
-- Time-series analysis
SELECT 
  DATE_TRUNC('month', transaction_date) AS month,
  COUNT(*) AS transactions,
  ROUND(SUM(amount), 2) AS revenue
FROM r2_replicated_data.sales.transactions
WHERE transaction_date >= '2024-01-01'
GROUP BY DATE_TRUNC('month', transaction_date)
ORDER BY month

## Step 6: Performance Comparison

Let's compare query performance with R2 vs. traditional cloud storage.

In [None]:
-- Run a complex analytical query
-- Note the execution time
SELECT 
  region,
  product_category,
  COUNT(DISTINCT customer_id) AS unique_customers,
  COUNT(*) AS total_transactions,
  ROUND(SUM(amount), 2) AS total_revenue,
  ROUND(AVG(amount), 2) AS avg_order_value,
  PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY amount) AS median_order_value
FROM r2_replicated_data.sales.transactions
GROUP BY region, product_category
ORDER BY total_revenue DESC

**Performance Benefits with R2:**
- Lower latency due to global CDN
- No egress charges for data transfer
- Independent of provider's compute resources
- Consistent performance regardless of provider's load

## Step 7: Join with Local Data

One powerful feature is the ability to join R2-replicated data with your local data.

In [None]:
-- Example: Join replicated transactions with local customer data
-- Assuming you have a local customers table

-- First, let's create a sample local customers table for demo
CREATE OR REPLACE TABLE main.default.local_customers AS
SELECT 
  customer_id,
  'Premium' AS customer_tier,
  CASE 
    WHEN customer_id % 3 = 0 THEN 'Email'
    WHEN customer_id % 3 = 1 THEN 'Phone'
    ELSE 'SMS'
  END AS preferred_contact
FROM r2_replicated_data.sales.transactions
GROUP BY customer_id

In [None]:
-- Join replicated data with local data
SELECT 
  t.customer_id,
  c.customer_tier,
  c.preferred_contact,
  COUNT(*) AS purchase_count,
  ROUND(SUM(t.amount), 2) AS total_spend
FROM r2_replicated_data.sales.transactions t
JOIN main.default.local_customers c
  ON t.customer_id = c.customer_id
GROUP BY t.customer_id, c.customer_tier, c.preferred_contact
ORDER BY total_spend DESC
LIMIT 20

## Step 8: Monitor Data Freshness

With R2 replication, it's important to understand data freshness. Let's check when the data was last updated.

In [None]:
-- Check table metadata
DESCRIBE DETAIL r2_replicated_data.sales.transactions

**Understanding Replication Lag:**
- R2 replication is typically near real-time
- Provider controls replication frequency
- Check `lastModified` timestamp in metadata
- SLA depends on provider's configuration

## Step 9: Cost Analysis

Let's understand the cost benefits of R2-based Delta Sharing.

In [None]:
-- Estimate data volume accessed
SELECT 
  COUNT(*) AS total_records,
  ROUND(SUM(size_bytes) / (1024 * 1024 * 1024), 2) AS estimated_gb
FROM (
  SELECT 
    *,
    1024 AS size_bytes -- Estimated average row size
  FROM r2_replicated_data.sales.transactions
)

**R2 Cost Advantages:**

| Feature | Traditional Cloud Storage | Cloudflare R2 |
|---------|---------------------------|---------------|
| Storage Cost | $0.023/GB/month | $0.015/GB/month |
| Egress Cost | $0.09/GB | **$0.00/GB** |
| API Calls | Varies | Very low cost |

**Example Savings:**
- 1TB of shared data
- 10TB monthly egress
- Traditional: ~$920/month
- R2: ~$15/month
- **Savings: ~$900/month or 98%**

## Step 10: Best Practices

Here are best practices for working with R2-replicated shares:

In [None]:
-- 1. Cache frequently accessed data locally
CREATE OR REPLACE TABLE main.default.cached_top_products AS
SELECT 
  product_category,
  COUNT(*) AS sales_count,
  ROUND(SUM(amount), 2) AS total_revenue
FROM r2_replicated_data.sales.transactions
WHERE transaction_date >= CURRENT_DATE - INTERVAL '30' DAY
GROUP BY product_category

In [None]:
-- 2. Use partition pruning for efficiency
SELECT *
FROM r2_replicated_data.sales.transactions
WHERE transaction_date = '2024-10-21'  -- Specific date for partition pruning
LIMIT 100

In [None]:
-- 3. Create materialized views for complex queries
CREATE OR REPLACE VIEW main.default.regional_sales_summary AS
SELECT 
  region,
  DATE_TRUNC('day', transaction_date) AS date,
  COUNT(*) AS transactions,
  ROUND(SUM(amount), 2) AS revenue
FROM r2_replicated_data.sales.transactions
GROUP BY region, DATE_TRUNC('day', transaction_date)

## Step 11: Troubleshooting

Common issues and solutions when working with R2-backed shares:

In [None]:
-- Check if the share is accessible
SELECT COUNT(*) AS row_count
FROM r2_replicated_data.sales.transactions
LIMIT 1

**Common Issues:**

1. **Connection Timeout**: R2 is globally distributed, but some regions may have higher latency
   - Solution: Contact provider about replication to closer regions

2. **Stale Data**: Data seems outdated
   - Solution: Check replication schedule with provider
   - Run `DESCRIBE DETAIL` to see last modification time

3. **Permission Denied**: Can't access certain tables
   - Solution: Verify with provider that share includes desired tables
   - Run `SHOW TABLES` to see what's available

4. **Slow Query Performance**: Queries taking longer than expected
   - Solution: Use partition pruning
   - Cache frequently accessed data locally
   - Check query execution plan

## Cleanup

When you're done with the demo, you can clean up the resources.

In [None]:
-- Uncomment to drop the catalog when finished
-- DROP CATALOG IF EXISTS r2_replicated_data;

-- Uncomment to drop local test tables
-- DROP TABLE IF EXISTS main.default.local_customers;
-- DROP TABLE IF EXISTS main.default.cached_top_products;
-- DROP VIEW IF EXISTS main.default.regional_sales_summary;

## Summary

In this demo, you learned:

✅ **Architecture**: How Cloudflare R2 fits into Delta Sharing architecture
✅ **Setup**: How to mount and access R2-backed shares
✅ **Querying**: Standard SQL queries work seamlessly with R2 data
✅ **Performance**: R2 provides low-latency, globally distributed access
✅ **Cost**: Significant savings with zero egress fees
✅ **Integration**: Easy to join with local data
✅ **Best Practices**: Caching, partition pruning, and monitoring

## Key Takeaways

1. **Zero Egress Costs**: R2 eliminates expensive data transfer fees
2. **Global Performance**: CDN ensures low latency worldwide
3. **Provider Independence**: Recipients aren't affected by provider downtime
4. **Seamless Integration**: Works with standard Delta Sharing protocol
5. **Cost-Effective Scaling**: Share data with unlimited recipients economically

## Use Cases

**Ideal for:**
- High-volume data sharing with many recipients
- Global data distribution
- Cost-sensitive sharing scenarios
- Public data sets
- SaaS platforms sharing data with customers

**Not ideal for:**
- Real-time streaming requirements (slight replication lag)
- Scenarios requiring write access (Delta Sharing is read-only)
- Very frequently updated data (consider direct sharing)

## Next Steps

1. Explore fine-grained access control with Dynamic Views
2. Learn about Change Data Feed for tracking updates
3. Implement monitoring and governance best practices
4. Consider hybrid approaches combining direct and replicated sharing

---
&copy; 2025 Databricks, Inc. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg logo are trademarks of the <a href="https://www.apache.org/" target="_blank">Apache Software Foundation</a>.<br/><br/><a href="https://databricks.com/privacy-policy" target="_blank">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use" target="_blank">Terms of Use</a> | <a href="https://help.databricks.com/" target="_blank">Support</a>