<div style="display: flex; justify-content: space-between; align-items: center; padding: 8px 16px; background: #F8F9FA; border-bottom: 2px solid #E0E0E0; margin: 0; line-height: 1;">
    <div style="font-size: 14px; color: #666;">
        <span style="font-weight: bold; color: #333;">{SOURCE_PLATFORM} → Databricks Migration</span>
        <span style="margin-left: 8px; color: #999;">|</span>
        <span style="margin-left: 8px;">03 - Execute</span>
    </div>
    <div style="display: flex; align-items: center; gap: 8px;">
        <img src="https://cdn.simpleicons.org/snowflake/29B5E8" width="24" height="24"/>
        <span style="color: #999; font-size: 16px;">→</span>
        <img src="https://cdn.simpleicons.org/databricks/FF3621" width="24" height="24"/>
    </div>
</div>


<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img
    src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png"
    alt="Databricks Learning"
  >
</div>

# Data Migration and Ingestion

## Overview

Data replication is the core of migration execution. This module covers the available patterns for moving data from **{SOURCE_PLATFORM}** to **Databricks**, helping you select the right approach for each workload. We cover everything from one-time snapshots to real-time streaming, including cross-cloud replication scenarios.

## Learning Objectives

By the end of this lesson, you will be able to:
- Understand the four primary replication patterns and when to use each
- Implement snapshot migration for historical data
- Configure incremental sync for batch updates
- Enable Change Data Capture (CDC) for real-time synchronization
- Use shared external tables for cross-platform access
- Select appropriate tools and accelerators for data movement

## Replication Patterns Overview

Data replication patterns range from simple one-time loads to continuous real-time synchronization. The right choice depends on data freshness requirements, volume, cost constraints, and operational complexity.

<br />
<div class="mermaid">
block-beta
    columns 3
    block:primary:3
        columns 3
        snap["<b>Snapshot Migration</b><br/><br/><i>One-time bulk load</i>"]
        sync["<b>Incremental Sync</b><br/><br/><i>Periodic batch updates</i>"]
        cdc["<b>Change Data Capture</b><br/><br/><i>Real-time continuous<br/>synchronization</i>"]
    end
    block:shared:3
        columns 3
        ext["<b>Shared External Tables</b><br/><br/><i>Optional: both platforms read same files (any pattern above)</i>"]
    end
    style snap fill:#e3f2fd,stroke:#1976d2
    style sync fill:#e3f2fd,stroke:#1976d2
    style cdc fill:#fff3e0,stroke:#ff9800
    style ext fill:#e8f5e9,stroke:#4caf50
</div>
<script type="module"> import mermaid from "https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.esm.min.mjs"; mermaid.initialize({ startOnLoad: true, theme: "default" }); </script>

### Pattern Comparison

| Pattern | Latency | Complexity | Best For |
|---------|---------|------------|----------|
| **Snapshot** | N/A (one-time) | Low | Historical data, static tables |
| **Incremental Sync** | Hours | Medium | Daily/hourly batch updates |
| **CDC** | Seconds-Minutes | High | Real-time requirements |
| **Shared External** | Near real-time | Low | Same cloud, both platforms need access |

### Pattern Selection Decision Tree

<br />
<div class="mermaid">
flowchart TD
    Q1["<b>Is data actively changing?</b>"]
    Q1 --> NO["<b>No</b><br/><i>Static</i>"]
    Q1 --> SLOW["<b>Slowly</b><br/><i>Daily/Weekly</i>"]
    Q1 --> FAST["<b>Fast</b><br/><i>Real-time</i>"]
    NO --> SNAPSHOT["<b>Snapshot Migration</b>"]
    SLOW --> SYNC["<b>Incremental Sync</b>"]
    FAST --> CDC["<b>Change Data Capture</b>"]
    SNAPSHOT -.-> SHARED["<b>Shared External Tables</b><br/><i>Optional: both platforms<br/>read same files</i>"]
    SYNC -.-> SHARED
    CDC -.-> SHARED
    style Q1 fill:#f5f5f5,stroke:#333
    style NO fill:#fff,stroke:#666
    style SLOW fill:#fff,stroke:#666
    style FAST fill:#fff,stroke:#666
    style SNAPSHOT fill:#e3f2fd,stroke:#1976d2
    style SYNC fill:#e3f2fd,stroke:#1976d2
    style CDC fill:#fff3e0,stroke:#ff9800
    style SHARED fill:#e8f5e9,stroke:#4caf50
</div>
<script type="module"> import mermaid from "https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.esm.min.mjs"; mermaid.initialize({ startOnLoad: true, theme: "default" }); </script>

## Pattern 1: Snapshot Migration

Snapshot migration performs a **one-time bulk load** of data from {SOURCE_PLATFORM} to Databricks. This is the simplest replication pattern, ideal for historical data, static reference tables, and initial data loads.

<br />
<div class="mermaid">
flowchart LR
    subgraph SOURCE["{SOURCE_PLATFORM}"]
        ST["<b>Source Table</b><br/><i>Full data extract</i>"]
    end
    subgraph STAGING["Cloud Storage (Staging)"]
        FILES["<b>Parquet / CSV Files</b><br/><i>Intermediate landing zone</i>"]
    end
    subgraph DATABRICKS["Databricks"]
        DELTA["<b>Delta Table</b><br/><i>Full copy, managed</i>"]
    end
    ST -->|"Extract<br/>(one-time)"| FILES
    FILES -->|"Load<br/>(COPY INTO / Auto Loader)"| DELTA
    style SOURCE fill:#e3f2fd,stroke:#1976d2
    style STAGING fill:#fff3e0,stroke:#ff9800
    style DATABRICKS fill:#e8f5e9,stroke:#4caf50
</div>
<script type="module"> import mermaid from "https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.esm.min.mjs"; mermaid.initialize({ startOnLoad: true, theme: "default" }); </script>

### When to Use Snapshot Migration

| Table Type | Snapshot Suitable? | Notes |
|------------|-------------------|-------|
| Reference tables | Yes | Country codes, status codes |
| Historical facts | Yes | Closed periods, archived orders |
| Dimension tables | Maybe | Check update frequency |
| Transaction tables | No | Use CDC instead |
| Real-time tables | No | Use CDC or streaming |

### Snapshot Implementation

<div class="code-block" data-language="sql">
-- Step 1: Create target table in Databricks
CREATE TABLE IF NOT EXISTS catalog.schema.customers (
    customer_id BIGINT,
    first_name STRING,
    last_name STRING,
    email STRING,
    created_at TIMESTAMP
)
TBLPROPERTIES ('delta.enableChangeDataFeed' = 'true');

-- Step 2: Load data from staged files using COPY INTO
COPY INTO catalog.schema.customers
FROM 's3://bucket/staging/customers/'
FILEFORMAT = PARQUET
COPY_OPTIONS ('mergeSchema' = 'true');

-- Alternative: Use Auto Loader for incremental file detection
-- (Recommended for ongoing ingestion)
</div>

<link href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism.min.css" rel="stylesheet" />
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/components/prism-sql.min.js"></script>

<script>
(function() {
    document.querySelectorAll('.code-block').forEach(function(block) {
        var lang = block.getAttribute('data-language') || 'sql';
        var code = block.textContent.trim();
        var id = 'code-' + Math.random().toString(36).substr(2, 9);
        
        block.innerHTML = 
            '<div style="position:relative;margin:16px 0;">' +
                '<button class="copy-btn" style="position:absolute;top:8px;right:8px;padding:4px 12px;font-size:12px;background:#ddd;color:#333;border:1px solid #ccc;border-radius:4px;cursor:pointer;z-index:10;">Copy</button>' +
                '<pre style="background:#f8f8f8;border-radius:8px;padding:16px;padding-top:40px;overflow-x:auto;margin:0;border:1px solid #e0e0e0;"><code id="' + id + '" class="language-' + lang + '" style="font-family:Consolas,Monaco,monospace;font-size:14px;"></code></pre>' +
            '</div>';
        
        var codeEl = document.getElementById(id);
        codeEl.textContent = code;
        Prism.highlightElement(codeEl);
        
        block.querySelector('.copy-btn').onclick = function() {
            var t = document.createElement('textarea');
            t.value = code;
            document.body.appendChild(t);
            t.select();
            document.execCommand('copy');
            document.body.removeChild(t);
            this.textContent = '✓ Copied!';
            setTimeout(() => this.textContent = 'Copy', 2000);
        };
    });
})();
</script>

## Pattern 2: Incremental Sync

Incremental sync performs **periodic batch updates** from {SOURCE_PLATFORM} to Databricks on a defined schedule (hourly, daily, etc.). This pattern balances data freshness with operational simplicity.

<br />
<div class="mermaid">
flowchart LR
    subgraph SOURCE["{SOURCE_PLATFORM}"]
        ST["<b>Source Table</b><br/><i>Changed records<br/>(since last sync)</i>"]
    end
    subgraph STAGING["Cloud Storage (Landing Zone)"]
        FILES["<b>Incremental Files</b><br/><i>Parquet / CSV<br/>timestamped batches</i>"]
    end
    subgraph DATABRICKS["Databricks"]
        AL["<b>Auto Loader</b><br/><i>Incremental ingestion</i>"]
        DELTA["<b>Delta Table</b><br/><i>MERGE / Upsert</i>"]
    end
    ST -->|"Scheduled Export<br/>(hourly/daily)"| FILES
    FILES -->|"Detect new files"| AL
    AL -->|"MERGE INTO"| DELTA
    style SOURCE fill:#e3f2fd,stroke:#1976d2
    style STAGING fill:#fff3e0,stroke:#ff9800
    style DATABRICKS fill:#e8f5e9,stroke:#4caf50
</div>
<script type="module"> import mermaid from "https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.esm.min.mjs"; mermaid.initialize({ startOnLoad: true, theme: "default" }); </script>

### Sync Strategies

| Strategy | Description | Use Case | Complexity |
|----------|-------------|----------|------------|
| **Full Refresh** | Replace entire table each sync | Small tables, no history needed | Low |
| **Incremental Append** | Add new records only | Append-only logs, events | Low |
| **Incremental Merge** | Upsert based on key + watermark | Tables with updates | Medium |
| **Watermark-based** | Sync records since last timestamp | Time-series data | Medium |

### Incremental Sync Implementation

<div class="code-block" data-language="sql">
-- MERGE for incremental updates (SCD Type 1)
MERGE INTO catalog.schema.customers AS target
USING (
    SELECT * FROM read_files(
        's3://bucket/incremental/customers/',
        format => 'parquet'
    )
) AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN
    UPDATE SET *
WHEN NOT MATCHED THEN
    INSERT *;
</div>

<link href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism.min.css" rel="stylesheet" />
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/components/prism-sql.min.js"></script>

<script>
(function() {
    document.querySelectorAll('.code-block').forEach(function(block) {
        var lang = block.getAttribute('data-language') || 'sql';
        var code = block.textContent.trim();
        var id = 'code-' + Math.random().toString(36).substr(2, 9);
        
        block.innerHTML = 
            '<div style="position:relative;margin:16px 0;">' +
                '<button class="copy-btn" style="position:absolute;top:8px;right:8px;padding:4px 12px;font-size:12px;background:#ddd;color:#333;border:1px solid #ccc;border-radius:4px;cursor:pointer;z-index:10;">Copy</button>' +
                '<pre style="background:#f8f8f8;border-radius:8px;padding:16px;padding-top:40px;overflow-x:auto;margin:0;border:1px solid #e0e0e0;"><code id="' + id + '" class="language-' + lang + '" style="font-family:Consolas,Monaco,monospace;font-size:14px;"></code></pre>' +
            '</div>';
        
        var codeEl = document.getElementById(id);
        codeEl.textContent = code;
        Prism.highlightElement(codeEl);
        
        block.querySelector('.copy-btn').onclick = function() {
            var t = document.createElement('textarea');
            t.value = code;
            document.body.appendChild(t);
            t.select();
            document.execCommand('copy');
            document.body.removeChild(t);
            this.textContent = '✓ Copied!';
            setTimeout(() => this.textContent = 'Copy', 2000);
        };
    });
})();
</script>

## Pattern 3: Change Data Capture (CDC)

CDC enables **real-time or near real-time synchronization** by capturing and applying row-level changes (inserts, updates, deletes) as they occur. This pattern provides the lowest latency replication.

<br />
<div class="mermaid">
flowchart LR
    subgraph SOURCE["{SOURCE_PLATFORM}"]
        ST["<b>Source Table</b><br/><i>Transactional changes</i>"]
        LOG["<b>Change Log</b><br/><i>INSERT / UPDATE / DELETE</i>"]
    end
    subgraph STREAM["Streaming Infrastructure"]
        KAFKA["<b>Kafka / Event Hub</b><br/><i>Change events</i>"]
    end
    subgraph DATABRICKS["Databricks"]
        DLT["<b>Spark Declarative Pipelines</b><br/><i>APPLY CHANGES INTO</i>"]
        DELTA["<b>Delta Table</b><br/><i>Current state</i>"]
        CDF["<b>Change Data Feed</b><br/><i>Downstream consumers</i>"]
    end
    ST --> LOG
    LOG -->|"Capture"| KAFKA
    KAFKA -->|"Stream"| DLT
    DLT -->|"Apply"| DELTA
    DELTA -.->|"Enable CDF"| CDF
    style SOURCE fill:#e3f2fd,stroke:#1976d2
    style STREAM fill:#fff3e0,stroke:#ff9800
    style DATABRICKS fill:#e8f5e9,stroke:#4caf50
</div>
<script type="module"> import mermaid from "https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.esm.min.mjs"; mermaid.initialize({ startOnLoad: true, theme: "default" }); </script>

### CDC Extraction Methods

| Method | Latency | Source Impact | Deletes Captured | Complexity |
|--------|---------|---------------|------------------|------------|
| **Native Streams** | Low | Low | Yes | Low (if available) |
| **Log-based CDC** | Very Low | Minimal | Yes | Medium |
| **Query-based CDC** | Medium | Higher | Requires soft delete | Low |
| **Partner Tools** | Low | Low | Yes | Low (managed) |

### Slowly Changing Dimensions (SCD)

CDC enables implementation of SCD patterns for tracking dimensional changes over time.

<br />
<div class="mermaid">
flowchart LR
    subgraph TYPE1["<b>Type 1 SCD</b><br/><i>Overwrite</i>"]
        T1A["Current state only"]
        T1B["No history preserved"]
        T1C["Simple, space-efficient"]
    end
    subgraph TYPE2["<b>Type 2 SCD</b><br/><i>Historical</i>"]
        T2A["Full change history"]
        T2B["Effective date ranges"]
        T2C["Point-in-time queries"]
    end
    CDC["<b>CDC Events</b>"] --> TYPE1
    CDC --> TYPE2
    style CDC fill:#fff3e0,stroke:#ff9800
    style TYPE1 fill:#e3f2fd,stroke:#1976d2
    style TYPE2 fill:#e8f5e9,stroke:#4caf50
</div>
<script type="module"> import mermaid from "https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.esm.min.mjs"; mermaid.initialize({ startOnLoad: true, theme: "default" }); </script>

### SCD Comparison

| Aspect | Type 1 | Type 2 |
|--------|--------|--------|
| **History** | None (overwrite) | Full history preserved |
| **Storage** | Minimal | Grows with changes |
| **Query complexity** | Simple | Requires date filtering |
| **Use case** | Current state analytics | Audit, trend analysis |
| **DLT syntax** | `APPLY CHANGES INTO` | `APPLY CHANGES INTO ... STORED AS SCD TYPE 2` |

### Delta Change Data Feed

Change Data Feed (CDF) enables Delta tables to emit their own CDC stream, allowing downstream consumers to process only changed rows.

<div class="code-block" data-language="sql">
-- Enable Change Data Feed on existing table
ALTER TABLE catalog.schema.my_table 
SET TBLPROPERTIES ('delta.enableChangeDataFeed' = 'true');

-- Read changes since version
SELECT * FROM table_changes('catalog.schema.my_table', 5);

-- Read changes since timestamp
SELECT * FROM table_changes('catalog.schema.my_table', '2024-01-01T00:00:00');

-- CDF columns included automatically:
--   _change_type: insert, update_preimage, update_postimage, delete
--   _commit_version: Delta version number
--   _commit_timestamp: Time of commit
</div>

<link href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism.min.css" rel="stylesheet" />
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/components/prism-sql.min.js"></script>

<script>
(function() {
    document.querySelectorAll('.code-block').forEach(function(block) {
        var lang = block.getAttribute('data-language') || 'sql';
        var code = block.textContent.trim();
        var id = 'code-' + Math.random().toString(36).substr(2, 9);
        
        block.innerHTML = 
            '<div style="position:relative;margin:16px 0;">' +
                '<button class="copy-btn" style="position:absolute;top:8px;right:8px;padding:4px 12px;font-size:12px;background:#ddd;color:#333;border:1px solid #ccc;border-radius:4px;cursor:pointer;z-index:10;">Copy</button>' +
                '<pre style="background:#f8f8f8;border-radius:8px;padding:16px;padding-top:40px;overflow-x:auto;margin:0;border:1px solid #e0e0e0;"><code id="' + id + '" class="language-' + lang + '" style="font-family:Consolas,Monaco,monospace;font-size:14px;"></code></pre>' +
            '</div>';
        
        var codeEl = document.getElementById(id);
        codeEl.textContent = code;
        Prism.highlightElement(codeEl);
        
        block.querySelector('.copy-btn').onclick = function() {
            var t = document.createElement('textarea');
            t.value = code;
            document.body.appendChild(t);
            t.select();
            document.execCommand('copy');
            document.body.removeChild(t);
            this.textContent = '✓ Copied!';
            setTimeout(() => this.textContent = 'Copy', 2000);
        };
    });
})();
</script>

## Pattern 4: Shared External Tables

Shared External Tables enable both {SOURCE_PLATFORM} and Databricks to access the **same data files** in cloud storage, eliminating data movement during coexistence.

<br />
<div class="mermaid">
flowchart TB
    subgraph STORAGE["Cloud Storage (S3 / ADLS / GCS / R2)"]
        FILES["<b>Shared Data Files</b><br/><i>Iceberg, Parquet, or Delta</i>"]
    end
    subgraph SOURCE["{SOURCE_PLATFORM}"]
        SRC_EXT["<b>External Table</b><br/><i>Read / Write</i>"]
    end
    subgraph DATABRICKS["Databricks"]
        DBX_EXT["<b>External Table</b><br/><i>Read / Write</i>"]
        DBX_MGD["<b>Managed Table</b><br/><i>Final target (optional)</i>"]
    end
    SOURCE <-->|"Direct access"| STORAGE
    DATABRICKS <-->|"Direct access"| STORAGE
    DBX_EXT -.->|"MERGE sync"| DBX_MGD
    style STORAGE fill:#fff3e0,stroke:#ff9800
    style SOURCE fill:#e3f2fd,stroke:#1976d2
    style DATABRICKS fill:#e8f5e9,stroke:#4caf50
</div>
<script type="module"> import mermaid from "https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.esm.min.mjs"; mermaid.initialize({ startOnLoad: true, theme: "default" }); </script>

### Format Compatibility

| Format | {SOURCE_PLATFORM} Read | {SOURCE_PLATFORM} Write | Databricks Read | Databricks Write | ACID |
|--------|------------------------|-------------------------|-----------------|------------------|------|
| **Iceberg** | Yes | Yes | Yes | Yes | Yes |
| **Parquet** | Yes | Yes | Yes | Yes | No |
| **Delta** | Limited | No | Yes | Yes | Yes |
| **Delta + UniForm** | Via Iceberg | No | Yes | Yes | Yes |

### Cross-Cloud Replication with Cloudflare R2

For cross-cloud scenarios, use Cloudflare R2 as a **zero-egress bridge** between cloud providers.

<br />
<div class="mermaid">
flowchart LR
    subgraph PROVIDER["Provider (Cloud A)"]
        SRC["<b>Managed Delta Table</b><br/><i>Source of truth</i>"]
        REPLICA["<b>External Table</b><br/><i>R2 replica</i>"]
    end
    subgraph R2["Cloudflare R2"]
        BUCKET["<b>R2 Bucket</b><br/><i>Delta / Iceberg files</i><br/><i>$0 egress</i>"]
    end
    subgraph RECIPIENT["Recipient (Cloud B)"]
        VIEW["<b>View</b><br/><i>Points to R2</i>"]
        TARGET["<b>Managed Delta Table</b><br/><i>Local copy</i>"]
    end
    SRC -->|"MERGE"| REPLICA
    REPLICA -->|"Write"| BUCKET
    BUCKET -->|"Read"| VIEW
    VIEW -->|"MERGE"| TARGET
    style PROVIDER fill:#e3f2fd,stroke:#1976d2
    style R2 fill:#fff3e0,stroke:#ff9800
    style RECIPIENT fill:#e8f5e9,stroke:#4caf50
</div>
<script type="module"> import mermaid from "https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.esm.min.mjs"; mermaid.initialize({ startOnLoad: true, theme: "default" }); </script>

### Benefits of R2 Bridge

| Benefit | Description |
|---------|-------------|
| **Zero egress costs** | Cloudflare R2 charges $0 for data egress |
| **Global distribution** | Leverages Cloudflare's global CDN |
| **Cloud independence** | Durable replica independent of hyperscaler |
| **Unlimited recipients** | Add recipients without increasing costs |

## Tools and Accelerators

Replication can be executed using Databricks-native tools or third-party solutions.

### Databricks Native Tools

| Tool | Pattern Fit | Description |
|------|-------------|-------------|
| **`COPY INTO`** | Snapshot | Bulk load from cloud storage, simple one-time migrations |
| **Auto Loader** | Incremental, CDC | Incrementally processes new files as they arrive |
| **Spark Declarative Pipelines** | Incremental, CDC | Declarative pipelines with built-in quality controls |
| **Lakeflow Connect** | All patterns | Managed ingestion with pre-built connectors |
| **Lakeflow Jobs** | All patterns | Orchestration and scheduling for workflows |
| **Delta Sharing** | Shared External | Secure cross-organization data sharing |

### Partner Tools

| Tool | Best For | Description |
|------|----------|-------------|
| **Fivetran** | Managed ELT | 300+ pre-built connectors, managed CDC |
| **Airbyte** | Open source ELT | Self-hosted option, 400+ connectors |
| **Kafka Connect** | Real-time streaming | Debezium CDC, high-throughput |
| **Lakebridge** | Migration automation | Assessment, conversion, reconciliation |

## Validation and Reconciliation

After each replication, verify data integrity by comparing source and target.

### Validation Checks

| Check | Method | Expected Result |
|-------|--------|------------------|
| Row count | `SELECT COUNT(*)` | Exact match |
| Column sums | `SELECT SUM(amount)` | Match (within tolerance) |
| Distinct counts | `SELECT COUNT(DISTINCT id)` | Exact match |
| Min/Max values | `SELECT MIN(date), MAX(date)` | Match |

<br />
<div class="mermaid">
flowchart TB
    subgraph LAKEBRIDGE["Lakebridge Reconciler"]
        direction TB
        SOURCE["<b>Source Table</b><br/><i>{SOURCE_PLATFORM}</i>"]
        TARGET["<b>Target Table</b><br/><i>Delta Lake</i>"]
        COMPARE["<b>Compare</b>"]
        REPORT["<b>Reconciliation Report</b><br/><i>• Row counts<br/>• Checksums<br/>• Differences</i>"]
        SOURCE --> COMPARE
        TARGET --> COMPARE
        COMPARE --> REPORT
    end
    style SOURCE fill:#e3f2fd,stroke:#1976d2
    style TARGET fill:#e8f5e9,stroke:#4caf50
    style COMPARE fill:#fff3e0,stroke:#ff9800
    style REPORT fill:#f5f5f5,stroke:#666
    style LAKEBRIDGE fill:#fff,stroke:#FF3621,stroke-width:2px
</div>
<script type="module"> import mermaid from "https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.esm.min.mjs"; mermaid.initialize({ startOnLoad: true, theme: "default" }); </script>

## Summary

### Pattern Selection Quick Reference

| Your Situation | Recommended Pattern |
|----------------|---------------------|
| Historical or static data | **Snapshot Migration** |
| Daily/hourly batch updates | **Incremental Sync** |
| Real-time requirements | **Change Data Capture** |
| Same cloud, both platforms need access | **Shared External Tables** (as overlay) |

### Key Takeaways

1. **Match pattern to requirements** - Don't over-engineer; CDC isn't needed for static tables
2. **Shared External Tables is an overlay** - Can be combined with any primary pattern
3. **Tools accelerate execution** - Lakebridge, Fivetran, and Kafka Connect reduce manual effort
4. **Schema first, data second** - Target schemas must be in place before replication
5. **Validate continuously** - Row counts, checksums, and reconciliation are essential

### Next Steps

With data replicated, proceed to:

- [**3.4 - SQL and Code Conversion**]($./3.4 - SQL and Code Conversion) - Port queries and logic to Databricks
- [**3.5 - Pipeline and Orchestration**]($./3.5 - Pipeline and Orchestration) - Rebuild jobs and scheduling

<div style="color: #FF3621; font-weight: bold; font-size: 2em; margin-bottom: 12px;">COURSE DEVELOPER (remove before publishing)</div>

### Template Customization

**Placeholders to replace:**
- `{SOURCE_PLATFORM}` - Source platform name (Snowflake, BigQuery, Redshift, Teradata)

**Platform-specific additions required:**
- Add platform-specific CDC features (e.g., Snowflake Streams, BigQuery CDC)
- Include platform-specific export syntax for snapshots
- Add platform-specific Fivetran/Airbyte connector details
- Document platform-specific change tracking patterns

**Detailed reference materials:**
- See `ref/3.3 - Snapshot Migration Replication Pattern.ipynb` for detailed snapshot content
- See `ref/3.4 - Incremental Sync Replication Pattern.ipynb` for detailed sync content
- See `ref/3.5 - Change Data Capture Replication Pattern.ipynb` for detailed CDC content
- See `ref/3.6 - Shared External Tables Replication Pattern.ipynb` for cross-platform details

&copy; 2026 Databricks, Inc. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg logo are trademarks of the <a href="https://www.apache.org/" target="_blank">Apache Software Foundation</a>.<br/><br/><a href="https://databricks.com/privacy-policy" target="_blank">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use" target="_blank">Terms of Use</a> | <a href="https://help.databricks.com/" target="_blank">Support</a>
