# Stream CockroachDB CDC to Databricks (Azure)

This notebook demonstrates how to stream CockroachDB changefeeds to Databricks using Azure Blob Storage.

## Prerequisites

- CockroachDB cluster (Cloud or self-hosted)
- Azure Storage Account with hierarchical namespace enabled
- Databricks workspace with Unity Catalog
- Unity Catalog External Location configured for your storage account

**Note:** This notebook uses the **YCSB (Yahoo! Cloud Serving Benchmark)** schema as the default table structure, with `ycsb_key` as the primary key and `field0-9` columns. The default schema name is `public`.

## CDC Mode Selection

This notebook supports **4 CDC ingestion modes** by combining two independent settings:

### 1. CDC Processing Mode (`cdc_mode`)
How CDC events are processed in the target table:

- **`append_only`**: Store all CDC events as rows (audit log)
  - **Behavior**: All events (INSERT/UPDATE/DELETE) are appended as new rows
  - **Use case**: History tracking, time-series analysis, audit logs
  - **Storage**: Higher (keeps all historical events)

- **`update_delete`**: Apply MERGE logic (current state replication)
  - **Behavior**: DELETE removes rows, UPDATE modifies rows in-place
  - **Use case**: Current state synchronization, production replication
  - **Storage**: Lower (only latest state per key)

### 2. Column Family Mode (`column_family_mode`)
Table structure and changefeed configuration:

- **`single_cf`**: Standard table (1 column family, default)
  - **Changefeed**: `split_column_families=false`
  - **Files**: 1 Parquet file per CDC event
  - **Use case**: Most tables, simpler configuration, better performance

- **`multi_cf`**: Multiple column families (for wide tables)
  - **Changefeed**: `split_column_families=true`
  - **Files**: Multiple Parquet files per CDC event (fragments need merging)
  - **Use case**: Wide tables (50+ columns), selective column access patterns

### Function Selection Matrix

The notebook automatically selects the appropriate ingestion function based on your configuration:

| CDC Mode | Column Family Mode | Function Called |
|----------|-------------------|-----------------|
| `append_only` | `single_cf` | `ingest_cdc_append_only_single_family()` |
| `append_only` | `multi_cf` | `ingest_cdc_append_only_multi_family()` |
| `update_delete` | `single_cf` | `ingest_cdc_with_merge_single_family()` |
| `update_delete` | `multi_cf` | `ingest_cdc_with_merge_multi_family()` |

---

In [None]:
import json
import os
from urllib.parse import quote

# Configuration file path (adjust as needed)
config_file = "cockroachdb_cdc_tutorial_config_append_single_cf.json"

#config_file = "cockroachdb_cdc_tutorial_config_append_multi_cf.json"

#config_file = "cockroachdb_cdc_tutorial_config_update_delete_multi_cf.json"

#config_file = "cockroachdb_cdc_tutorial_config_update_delete_single_cf.json"


# Try to load from file, fallback to embedded config
try:
    with open(config_file, 'r') as f:
        config = json.load(f)
    print(f"‚úÖ Configuration loaded from: {config_file}")
except Exception as e:
    print(f"‚ÑπÔ∏è  Using embedded configuration (config file error: {e})")
    config = None

# Embedded configuration (fallback)
if config is None:
    config = {
      "cockroachdb": {
        "host": "replace_me",
        "port": 26257,
        "user": "replace_me",
        "password": "replace_me",
        "database": "defaultdb"
      },
      "cockroachdb_source": {
        "catalog": "defaultdb",
        "schema": "public",
        "table_name": "usertable",
        "_schema_note": "Default schema is 'public'. Table uses YCSB structure (ycsb_key, field0-9)",
      },
      "azure_storage": {
        "account_name": "replace_me",
        "account_key": "replace_me",
        "container_name": "changefeed-events"
      },
      "databricks_target": {
        "catalog": "main",
        "schema": "replace_me",
        "table_name": "usertable",
      },
      "cdc_config": {
        "mode": "append_only",
        "column_family_mode": "multi_cf",
        "primary_key_columns": ["ycsb_key"],
        "auto_suffix_mode_family": True,
      },
      "workload_config": {
        "snapshot_count": 10,
        "insert_count": 10,
        "update_count": 9,
        "delete_count": 8,
      }
    }


In [None]:
from urllib.parse import quote

# Extract configuration values
cockroachdb_host = config["cockroachdb"]["host"]
cockroachdb_port = config["cockroachdb"]["port"]
cockroachdb_user = config["cockroachdb"]["user"]
cockroachdb_password = config["cockroachdb"]["password"]
cockroachdb_database = config["cockroachdb"]["database"]

source_catalog = config["cockroachdb_source"]["catalog"]
source_schema = config["cockroachdb_source"]["schema"]
source_table = config["cockroachdb_source"]["table_name"]

storage_account_name = config["azure_storage"]["account_name"]
storage_account_key = config["azure_storage"]["account_key"]
storage_account_key_encoded = quote(storage_account_key, safe='')
container_name = config["azure_storage"]["container_name"]

target_catalog = config["databricks_target"]["catalog"]
target_schema = config["databricks_target"]["schema"]
target_table = config["databricks_target"]["table_name"]

cdc_mode = config["cdc_config"]["mode"]
column_family_mode = config["cdc_config"]["column_family_mode"]
primary_key_columns = config["cdc_config"]["primary_key_columns"]

snapshot_count = config["workload_config"]["snapshot_count"]
insert_count = config["workload_config"]["insert_count"]
update_count = config["workload_config"]["update_count"]
delete_count = config["workload_config"]["delete_count"]

# Auto-suffix table names with mode and column family if enabled
auto_suffix = config["cdc_config"].get("auto_suffix_mode_family", False)
if auto_suffix:
    suffix = f"_{cdc_mode}_{column_family_mode}"
    
    # Add suffix to source_table if not already present
    if not source_table.endswith(suffix):
        source_table = f"{source_table}{suffix}"
    
    # Add suffix to target_table if not already present
    if not target_table.endswith(suffix):
        target_table = f"{target_table}{suffix}"

    # Update config dict with suffixed table names
    config["cockroachdb_source"]["table_name"] = source_table
    config["databricks_target"]["table_name"] = target_table

# Extract format for reuse (default: parquet)
cdc_format = config["cdc_config"].get("format", "parquet")

# set the path in azure
path = f"{cdc_format}/{source_catalog}/{source_schema}/{source_table}/{target_table}"
config["cdc_config"]["path"] = path

print("‚úÖ Configuration loaded")
print(f"   CDC Processing Mode: {cdc_mode}")
print(f"   Column Family Mode: {column_family_mode}")
print(f"   Primary Keys: {primary_key_columns}")
print(f"   Target Table: {target_table}")
print(f"   CDC Workload: {snapshot_count} snapshot ‚Üí +{insert_count} INSERTs, ~{update_count} UPDATEs, -{delete_count} DELETEs")


In [None]:
%pip install pg8000 azure-storage-blob --quiet
print("‚úÖ Dependencies installed")

In [None]:
# Import CockroachDB connection utilities
import importlib
import cockroachdb_conn
importlib.reload(cockroachdb_conn)
from cockroachdb_conn import get_cockroachdb_connection as _get_connection

# Wrapper function that uses config variables from Cell 3
def get_cockroachdb_connection():
    """Create connection to CockroachDB using config from Cell 3"""
    return _get_connection(
        cockroachdb_host=cockroachdb_host,
        cockroachdb_port=cockroachdb_port,
        cockroachdb_user=cockroachdb_user,
        cockroachdb_password=cockroachdb_password,
        cockroachdb_database=cockroachdb_database
    )

# Test connection
try:
    conn = get_cockroachdb_connection()
    with conn.cursor() as cur:
        cur.execute("SELECT version()")
        version = cur.fetchone()[0]
    conn.close()
    
    print("‚úÖ Connected to CockroachDB")
    print(f"   Version: {version[:50]}...")
except Exception as e:
    print(f"‚ùå Connection failed: {e}")
    raise

In [None]:
# Import Azure utilities
import importlib
import cockroachdb_azure
importlib.reload(cockroachdb_azure)
from cockroachdb_azure import check_azure_files, wait_for_changefeed_files

# Import YCSB utility functions
import cockroachdb_ycsb
importlib.reload(cockroachdb_ycsb)
from cockroachdb_ycsb import (
    get_table_stats,
    get_table_stats_spark,
    get_column_sum,
    get_column_sum_spark,
    deduplicate_to_latest,
    get_column_sum_spark_deduplicated
)

print("‚úÖ Helper functions loaded (CockroachDB & Azure)")
print("‚úÖ YCSB utility functions imported from cockroachdb_ycsb.py")

In [None]:
# Create table using cockroachdb_ycsb.py
# Import YCSB functions
import importlib, cockroachdb_ycsb
importlib.reload(cockroachdb_ycsb)
from cockroachdb_ycsb import create_ycsb_table

# Create table
conn = get_cockroachdb_connection()
try:
    create_ycsb_table(
        conn=conn,
        table_name=source_table,
        column_family_mode=column_family_mode
    )
finally:
    conn.close()

In [None]:
# Insert snapshot data with NULL testing using cockroachdb_ycsb.py
from cockroachdb_ycsb import insert_ycsb_snapshot_with_random_nulls

conn = get_cockroachdb_connection()
try:
    insert_ycsb_snapshot_with_random_nulls(
        conn=conn,
        table_name=source_table,
        snapshot_count=snapshot_count,
        null_probability=0.3,  # 30% chance of NULL in snapshot
        columns_to_randomize=['field0', 'field1', 'field2', 'field3', 'field4', 'field5', 'field6', 'field7', 'field8', 'field9'],  # ALL fields
        seed=42,  # Reproducible random NULLs
        force_all_null_row=True  # Row 0 will have all randomized columns as NULL (edge case testing)
    )
finally:
    conn.close()

In [None]:
# Build Azure Blob Storage URI with table-specific path
# Note: For Azure, path goes in URI (not as path_prefix query parameter like S3)
changefeed_path = f"azure://{container_name}/{path}?AZURE_ACCOUNT_NAME={storage_account_name}&AZURE_ACCOUNT_KEY={storage_account_key_encoded}"

# Build changefeed options based on column_family_mode
if column_family_mode == "multi_cf":
    # Include split_column_families for multi-family mode
    changefeed_options = """
    format='parquet',
    updated,
    resolved='10s',
    split_column_families
"""
else:
    # Standard options for single-family mode
    changefeed_options = """
    format='parquet',
    updated,
    resolved='10s'
"""

# Create changefeed SQL
create_changefeed_sql = f"""
CREATE CHANGEFEED FOR TABLE {source_table}
INTO '{changefeed_path}'
WITH {changefeed_options}
"""

conn = get_cockroachdb_connection()
try:
    with conn.cursor() as cur:
        # Check for existing changefeeds by matching the sink_uri
        # This is more exact than parsing the description field
        # Match pattern: azure://{container}/{path}?...
        # Note: We check for ALL matches (no LIMIT) to detect duplicates
        sink_uri_pattern = f"%{container_name}/{path}%"
        
        cur.execute("""
            SELECT job_id, status, sink_uri
            FROM [SHOW CHANGEFEED JOBS] 
            WHERE sink_uri LIKE %s
            AND status IN ('running', 'paused')
        """, (sink_uri_pattern,))
        
        existing_changefeeds = cur.fetchall()
        
        if existing_changefeeds:
            print(f"‚úÖ Changefeed(s) already exist for this source ‚Üí target mapping")
            print(f"   Found {len(existing_changefeeds)} changefeed(s):")
            for job_id, status, sink_uri in existing_changefeeds:
                print(f"   ‚Ä¢ Job ID: {job_id}, Status: {status}")
                print(f"     Sink URI: {sink_uri[:80]}...")  # Show first 80 chars (redacted credentials)
            if len(existing_changefeeds) > 1:
                print(f"\n‚ö†Ô∏è  WARNING: Multiple changefeeds detected for same destination!")
                print(f"   This may cause duplicate data. Consider running Cell 17 to clean up.")
            if column_family_mode == "multi_cf":
                print(f"\n   Expected: Column family fragments")
            print(f"\nüí° Tip: Run Cell 10 to generate UPDATE/DELETE events")
            print(f"   Then check Cell 11 to verify new files appear")
        else:
            # Create new changefeed
            cur.execute(create_changefeed_sql)
            result = cur.fetchone()
            job_id = result[0]
            
            print(f"‚úÖ Changefeed created")
            print(f"   Job ID: {job_id}")
            print(f"   Source: {source_catalog}.{source_schema}.{source_table}")
            print(f"   Target path: .../{source_table}/{target_table}/")
            print(f"   Format: Parquet")
            if column_family_mode == "multi_cf":
                print(f"   Split column families: TRUE (fragments will be generated)")
            else:
                print(f"   Split column families: FALSE (single file per event)")
            print(f"   Destination: Azure Blob Storage")
            print(f"")
            
            # Wait for files to appear using helper function
            wait_for_changefeed_files(
                storage_account_name, storage_account_key, container_name,
                source_catalog, source_schema, source_table, target_table,
                max_wait=300, check_interval=5,
                format=cdc_format
            )
finally:
    conn.close()

In [None]:
import time

# Capture baseline file count BEFORE generating CDC events
print("üìä Capturing baseline file count...")
result_before = check_azure_files(
    storage_account_name, storage_account_key, container_name,
    source_catalog, source_schema, source_table, target_table,
    verbose=False,
    format=cdc_format
)
files_before = len(result_before['data_files'])
print(f"   Current files: {files_before}")
print()

# Run workload with NULL testing using cockroachdb_ycsb.py
from cockroachdb_ycsb import run_ycsb_workload_with_random_nulls

conn = get_cockroachdb_connection()
try:
    run_ycsb_workload_with_random_nulls(
        conn=conn,
        table_name=source_table,
        insert_count=insert_count,
        update_count=update_count,
        delete_count=delete_count,
        null_probability=0.5,  # 50% chance of NULL in UPDATEs
        columns_to_randomize=['field0', 'field1', 'field2', 'field3', 'field4', 'field5', 'field6', 'field7', 'field8', 'field9'],  # ALL fields
        seed=42,  # Reproducible random NULLs
        force_all_null_update=True  # First UPDATE will have all NULLs (edge case testing)
    )
finally:
    conn.close()

# Wait for new CDC files to appear in Azure (positive confirmation)
print(f"")
print(f"‚è≥ Waiting for new CDC files to appear in Azure...")
print(f"   Baseline: {files_before} files")
print()

# Poll for new files (max 90 seconds)
max_wait = 90
check_interval = 10
elapsed = 0

while elapsed < max_wait:
    result = check_azure_files(
        storage_account_name, storage_account_key, container_name,
        source_catalog, source_schema, source_table, target_table,
        verbose=False,
        format=cdc_format
    )
    files_now = len(result['data_files'])
    
    if files_now > files_before:
        print(f"‚úÖ New CDC files appeared after {elapsed} seconds!")
        print(f"   Baseline (before workload): {files_before} files")
        print(f"   Current (after workload): {files_now} files")
        print(f"   New files generated: {files_now - files_before}")
        break
    
    print(f"   Checking... ({elapsed}s elapsed, baseline: {files_before} files)", end='\r')
    time.sleep(check_interval)
    elapsed += check_interval
else:
    print(f"\n‚ö†Ô∏è  Timeout after {max_wait}s - files may still be flushing")
    print(f"   Run Cell 11 to check manually")

In [None]:
# Use the helper function from Cell 4 to check for files
result = check_azure_files(
    storage_account_name, storage_account_key, container_name,
    source_catalog, source_schema, source_table, target_table,
    verbose=True,
    format=cdc_format
)

# Provide guidance
if len(result['data_files']) == 0:
    print(f"\n‚ö†Ô∏è  No data files found yet.")
    print(f"   üí° Possible reasons:")
    print(f"   - Changefeed not created yet (run Cell 9)")
    print(f"   - Path configuration mismatch (check Cell 1 variables)")
    print(f"   - Azure credentials issue (check External Location)")
else:
    print(f"\n‚úÖ Files are ready! Proceed to Cell 10 to read with Databricks.")

In [None]:
# Import CDC ingestion functions from cockroachdb_autoload.py
import importlib, cockroachdb_autoload
importlib.reload(cockroachdb_autoload)
from cockroachdb_autoload import (
    ingest_cdc_append_only_single_family,
    ingest_cdc_append_only_multi_family,
    ingest_cdc_with_merge_single_family,
    ingest_cdc_with_merge_multi_family
)

print(f"üî∑ CDC Configuration:")
print(f"   Processing Mode: {cdc_mode}")
print(f"   Column Family Mode: {column_family_mode}")
print()

# Select function based on BOTH cdc_mode and column_family_mode
if cdc_mode == "append_only" and column_family_mode == "single_cf":
    print(f"üìò Running: ingest_cdc_append_only_single_family()")
    print(f"   - All CDC events will be stored as rows")
    print(f"   - No column family merging needed\n")
    
    query = ingest_cdc_append_only_single_family(
        storage_account_name=storage_account_name,
        container_name=container_name,
        source_catalog=source_catalog,
        source_schema=source_schema,
        source_table=source_table,
        target_catalog=target_catalog,
        target_schema=target_schema,
        target_table=target_table,
        spark=spark
    )

elif cdc_mode == "append_only" and column_family_mode == "multi_cf":
    print(f"üìô Running: ingest_cdc_append_only_multi_family()")
    print(f"   - All CDC events will be stored as rows")
    print(f"   - Column family fragments will be merged\n")
    
    if not primary_key_columns:
        raise ValueError("primary_key_columns required for multi_cf mode")
    
    query = ingest_cdc_append_only_multi_family(
        storage_account_name=storage_account_name,
        container_name=container_name,
        source_catalog=source_catalog,
        source_schema=source_schema,
        source_table=source_table,
        target_catalog=target_catalog,
        target_schema=target_schema,
        target_table=target_table,
        primary_key_columns=primary_key_columns,
        spark=spark
    )

elif cdc_mode == "update_delete" and column_family_mode == "single_cf":
    print(f"üìó Running: ingest_cdc_with_merge_single_family()")
    print(f"   - MERGE logic applied (UPDATE/DELETE processed)")
    print(f"   - No column family merging needed\n")
    
    if not primary_key_columns:
        raise ValueError("primary_key_columns required for update_delete mode")
    
    result = ingest_cdc_with_merge_single_family(
        storage_account_name=storage_account_name,
        container_name=container_name,
        source_catalog=source_catalog,
        source_schema=source_schema,
        source_table=source_table,
        target_catalog=target_catalog,
        target_schema=target_schema,
        target_table=target_table,
        primary_key_columns=primary_key_columns,
        spark=spark
    )
    
    query = result["query"]

elif cdc_mode == "update_delete" and column_family_mode == "multi_cf":
    print(f"üìï Running: ingest_cdc_with_merge_multi_family()")
    print(f"   - MERGE logic applied (UPDATE/DELETE processed)")
    print(f"   - Column family fragments will be merged\n")
    
    if not primary_key_columns:
        raise ValueError("primary_key_columns required for update_delete + multi_cf mode")
    
    result = ingest_cdc_with_merge_multi_family(
        storage_account_name=storage_account_name,
        container_name=container_name,
        source_catalog=source_catalog,
        source_schema=source_schema,
        source_table=source_table,
        target_catalog=target_catalog,
        target_schema=target_schema,
        target_table=target_table,
        primary_key_columns=primary_key_columns,
        spark=spark
    )
    
    query = result["query"]

else:
    raise ValueError(
        f"Invalid mode combination:\n"
        f"  cdc_mode='{cdc_mode}' (valid: 'append_only', 'update_delete')\n"
        f"  column_family_mode='{column_family_mode}' (valid: 'single_cf', 'multi_cf')\n"
        f"Change modes in Cell 1."
    )

# Wait for completion (if not already complete)
if cdc_mode == "append_only":
    query.awaitTermination()
    print("\n" + "=" * 80)
    print(f"‚úÖ CDC INGESTION COMPLETE")
    print("=" * 80)
    print(f"   Mode: {cdc_mode} + {column_family_mode}")
    print(f"   Target: {target_catalog}.{target_schema}.{target_table}")
    print()
    print(f"üìä Query your data: SELECT * FROM {target_catalog}.{target_schema}.{target_table}")
else:
    # update_delete mode already completed inside the function
    print(f"üìä Query your data: SELECT * FROM {target_catalog}.{target_schema}.{target_table}")

In [None]:
# ALL-IN-ONE CDC DIAGNOSIS

# What this does:
#   1. CDC Event Summary (replaces Cell 13)
#      - Shows total rows, operation breakdown, sample data
#   
#   2. Source vs Target Verification (replaces Cell 14)
#      - Connects to CockroachDB source
#      - Auto-deduplicates target for append_only mode
#      - Compares column sums
#      - Detects mismatches
#   
#   3. Detailed Diagnosis (automatic if issues found)
#      - Column family sync analysis
#      - CDC event distribution
#      - Row-by-row comparison
#      - Troubleshooting recommendations
#
# Smart behavior:
#   ‚úÖ If everything matches ‚Üí Shows "Perfect sync!" and exits
#   ‚ö†Ô∏è  If mismatches found ‚Üí Automatically runs detailed diagnosis
#
# No external dependencies - just run this!
# ============================================================================

import importlib,cockroachdb_ycsb,cockroachdb_debug, cockroachdb_conn
importlib.reload(cockroachdb_conn)  # Reload first (cockroachdb_debug depends on it)
importlib.reload(cockroachdb_ycsb)  # Reload first (cockroachdb_ycsb depends on it)
importlib.reload(cockroachdb_debug)
from cockroachdb_debug import run_full_diagnosis_from_config

run_full_diagnosis_from_config(spark=spark, config=config)

## Optional: Cleanup

Run the cells below if you want to clean up the test resources.

In [None]:
# ‚ö†Ô∏è  SAFETY STOP: Cleanup Section
# This cell prevents accidental cleanup when running "Run All"
# 
# To cleanup resources, manually run each cell below INDIVIDUALLY:
#   - Cell 16: Cancel changefeed
#   - Cell 17: Drop CockroachDB source table  
#   - Cell 18: Drop Databricks target table & checkpoint
#   - Cell 19: Clear Azure changefeed data (optional - use for complete reset)

raise RuntimeError(
    "\n"
    "‚ö†Ô∏è  CLEANUP SAFETY STOP\n"
    "\n"
    "The cells below will DELETE your resources.\n"
    "Do NOT run all cells - run each cleanup cell individually.\n"
    "\n"
    "üí° TIP: If Cell 13 shows sync issues due to old data,\n"
    "   run Cell 19 to clear Azure changefeed data completely.\n"
)

In [None]:
# CLEANUP CELL 1: CANCEL CHANGEFEED(S)
conn = get_cockroachdb_connection()
try:
    with conn.cursor() as cur:
        # Find ALL changefeed jobs by matching the sink_uri
        # (matches the same pattern used in Cell 9)
        # Note: We cancel ALL matches to handle duplicate scenarios
        sink_uri_pattern = f"%{container_name}/{path}%"
        
        cur.execute("""
            SELECT job_id, sink_uri
            FROM [SHOW CHANGEFEED JOBS] 
            WHERE sink_uri LIKE %s
            AND status IN ('running', 'paused')
        """, (sink_uri_pattern,))
        
        changefeeds = cur.fetchall()
        if changefeeds:
            print(f"üóëÔ∏è  Cancelling {len(changefeeds)} changefeed(s)...")
            for job_id, sink_uri in changefeeds:
                cur.execute(f"CANCEL JOB {job_id}")
                print(f"   ‚úÖ Cancelled Job ID: {job_id}")
                print(f"      Sink URI: {sink_uri[:80]}...")
            if len(changefeeds) > 1:
                print(f"\n‚ö†Ô∏è  Cancelled {len(changefeeds)} changefeeds (duplicates detected!)")
        else:
            print("‚ÑπÔ∏è  No active changefeeds found for this source ‚Üí target mapping")
finally:
    conn.close()

In [None]:
# CLEANUP CELL 2: DROP SOURCE TABLE (CockroachDB)
conn = get_cockroachdb_connection()
try:
    with conn.cursor() as cur:
        cur.execute(f"DROP TABLE IF EXISTS {source_table} CASCADE")
        conn.commit()
    print(f"‚úÖ Table '{source_table}' dropped from CockroachDB")
finally:
    conn.close()

In [None]:
# CLEANUP CELL 19: CLEAR AZURE CHANGEFEED DATA (Optional)
# ‚ö†Ô∏è  WARNING: This will DELETE all changefeed data in Azure for this table!
#
# Use this when:
# - You want to start completely fresh
# - Old data from previous runs is causing sync issues
# - You changed the table schema (e.g., VARCHAR ‚Üí INT)
#
# Uses Azure SDK (same as Cell 11 for checking files)

from azure.storage.blob import BlobServiceClient

# Use path from config (must match Cell 9 changefeed path)
changefeed_path = f"{path}/"

print(f"üóëÔ∏è  Deleting Azure changefeed data...")
print(f"=" * 80)
print(f"Container: {container_name}")
print(f"Path: {changefeed_path}")
print()

# Connect to Azure (same as Cell 9)
connection_string = f"DefaultEndpointsProtocol=https;AccountName={storage_account_name};AccountKey={storage_account_key};EndpointSuffix=core.windows.net"
blob_service = BlobServiceClient.from_connection_string(connection_string)
container_client = blob_service.get_container_client(container_name)

# List all blobs with this prefix
print(f"üîç Scanning for files...")
blobs = list(container_client.list_blobs(name_starts_with=changefeed_path))

if not blobs:
    print(f"‚ÑπÔ∏è  No files found at: {changefeed_path}")
    print(f"   Files may have already been deleted, or path is incorrect")
    print()
    print(f"üí° To check what's in the container, run Cell 9")
else:
    print(f"‚úÖ Found {len(blobs)} items to delete")
    
    # Show sample items
    data_files = [b for b in blobs if b.size > 0 and '.parquet' in b.name]
    resolved_files = [b for b in blobs if '.RESOLVED' in b.name]
    directories = [b for b in blobs if b.size == 0]
    
    print(f"   üìÑ Data files: {len(data_files)}")
    print(f"   üïê Resolved files: {len(resolved_files)}")
    print(f"   üìÅ Directories: {len(directories)}")
    print()
    
    # Delete all blobs with this prefix
    # Note: Azure SDK doesn't have recursive delete - we list all blobs and delete each one
    print(f"üîÑ Deleting {len(blobs)} items...")
    deleted = 0
    failed = 0
    
    for blob in blobs:
        try:
            container_client.delete_blob(blob.name)
            deleted += 1
            if deleted % 50 == 0:
                print(f"   Deleted {deleted}/{len(blobs)} items...", end='\r')
        except Exception as e:
            # Some errors are expected (e.g., directories already removed)
            error_str = str(e)
            if "DirectoryIsNotEmpty" not in error_str and "BlobNotFound" not in error_str:
                failed += 1
                print(f"\n   ‚ö†Ô∏è  Failed: {blob.name[:60]}... - {e}")
    
    print(f"‚úÖ Deleted {deleted} items from Azure                    ")
    if failed > 0:
        print(f"   ‚ö†Ô∏è  Failed to delete {failed} items")
    
    print()
    print(f"=" * 80)
    print(f"‚úÖ Cleanup complete!")
    print()
    print(f"üí° Next steps:")
    print(f"   1. Drop the Databricks target table (Cell 17)")
    print(f"   2. Re-run from Cell 6 (Snapshot) to start fresh")

In [None]:
# CLEANUP CELL 3: DROP TARGET TABLE & CHECKPOINT (Databricks)
target_table_fqn = f"{target_catalog}.{target_schema}.{target_table}"
checkpoint_path = f"/checkpoints/{target_schema}_{target_table}"  # Must match Cell 10

# Drop Delta table
spark.sql(f"DROP TABLE IF EXISTS {target_table_fqn}")
print(f"‚úÖ Delta table '{target_table_fqn}' dropped")

# Remove checkpoint
try:
    dbutils.fs.rm(checkpoint_path, True)
    print(f"‚úÖ Checkpoint '{checkpoint_path}' removed")
except:
    print(f"‚ÑπÔ∏è  Checkpoint not found (may have been already removed)")

print("\n‚úÖ Cleanup complete!")

In [None]:
# CLEANUP CELL 4: Complete cleanup for fresh start

# 1. Drop staging table
staging_table_fqn = f"{target_catalog}.{target_schema}.{target_table}_staging_cf"
print(f"üóëÔ∏è  Dropping staging table: {staging_table_fqn}")
spark.sql(f"DROP TABLE IF EXISTS {staging_table_fqn}")

# 2. Drop target table (if not already done)
target_table_fqn = f"{target_catalog}.{target_schema}.{target_table}"
print(f"üóëÔ∏è  Dropping target table: {target_table_fqn}")
spark.sql(f"DROP TABLE IF EXISTS {target_table_fqn}")

# 3. Clear checkpoint location
checkpoint_path = f"/checkpoints/{target_schema}_{target_table}_merge_cf"
print(f"üóëÔ∏è  Clearing checkpoint: {checkpoint_path}")
try:
    dbutils.fs.rm(checkpoint_path, recurse=True)
    print(f"   ‚úÖ Checkpoint cleared")
except Exception as e:
    print(f"   ‚ÑπÔ∏è  Checkpoint may not exist: {e}")

# 4. Verify cleanup
print(f"\n‚úÖ Cleanup complete! Ready for fresh start.")
print(f"   Next: Re-run Cell 12 (ingestion)")

In [None]:
# Recreate the schema
print(f"üìÅ Creating schema: {target_catalog}.{target_schema}")
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {target_catalog}.{target_schema}")
print(f"‚úÖ Schema created")

# Verify schema exists
schemas = spark.sql(f"SHOW SCHEMAS IN {target_catalog}").collect()
schema_names = [row['databaseName'] for row in schemas]
if target_schema in schema_names:
    print(f"‚úÖ Verified: Schema {target_schema} exists")
else:
    print(f"‚ùå Schema {target_schema} not found. Available schemas: {schema_names}")

# Debug Codes