# Stream CockroachDB CDC to Databricks (Azure)

This notebook demonstrates how to stream CockroachDB changefeeds to Databricks using Azure Blob Storage.

## Prerequisites

- CockroachDB cluster (Cloud or self-hosted)
- Azure Storage Account with hierarchical namespace enabled
- Databricks workspace with Unity Catalog
- Unity Catalog External Location configured for your storage account

**Note:** This notebook uses the **YCSB (Yahoo! Cloud Serving Benchmark)** schema as the default table structure, with `ycsb_key` as the primary key and `field0-9` columns. The default schema name is `public`.

## CDC Mode Selection

This notebook supports **4 CDC ingestion modes** by combining two independent settings:

### 1. CDC Processing Mode (`cdc_mode`)
How CDC events are processed in the target table:

- **`append_only`**: Store all CDC events as rows (audit log)
  - **Behavior**: All events (INSERT/UPDATE/DELETE) are appended as new rows
  - **Use case**: History tracking, time-series analysis, audit logs
  - **Storage**: Higher (keeps all historical events)

- **`update_delete`**: Apply MERGE logic (current state replication)
  - **Behavior**: DELETE removes rows, UPDATE modifies rows in-place
  - **Use case**: Current state synchronization, production replication
  - **Storage**: Lower (only latest state per key)

### 2. Column Family Mode (`column_family_mode`)
Table structure and changefeed configuration:

- **`single_cf`**: Standard table (1 column family, default)
  - **Changefeed**: `split_column_families=false`
  - **Files**: 1 Parquet file per CDC event
  - **Use case**: Most tables, simpler configuration, better performance

- **`multi_cf`**: Multiple column families (for wide tables)
  - **Changefeed**: `split_column_families=true`
  - **Files**: Multiple Parquet files per CDC event (fragments need merging)
  - **Use case**: Wide tables (50+ columns), selective column access patterns

### Function Selection Matrix

The notebook automatically selects the appropriate ingestion function based on your configuration:

| CDC Mode | Column Family Mode | Function Called |
|----------|-------------------|-----------------|
| `append_only` | `single_cf` | `ingest_cdc_append_only_single_family()` |
| `append_only` | `multi_cf` | `ingest_cdc_append_only_multi_family()` |
| `update_delete` | `single_cf` | `ingest_cdc_with_merge_single_family()` |
| `update_delete` | `multi_cf` | `ingest_cdc_with_merge_multi_family()` |

---

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import json
import os
from urllib.parse import quote

# Configuration file config.cdc_config.path (adjust as needed)
#config_file = "../.env/cockroachdb_cdc_tutorial_config_append_single_cf.json"

#config_file = "../.env/cockroachdb_cdc_tutorial_config_append_multi_cf.json"

#config_file = "../.env/cockroachdb_cdc_tutorial_config_update_delete_multi_cf.json"

config_file = "../.env/cockroachdb_cdc_tutorial_config_update_delete_single_cf.json"


import importlib
import cockroachdb_config
importlib.reload(cockroachdb_config)
from cockroachdb_config import load_config, process_config

# Try to load from file, fallback to embedded config
config = load_config(config_file)

# Embedded configuration (fallback)
if config is None:
    config = {
      "cockroachdb": {
        "host": "replace_me",
        "port": 26257,
        "user": "replace_me",
        "password": "replace_me",
        "database": "defaultdb"
      },
      "cockroachdb_source": {
        "catalog": "defaultdb",
        "schema": "public",
        "table_name": "usertable",
        "_schema_note": "Default schema is 'public'. Table uses YCSB structure (ycsb_key, field0-9)",
      },
      "azure_storage": {
        "account_name": "replace_me",
        "account_key": "replace_me",
        "config.azure_storage.container_name": "changefeed-events"
      },
      "databricks_target": {
        "catalog": "main",
        "schema": "replace_me",
        "table_name": "usertable",
      },
      "cdc_config": {
        "mode": "append_only",
        "config.cdc_config.column_family_mode": "multi_cf",
        "config.cdc_config.primary_key_columns": ["ycsb_key"],
        "auto_suffix_mode_family": True,
      },
    "uc_external_volume": {
        "volume_catalog": "robert_lee",
        "volume_full_path": "robert_lee.robert_lee_cockroachdb.cockroachdb_cdc_1768934658",
        "volume_id": "de84b515-ec65-4dbc-8a76-460328c6f1b1",
        "volume_name": "cockroachdb_cdc_1768934658",
        "volume_schema": "robert_lee_cockroachdb"
    },       
      "workload_config": {
        "config.workload_config.snapshot_count": 10,
        "config.workload_config.insert_count": 10,
        "config.workload_config.update_count": 9,
        "config.workload_config.delete_count": 8,
      }
    }
config=process_config(config)


In [None]:
%pip install pg8000 azure-storage-blob --quiet
print("‚úÖ Dependencies installed")

In [None]:
# Import CockroachDB connection utilities
import importlib
import cockroachdb_conn
importlib.reload(cockroachdb_conn)
from cockroachdb_conn import get_cockroachdb_connection

# Test connection to CockroachDB
# The function automatically tests the connection (test=True by default)
conn = get_cockroachdb_connection(
    cockroachdb_host=config.cockroachdb.host,
    cockroachdb_port=config.cockroachdb.port,
    cockroachdb_user=config.cockroachdb.user,
    cockroachdb_password=config.cockroachdb.password,
    cockroachdb_database=config.cockroachdb.database,
    test=True  # Automatically tests connection and prints version (default)
)

print("‚úÖ Connection function ready for use")


In [None]:
# Import storage utilities (works with both Azure and UC Volume)
import importlib
import cockroachdb_storage
importlib.reload(cockroachdb_storage)
from cockroachdb_storage import check_files, wait_for_files

# Import YCSB utility functions
import cockroachdb_ycsb
importlib.reload(cockroachdb_ycsb)
from cockroachdb_ycsb import (
    get_table_stats,
    get_table_stats_spark,
    get_column_sum,
    get_column_sum_spark,
    deduplicate_to_latest,
    get_column_sum_spark_deduplicated
)

print("‚úÖ Helper functions loaded (CockroachDB & Azure)")
print("‚úÖ YCSB utility functions imported from cockroachdb_ycsb.py")

In [None]:
# Create table using cockroachdb_ycsb.py
# Import YCSB functions
import importlib, cockroachdb_ycsb
importlib.reload(cockroachdb_ycsb)
from cockroachdb_ycsb import create_ycsb_table

# Create table
try:
    create_ycsb_table(
        conn=conn,
        table_name=config.tables.source_table_name,
        column_family_mode=config.cdc_config.column_family_mode
    )
except:
    conn.close()

In [None]:
from cockroachdb_ycsb import insert_ycsb_snapshot_with_random_nulls

try:
    insert_ycsb_snapshot_with_random_nulls(
        conn=conn,
        table_name=config.tables.source_table_name,
        snapshot_count=config.workload_config.snapshot_count,
        null_probability=0.3,  # 30% chance of NULL in snapshot
        columns_to_randomize=['field0', 'field1', 'field2', 'field3', 'field4', 'field5', 'field6', 'field7', 'field8', 'field9'],  # ALL fields
        seed=42,  # Reproducible random NULLs
        force_all_null_row=True  # Row 0 will have all randomized columns as NULL (edge case testing)
    )
except:
    conn.close()

In [None]:
from cockroachdb_sql import create_changefeed_from_config

try:
    result = create_changefeed_from_config(conn, config, spark)
    
    if result['created']:
        print(f"New changefeed: Job {result['job_id']}")
    else:
        print(f"Using existing: {result['existing_count']} found")
except Exception as e:
    print(e)
    conn.close()

In [None]:
import time

# Capture baseline file count BEFORE generating CDC events
print("üìä Capturing baseline file count...")
result_before = check_files(
    config=config,
    spark=spark,
    verbose=False
)
files_before = len(result_before['data_files'])
resolved_before = len(result_before['resolved_files'])
print(f"   Data files: {files_before}")
print(f"   Resolved files: {resolved_before}")
print()

# Run workload with NULL testing using cockroachdb_ycsb.py
from cockroachdb_ycsb import run_ycsb_workload_with_random_nulls

# Run workload - connection is managed by notebook, not closed here
run_ycsb_workload_with_random_nulls(
    conn=conn,
    table_name=config.tables.source_table_name,
    insert_count=config.workload_config.insert_count,
    update_count=config.workload_config.update_count,
    delete_count=config.workload_config.delete_count,
    null_probability=0.5,  # 50% chance of NULL in UPDATEs
    columns_to_randomize=['field0', 'field1', 'field2', 'field3', 'field4', 'field5', 'field6', 'field7', 'field8', 'field9'],  # ALL fields
    seed=42,  # Reproducible random NULLs
    force_all_null_update=True  # First UPDATE will have all NULLs (edge case testing)
)

# Wait for new CDC files to appear in storage (positive confirmation)
storage_label = "Unity Catalog Volume" if config.data_source == "uc_external_volume" else "Azure"
print(f"")
print(f"‚è≥ Waiting for new CDC files to appear in {storage_label}...")
print(f"   Baseline: {files_before} data files, {resolved_before} resolved files")
print()

# Poll for new files (max 90 seconds)
max_wait = 90
check_interval = 10
elapsed = 0

while elapsed < max_wait:
    result = check_files(
        config=config,
        spark=spark,
        verbose=False
    )
    files_now = len(result['data_files'])
    resolved_now = len(result['resolved_files'])
    
    if files_now > files_before or resolved_now > resolved_before:
        print(f"‚úÖ New CDC files appeared after {elapsed} seconds!")
        print(f"   Data files: {files_before} ‚Üí {files_now} (+{files_now - files_before})")
        print(f"   Resolved files: {resolved_before} ‚Üí {resolved_now} (+{resolved_now - resolved_before})")
        break
    
    print(f"   Checking... ({elapsed}s elapsed, baseline: {files_before} data, {resolved_before} resolved)", end='\r')
    time.sleep(check_interval)
    elapsed += check_interval
else:
    print(f"\n‚ö†Ô∏è  Timeout after {max_wait}s - files may still be flushing")
    print(f"   Run Cell 11 to check manually")

In [None]:
# Use the unified storage function to check for files (works with both Azure and UC Volume)
result = check_files(
    config=config,
    spark=spark,
    verbose=True
)

# Provide guidance
if len(result['data_files']) == 0:
    print(f"\n‚ö†Ô∏è  No data files found yet.")
    print(f"   üí° Possible reasons:")
    print(f"   - Changefeed not created yet (run Cell 9)")
    print(f"   - Path configuration mismatch (check Cell 1 variables)")
    print(f"   - Azure credentials issue (check External Location)")
else:
    print(f"\n‚úÖ Files are ready! Proceed to Cell 10 to read with Databricks.")

In [None]:
# Import CDC ingestion functions from cockroachdb_autoload.py
import importlib, cockroachdb_autoload
importlib.reload(cockroachdb_autoload)
from cockroachdb_autoload import (
    ingest_cdc_append_only_single_family,
    ingest_cdc_append_only_multi_family,
    ingest_cdc_with_merge_single_family,
    ingest_cdc_with_merge_multi_family
)

print(f"üî∑ CDC Configuration:")
print(f"   Processing Mode: {config.cdc_config.mode}")
print(f"   Column Family Mode: {config.cdc_config.column_family_mode}")
print(f"   Data Source: {config.data_source}")
print()

# Select function based on BOTH config.cdc_config.mode and config.cdc_config.column_family_mode
if config.cdc_config.mode == "append_only" and config.cdc_config.column_family_mode == "single_cf":
    print(f"üìò Running: ingest_cdc_append_only_single_family()")
    print(f"   - All CDC events will be stored as rows")
    print(f"   - No column family merging needed\n")
    
    query = ingest_cdc_append_only_single_family(
        config=config,
        spark=spark
    )

elif config.cdc_config.mode == "append_only" and config.cdc_config.column_family_mode == "multi_cf":
    print(f"üìô Running: ingest_cdc_append_only_multi_family()")
    print(f"   - All CDC events will be stored as rows")
    print(f"   - Column family fragments will be merged\n")
    
    if not config.cdc_config.primary_key_columns:
        raise ValueError("config.cdc_config.primary_key_columns required for multi_cf mode")
    
    query = ingest_cdc_append_only_multi_family(
        config=config,
        spark=spark
    )

elif config.cdc_config.mode == "update_delete" and config.cdc_config.column_family_mode == "single_cf":
    print(f"üìó Running: ingest_cdc_with_merge_single_family()")
    print(f"   - MERGE logic applied (UPDATE/DELETE processed)")
    print(f"   - No column family merging needed\n")
    
    if not config.cdc_config.primary_key_columns:
        raise ValueError("config.cdc_config.primary_key_columns required for update_delete mode")
    
    result = ingest_cdc_with_merge_single_family(
        config=config,
        spark=spark
    )
    
    query = result["query"]

elif config.cdc_config.mode == "update_delete" and config.cdc_config.column_family_mode == "multi_cf":
    print(f"üìï Running: ingest_cdc_with_merge_multi_family()")
    print(f"   - MERGE logic applied (UPDATE/DELETE processed)")
    print(f"   - Column family fragments will be merged\n")
    
    if not config.cdc_config.primary_key_columns:
        raise ValueError("config.cdc_config.primary_key_columns required for update_delete + multi_cf mode")
    
    result = ingest_cdc_with_merge_multi_family(
        config=config,
        spark=spark
    )
    
    query = result["query"]

else:
    raise ValueError(
        f"Invalid mode combination:\n"
        f"  cdc_mode='{config.cdc_config.mode}' (valid: 'append_only', 'update_delete')\n"
        f"  column_family_mode='{config.cdc_config.column_family_mode}' (valid: 'single_cf', 'multi_cf')\n"
        f"Change modes in Cell 1."
    )

# Wait for completion (if not already complete)
if config.cdc_config.mode == "append_only":
    query.awaitTermination()
    print("\n" + "=" * 80)
    print(f"‚úÖ CDC INGESTION COMPLETE")
    print("=" * 80)
    print(f"   Mode: {config.cdc_config.mode} + {config.cdc_config.column_family_mode}")
    print(f"   Target: {config.tables.destination_catalog}.{config.tables.destination_schema}.{config.tables.destination_table_name}")
    print()
    print(f"üìä Query your data: SELECT * FROM {config.tables.destination_catalog}.{config.tables.destination_schema}.{config.tables.destination_table_name}")
else:
    # update_delete mode already completed inside the function
    print(f"üìä Query your data: SELECT * FROM {config.tables.destination_catalog}.{config.tables.destination_schema}.{config.tables.destination_table_name}")

In [None]:
# ALL-IN-ONE CDC DIAGNOSIS

# What this does:
#   1. CDC Event Summary (replaces Cell 13)
#      - Shows total rows, operation breakdown, sample data
#   
#   2. Source vs Target Verification (replaces Cell 14)
#      - Connects to CockroachDB source
#      - Auto-deduplicates target for append_only mode
#      - Compares column sums
#      - Detects mismatches
#   
#   3. Detailed Diagnosis (automatic if issues found)
#      - Column family sync analysis
#      - CDC event distribution
#      - Row-by-row comparison
#      - Troubleshooting recommendations
#
# Smart behavior:
#   ‚úÖ If everything matches ‚Üí Shows "Perfect sync!" and exits
#   ‚ö†Ô∏è  If mismatches found ‚Üí Automatically runs detailed diagnosis
#
# No external dependencies - just run this!
# ============================================================================

import importlib,cockroachdb_ycsb,cockroachdb_debug, cockroachdb_conn
importlib.reload(cockroachdb_conn)  # Reload first (cockroachdb_debug depends on it)
importlib.reload(cockroachdb_ycsb)  # Reload first (cockroachdb_ycsb depends on it)
importlib.reload(cockroachdb_debug)
from cockroachdb_debug import run_full_diagnosis_from_config

run_full_diagnosis_from_config(conn=conn, spark=spark, config=config)

## Optional: Cleanup

Run the cells below if you want to clean up the test resources.

In [None]:
# ‚ö†Ô∏è  SAFETY STOP: Cleanup Section
# This cell prevents accidental cleanup when running "Run All"
# 
# To cleanup resources, manually run each cell below INDIVIDUALLY:
#   - Cell 16: Cancel changefeed
#   - Cell 17: Drop CockroachDB source table  
#   - Cell 18: Drop Databricks target table & checkpoint
#   - Cell 19: Clear Azure changefeed data (optional - use for complete reset)

raise RuntimeError(
    "\n"
    "‚ö†Ô∏è  CLEANUP SAFETY STOP\n"
    "\n"
    "The cells below will DELETE your resources.\n"
    "Do NOT run all cells - run each cleanup cell individually.\n"
    "\n"
    "üí° TIP: If Cell 13 shows sync issues due to old data,\n"
    "   run Cell 19 to clear Azure changefeed data completely.\n"
)

In [None]:
if conn is None:
    conn = get_cockroachdb_connection(
        cockroachdb_host=config.cockroachdb.host,
        cockroachdb_port=config.cockroachdb.port,
        cockroachdb_user=config.cockroachdb.user,
        cockroachdb_password=config.cockroachdb.password,
        cockroachdb_database=config.cockroachdb.database,
        test=False  # Skip test, connection already validated
    )

In [None]:
# CLEANUP CELL 1: CANCEL CHANGEFEED(S)
from cockroachdb_sql import cancel_changefeeds

try:
    result = cancel_changefeeds(conn, config)
except:
    conn.close()

In [None]:
# CLEANUP CELL 2: DROP SOURCE TABLE (CockroachDB)
from cockroachdb_sql import drop_table

conn = get_cockroachdb_connection(
    cockroachdb_host=config.cockroachdb.host,
    cockroachdb_port=config.cockroachdb.port,
    cockroachdb_user=config.cockroachdb.user,
    cockroachdb_password=config.cockroachdb.password,
    cockroachdb_database=config.cockroachdb.database,
    test=False  # Skip test, connection already validated
)
try:
    drop_table(conn, config.tables.source_table_name)
except:
    conn.close()

In [None]:
importlib.reload(cockroachdb_azure)
from cockroachdb_azure import delete_changefeed_files

result = delete_changefeed_files(
    storage_account_name=config.azure_storage.account_name,
    storage_account_key=config.azure_storage.account_key,
    container_name=config.azure_storage.container_name,
    changefeed_path=config.cdc_config.path  # Uses path from config
)


In [None]:
# CLEANUP CELL 3: DROP TARGET TABLE & CHECKPOINT (Databricks)
# Checkpoint lives on target schema; directory name = table name (same as ingestion Cell 10).
from cockroachdb_autoload import _build_paths
_, checkpoint_path, target_table_fqn = _build_paths(config, spark=spark)

# Drop Delta table
spark.sql(f"DROP TABLE IF EXISTS {target_table_fqn}")
print(f"‚úÖ Delta table '{target_table_fqn}' dropped")

# Remove checkpoint
try:
    dbutils.fs.rm(checkpoint_path, True)
    print(f"‚úÖ Checkpoint '{checkpoint_path}' removed")
except:
    print(f"‚ÑπÔ∏è  Checkpoint not found (may have been already removed)")

print("\n‚úÖ Cleanup complete!")

In [None]:
# CLEANUP CELL 4: Complete cleanup for fresh start

# 1. Drop staging table
staging_table_fqn = f"{config.tables.destination_catalog}.{config.tables.destination_schema}.{config.tables.destination_table_name}_staging_cf"
print(f"üóëÔ∏è  Dropping staging table: {staging_table_fqn}")
spark.sql(f"DROP TABLE IF EXISTS {staging_table_fqn}")

# 2. Drop target table (if not already done)
target_table_fqn = f"{config.tables.destination_catalog}.{config.tables.destination_schema}.{config.tables.destination_table_name}"
print(f"üóëÔ∏è  Dropping target table: {target_table_fqn}")
spark.sql(f"DROP TABLE IF EXISTS {target_table_fqn}")

# 3. Clear checkpoint location (target schema, directory = table name + _merge_cf)
from cockroachdb_autoload import _build_paths
_, checkpoint_path, _ = _build_paths(config, mode_suffix="_merge_cf", spark=spark)
print(f"üóëÔ∏è  Clearing checkpoint: {checkpoint_path}")
try:
    dbutils.fs.rm(checkpoint_path, recurse=True)
    print(f"   ‚úÖ Checkpoint cleared")
except Exception as e:
    print(f"   ‚ÑπÔ∏è  Checkpoint may not exist: {e}")

# 4. Verify cleanup
print(f"\n‚úÖ Cleanup complete! Ready for fresh start.")
print(f"   Next: Re-run Cell 12 (ingestion)")

In [None]:
# Recreate the schema
print(f"üìÅ Creating schema: {config.tables.destination_catalog}.{config.tables.destination_schema}")
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {config.tables.destination_catalog}.{config.tables.destination_schema}")
print(f"‚úÖ Schema created")

# Verify schema exists
schemas = spark.sql(f"SHOW SCHEMAS IN {config.tables.destination_catalog}").collect()
schema_names = [row['databaseName'] for row in schemas]
if config.tables.destination_schema in schema_names:
    print(f"‚úÖ Verified: Schema {config.tables.destination_schema} exists")
else:
    print(f"‚ùå Schema {config.tables.destination_schema} not found. Available schemas: {schema_names}")

# Debug Codes