# Automated CDC Scenario Testing

This notebook uses the automated `load_and_merge_cdc_to_delta()` function to test CDC scenarios.

## Features
- ‚úÖ Auto-detects primary keys from CockroachDB
- ‚úÖ Auto-detects column families
- ‚úÖ Loads, transforms, merges, and writes in one call
- ‚úÖ Verifies results automatically
- ‚úÖ Compares with source files

## Prerequisites
- Unity Catalog Volume with synced test data
- Configuration files: `.env/cockroachdb_credentials.json` and `.env/cockroachdb_pipelines.json`
- Data synced via `test_cdc_matrix.sh` (auto-syncs to volume subdirectories)


## üìù Setup Notes

**If you see "No timestamped directories found" error:**

The code automatically handles the `dbutils.fs.ls()` quirk where `item.name` can be empty. 
If you still see this error, check:
- Run `test_cdc_matrix.sh` to generate test data if not already present
- Verify the `TEST_FORMAT` and `TEST_NAME` variables match your test data****

## Step 1: Setup and Configuration


In [1]:
if "dbutils" not in vars():
    raise RuntimeError("This notebook must be run in Databricks Connect or workspace with dbutils available")
if "spark" not in vars():
    raise RuntimeError("This notebook must be run in Databricks Connect or workspace with Spark available")

In [2]:
# Import ConnectorMode enum for type-safe mode selection
from cockroachdb import ConnectorMode

# Available modes:
# - ConnectorMode.VOLUME: Read JSON/Parquet from Unity Catalog Volumes
# - ConnectorMode.AZURE_PARQUET: Read Parquet from Azure Blob Storage  
# - ConnectorMode.AZURE_JSON: Read JSON from Azure Blob Storage
# - ConnectorMode.DIRECT: Instream changefeed (live CDC)

print(f"‚úÖ ConnectorMode enum imported")
print(f"   Available modes: {[mode.value for mode in ConnectorMode]}")

‚úÖ ConnectorMode enum imported
   Available modes: ['volume', 'azure_parquet', 'azure_json', 'azure_dual', 'direct']


In [3]:
import json
import os
import sys
import importlib

# Add parent directory to path
parent_dir = os.path.abspath("../..")
if parent_dir not in sys.path:
    sys.path.insert(0, parent_dir)

import cockroachdb
importlib.reload(cockroachdb)
from cockroachdb import load_crdb_config, load_and_merge_cdc_to_delta, cleanup_test_checkpoint

print("="*80)
print("CONFIGURATION SETUP")
print("="*80)

# Load configuration files
git_root = os.path.abspath("../../..")
cockroach_dir = f"{git_root}/sources/cockroachdb"
crdb_json_path = f"{cockroach_dir}/.env/cockroachdb_credentials.json"
pipeline_json_path = f"{cockroach_dir}/.env/cockroachdb_pipelines.json"

crdb_config = load_crdb_config(crdb_json_path)

with open(pipeline_json_path, 'r') as f:
    pipeline_config = json.load(f)

print("\n‚úÖ Configuration loaded!")
print("="*80)


CONFIGURATION SETUP

‚úÖ Configuration loaded!


## Step 2: Configure Test Scenario

Update these variables to test different scenarios from `test_cdc_matrix.sh`:


In [4]:
# ============================================================================
# üîß CHANGE THIS TO TEST DIFFERENT SCENARIOS
# ============================================================================

# ============================================================================
# Available test scenarios (JSON first, then Parquet):
#   - "test-json_usertable_with_split"        ‚≠ê Tests merge with column families
#   - "test-json_usertable_no_split"
#   - "test-json_simple_test_with_split"
#   - "test-json_simple_test_no_split"
#   - "test-parquet_usertable_with_split"
#   - "test-parquet_usertable_no_split"
#   - "test-parquet_simple_test_with_split"
#   - "test-parquet_simple_test_no_split"
# ============================================================================


# Test scenario (subdirectory name from test_cdc_matrix.sh)
TEST_FORMAT="parquet"   # json | parquet
TEST_NAME="usertable_with_split"

# Test version (which test run to analyze)
TEST_VERSION = 0  # 0=latest, 1=second newest, -1=oldest

TEST_SCENARIO = f"test-{TEST_FORMAT}_{TEST_NAME}"  # ‚≠ê Change this!


In [5]:
# Parse test scenario to extract components
# Uses centralized parse_test_scenario() from cockroachdb.py
# Pattern: test-{format}_{table_name}_{split_info}
# Examples:
#   "test-parquet_simple_test_no_split" ‚Üí format='parquet', table_name='simple_test', has_split=False
#   "test-json_usertable_with_split" ‚Üí format='json', table_name='usertable', has_split=True
from cockroachdb import parse_test_scenario

scenario = parse_test_scenario(TEST_SCENARIO)

# Extract table name from scenario
SOURCE_TABLE = scenario.table_name

# CockroachDB connection defaults (used for schema auto-detection)
# These are NOT part of the scenario name - they're configuration
CRDB_CATALOG = "defaultdb"  # CockroachDB database/catalog
CRDB_SCHEMA = "public"      # CockroachDB schema

print(f"‚úÖ Parsed scenario: {scenario.scenario_name}")
print(f"   Format: {scenario.format}")
print(f"   Table: {SOURCE_TABLE}")
print(f"   Split: {scenario.split_info}")
print(f"   Using catalog: {CRDB_CATALOG}")
print(f"   Using schema: {CRDB_SCHEMA}")
print()

# Derived configuration
CATALOG = pipeline_config["catalog"]
SCHEMA = pipeline_config["schema"]
VOLUME_NAME = pipeline_config["volume_name"]

# Volume path prefix (timestamp will be resolved based on TEST_VERSION)
# This allows testing different test runs without changing paths
VOLUME_PATH = f"/Volumes/{CATALOG}/{SCHEMA}/{VOLUME_NAME}/{TEST_FORMAT}/{CRDB_CATALOG}/{CRDB_SCHEMA}/{TEST_SCENARIO}"

# Target Delta table
TARGET_TABLE = f"{SOURCE_TABLE}_{TEST_SCENARIO.replace('-', '_')}_delta"
TARGET_TABLE_PATH = f"{CATALOG}.{SCHEMA}.{TARGET_TABLE}"

print("="*80)
print("TEST CONFIGURATION")
print("="*80)
print(f"Test scenario: {TEST_SCENARIO}")
print(f"Test version: {TEST_VERSION} (0=latest, -1=oldest)")
print(f"Source table: {SOURCE_TABLE}")
print(f"Volume path prefix: {VOLUME_PATH}")
print(f"  (Timestamp will be resolved automatically)")
print(f"Target table: {TARGET_TABLE_PATH}")
print("="*80)

‚úÖ Parsed scenario: test-parquet_usertable_with_split
   Format: parquet
   Table: usertable
   Split: with_split
   Using catalog: defaultdb
   Using schema: public

TEST CONFIGURATION
Test scenario: test-parquet_usertable_with_split
Test version: 0 (0=latest, -1=oldest)
Source table: usertable
Volume path prefix: /Volumes/main/robert_lee_cockroachdb/parquet_files/parquet/defaultdb/public/test-parquet_usertable_with_split
  (Timestamp will be resolved automatically)
Target table: main.robert_lee_cockroachdb.usertable_test_parquet_usertable_with_split_delta


## Step 3: Run Automated Test

This single function call:
- Auto-detects primary keys and column families
- Loads data with Autoloader
- Applies CDC transformations
- Merges column family fragments
- Writes to Delta table
- Verifies results
- Compares with source files


In [6]:
# Run automated test
result = load_and_merge_cdc_to_delta(
    source_table=SOURCE_TABLE,
    volume_path=VOLUME_PATH,
    target_table_path=TARGET_TABLE_PATH,
    crdb_config=crdb_config,
    catalog=CRDB_CATALOG,
    schema=CRDB_SCHEMA,
    spark=spark, 
    dbutils=dbutils,
    clear_checkpoint=True,   # Set to False if appending data
    verify=True,             # Verify Delta table
    compare_source=True,     # Compare with source files
    debug=True,              # Show detailed progress
    version=TEST_VERSION     # Which test run to use (0=latest, -1=oldest)
)


üìÇ Parsed volume path:
   Volume base: /Volumes/main/robert_lee_cockroachdb/parquet_files
   Path prefix: parquet/defaultdb/public/test-parquet_usertable_with_split
   Timestamp in path: None
   Version parameter: 0
üîç Auto-resolving timestamped path (version=0)...
   ‚úÖ Resolved to timestamp: 1769034791
   Full path: /Volumes/main/robert_lee_cockroachdb/parquet_files/parquet/defaultdb/public/test-parquet_usertable_with_split/1769034791
üîç Detected test table name from path: test_parquet_usertable_with_split
AUTOMATED CDC TESTING
üìå Code Version: cockroachdb@f603af7
Source table: usertable
Volume path: /Volumes/main/robert_lee_cockroachdb/parquet_files/parquet/defaultdb/public/test-parquet_usertable_with_split/1769034791
Target table: main.robert_lee_cockroachdb.usertable_test_parquet_usertable_with_split_delta
CockroachDB catalog: defaultdb
CockroachDB schema: public

üîç Auto-detected format: parquet
üîç Validating volume path...
   File format: parquet
   File extension: 



   ‚úÖ Schema file found
   Primary keys: ['ycsb_key']
   Has column families: True

FAST CHECKPOINT CLEARING
PARALLEL CHECKPOINT DELETION
Path: /Volumes/main/robert_lee_cockroachdb/parquet_files/parquet/defaultdb/public/test-parquet_usertable_with_split/1769034791/_checkpoints
Workers: 20

‚ÑπÔ∏è  Checkpoint doesn't exist: /Volumes/main/robert_lee_cockroachdb/parquet_files/parquet/defaultdb/public/test-parquet_usertable_with_split/1769034791/_checkpoints
   (This is OK if it's the first run)

‚úÖ Dropped table: main.robert_lee_cockroachdb.usertable_test_parquet_usertable_with_split_delta

CLEAR COMPLETE
‚è±Ô∏è  Total time: 1.7s
üìä Items deleted: 0

üì• Loading data with Autoloader (parquet format)...
   ‚úÖ Autoloader configured

üîß Applying CDC transformations...
   ‚úÖ CDC metadata added

üíæ Writing CDC events to temp table...
   (Column family merge will happen in batch mode)
   üöÄ Step 1: Writing raw CDC events to temp table...
   ‚úÖ Raw CDC events written to temp table


# Step 4 (Optional)

In [7]:
# Quick cleanup after inspecting results
# This is faster than clear_checkpoint=True because:
# - You can inspect the results first
# - Only deletes what's needed
# - Can keep table for further inspection if desired

# Option 1: Clean checkpoint only (keep table for inspection)
cleanup_test_checkpoint(
    volume_path=VOLUME_PATH,
    dbutils=dbutils,
    version=TEST_VERSION,
    drop_table=False  # Keep table
)

# Option 2: Full cleanup (checkpoint + table)
# cleanup_test_checkpoint(
#     volume_path=VOLUME_PATH,
#     target_table_path=TARGET_TABLE_PATH,
#     dbutils=dbutils,
#     spark=spark,
#     version=TEST_VERSION,
#     drop_table=True  # Drop table too
# )

print("\n‚úÖ Ready for next test iteration!")


CLEANUP TEST CHECKPOINT
Volume path: /Volumes/main/robert_lee_cockroachdb/parquet_files/parquet/defaultdb/public/test-parquet_usertable_with_split/1769034791
Checkpoint: /Volumes/main/robert_lee_cockroachdb/parquet_files/parquet/defaultdb/public/test-parquet_usertable_with_split/1769034791/_checkpoints

üßπ Deleting checkpoint files...
PARALLEL CHECKPOINT DELETION
Path: /Volumes/main/robert_lee_cockroachdb/parquet_files/parquet/defaultdb/public/test-parquet_usertable_with_split/1769034791/_checkpoints
Workers: 20

üìä Found:
   Subdirectories: 2
   Files: 0

üîç Recursively scanning directories (max 5 levels deep for hierarchical structures)...
   delta/: 11 leaf directories
   schema/: 2 leaf directories

   Total: 13 directories to delete

üî• Deleting 13 directories in parallel...
   ‚úÖ delta/__tmp_path_dir
   ‚úÖ 0/metadata
   ‚úÖ offsets/0
   ‚úÖ _schemas/0
   ‚úÖ delta/metadata
   ‚úÖ commits/0
   ‚úÖ _schemas/__tmp_path_dir
   ‚úÖ 0/__tmp_path_dir
   ‚úÖ rocksdb/__tmp_path_

## Step 4: Review Results


In [8]:
print("="*80)
print("TEST RESULTS")
print("="*80)
print(f"Success: {result['success']}")
print(f"Primary keys: {result['primary_keys']}")
print(f"Has column families: {result['has_column_families']}")
print(f"Delta table rows: {result['delta_count']:,}")
print(f"Source file rows: {result['source_count']:,}")
print(f"Match: {result['match']} {'‚úÖ' if result['match'] else '‚ö†Ô∏è'}")
print("="*80)

if result['match']:
    print("\n‚úÖ‚úÖ‚úÖ TEST PASSED! ‚úÖ‚úÖ‚úÖ")
    print("Column family merge worked correctly!")
else:
    diff = result['delta_count'] - result['source_count']
    print(f"\n‚ö†Ô∏è  TEST FAILED: {diff:+,} row difference")
    print("Review the logs above for details.")


TEST RESULTS
Success: True
Primary keys: ['ycsb_key']
Has column families: True
Delta table rows: 9,950
Source file rows: 9,950
Match: True ‚úÖ

‚úÖ‚úÖ‚úÖ TEST PASSED! ‚úÖ‚úÖ‚úÖ
Column family merge worked correctly!


## Step 5: Query Delta Table (Optional)


In [9]:
# Display sample data
display(spark.table(TARGET_TABLE_PATH).limit(10))


Unnamed: 0,ycsb_key,_cdc_timestamp,_cdc_operation,field0,field1,field2,field3,field4,field5,field6,field7,field8,field9,__crdb__event_type,__crdb__updated,_rescued_data,_source_file,_processing_time
0,newuser0000000001,1.769035360730318e+18,UPSERT,field0_new_1,field1_new_1,field2_new_c4ca4238a0b923820dcc509a6f75849b,field3_new_yy,field4_new_1,field5_new_1,field6_new_1,field7_new_1,field8_new_1,field9_new_1,c,1.769035360730318e+18,,/Volumes/main/robert_lee_cockroachdb/parquet_files/parquet/defaultdb/public/test-parquet_usertable_with_split/1769034791/2026-01-21/202601212242310000000000000000001-b409b8db75c1cc45-1-30-00000002-test_parquet_usertable_with_split+data-1.parquet,2026-01-22 00:17:51.542
1,newuser0000000002,1.769035360730318e+18,UPSERT,field0_new_2,field1_new_2,field2_new_c81e728d9d4c2f636f067f89cc14862c,field3_new_yyy,field4_new_2,field5_new_2,field6_new_2,field7_new_2,field8_new_2,field9_new_2,c,1.769035360730318e+18,,/Volumes/main/robert_lee_cockroachdb/parquet_files/parquet/defaultdb/public/test-parquet_usertable_with_split/1769034791/2026-01-21/202601212242310000000000000000001-b409b8db75c1cc45-1-30-00000002-test_parquet_usertable_with_split+data-1.parquet,2026-01-22 00:17:51.542
2,newuser0000000003,1.769035360730318e+18,UPSERT,field0_new_3,field1_new_3,field2_new_eccbc87e4b5ce2fe28308fd9f2a7baf3,field3_new_yyyy,field4_new_3,field5_new_3,field6_new_3,field7_new_3,field8_new_3,field9_new_3,c,1.769035360730318e+18,,/Volumes/main/robert_lee_cockroachdb/parquet_files/parquet/defaultdb/public/test-parquet_usertable_with_split/1769034791/2026-01-21/202601212242310000000000000000001-b409b8db75c1cc45-1-30-00000002-test_parquet_usertable_with_split+data-1.parquet,2026-01-22 00:17:51.542
3,newuser0000000004,1.769035360730318e+18,UPSERT,field0_new_4,field1_new_4,field2_new_a87ff679a2f3e71d9181a67b7542122c,field3_new_yyyyy,field4_new_4,field5_new_4,field6_new_4,field7_new_4,field8_new_4,field9_new_4,c,1.769035360730318e+18,,/Volumes/main/robert_lee_cockroachdb/parquet_files/parquet/defaultdb/public/test-parquet_usertable_with_split/1769034791/2026-01-21/202601212242310000000000000000001-b409b8db75c1cc45-1-30-00000002-test_parquet_usertable_with_split+data-1.parquet,2026-01-22 00:17:51.542
4,newuser0000000005,1.769035360730318e+18,UPSERT,field0_new_5,field1_new_5,field2_new_e4da3b7fbbce2345d7772b0674a318d5,field3_new_yyyyyy,field4_new_5,field5_new_5,field6_new_5,field7_new_5,field8_new_5,field9_new_5,c,1.769035360730318e+18,,/Volumes/main/robert_lee_cockroachdb/parquet_files/parquet/defaultdb/public/test-parquet_usertable_with_split/1769034791/2026-01-21/202601212242310000000000000000001-b409b8db75c1cc45-1-30-00000002-test_parquet_usertable_with_split+data-1.parquet,2026-01-22 00:17:51.542
5,newuser0000000006,1.769035360730318e+18,UPSERT,field0_new_6,field1_new_6,field2_new_1679091c5a880faf6fb5e6087eb1b2dc,field3_new_yyyyyyy,field4_new_6,field5_new_6,field6_new_6,field7_new_6,field8_new_6,field9_new_6,c,1.769035360730318e+18,,/Volumes/main/robert_lee_cockroachdb/parquet_files/parquet/defaultdb/public/test-parquet_usertable_with_split/1769034791/2026-01-21/202601212242310000000000000000001-b409b8db75c1cc45-1-30-00000002-test_parquet_usertable_with_split+data-1.parquet,2026-01-22 00:17:51.542
6,newuser0000000007,1.769035360730318e+18,UPSERT,field0_new_7,field1_new_7,field2_new_8f14e45fceea167a5a36dedd4bea2543,field3_new_yyyyyyyy,field4_new_7,field5_new_7,field6_new_7,field7_new_7,field8_new_7,field9_new_7,c,1.769035360730318e+18,,/Volumes/main/robert_lee_cockroachdb/parquet_files/parquet/defaultdb/public/test-parquet_usertable_with_split/1769034791/2026-01-21/202601212242310000000000000000001-b409b8db75c1cc45-1-30-00000002-test_parquet_usertable_with_split+data-1.parquet,2026-01-22 00:17:51.542
7,newuser0000000008,1.769035360730318e+18,UPSERT,field0_new_8,field1_new_8,field2_new_c9f0f895fb98ab9159f51fd0297e236d,field3_new_yyyyyyyyy,field4_new_8,field5_new_8,field6_new_8,field7_new_8,field8_new_8,field9_new_8,c,1.769035360730318e+18,,/Volumes/main/robert_lee_cockroachdb/parquet_files/parquet/defaultdb/public/test-parquet_usertable_with_split/1769034791/2026-01-21/202601212242310000000000000000001-b409b8db75c1cc45-1-30-00000002-test_parquet_usertable_with_split+data-1.parquet,2026-01-22 00:17:51.542
8,newuser0000000009,1.769035360730318e+18,UPSERT,field0_new_9,field1_new_9,field2_new_45c48cce2e2d7fbdea1afc51c7c6ad26,field3_new_yyyyyyyyyy,field4_new_9,field5_new_9,field6_new_9,field7_new_9,field8_new_9,field9_new_9,c,1.769035360730318e+18,,/Volumes/main/robert_lee_cockroachdb/parquet_files/parquet/defaultdb/public/test-parquet_usertable_with_split/1769034791/2026-01-21/202601212242310000000000000000001-b409b8db75c1cc45-1-30-00000002-test_parquet_usertable_with_split+data-1.parquet,2026-01-22 00:17:51.542
9,newuser0000000010,1.769035360730318e+18,UPSERT,field0_new_10,field1_new_10,field2_new_d3d9446802a44259755d38e6d163e820,field3_new_yyyyyyyyyyy,field4_new_10,field5_new_10,field6_new_10,field7_new_10,field8_new_10,field9_new_10,c,1.769035360730318e+18,,/Volumes/main/robert_lee_cockroachdb/parquet_files/parquet/defaultdb/public/test-parquet_usertable_with_split/1769034791/2026-01-21/202601212242310000000000000000001-b409b8db75c1cc45-1-30-00000002-test_parquet_usertable_with_split+data-1.parquet,2026-01-22 00:17:51.542


In [10]:
# Display operation breakdown
display(spark.table(TARGET_TABLE_PATH).groupBy("_cdc_operation").count())


Unnamed: 0,_cdc_operation,count
0,UPSERT,9950


# Debug Code

## Summary

### What Just Happened

1. **Auto-detection**: Primary keys and column families detected from CockroachDB
2. **Loading**: Parquet files loaded from Unity Catalog Volume with Autoloader
3. **Transformation**: CDC metadata added (operation type, timestamp, source file)
4. **Merging**: Column family fragments merged into complete rows
5. **Writing**: Data written to Delta table with streaming aggregation
6. **Verification**: Row counts compared between Delta table and source files

### Next Steps

- Test more scenarios by changing `TEST_SCENARIO` in Step 2
- Compare results across different format/split combinations
- Validate that `test-parquet_usertable_with_split` produces correct count (not 11x inflation)

### Key Learnings

- **Automation**: One function call replaces 8 manual steps
- **Auto-detection**: No need to manually specify primary keys or check for column families
- **Verification**: Built-in validation ensures data integrity
- **Flexibility**: All steps can be controlled with optional parameters


## Alternative: Test Iterator Pattern (Community Connector)

This section demonstrates using the **Iterator Pattern** (Pattern 1) instead of Autoloader.
The iterator pattern is useful for:
- Testing and prototyping
- Low-volume workloads
- Batch processing with manual control

### Supported Modes:
1. **`volume`** - Read JSON/Parquet files from Unity Catalog Volumes (file-based)
2. **`azure_parquet`** - Read Parquet files from Azure Blob Storage (file-based)
3. **`direct`** - Instream changefeed connection (live CDC)

### How to Test Different Modes

To test different data sources, simply change the `ITERATOR_MODE` variable in the configuration cell:

#### Mode 1: Volume (File-based - JSON or Parquet) ‚≠ê RECOMMENDED
```python
ITERATOR_MODE = ConnectorMode.VOLUME
```
- **Use case:** Testing with files already synced to Unity Catalog
- **Data source:** JSON or Parquet files from `test_cdc_matrix.sh`
- **Format:** Auto-detected from `TEST_FORMAT` variable (set in Step 2)
- **Pros:** Fast, reliable, reproducible, supports both JSON and Parquet
- **Cons:** Requires files to be synced first

**To switch between JSON and Parquet in Volume mode:**
- Change `TEST_FORMAT = "json"` or `TEST_FORMAT = "parquet"` in **Step 2**
- The volume path and format will be auto-configured

#### Mode 2: Azure Parquet (File-based - Parquet only)
```python
ITERATOR_MODE = ConnectorMode.AZURE_PARQUET
```
- **Use case:** Reading directly from Azure Blob Storage
- **Data source:** Parquet files written by CockroachDB changefeeds
- **Format:** Parquet only
- **Pros:** No volume sync needed, direct access to Azure
- **Cons:** Requires Azure credentials, Parquet only (no JSON support)

#### Mode 3: Direct Instream (Live CDC) ‚ö†Ô∏è USE WITH CAUTION
```python
ITERATOR_MODE = ConnectorMode.DIRECT
```
- **Use case:** Live CDC testing, development, real-time data
- **Data source:** Direct changefeed connection to CockroachDB
- **Format:** JSON (sinkless changefeed)
- **Pros:** Real-time data, no file storage needed, tests live connection
- **Cons:** Requires active CockroachDB connection, may run indefinitely
- **‚ö†Ô∏è Note:** This creates a live changefeed connection - use safety limits!

### Quick Reference: Testing Combinations

| Scenario | Step 2: TEST_FORMAT | Configuration Cell: ITERATOR_MODE | Result |
|----------|---------------------|-----------------------------------|---------|
| **JSON files from Volume** | `"json"` | `ConnectorMode.VOLUME` | Reads JSON files from Unity Catalog |
| **Parquet files from Volume** | `"parquet"` | `ConnectorMode.VOLUME` | Reads Parquet files from Unity Catalog |
| **Parquet from Azure Blob** | `"parquet"` | `ConnectorMode.AZURE_PARQUET` | Reads Parquet from Azure Storage |
| **Live CDC (Instream)** | (any) | `ConnectorMode.DIRECT` | Creates live changefeed connection |

**Example Workflow:**

1. **Test JSON files:**
   - Step 2: Set `TEST_FORMAT = "json"`
   - Configuration cell: Set `ITERATOR_MODE = ConnectorMode.VOLUME`
   - Run iterator test ‚Üí Reads JSON files

2. **Test Parquet files:**
   - Step 2: Set `TEST_FORMAT = "parquet"`
   - Configuration cell: Keep `ITERATOR_MODE = ConnectorMode.VOLUME`
   - Run iterator test ‚Üí Reads Parquet files

3. **Test live CDC:**
   - Configuration cell: Set `ITERATOR_MODE = ConnectorMode.DIRECT`
   - Run iterator test ‚Üí Connects to CockroachDB changefeed

In [11]:
import importlib
import cockroachdb
importlib.reload(cockroachdb)
from cockroachdb import LakeflowConnect, ConnectorMode

In [None]:
# ============================================================================
# üîß CONFIGURE ITERATOR PATTERN MODE (Using ConnectorMode Enum)
# ============================================================================

# Select which mode to test using the ConnectorMode enum:
#   - ConnectorMode.VOLUME        : Read JSON/Parquet from Unity Catalog Volumes (recommended)
#   - ConnectorMode.AZURE_PARQUET : Read Parquet from Azure Blob Storage
#   - ConnectorMode.AZURE_JSON    : Read JSON from Azure Blob Storage
#   - ConnectorMode.DIRECT        : Instream changefeed (live CDC connection)

ITERATOR_MODE = ConnectorMode.VOLUME  # ‚≠ê Change this to test different modes

print("="*80)
print("ITERATOR PATTERN CONFIGURATION")
print("="*80)
print(f"Mode: {ITERATOR_MODE.value} (enum: {ITERATOR_MODE.name})")

# Build connector options based on selected mode
connector_options = {}

if ITERATOR_MODE == ConnectorMode.VOLUME:
    # Mode 1: Read from Unity Catalog Volume (JSON or Parquet files)
    print(f"üìÅ Data source: Unity Catalog Volume")
    print(f"   Path: {VOLUME_PATH}")
    print(f"   Format: Auto-detected from path ({TEST_FORMAT})")
    
    connector_options = {
        'mode': ConnectorMode.VOLUME.value,  # Convert enum to string
        'volume_path': VOLUME_PATH,
        'spark': spark,
        'dbutils': dbutils
    }

elif ITERATOR_MODE == ConnectorMode.AZURE_PARQUET:
    # Mode 2: Read from Azure Blob Storage (Parquet files)
    print(f"‚òÅÔ∏è  Data source: Azure Blob Storage (Parquet)")
    print(f"   Account: {pipeline_config.get('azure_account', 'N/A')}")
    print(f"   Container: {pipeline_config.get('container_name', 'N/A')}")
    
    # Construct Azure path prefix
    azure_path_prefix = f"{TEST_FORMAT}/{CRDB_CATALOG}/{CRDB_SCHEMA}/{TEST_SCENARIO}"
    print(f"   Path prefix: {azure_path_prefix}")
    
    connector_options = {
        'mode': ConnectorMode.AZURE_PARQUET.value,
        'azure_account': pipeline_config.get('azure_account'),
        'azure_key': pipeline_config.get('azure_key'),
        'container_name': pipeline_config.get('container_name'),
        'azure_path_prefix': azure_path_prefix,
        'format': 'parquet'  # Azure mode only supports Parquet
    }

elif ITERATOR_MODE == ConnectorMode.DIRECT:
    # Mode 3: Direct instream changefeed connection
    print(f"üî¥ Data source: Instream changefeed (Live CDC)")
    print(f"   Host: {crdb_config.get('host', 'N/A')}")
    print(f"   Database: {crdb_config.get('database', 'N/A')}")
    print(f"   ‚ö†Ô∏è  Note: This will create a live changefeed connection")
    
    connector_options = {
        'mode': ConnectorMode.DIRECT.value,
        'connection_url': crdb_config.get('connection_url'),
        'catalog': CRDB_CATALOG,
        'schema': CRDB_SCHEMA,
        'format': 'json'  # Direct mode uses JSON format
    }

else:
    valid_modes = [mode.name for mode in ConnectorMode]
    raise ValueError(f"Unknown mode: {ITERATOR_MODE}. Valid modes: {valid_modes}")

print("="*80)
print()

ITERATOR PATTERN CONFIGURATION
Mode: volume (enum: VOLUME)
üìÅ Data source: Unity Catalog Volume
   Path prefix: /Volumes/main/robert_lee_cockroachdb/parquet_files/parquet/defaultdb/public/test-parquet_usertable_with_split
   Resolved path: /Volumes/main/robert_lee_cockroachdb/parquet_files/parquet/defaultdb/public/test-parquet_usertable_with_split/1769034791
   Format: Auto-detected from path (parquet)



In [None]:
# ============================================================================
# Test Community Connector Iterator Pattern
# ============================================================================

from cockroachdb import LakeflowConnect

# Initialize connector with selected mode (configured in previous cell)
print(f"üîå Initializing connector in '{ITERATOR_MODE}' mode...")
connector = LakeflowConnect(connector_options)
print(f"‚úÖ Connector initialized!")

# Get table schema
print(f"\nüìã Fetching table schema for '{SOURCE_TABLE}'...")
schema = connector.get_table_schema(SOURCE_TABLE, {})
print(f"‚úÖ Table schema: {len(schema.fields)} fields")

# Get table metadata (primary keys, etc.)
print(f"\nüìã Fetching table metadata...")
metadata = connector.read_table_metadata(SOURCE_TABLE, {})
print(f"‚úÖ Primary keys: {metadata['primary_keys']}")
print(f"‚úÖ Ingestion type: {metadata['ingestion_type']}")

# Initialize offset for reading
start_offset = {"cursor": ""}

# Read data using iterator pattern
print(f"\nüìñ Reading data from {SOURCE_TABLE} using iterator pattern...")
print(f"   Mode: {ITERATOR_MODE}")
all_records = []
batch_count = 0

while True:
    # Read one batch
    record_iterator, end_offset = connector.read_table(
        table_name=SOURCE_TABLE,
        start_offset=start_offset,
        table_options={}
    )
    
    # Collect records from iterator
    batch_records = list(record_iterator)
    
    if not batch_records:
        print(f"‚úÖ No more records. Finished reading.")
        break
    
    batch_count += 1
    all_records.extend(batch_records)
    print(f"   Batch {batch_count}: {len(batch_records):,} records (cursor: {end_offset.get('cursor', 'N/A')[:20]}...)")
    
    # Check if we're done (offset didn't change)
    if end_offset.get('cursor') == start_offset.get('cursor'):
        print(f"‚úÖ Cursor unchanged. Finished reading.")
        break
    
    # Update offset for next iteration
    start_offset = end_offset
    
    # Safety limit for instream mode (prevent infinite loop)
    if ITERATOR_MODE == ConnectorMode.DIRECT and batch_count >= 100:
        print(f"‚ö†Ô∏è  Safety limit reached (100 batches). Stopping.")
        break

print(f"\nüìä Total records read: {len(all_records):,}")
print(f"üìä Total batches: {batch_count}")

# Show sample records
if all_records:
    print(f"\nüìã Sample record (first):")
    sample = all_records[0]
    for key, value in list(sample.items())[:5]:
        print(f"   {key}: {value}")
    
    # Check CDC operations
    operations = {}
    for record in all_records:
        op = record.get('_cdc_operation', 'UNKNOWN')
        operations[op] = operations.get(op, 0) + 1
    
    print(f"\nüìä CDC Operations:")
    for op, count in sorted(operations.items()):
        print(f"   {op}: {count:,}")

üîå Initializing connector in 'volume' mode...
‚úÖ Connector initialized!

üìñ Reading 'usertable' using iterator pattern...
   Mode: volume
   Primary keys: ['ycsb_key']

Reading table: usertable (mode=volume, cursor=)
   Batch 1: 9,950 records (cursor: 20260121224231000000...)
Reading table: usertable (mode=volume, cursor=202601212242310000000000000000001-b409b8db75c1cc45-1-30-00000003-test_parquet_usertable_with_split+pk-1.parquet)
‚úÖ No more records. Finished reading.

üìä Summary:
   Total records: 9,950
   Total batches: 1
   CDC Operations:
      SNAPSHOT: 9,950

üìã Sample record (first):
   ycsb_key: newuser0000000002
   _cdc_updated: 1769035360730318091.0000000000
   _cdc_operation: SNAPSHOT
   field0: field0_new_2
   field1: field1_new_2



### Optional: Write Iterator Data to Delta Table

Now that we have the records from the iterator, we can write them to a Delta table manually:

In [14]:
# Convert iterator records to DataFrame and write to Delta
# NOTE: Deduplication is now handled by the connector's _read_table_from_volume() method!
# The connector already applied:
#   ‚úÖ Column family fragment merging (if needed)
#   ‚úÖ Deduplication by primary key (keep latest by timestamp)
#   ‚úÖ DELETE operation filtering
# So we just need to write the final results to Delta!

from pyspark.sql import Row

# Create target table name for iterator pattern
iterator_target_table = f"{TARGET_TABLE}_iterator"
iterator_table_path = f"{CATALOG}.{SCHEMA}.{iterator_target_table}"

# Convert records to DataFrame
if all_records:
    print(f"\nüìä Iterator Pattern Results:")
    print(f"   Total records: {len(all_records):,}")
    print(f"   (Deduplication already applied by connector)")
    
    # Convert to pandas first (handles variable columns better)
    import pandas as pd
    pdf = pd.DataFrame(all_records)
    
    print(f"   Pandas DataFrame: {len(pdf)} rows, {len(pdf.columns)} columns")
    print(f"   Columns: {list(pdf.columns)[:10]}...")  # Show first 10
    
    # Convert to Spark DataFrame
    df_iterator = spark.createDataFrame(pdf)
    
    print(f"\nüìä Final row count: {df_iterator.count():,}")
    
    # Write to Delta table
    print(f"\nüíæ Writing iterator data to Delta table...")
    print(f"   Target: {iterator_table_path}")
    
    df_iterator.write \
        .format("delta") \
        .mode("overwrite") \
        .option("overwriteSchema", "true") \
        .saveAsTable(iterator_table_path)
    
    print(f"‚úÖ Successfully wrote {df_iterator.count():,} rows to {iterator_table_path}")
else:
    print("‚ö†Ô∏è No records to write")


üìä Iterator Pattern Results:
   Total records: 9,950
   (Deduplication already applied by connector)
   Pandas DataFrame: 9950 rows, 16 columns
   Columns: ['ycsb_key', '_cdc_updated', '_cdc_operation', 'field0', 'field1', 'field2', 'field3', 'field4', 'field5', 'field6']...

üìä Final row count: 9,950

üíæ Writing iterator data to Delta table...
   Target: main.robert_lee_cockroachdb.usertable_test_parquet_usertable_with_split_delta_iterator
‚úÖ Successfully wrote 9,950 rows to main.robert_lee_cockroachdb.usertable_test_parquet_usertable_with_split_delta_iterator


### Comparison: Autoloader vs Iterator Pattern

Compare results from both patterns:

In [15]:
# Compare results from both patterns
print("="*80)
print("PATTERN COMPARISON")
print("="*80)

# Autoloader Pattern (Pattern 2)
try:
    autoloader_count = spark.table(TARGET_TABLE_PATH).count()
    print(f"‚úÖ Autoloader Pattern (Pattern 2): {autoloader_count:,} rows")
    print(f"   Table: {TARGET_TABLE_PATH}")
    print(f"   Features: Streaming, checkpointing, file tracking")
except Exception as e:
    print(f"‚ö†Ô∏è Autoloader Pattern: Not run yet")
    autoloader_count = None

# Iterator Pattern (Pattern 1)
try:
    iterator_count = spark.table(iterator_table_path).count()
    print(f"\n‚úÖ Iterator Pattern (Pattern 1): {iterator_count:,} rows")
    print(f"   Table: {iterator_table_path}")
    print(f"   Features: Manual batching, cursor tracking, memory efficient")
except Exception as e:
    print(f"\n‚ö†Ô∏è Iterator Pattern: Not run yet")
    iterator_count = None

# Compare
if autoloader_count is not None and iterator_count is not None:
    if autoloader_count == iterator_count:
        print(f"\n‚úÖ‚úÖ‚úÖ MATCH! Both patterns produced {autoloader_count:,} rows")
        print("Both patterns are working correctly!")
    else:
        diff = abs(autoloader_count - iterator_count)
        print(f"\n‚ö†Ô∏è MISMATCH: {diff:,} row difference")
        print("Review the data transformations in each pattern")

print("\n" + "="*80)
print("PATTERN CHARACTERISTICS")
print("="*80)
print("""
Pattern 1 - Iterator (Community Connector):
  ‚úÖ Simple, explicit control
  ‚úÖ Works with batch processing
  ‚úÖ Good for testing/prototyping
  ‚ö†Ô∏è Manual cursor management
  ‚ö†Ô∏è No built-in streaming
  
Pattern 2 - Autoloader (Standalone):
  ‚úÖ Automated file tracking
  ‚úÖ Built-in checkpointing
  ‚úÖ Streaming capable
  ‚úÖ Production-ready
  ‚ö†Ô∏è More complex setup

Recommendation:
  - Use Pattern 1 for: Testing, development, low-volume workloads
  - Use Pattern 2 for: Production, high-volume, continuous CDC
""")

PATTERN COMPARISON
‚úÖ Autoloader Pattern (Pattern 2): 9,950 rows
   Table: main.robert_lee_cockroachdb.usertable_test_parquet_usertable_with_split_delta
   Features: Streaming, checkpointing, file tracking

‚úÖ Iterator Pattern (Pattern 1): 9,950 rows
   Table: main.robert_lee_cockroachdb.usertable_test_parquet_usertable_with_split_delta_iterator
   Features: Manual batching, cursor tracking, memory efficient

‚úÖ‚úÖ‚úÖ MATCH! Both patterns produced 9,950 rows
Both patterns are working correctly!

PATTERN CHARACTERISTICS

Pattern 1 - Iterator (Community Connector):
  ‚úÖ Simple, explicit control
  ‚úÖ Works with batch processing
  ‚úÖ Good for testing/prototyping
  ‚ö†Ô∏è Manual cursor management
  ‚ö†Ô∏è No built-in streaming

Pattern 2 - Autoloader (Standalone):
  ‚úÖ Automated file tracking
  ‚úÖ Built-in checkpointing
  ‚úÖ Streaming capable
  ‚úÖ Production-ready
  ‚ö†Ô∏è More complex setup

Recommendation:
  - Use Pattern 1 for: Testing, development, low-volume workloads
  - Use