# Snapshot Sync Demonstration

This notebook demonstrates the `snapshot_sync` method of `AdapterStore`, which synchronizes the table to match a complete snapshot of data.

## Key Characteristics of snapshot_sync

- **Full Synchronization**: Updates, inserts, AND deletes to match the snapshot exactly
- **Soft Deletes**: Records missing from snapshot are marked with `deleted=True` and **preserve their content** (not physically removed)
- **No Timestamp Checking**: Unlike `incremental_update`, snapshot_sync does NOT require or check timestamps for gating updates
- **Timestamp Writing**: Does write `last_modified` timestamps to all changed records (auto-generated if not provided)
- **Idempotent**: Applying the same snapshot twice produces no changes

**Use case**: Full harvests where you receive a complete dataset and want the table to match exactly (e.g., nightly full dumps).

In [None]:
import sys
from uuid import uuid1
import pyarrow as pa
from adapters.utils.adapter_store import AdapterStore
from adapters.utils.iceberg import get_local_table

# Create a local Iceberg table for testing
table_name = f"demo_{str(uuid1())[:8]}"
table = get_local_table(
    table_name=table_name,
    namespace="demo",
    db_name="demo_catalog",
)
store = AdapterStore(table, default_namespace="snapshot_namespace")
print(f"Created table: {table_name}")

In [None]:
from adapters.utils.schemata import ARROW_SCHEMA

def create_record_table(records: list[dict], namespace: str = "snapshot_namespace") -> pa.Table:
    """Create a PyArrow table from a list of records.
    
    Note: snapshot_sync does NOT require timestamps, but will auto-generate them if missing.
    You can optionally provide last_modified values in records.
    """
    for record in records:
        record["namespace"] = namespace
        # Ensure optional fields have defaults
        record.setdefault("last_modified", None)
        record.setdefault("deleted", None)
    
    return pa.Table.from_pylist(records, schema=ARROW_SCHEMA)

def display_table(title: str):
    """Display current table contents."""
    print(f"\n{title}")
    print("=" * 100)
    current = table.scan().to_arrow().sort_by("id")
    if current.num_rows == 0:
        print("(empty table)")
    else:
        df = current.to_pandas()
        # Format for better readability
        print(df.to_string(index=False))
    return current

def display_changeset_summary():
    """Display a summary of changesets in the table."""
    print("\nüìä Changeset Summary")
    print("=" * 100)
    current = table.scan().to_arrow()
    if current.num_rows == 0:
        print("(empty table)")
        return
    
    df = current.to_pandas()
    # Group by changeset
    changeset_counts = df.groupby('changeset', dropna=False).size()
    print(f"\nTotal records: {len(df)}")
    print(f"Changesets: {changeset_counts.to_dict()}")
    
    # Show deleted records count (based on deleted flag, not content)
    deleted_count = df['deleted'].fillna(False).sum()
    if deleted_count > 0:
        print(f"üóëÔ∏è  Deleted records (deleted=True): {int(deleted_count)}")

## Scenario 1: Initial Load into Empty Table

Starting with an empty table, load initial records.

In [None]:
initial_snapshot = create_record_table([
    {"id": "rec001", "content": "First record"},
    {"id": "rec002", "content": "Second record"},
    {"id": "rec003", "content": "Third record"},
])

result = store.snapshot_sync(initial_snapshot)
changeset_id = result.changeset_id
print(f"‚úÖ Initial load complete. Changeset ID: {changeset_id}")

current = display_table("Initial State")
display_changeset_summary()

# Verify all records have the same changeset
assert current.column("changeset").to_pylist().count(changeset_id) == 3
print("\n‚úì All records share the same changeset_id")

## Scenario 2: Idempotent Sync (No Changes)

Apply the same snapshot again. Should return `None` and make no changes.

In [None]:
same_snapshot = create_record_table([
    {"id": "rec001", "content": "First record"},
    {"id": "rec002", "content": "Second record"},
    {"id": "rec003", "content": "Third record"},
])

result = store.snapshot_sync(same_snapshot)

if result is None:
    print("‚úÖ No changes detected (idempotent)")
else:
    print(f"‚ùå Unexpected: changeset created: {result}")

display_table("After Idempotent Sync (should be unchanged)")
display_changeset_summary()

# Verify no new records created
assert current.num_rows == 3
print("\n‚úì Record count unchanged")

## Scenario 3: Content Updates

Update some existing records with different content.

In [None]:
updated_snapshot = create_record_table([
    {"id": "rec001", "content": "UPDATED first record"},
    {"id": "rec002", "content": "Second record"},  # unchanged
    {"id": "rec003", "content": "UPDATED third record"},
])

result_2 = store.snapshot_sync(updated_snapshot)
changeset_id_2 = result_2.changeset_id
print(f"‚úÖ Updates applied. Changeset ID: {changeset_id_2}")

current = display_table("After Content Updates")
display_changeset_summary()

# Verify only changed records have new changeset
records = current.to_pylist()
rec001 = [r for r in records if r["id"] == "rec001"][0]
rec002 = [r for r in records if r["id"] == "rec002"][0]
rec003 = [r for r in records if r["id"] == "rec003"][0]

assert rec001["changeset"] == changeset_id_2, "rec001 should have new changeset"
assert rec002["changeset"] == changeset_id, "rec002 should retain old changeset (unchanged)"
assert rec003["changeset"] == changeset_id_2, "rec003 should have new changeset"

print("\n‚úì Only changed records received new changeset_id")

## Scenario 4: Insertions

Add new records to the snapshot.

In [None]:
snapshot_with_new = create_record_table([
    {"id": "rec001", "content": "UPDATED first record"},
    {"id": "rec002", "content": "Second record"},
    {"id": "rec003", "content": "UPDATED third record"},
    {"id": "rec004", "content": "NEW fourth record"},
    {"id": "rec005", "content": "NEW fifth record"},
])

result_3 = store.snapshot_sync(snapshot_with_new)
changeset_id_3 = result_3.changeset_id
print(f"‚úÖ Insertions applied. Changeset ID: {changeset_id_3}")

current = display_table("After Insertions")
display_changeset_summary()

# Verify new records exist and have the new changeset
records = current.to_pylist()
rec004 = [r for r in records if r["id"] == "rec004"][0]
rec005 = [r for r in records if r["id"] == "rec005"][0]

assert rec004["changeset"] == changeset_id_3
assert rec005["changeset"] == changeset_id_3
assert current.num_rows == 5

print("\n‚úì New records inserted with correct changeset_id")

## Scenario 5: Soft Deletes

Remove records from the snapshot. They should be marked as deleted (`deleted=True`) while **preserving their content** rather than being physically removed.

In [None]:
snapshot_with_deletions = create_record_table([
    {"id": "rec001", "content": "UPDATED first record"},
    {"id": "rec002", "content": "Second record"},
    # rec003, rec004, rec005 are missing - should be deleted
])

result_4 = store.snapshot_sync(snapshot_with_deletions)
changeset_id_4 = result_4.changeset_id
print(f"‚úÖ Deletions applied. Changeset ID: {changeset_id_4}")

current = display_table("After Deletions")
display_changeset_summary()

# Verify deleted records have deleted=True and content is PRESERVED
records = current.to_pylist()
rec003 = [r for r in records if r["id"] == "rec003"][0]
rec004 = [r for r in records if r["id"] == "rec004"][0]
rec005 = [r for r in records if r["id"] == "rec005"][0]

assert rec003["deleted"] is True, "rec003 should be marked as deleted"
assert rec004["deleted"] is True, "rec004 should be marked as deleted"
assert rec005["deleted"] is True, "rec005 should be marked as deleted"

# Content is preserved on deleted records!
assert rec003["content"] == "UPDATED third record", "rec003 content should be preserved"
assert rec004["content"] == "NEW fourth record", "rec004 content should be preserved"
assert rec005["content"] == "NEW fifth record", "rec005 content should be preserved"

assert rec003["changeset"] == changeset_id_4
assert rec004["changeset"] == changeset_id_4
assert rec005["changeset"] == changeset_id_4

print("\n‚úì Deleted records marked with deleted=True")
print("‚úì Content is PRESERVED on deleted records")
print("‚úì All deleted records have the same changeset_id")

# Show deleted records separately
print("\nüóëÔ∏è  Deleted Records (with preserved content):")
deleted_records = [r for r in records if r.get("deleted") is True]
for rec in deleted_records:
    print(f"  - {rec['id']}: '{rec['content']}' (changeset: {rec['changeset'][:8]}...)")

## Scenario 6: Undelete / Restore

Restore a previously deleted record. This should update the existing deleted record (setting `deleted=False`) without creating a duplicate.

In [None]:
snapshot_with_restore = create_record_table([
    {"id": "rec001", "content": "UPDATED first record"},
    {"id": "rec002", "content": "Second record"},
    {"id": "rec003", "content": "RESTORED"},  # Restoring this one
])

result_5 = store.snapshot_sync(snapshot_with_restore)
changeset_id_5 = result_5.changeset_id
current = display_table("After Restore")
display_changeset_summary()
records = current.to_pylist()
rec003 = [r for r in records if r["id"] == "rec003"][0]

assert rec003["content"] == "RESTORED"
assert rec003["changeset"] == changeset_id_5
assert rec003.get("deleted") is not True, "rec003 should no longer be deleted"
assert current.num_rows == 5, "Total records should still be 5"

print("\n‚úì rec003 restored successfully")
print(f"‚úì Total records: {current.num_rows}")

## Scenario 7: Mixed Operations (Insert + Update + Delete)

Apply a snapshot with all three operations simultaneously.

In [None]:
mixed_snapshot = create_record_table([
    {"id": "rec001", "content": "UPDATED first record"},
    {"id": "rec002", "content": "Second record"},
    # rec003 is missing - should be deleted again
])

result_6 = store.snapshot_sync(mixed_snapshot)
changeset_id_6 = result_6.changeset_id

current = display_table("After Mixed Operations")
display_changeset_summary()
records = current.to_pylist()
rec003 = [r for r in records if r["id"] == "rec003"][0]

assert rec003.get("deleted") is True, "rec003 should be deleted"
assert rec003["content"] == "RESTORED", "rec003 content should be preserved from before deletion"
assert current.num_rows == 5, "Total records should still be 5"

print("\n‚úì Mixed operations successful")
print(f"‚úì rec003 deleted again (content preserved: '{rec003['content']}')")
print(f"‚úì Total records: {current.num_rows}")

## Summary

This notebook demonstrated the complete behavior of `snapshot_sync`:

### ‚úÖ What We Showed

1. **Initial Load**: Populating an empty table
2. **Idempotency**: Repeated syncs with identical data produce no changes
3. **Updates**: Content changes are detected and applied
4. **Insertions**: New records are added
5. **Soft Deletes**: Missing records marked with `deleted=True` while **preserving content** (not physically deleted)
6. **Undelete/Restore**: Deleted records can be restored without creating duplicates
7. **Mixed Operations**: Insert, update, and delete in a single sync

### üîë Key Insight: Content Preservation

Unlike previous implementations, deleted records **preserve their content**. This is important because:
- The pipeline downstream needs to know what was deleted to take appropriate action
- It allows replaying deletions if something fails downstream
- The `deleted` flag clearly indicates the record's status while retaining the data

## Cleanup

In [None]:
# Drop the demo table
table.catalog.drop_table(f"demo.{table_name}")
print(f"‚úÖ Dropped table: {table_name}")