# Time Travel and Rollback with Apache Iceberg

This notebook demonstrates one of Iceberg's most powerful features: **time travel** and **rollback** capabilities. These features enable you to:

## Key Capabilities:
- **Query Historical Data**: Access any previous version of your table
- **Data Recovery**: Rollback to previous states after mistakes
- **Audit and Compliance**: Track data changes over time
- **A/B Testing**: Compare different versions of data
- **Debugging**: Investigate when and how data changed

## What We'll Cover:
1. **Snapshot Management**: Understanding table versions
2. **Time Travel Queries**: Querying data at specific points in time
3. **Rollback Operations**: Reverting to previous table states
4. **Practical Examples**: Real-world use cases
5. **Best Practices**: Performance and retention considerations

## Prerequisites:
- Run notebooks 1-6 to have a table with history
- Understanding of Iceberg snapshots and metadata

## Environment Setup

Initialize Spark session and verify table access.

In [1]:
from pyspark.sql import SparkSession

# Create Spark session
spark = SparkSession.builder \
    .appName("Iceberg Time Travel and Rollback") \
    .getOrCreate()

print(f"Spark version: {spark.version}")
print("Session initialized for time travel operations")

# Set default database
spark.sql("USE rest.`play_iceberg`")
print("Using database: rest.play_iceberg")

25/07/01 13:50:30 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


Spark version: 3.5.5
Session initialized for time travel operations
Using database: rest.play_iceberg


## 1. Current Table State Analysis

Let's first examine the current state of our table and understand its history.

In [2]:
# Check current table data
current_data = spark.sql("""
SELECT COUNT(*) as total_records,
       COUNT(DISTINCT user_id) as unique_users,
       MIN(user_id) as min_user_id,
       MAX(user_id) as max_user_id,
       SUM(CASE WHEN is_active THEN 1 ELSE 0 END) as active_users
FROM users
""")

print("Current Table Statistics:")
print("=" * 40)
current_data.show()

# Show sample of current data
print("\nCurrent Table Sample (Latest 5 users):")
print("=" * 50)
spark.sql("""
SELECT user_id, username, email, is_active, country, updated_at
FROM users 
ORDER BY user_id DESC 
LIMIT 5
""").show(truncate=False)

Current Table Statistics:


                                                                                

+-------------+------------+-----------+-----------+------------+
|total_records|unique_users|min_user_id|max_user_id|active_users|
+-------------+------------+-----------+-----------+------------+
|         1012|        1012|          1|       3003|         510|
+-------------+------------+-----------+-----------+------------+


Current Table Sample (Latest 5 users):
+-------+-----------------------+-------------------+---------+-------+--------------------------+
|user_id|username               |email              |is_active|country|updated_at                |
+-------+-----------------------+-------------------+---------+-------+--------------------------+
|3003   |partition_test_active2 |active2@test.com   |true     |France |2025-07-01 13:48:37.205836|
|3002   |partition_test_inactive|inactive@test.com  |false    |UK     |2025-07-01 13:48:37.205836|
|3001   |partition_test_active  |active@test.com    |true     |USA    |2025-07-01 13:48:37.205836|
|2003   |global_user_3          |gl

## 2. Table History and Snapshots

Iceberg maintains a complete history of all table changes. Each change creates a new snapshot with a unique ID.

In [3]:
# Get complete table history
history_df = spark.sql("""
SELECT 
    made_current_at,
    snapshot_id,
    parent_id,
    is_current_ancestor,
    CASE 
        WHEN parent_id IS NULL THEN 'INITIAL_LOAD'
        ELSE 'UPDATE'
    END as operation_type
FROM rest.`play_iceberg`.users.history
ORDER BY made_current_at
""")

print("Complete Table History:")
print("=" * 50)
history_df.show(truncate=False)

# Store snapshot IDs for later use
snapshots = history_df.collect()
print(f"\nTotal snapshots found: {len(snapshots)}")

if len(snapshots) >= 2:
    first_snapshot = snapshots[0]['snapshot_id']
    latest_snapshot = snapshots[-1]['snapshot_id']
    print(f"First snapshot: {first_snapshot}")
    print(f"Latest snapshot: {latest_snapshot}")
else:
    print("Insufficient history for time travel demo - run previous notebooks first")

Complete Table History:
+-----------------------+-------------------+-------------------+-------------------+--------------+
|made_current_at        |snapshot_id        |parent_id          |is_current_ancestor|operation_type|
+-----------------------+-------------------+-------------------+-------------------+--------------+
|2025-07-01 13:41:23.087|3824926247776142618|NULL               |true               |INITIAL_LOAD  |
|2025-07-01 13:46:22.566|3682366184055111348|3824926247776142618|true               |UPDATE        |
|2025-07-01 13:46:29.92 |5474445044599496672|3682366184055111348|true               |UPDATE        |
|2025-07-01 13:46:43.276|703749108260326071 |5474445044599496672|true               |UPDATE        |
|2025-07-01 13:46:51.537|1680807781020980082|703749108260326071 |true               |UPDATE        |
|2025-07-01 13:47:01.334|8866811446601788626|1680807781020980082|true               |UPDATE        |
|2025-07-01 13:48:29.944|5929459840228022626|8866811446601788626|tr

## 3. Time Travel Queries

Now let's demonstrate time travel by querying the table at different points in its history.

### 3.1 Query by Snapshot ID
The most precise way to time travel is using specific snapshot IDs.

In [4]:
if len(snapshots) >= 2:
    # Query the first snapshot (original table state)
    print(f"Data at First Snapshot ({first_snapshot}):")
    print("=" * 50)
    
    first_snapshot_data = spark.sql(f"""
    SELECT user_id, username, email, is_active, updated_at
    FROM users FOR SYSTEM_VERSION AS OF {first_snapshot}
    ORDER BY user_id
    """)
    
    first_snapshot_data.show(truncate=False)
    
    # Get record count at first snapshot
    first_count = spark.sql(f"""
    SELECT COUNT(*) as record_count 
    FROM users FOR SYSTEM_VERSION AS OF {first_snapshot}
    """).collect()[0]['record_count']
    
    print(f"Records in first snapshot: {first_count}")
    
    # Compare with current state
    current_count = spark.sql("SELECT COUNT(*) as record_count FROM users").collect()[0]['record_count']
    print(f"Records in current state: {current_count}")
    print(f"Records added since first snapshot: {current_count - first_count}")
else:
    print("Need more table history to demonstrate time travel")

Data at First Snapshot (3824926247776142618):
+-------+-------------+-------------------------+---------+--------------------------+
|user_id|username     |email                    |is_active|updated_at                |
+-------+-------------+-------------------------+---------+--------------------------+
|1      |john_doe     |john.doe@example.com     |true     |2025-07-01 13:41:16.173663|
|2      |jane_smith   |jane.smith@example.com   |true     |2025-07-01 13:41:16.173663|
|3      |alice_wonder |alice.wonder@example.com |false    |2025-07-01 13:41:16.173663|
|4      |bob_builder  |bob.builder@example.com  |true     |2025-07-01 13:41:16.173663|
|5      |charlie_brown|charlie.brown@example.com|true     |2025-07-01 13:41:16.173663|
+-------+-------------+-------------------------+---------+--------------------------+

Records in first snapshot: 5
Records in current state: 1012
Records added since first snapshot: 1007


### 3.2 Query by Timestamp
You can also time travel using timestamps, which is useful when you know when a change occurred.

In [5]:
if len(snapshots) >= 2:
    # Get timestamp of first snapshot
    first_timestamp = snapshots[0]['made_current_at']
    
    print(f"Time Travel to: {first_timestamp}")
    print("=" * 50)
    
    # Query using timestamp
    timestamp_data = spark.sql(f"""
    SELECT user_id, username, email, is_active
    FROM users FOR SYSTEM_TIME AS OF '{first_timestamp}'
    ORDER BY user_id
    """)
    
    timestamp_data.show(truncate=False)
    
    # Verify this matches the snapshot query
    timestamp_count = spark.sql(f"""
    SELECT COUNT(*) as record_count 
    FROM users FOR SYSTEM_TIME AS OF '{first_timestamp}'
    """).collect()[0]['record_count']
    
    print(f"Records at timestamp {first_timestamp}: {timestamp_count}")

Time Travel to: 2025-07-01 13:41:23.087000
+-------+-------------+-------------------------+---------+
|user_id|username     |email                    |is_active|
+-------+-------------+-------------------------+---------+
|1      |john_doe     |john.doe@example.com     |true     |
|2      |jane_smith   |jane.smith@example.com   |true     |
|3      |alice_wonder |alice.wonder@example.com |false    |
|4      |bob_builder  |bob.builder@example.com  |true     |
|5      |charlie_brown|charlie.brown@example.com|true     |
+-------+-------------+-------------------------+---------+

Records at timestamp 2025-07-01 13:41:23.087000: 5


## 4. Making Changes for Rollback Demo

Let's make some intentional changes that we can later rollback.

In [6]:
# Record the current snapshot before making changes
pre_change_history = spark.sql("""
SELECT snapshot_id, made_current_at 
FROM rest.`play_iceberg`.users.history 
ORDER BY made_current_at DESC 
LIMIT 1
""").collect()[0]

pre_change_snapshot = pre_change_history['snapshot_id']
pre_change_time = pre_change_history['made_current_at']

print(f"Current snapshot before changes: {pre_change_snapshot}")
print(f"Timestamp: {pre_change_time}")

# Count records before changes
before_count = spark.sql("SELECT COUNT(*) as count FROM users").collect()[0]['count']
print(f"Record count before changes: {before_count}")

Current snapshot before changes: 1926503896228031236
Timestamp: 2025-07-01 13:48:37.868000
Record count before changes: 1012


In [7]:
# FIXED: Insert test data with complete schema (including evolved columns)
print("Inserting test data for rollback demonstration...")

spark.sql("""
INSERT INTO users 
(user_id, username, email, is_active, created_year, created_month, created_day, updated_at, 
 country, registration_source, engagement_score, last_login_at) 
VALUES 
    (4001, 'test_user_1', 'test1@rollback.com', true, 2025, 7, 1, current_timestamp(), 
     'TEST', 'demo', 85.0, current_timestamp()),
    (4002, 'test_user_2', 'test2@rollback.com', false, 2025, 7, 1, current_timestamp(), 
     'TEST', 'demo', 65.5, null),
    (4003, 'temp_user', 'temp@rollback.com', true, 2025, 7, 1, current_timestamp(), 
     'TEMP', 'rollback_test', 90.0, current_timestamp())
""")

print("Test data inserted successfully!")


Inserting test data for rollback demonstration...
Test data inserted successfully!


In [8]:
# Make some updates that we'll want to rollback
print("Making updates for rollback demonstration...")

spark.sql("""
UPDATE users 
SET is_active = false, 
    updated_at = current_timestamp(),
    country = 'DEACTIVATED'
WHERE country = 'USA'
""")

print("Updates completed!")

# Show current state after changes
after_count = spark.sql("SELECT COUNT(*) as count FROM users").collect()[0]['count']
print(f"Record count after changes: {after_count}")
print(f"Records added: {after_count - before_count}")

# Show some of the changes
print("\nCurrent state after changes:")
spark.sql("""
SELECT user_id, username, email, is_active, country
FROM users 
WHERE user_id >= 101 OR country IN ('DEACTIVATED', 'TEST', 'TEMP')
ORDER BY user_id
""").show(truncate=False)

Making updates for rollback demonstration...
Updates completed!
Record count after changes: 1015
Records added: 3

Current state after changes:
+-------+-------------+-------------------------+---------+-------+
|user_id|username     |email                    |is_active|country|
+-------+-------------+-------------------------+---------+-------+
|101    |bulk_user_101|bulk.user.101@example.com|true     |NULL   |
|102    |bulk_user_102|bulk.user.102@example.com|false    |NULL   |
|103    |bulk_user_103|bulk.user.103@example.com|true     |NULL   |
|104    |bulk_user_104|bulk.user.104@example.com|false    |NULL   |
|106    |bulk_user_106|bulk.user.106@example.com|false    |NULL   |
|107    |bulk_user_107|bulk.user.107@example.com|false    |NULL   |
|108    |bulk_user_108|bulk.user.108@example.com|false    |NULL   |
|109    |bulk_user_109|bulk.user.109@example.com|false    |NULL   |
|111    |bulk_user_111|bulk.user.111@example.com|true     |NULL   |
|112    |bulk_user_112|bulk.user.112@exa

## 5. Review Updated History

Let's see how our changes affected the table history.

In [9]:
# Check updated history
updated_history = spark.sql("""
SELECT 
    made_current_at,
    snapshot_id,
    parent_id,
    is_current_ancestor
FROM rest.`play_iceberg`.users.history
ORDER BY made_current_at DESC
""")

print("Updated Table History (most recent first):")
print("=" * 55)
updated_history.show(truncate=False)

# Get the latest snapshot ID
latest_snapshots = updated_history.collect()
current_snapshot = latest_snapshots[0]['snapshot_id']
print(f"\nCurrent (latest) snapshot: {current_snapshot}")
print(f"Snapshot we want to rollback to: {pre_change_snapshot}")

Updated Table History (most recent first):
+-----------------------+-------------------+-------------------+-------------------+
|made_current_at        |snapshot_id        |parent_id          |is_current_ancestor|
+-----------------------+-------------------+-------------------+-------------------+
|2025-07-01 13:50:46.609|841024702459819741 |7274330971587238251|true               |
|2025-07-01 13:50:45.161|7274330971587238251|1926503896228031236|true               |
|2025-07-01 13:48:37.868|1926503896228031236|5929459840228022626|true               |
|2025-07-01 13:48:29.944|5929459840228022626|8866811446601788626|true               |
|2025-07-01 13:47:01.334|8866811446601788626|1680807781020980082|true               |
|2025-07-01 13:46:51.537|1680807781020980082|703749108260326071 |true               |
|2025-07-01 13:46:43.276|703749108260326071 |5474445044599496672|true               |
|2025-07-01 13:46:29.92 |5474445044599496672|3682366184055111348|true               |
|2025-07-01

## 6. Rollback Operations

Now let's demonstrate the rollback functionality to undo our changes.

### 6.1 Preview Rollback Target
Before rolling back, let's verify what the data looked like at our target snapshot.

In [10]:
# Preview what the table looked like before our changes
print(f"Preview: Data at snapshot {pre_change_snapshot} (before our changes):")
print("=" * 70)

preview_data = spark.sql(f"""
SELECT COUNT(*) as total_records,
       SUM(CASE WHEN is_active THEN 1 ELSE 0 END) as active_users,
       SUM(CASE WHEN country = 'USA' THEN 1 ELSE 0 END) as usa_users,
       MAX(user_id) as max_user_id
FROM users FOR SYSTEM_VERSION AS OF {pre_change_snapshot}
""")

preview_data.show()

# Show sample of original data
print("Sample of original data (users with country info):")
spark.sql(f"""
SELECT user_id, username, is_active, country
FROM users FOR SYSTEM_VERSION AS OF {pre_change_snapshot}
WHERE country IS NOT NULL
ORDER BY user_id
""").show(truncate=False)

Preview: Data at snapshot 1926503896228031236 (before our changes):
+-------------+------------+---------+-----------+
|total_records|active_users|usa_users|max_user_id|
+-------------+------------+---------+-----------+
|         1012|         510|        1|       3003|
+-------------+------------+---------+-----------+

Sample of original data (users with country info):
+-------+-----------------------+---------+-------+
|user_id|username               |is_active|country|
+-------+-----------------------+---------+-------+
|2001   |global_user_1          |true     |Canada |
|2002   |global_user_2          |true     |Germany|
|2003   |global_user_3          |false    |Japan  |
|3001   |partition_test_active  |true     |USA    |
|3002   |partition_test_inactive|false    |UK     |
|3003   |partition_test_active2 |true     |France |
+-------+-----------------------+---------+-------+



### 6.2 Perform the Rollback
Now let's actually rollback to the previous state.

In [11]:
# Perform the rollback
print(f"Rolling back to snapshot: {pre_change_snapshot}")
print(f"Rolling back to time: {pre_change_time}")
print("=" * 50)

rollback_result = spark.sql(f"""
CALL rest.system.rollback_to_snapshot('rest.`play_iceberg`.users', {pre_change_snapshot})
""")

print("Rollback operation result:")
rollback_result.show()

print("Rollback completed successfully!")

Rolling back to snapshot: 1926503896228031236
Rolling back to time: 2025-07-01 13:48:37.868000
Rollback operation result:
+--------------------+-------------------+
|previous_snapshot_id|current_snapshot_id|
+--------------------+-------------------+
|  841024702459819741|1926503896228031236|
+--------------------+-------------------+

Rollback completed successfully!


### 6.3 Verify Rollback Results
Let's confirm that the rollback worked as expected.

In [12]:
# Verify the rollback worked
print("Verification: Current table state after rollback:")
print("=" * 50)

# Check record counts
post_rollback_stats = spark.sql("""
SELECT COUNT(*) as total_records,
       SUM(CASE WHEN is_active THEN 1 ELSE 0 END) as active_users,
       SUM(CASE WHEN country = 'USA' THEN 1 ELSE 0 END) as usa_users,
       MAX(user_id) as max_user_id
FROM users
""")

post_rollback_stats.show()

# Verify specific changes were rolled back
print("Checking if test users were removed:")
test_users_check = spark.sql("""
SELECT COUNT(*) as test_user_count
FROM users 
WHERE user_id >= 4001
""").collect()[0]['test_user_count']

print(f"Test users remaining: {test_users_check} (should be 0)")

# Check if USA users are active again
print("\nChecking USA users status:")
spark.sql("""
SELECT user_id, username, is_active, country
FROM users 
WHERE country = 'USA'
ORDER BY user_id
""").show(truncate=False)

# Final verification
final_count = spark.sql("SELECT COUNT(*) as count FROM users").collect()[0]['count']
print(f"\nFinal record count: {final_count}")
print(f"Original count before changes: {before_count}")
print(f"Rollback successful: {final_count == before_count}")

Verification: Current table state after rollback:
+-------------+------------+---------+-----------+
|total_records|active_users|usa_users|max_user_id|
+-------------+------------+---------+-----------+
|         1012|         510|        1|       3003|
+-------------+------------+---------+-----------+

Checking if test users were removed:
Test users remaining: 0 (should be 0)

Checking USA users status:
+-------+---------------------+---------+-------+
|user_id|username             |is_active|country|
+-------+---------------------+---------+-------+
|3001   |partition_test_active|true     |USA    |
+-------+---------------------+---------+-------+


Final record count: 1012
Original count before changes: 1012
Rollback successful: True


### 6.4 History After Rollback
Let's see how the rollback affected our table history.

In [13]:
# Check history after rollback
final_history = spark.sql("""
SELECT 
    made_current_at,
    snapshot_id,
    parent_id,
    is_current_ancestor,
    CASE 
        WHEN snapshot_id = {} THEN 'ROLLBACK_TARGET'
        WHEN is_current_ancestor = false THEN 'ORPHANED'
        ELSE 'ACTIVE'
    END as status
FROM rest.`play_iceberg`.users.history
ORDER BY made_current_at
""".format(pre_change_snapshot))

print("Table History After Rollback:")
print("=" * 50)
final_history.show(truncate=False)

# Count different types of snapshots
history_summary = spark.sql("""
SELECT 
    COUNT(*) as total_snapshots,
    SUM(CASE WHEN is_current_ancestor THEN 1 ELSE 0 END) as current_lineage,
    SUM(CASE WHEN is_current_ancestor = false THEN 1 ELSE 0 END) as orphaned_snapshots
FROM rest.`play_iceberg`.users.history
""")

print("\nHistory Summary:")
history_summary.show()

print("Note: Orphaned snapshots are the ones we rolled back from.")
print("They're still available for time travel but not in the current lineage.")

Table History After Rollback:
+-----------------------+-------------------+-------------------+-------------------+---------------+
|made_current_at        |snapshot_id        |parent_id          |is_current_ancestor|status         |
+-----------------------+-------------------+-------------------+-------------------+---------------+
|2025-07-01 13:41:23.087|3824926247776142618|NULL               |true               |ACTIVE         |
|2025-07-01 13:46:22.566|3682366184055111348|3824926247776142618|true               |ACTIVE         |
|2025-07-01 13:46:29.92 |5474445044599496672|3682366184055111348|true               |ACTIVE         |
|2025-07-01 13:46:43.276|703749108260326071 |5474445044599496672|true               |ACTIVE         |
|2025-07-01 13:46:51.537|1680807781020980082|703749108260326071 |true               |ACTIVE         |
|2025-07-01 13:47:01.334|8866811446601788626|1680807781020980082|true               |ACTIVE         |
|2025-07-01 13:48:29.944|5929459840228022626|8866811

## 7. Advanced Time Travel Examples

Let's explore some advanced time travel scenarios.

### 7.1 Data Comparison Across Time
Compare data between different time points.

In [14]:
if len(snapshots) >= 2:
    # Compare user counts across snapshots
    print("Data Evolution Analysis:")
    print("=" * 40)
    
    for i, snapshot in enumerate(snapshots[:3]):  # Show first 3 snapshots
        snapshot_id = snapshot['snapshot_id']
        timestamp = snapshot['made_current_at']
        
        count_data = spark.sql(f"""
        SELECT 
            '{timestamp}' as snapshot_time,
            {snapshot_id} as snapshot_id,
            COUNT(*) as total_users,
            SUM(CASE WHEN is_active THEN 1 ELSE 0 END) as active_users,
            COUNT(DISTINCT COALESCE(country, 'NULL')) as countries
        FROM users FOR SYSTEM_VERSION AS OF {snapshot_id}
        """)
        
        print(f"\nSnapshot {i+1} ({timestamp}):")
        count_data.show(truncate=False)

Data Evolution Analysis:


AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `country` cannot be resolved. Did you mean one of the following? [`email`, `user_id`, `username`, `created_day`, `is_active`].; line 7 pos 36;
'Aggregate [2025-07-01 13:41:23.087000 AS snapshot_time#956, 3824926247776142618 AS snapshot_id#957L, count(1) AS total_users#958L, sum(CASE WHEN is_active#964 THEN 1 ELSE 0 END) AS active_users#959L, 'COUNT(distinct 'COALESCE('country, NULL)) AS countries#960]
+- SubqueryAlias rest.play_iceberg.users
   +- RelationV2[user_id#961L, username#962, email#963, is_active#964, created_year#965, created_month#966, created_day#967, updated_at#968] rest.play_iceberg.users rest.play_iceberg.users


### 7.2 Audit Trail Example
Track changes to specific records over time.

In [15]:
# Create an audit trail for a specific user
user_to_audit = 1  # Track changes to user_id = 1

print(f"Audit Trail for User ID {user_to_audit}:")
print("=" * 50)

audit_results = []

for snapshot in snapshots:
    snapshot_id = snapshot['snapshot_id']
    timestamp = snapshot['made_current_at']
    
    try:
        user_data = spark.sql(f"""
        SELECT username, email, is_active, country, updated_at
        FROM users FOR SYSTEM_VERSION AS OF {snapshot_id}
        WHERE user_id = {user_to_audit}
        """).collect()
        
        if user_data:
            user = user_data[0]
            audit_results.append({
                'snapshot_time': timestamp,
                'username': user['username'],
                'email': user['email'],
                'is_active': user['is_active'],
                'country': user['country']
            })
            
            print(f"At {timestamp}:")
            print(f"  Username: {user['username']}")
            print(f"  Email: {user['email']}")
            print(f"  Active: {user['is_active']}")
            print(f"  Country: {user['country']}")
            print()
        else:
            print(f"At {timestamp}: User not found (may not exist yet)")
            print()
    except Exception as e:
        print(f"At {timestamp}: Error querying - {str(e)[:50]}...")
        print()

print(f"Total audit records found: {len(audit_results)}")

Audit Trail for User ID 1:
At 2025-07-01 13:41:23.087000: Error querying - [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or fu...

At 2025-07-01 13:46:22.566000: Error querying - [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or fu...

At 2025-07-01 13:46:29.920000: Error querying - [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or fu...

At 2025-07-01 13:46:43.276000: Error querying - [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or fu...

At 2025-07-01 13:46:51.537000: Error querying - [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or fu...

At 2025-07-01 13:47:01.334000: Error querying - [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or fu...

At 2025-07-01 13:48:29.944000:
  Username: updated_user_1
  Email: updated.user.1@example.com
  Active: True
  Country: None

At 2025-07-01 13:48:37.868000:
  Username: updated_user_1
  Email: updated.user.1@example.com
  Active: True
  Country: None

Total audit records found: 2


## 8. Best Practices and Considerations

### 8.1 Performance Considerations

In [16]:
# Check table metadata for performance insights
print("Performance Considerations:")
print("=" * 40)

# Number of snapshots
total_snapshots = len(snapshots)
print(f"Total snapshots: {total_snapshots}")

if total_snapshots > 10:
    print("RECOMMENDATION: Consider snapshot cleanup for large numbers of snapshots")

# Check file count
file_count = spark.sql("""
SELECT COUNT(*) as file_count 
FROM rest.`play_iceberg`.users.files
""").collect()[0]['file_count']

print(f"Current data files: {file_count}")

if file_count > 100:
    print("RECOMMENDATION: Consider compaction for large numbers of files")

# Best practices summary
print("\nBest Practices:")
print("1. Use snapshot IDs for precise time travel")
print("2. Use timestamps for approximate time travel")
print("3. Test rollbacks in non-production environments first")
print("4. Monitor snapshot retention policies")
print("5. Consider performance impact of many snapshots")
print("6. Document rollback procedures for your team")

Performance Considerations:
Total snapshots: 8
Current data files: 6

Best Practices:
1. Use snapshot IDs for precise time travel
2. Use timestamps for approximate time travel
3. Test rollbacks in non-production environments first
4. Monitor snapshot retention policies
5. Consider performance impact of many snapshots
6. Document rollback procedures for your team


### 8.2 Snapshot Cleanup (Optional)
Demonstrate how to clean up old snapshots if needed.

In [17]:
# Show current snapshot count
current_snapshot_count = spark.sql("""
SELECT COUNT(*) as snapshot_count 
FROM rest.`play_iceberg`.users.history
""").collect()[0]['snapshot_count']

print(f"Current snapshots: {current_snapshot_count}")

# Example of how to expire old snapshots (uncomment to use)
print("\nSnapshot cleanup options:")
print("1. Expire snapshots older than timestamp:")
print("   CALL rest.system.expire_snapshots('rest.`play_iceberg`.users', '2025-01-01 00:00:00')")
print("\n2. Retain only last N snapshots:")
print("   CALL rest.system.expire_snapshots('rest.`play_iceberg`.users', retain_last => 5)")
print("\nNote: Snapshot cleanup is irreversible - use with caution!")

# For demo purposes, we'll not actually clean up snapshots
print("\nSkipping actual cleanup for demo safety")

Current snapshots: 11

Snapshot cleanup options:
1. Expire snapshots older than timestamp:
   CALL rest.system.expire_snapshots('rest.`play_iceberg`.users', '2025-01-01 00:00:00')

2. Retain only last N snapshots:
   CALL rest.system.expire_snapshots('rest.`play_iceberg`.users', retain_last => 5)

Note: Snapshot cleanup is irreversible - use with caution!

Skipping actual cleanup for demo safety


## Summary

This notebook demonstrated the powerful time travel and rollback capabilities of Apache Iceberg:

### What We Accomplished:
1. **Explored Table History**: Understood how Iceberg tracks all changes
2. **Time Travel Queries**: Accessed historical data using snapshots and timestamps
3. **Rollback Operations**: Successfully reverted unwanted changes
4. **Data Auditing**: Tracked changes to specific records over time
5. **Best Practices**: Learned performance and operational considerations

### Key Benefits:
- **Data Recovery**: Quick recovery from accidental changes
- **Debugging**: Investigate when and how data changed
- **Compliance**: Complete audit trail of all changes
- **Testing**: Safe experimentation with rollback capability
- **Analysis**: Compare data across different time periods

### Use Cases:
- **ETL Error Recovery**: Rollback failed batch jobs
- **Data Quality Issues**: Revert to clean state and re-process
- **Regulatory Compliance**: Maintain complete change history
- **A/B Testing**: Compare results across different data versions
- **Incident Response**: Quickly restore service after data corruption

Time travel and rollback make Iceberg an excellent choice for production data lakes where data reliability and recoverability are critical.

In [None]:
# Clean up
print("Time travel and rollback demonstration completed!")
print("Spark session cleanup...")
spark.stop()
print("Session closed.")