## Future Steps

### Immediate Actions
- Implement automated upsert scheduling for regular data synchronization
- Add comprehensive logging and monitoring for production deployments
- Create performance benchmarks for different data volumes

### Production Enhancements
- Implement retry logic for failed merge operations
- Add data quality monitoring and alerting
- Optimize partition strategy based on merge patterns
- Set up automated testing for merge operation validation

### Advanced Features
- Implement Change Data Capture (CDC) integration with merge operations
- Add support for streaming upserts using Structured Streaming
- Create custom merge strategies for domain-specific use cases
- Integrate with data lineage tracking systems

# Advanced Upsert Operations with Apache Iceberg and Spark

## Objectives
- Master MERGE INTO capabilities for efficient data synchronization
- Implement conditional merge logic with complex business rules
- Execute bulk upsert operations with performance optimization
- Handle DELETE operations within MERGE statements for data cleanup

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, LongType, StringType, BooleanType, IntegerType, TimestampType
from pyspark.sql.functions import col, current_timestamp
from datetime import datetime
import time

In [2]:
# Create Spark session with optimized configuration for merge operations
spark = SparkSession.builder \
    .appName("Advanced Iceberg Upsert Tutorial") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
    .getOrCreate()

print(f"Spark Version: {spark.version}")
print(f"Default Catalog: {spark.conf.get('spark.sql.defaultCatalog')}")
print(f"Adaptive Query Execution: {spark.conf.get('spark.sql.adaptive.enabled')}")

# Set catalog context
spark.sql("USE rest.`play_iceberg`")
print("\nCatalog context set to: rest.play_iceberg")

25/07/01 13:45:51 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


Spark Version: 3.5.5
Default Catalog: rest
Adaptive Query Execution: true

Catalog context set to: rest.play_iceberg


In [6]:
# Verify the merge results
print("Table state after basic merge:")
result_df = spark.sql("SELECT * FROM users ORDER BY user_id")
result_df.show(truncate=False)

# Show statistics after merge
print("\nPost-merge statistics:")
spark.sql("""
    SELECT 
        COUNT(*) as total_records,
        COUNT(CASE WHEN is_active = true THEN 1 END) as active_users,
        COUNT(CASE WHEN is_active = false THEN 1 END) as inactive_users
    FROM users
""").show()

Table state after basic merge:
+-------+-------------------+-------------------------------+---------+------------+-------------+-----------+--------------------------+
|user_id|username           |email                          |is_active|created_year|created_month|created_day|updated_at                |
+-------+-------------------+-------------------------------+---------+------------+-------------+-----------+--------------------------+
|1      |john_doe_updated   |john.doe.updated@example.com   |false    |2025        |7            |1          |2025-07-01 13:46:10.39113 |
|2      |jane_smith         |jane.smith@example.com         |true     |2025        |7            |1          |2025-07-01 13:41:16.173663|
|3      |alice_wonder_active|alice.wonder.active@example.com|true     |2025        |7            |1          |2025-07-01 13:46:10.39113 |
|4      |bob_builder        |bob.builder@example.com        |true     |2025        |7            |1          |2025-07-01 13:41:16.173663|
|5 

## 4. Conditional Merge Operations

Now let's explore more sophisticated merge patterns with conditional logic.

## 5. Bulk Upsert Performance Optimization

Let's demonstrate handling larger datasets and optimization techniques.

In [10]:
# Generate bulk data for performance testing
import random
from datetime import datetime, timedelta

def generate_bulk_users(start_id, count):
    """Generate bulk user data for testing."""
    bulk_data = []
    base_time = datetime.now()
    
    for i in range(count):
        user_id = start_id + i
        bulk_data.append({
            'user_id': user_id,
            'username': f'bulk_user_{user_id}',
            'email': f'bulk.user.{user_id}@example.com',
            'is_active': random.choice([True, False]),
            'created_year': 2025,
            'created_month': 6,
            'created_day': 27,
            'updated_at': base_time + timedelta(seconds=i)
        })
    
    return bulk_data

# Generate 1000 new users and 100 updates to existing users
new_users = generate_bulk_users(100, 1000)
existing_updates = []

# Update some existing users (1-8)
for user_id in range(1, 8):
    existing_updates.append({
        'user_id': user_id,
        'username': f'updated_user_{user_id}',
        'email': f'updated.user.{user_id}@example.com',
        'is_active': True,
        'created_year': 2025,
        'created_month': 6,
        'created_day': 27,
        'updated_at': datetime.now()
    })

# Combine all data
bulk_upsert_data = new_users + existing_updates
print(f"Generated {len(bulk_upsert_data)} records for bulk upsert")
print(f"- New users: {len(new_users)}")
print(f"- Updated existing users: {len(existing_updates)}")

Generated 1007 records for bulk upsert
- New users: 1000
- Updated existing users: 7


## 6. Delete Operations with MERGE

Iceberg also supports DELETE operations within MERGE statements for data cleanup.

In [14]:
# Create a cleanup dataset - users to be deleted or deactivated
cleanup_data = [
    {'user_id': 105, 'action': 'delete'},
    {'user_id': 110, 'action': 'delete'},
    {'user_id': 115, 'action': 'deactivate'},
    {'user_id': 120, 'action': 'deactivate'}
]

cleanup_schema = StructType([
    StructField("user_id", LongType(), False),
    StructField("action", StringType(), False)
])

cleanup_df = spark.createDataFrame(cleanup_data, cleanup_schema)
cleanup_df.createOrReplaceTempView("cleanup_actions")

print("Cleanup actions to be performed:")
cleanup_df.show()

# Check current state of these users
print("\nCurrent state of affected users:")
spark.sql("""
    SELECT user_id, username, email, is_active 
    FROM users 
    WHERE user_id IN (105, 110, 115, 120)
    ORDER BY user_id
""").show()

Cleanup actions to be performed:
+-------+----------+
|user_id|    action|
+-------+----------+
|    105|    delete|
|    110|    delete|
|    115|deactivate|
|    120|deactivate|
+-------+----------+


Current state of affected users:
+-------+-------------+--------------------+---------+
|user_id|     username|               email|is_active|
+-------+-------------+--------------------+---------+
|    105|bulk_user_105|bulk.user.105@exa...|     true|
|    110|bulk_user_110|bulk.user.110@exa...|     true|
|    115|bulk_user_115|bulk.user.115@exa...|    false|
|    120|bulk_user_120|bulk.user.120@exa...|    false|
+-------+-------------+--------------------+---------+



In [15]:
# Execute conditional delete/update merge
spark.sql("""
    MERGE INTO users AS target
    USING cleanup_actions AS source
    ON target.user_id = source.user_id
    WHEN MATCHED AND source.action = 'delete' THEN
        DELETE
    WHEN MATCHED AND source.action = 'deactivate' THEN
        UPDATE SET
            target.is_active = false,
            target.updated_at = current_timestamp()
""")

print("Cleanup merge operation completed")

                                                                                

Cleanup merge operation completed


## 8. Advanced Merge Patterns

Explore complex scenarios and edge cases in merge operations.

In [18]:
# Demonstrate merge with data validation
# Create source data with some invalid records
validation_data = [
    {
        'user_id': 1001,
        'username': 'valid_user',
        'email': 'valid@example.com',
        'is_active': True,
        'created_year': 2025,
        'created_month': 6,
        'created_day': 27,
        'updated_at': datetime.now()
    },
    {
        'user_id': 1002,
        'username': '',  # Invalid: empty username
        'email': 'invalid@example.com',
        'is_active': True,
        'created_year': 2025,
        'created_month': 6,
        'created_day': 27,
        'updated_at': datetime.now()
    },
    {
        'user_id': 1003,
        'username': 'another_valid_user',
        'email': 'not-an-email',  # Invalid: bad email format
        'is_active': True,
        'created_year': 2025,
        'created_month': 6,
        'created_day': 27,
        'updated_at': datetime.now()
    }
]

validation_df = spark.createDataFrame(validation_data, spark.table("users").schema)

# Add validation logic
validated_df = validation_df.filter(
    (col("username") != "") & 
    (col("username").isNotNull()) &
    (col("email").contains("@")) &
    (col("email").contains("."))
)

validated_df.createOrReplaceTempView("validated_updates")

print(f"Original records: {validation_df.count()}")
print(f"Valid records: {validated_df.count()}")
print(f"Invalid records filtered out: {validation_df.count() - validated_df.count()}")

print("\nValid records to be merged:")
validated_df.show()

                                                                                

Original records: 3


                                                                                

Valid records: 1


                                                                                

Invalid records filtered out: 2

Valid records to be merged:
+-------+----------+-----------------+---------+------------+-------------+-----------+--------------------+
|user_id|  username|            email|is_active|created_year|created_month|created_day|          updated_at|
+-------+----------+-----------------+---------+------------+-------------+-----------+--------------------+
|   1001|valid_user|valid@example.com|     true|        2025|            6|         27|2025-07-01 13:46:...|
+-------+----------+-----------------+---------+------------+-------------+-----------+--------------------+



In [19]:
# Execute merge with validated data
spark.sql("""
    MERGE INTO users AS target
    USING validated_updates AS source
    ON target.user_id = source.user_id
    WHEN NOT MATCHED THEN
        INSERT (
            user_id, username, email, is_active,
            created_year, created_month, created_day, updated_at
        )
        VALUES (
            source.user_id, source.username, source.email, source.is_active,
            source.created_year, source.created_month, source.created_day, source.updated_at
        )
""")

print("Validated merge completed")

# Verify only valid record was inserted
spark.sql("""
    SELECT user_id, username, email 
    FROM users 
    WHERE user_id IN (1001, 1002, 1003)
    ORDER BY user_id
""").show()

                                                                                

Validated merge completed
+-------+--------------+--------------------+
|user_id|      username|               email|
+-------+--------------+--------------------+
|   1001|bulk_user_1001|bulk.user.1001@ex...|
|   1002|bulk_user_1002|bulk.user.1002@ex...|
|   1003|bulk_user_1003|bulk.user.1003@ex...|
+-------+--------------+--------------------+



## 9. Summary and Cleanup

Let's summarize what we've learned and clean up our session.