# Insert Data with Spark - Iceberg

## Objectives

- Insert data into Iceberg tables using Spark DataFrame API
- Demonstrate schema-aware data insertion with automatic validation
- Show partition-aware writes for optimal performance
- Implement data verification and quality checks
- Understand Spark-Iceberg integration patterns

In [11]:
from pyspark.sql import SparkSession
from datetime import datetime, timezone

# Create Spark session
spark = SparkSession.builder.appName("IcebergInsert").getOrCreate()

# Sample user data
current_time = datetime.now(timezone.utc)
sample_users_data = [
    (1, "john_doe", "john.doe@example.com", True, current_time.year, current_time.month, current_time.day, current_time),
    (2, "jane_smith", "jane.smith@example.com", True, current_time.year, current_time.month, current_time.day, current_time),
    (3, "alice_wonder", "alice.wonder@example.com", False, current_time.year, current_time.month, current_time.day, current_time),
    (4, "bob_builder", "bob.builder@example.com", True, current_time.year, current_time.month, current_time.day, current_time),
    (5, "charlie_brown", "charlie.brown@example.com", True, current_time.year, current_time.month, current_time.day, current_time)
]

# Create DataFrame with schema derived from existing table
table_schema = spark.table("rest.play_iceberg.users").schema
users_df = spark.createDataFrame(sample_users_data, schema=table_schema)

print(f"Created DataFrame with {users_df.count()} records")
users_df.show()

25/07/02 08:11:32 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


Created DataFrame with 5 records
+-------+-------------+--------------------+---------+------------+-------------+-----------+--------------------+
|user_id|     username|               email|is_active|created_year|created_month|created_day|          updated_at|
+-------+-------------+--------------------+---------+------------+-------------+-----------+--------------------+
|      1|     john_doe|john.doe@example.com|     true|        2025|            7|          2|2025-07-02 08:11:...|
|      2|   jane_smith|jane.smith@exampl...|     true|        2025|            7|          2|2025-07-02 08:11:...|
|      3| alice_wonder|alice.wonder@exam...|    false|        2025|            7|          2|2025-07-02 08:11:...|
|      4|  bob_builder|bob.builder@examp...|     true|        2025|            7|          2|2025-07-02 08:11:...|
|      5|charlie_brown|charlie.brown@exa...|     true|        2025|            7|          2|2025-07-02 08:11:...|
+-------+-------------+--------------------+---

In [12]:
# Insert data into Iceberg table
before_count = spark.sql("SELECT COUNT(*) as count FROM rest.play_iceberg.users").collect()[0]['count']
print(f"Records before insertion: {before_count}")

# Perform insertion using optimized DataFrame API  
users_df.writeTo("rest.play_iceberg.users").append()

after_count = spark.sql("SELECT COUNT(*) as count FROM rest.play_iceberg.users").collect()[0]['count']
print(f"Records after insertion: {after_count}")
print(f"Records inserted: {after_count - before_count}")

# Verify insertion
print("\nAll records in table:")
spark.sql("SELECT * FROM rest.play_iceberg.users ORDER BY user_id").show(truncate=False)

Records before insertion: 0


                                                                                

Records after insertion: 5
Records inserted: 5

All records in table:
+-------+-------------+-------------------------+---------+------------+-------------+-----------+--------------------------+
|user_id|username     |email                    |is_active|created_year|created_month|created_day|updated_at                |
+-------+-------------+-------------------------+---------+------------+-------------+-----------+--------------------------+
|1      |john_doe     |john.doe@example.com     |true     |2025        |7            |2          |2025-07-02 08:11:32.958859|
|2      |jane_smith   |jane.smith@example.com   |true     |2025        |7            |2          |2025-07-02 08:11:32.958859|
|3      |alice_wonder |alice.wonder@example.com |false    |2025        |7            |2          |2025-07-02 08:11:32.958859|
|4      |bob_builder  |bob.builder@example.com  |true     |2025        |7            |2          |2025-07-02 08:11:32.958859|
|5      |charlie_brown|charlie.brown@example.com

## Future Steps

### Immediate Next Actions:
1. **Query Operations**: Read and analyze inserted data (→ Notebook 3)
2. **Update Operations**: Modify existing records with upserts (→ Notebook 4)
3. **Schema Evolution**: Add new columns to existing data (→ Notebook 5)
4. **Time Travel**: Query historical versions of data (→ Notebook 6)

### Production Enhancements:
- **Error Handling**: Implement robust error handling and retry logic
- **Data Validation**: Add business rule validation before insertion
- **Monitoring**: Track insertion metrics and performance
- **Batch Processing**: Handle large-scale data ingestion efficiently
- **Idempotency**: Ensure safe re-execution of insertion operations

### Advanced Features:
- **Streaming Ingestion**: Real-time data insertion with Spark Streaming
- **Dynamic Partitioning**: Automatic partition creation for new date ranges
- **Merge Operations**: Complex upsert patterns with conflict resolution
- **Data Compaction**: Optimize storage with background compaction jobs