
This notebook documents the basic concepts of Delta Lake, including how data is
stored in Delta format, how Delta tables are created, schema enforcement, and
handling duplicate data using merge operations.

## 1. Introduction to Delta Lake

Delta Lake is an open-source storage layer that brings reliability to data lakes.
It provides ACID transactions, schema enforcement, and support for updates and
deletes on top of Apache Spark.

## 2. Converting CSV Data to Delta Format

Raw CSV files do not support transactions and can lead to data inconsistency.
Delta format stores data in a transactional manner, making it more reliable.

In [None]:
# Converting data to Delta format
events.write.format("delta") \
    .mode("overwrite") \
    .save("/delta/events")

## 3. Creating Delta Tables (PySpark and SQL)

Delta tables can be created using both PySpark and SQL so that different users
such as data engineers and analysts can work with the same data.

In [None]:
# Create Delta table using PySpark
events.write.format("delta") \
    .saveAsTable("events_table")

In [None]:
# Create Delta table using SQL
CREATE TABLE events_delta
USING DELTA
AS SELECT * FROM events_table;

## 4. Schema Enforcement in Delta Lake

Delta Lake enforces schema to prevent incompatible or incorrect data from being
written into existing tables, ensuring data quality.

In [None]:
# Attempt to write data with incompatible schema
try:
    wrong_schema_df.write.format("delta") \
        .mode("append") \
        .save("/delta/events")
except Exception as e:
    print("Schema enforcement error:", e)

## 5. Handling Duplicate Inserts using MERGE

Delta Lake supports MERGE operations to avoid duplicate records when pipelines
are re-run or data is reprocessed.

In [None]:
MERGE INTO events_delta t
USING new_events s
ON t.id = s.id
WHEN MATCHED THEN UPDATE SET *

## 6. Key Takeaways

Delta Lake enhances data lakes by adding reliability, schema enforcement, and
safe mechanisms for handling updates and duplicate data, making it suitable for
production data pipelines.