# Introduction

##### Delta Lake is an open-source storage framework that enables building a Lakehouse architecture. It provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. It runs on top of your existing data lakes and is compatible with processing engines like Apache Spark, PrestoDB, Flink, Trino, and Hive and APIs for Scala, Java, Rust, Ruby, and Python. Specifically, it provides the following features:
> __ACID guarantees__ <br>
    Delta Lake ensures that all data changes written to storage are committed for durability and made visible to readers atomically. In other words, no more partial or corrupted files! We will discuss more on the acid guarantees as part of the transaction log later in this chapter.

> __Scalable data and metadata handling:__
Since Delta Lake is built on data lakes, all reads and writes using Spark or other distributed processing engines are inherently scalable to petabyte-scale. However, unlike most other storage formats and query engines, Delta Lake leverages Spark to scale out all the metadata processing, thus efficiently handling metadata of billions of files for petabyte-scale tables. We will discuss more on the transaction log later in this chapter.

> __Audit History and Time travel__
The Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes. These data snapshots enable developers to access and revert to earlier versions of data for audits, rollbacks, or to reproduce experiments. We will dive further into this topic in Chapter 3: Time Travel with Delta.

> __Schema enforcement and schema evolution__
Delta Lake automatically prevents the insertion of data with an incorrect schema, i.e. not matching the table schema. And when needed, it allows the table schema to be explicitly and safely evolved to accommodate ever-change data. We will dive further into this topic in Chapter 4 focusing on schema enforcement and evolution.

> __Support for deletes updates, and merge__
Most distributed processing frameworks do not support atomic data modification operations on data lakes. Delta Lake supports merge, update, and delete operations to enable complex use cases including but not limited to change-datacapture (CDC), slowly-changing-dimension (SCD) operations, and streaming upserts. We will dive further into this topic in Chapter 5: Data modifications in Delta.

> __Streaming and batch unification__
A Delta Lake table has the ability to work both in batch and as a streaming source and sink. The ability to work across a wide variety of latencies ranging from streaming data ingest to batch historic backfill to interactive queries all just work out of the box. We will dive further into this topic in Chapter 6: Streaming Applications with Delta.

### For additional details, please refer to Apache Iceberg documentation:<br>
- __[Deltalake Documentation](https://docs.delta.io/latest/index.html)__

## Pre-Requisite:

For executing the code in this notebook you will need the below:
- A AWS account <br>

Below services should be created and configured
- EMR Studio
- EMR Studio Workspace
- EMR on EKS Virtual Cluster
- EKS Cluster (EC2 based)
- Managed Endpoint
- IAM Policy
- Application Load Balancer
- VPC and Subnet

In [1]:
%%configure -f

{
  "driverMemory": "1G",
  "driverCores" : 1,
  "executorMemory" : "1G",
  "executorCores": 1,
  "conf": {
     "spark.dynamicAllocation.maxExecutors" : 10,
     "spark.dynamicAllocation.minExecutors": 1,
     "spark.sql.extensions" : "io.delta.sql.DeltaSparkSessionExtension",
     "spark.serializer":"org.apache.spark.serializer.KryoSerializer",
     "spark.sql.catalog.spark_catalog":"org.apache.spark.sql.delta.catalog.DeltaCatalog",
     "spark.databricks.hive.metastore.glueCatalog.enabled":"true",
     "spark.jars.packages":"io.delta:delta-core_2.12:2.1.0",
     "spark.databricks.delta.schema.autoMerge.enabled" : "true"
  }
}

[I 2022-12-06 16:59:00,465.465 configure_magic] Magic cell payload received: {"driverMemory": "1G", "driverCores": 1, "executorMemory": "1G", "executorCores": 1, "conf": {"spark.dynamicAllocation.maxExecutors": 10, "spark.dynamicAllocation.minExecutors": 1, "spark.sql.extensions": "io.delta.sql.DeltaSparkSessionExtension", "spark.serializer": "org.apache.spark.serializer.KryoSerializer", "spark.sql.catalog.spark_catalog": "org.apache.spark.sql.delta.catalog.DeltaCatalog", "spark.databricks.hive.metastore.glueCatalog.enabled": "true", "spark.jars.packages": "io.delta:delta-core_2.12:2.1.0"}, "proxyUser": "assumed-role_cf-emr-studio-1-StudioUserRole-1UMGXJ16SJMRN_emrstudio"}

[I 2022-12-06 16:59:00,467.467 configure_magic] Sending request to update kernel. Please wait while the kernel will be refreshed.


The kernel is successfully refreshed.

In [1]:
from pyspark.sql.functions import *
from pyspark.sql.types import *
from datetime import datetime
from delta.tables import *



In [2]:
InputData = [
    (1,'Prasad Nadig', 25, 'NJ','2022-01-01', datetime.strptime(datetime.now().strftime("%Y-%d-%m %H:%M:%S"), "%Y-%d-%m %H:%M:%S")),
    (2,'Ethereum', 80, 'NY', '2022-01-02', datetime.strptime(datetime.now().strftime("%Y-%d-%m %H:%M:%S"), "%Y-%d-%m %H:%M:%S")),
    (3,'Cosmos', 25, 'PA', '2022-01-03', datetime.strptime(datetime.now().strftime("%Y-%d-%m %H:%M:%S"), "%Y-%d-%m %H:%M:%S")),
    (4,'Solana', 55, 'MD', '2022-01-04', datetime.strptime(datetime.now().strftime("%Y-%d-%m %H:%M:%S"), "%Y-%d-%m %H:%M:%S")),
    (5,'Carnado', 15, 'TX', '2022-01-05', datetime.strptime(datetime.now().strftime("%Y-%d-%m %H:%M:%S"), "%Y-%d-%m %H:%M:%S")),
    (6,'Link', 45, 'NJ', '2022-01-06', datetime.strptime(datetime.now().strftime("%Y-%d-%m %H:%M:%S"), "%Y-%d-%m %H:%M:%S"))
]

#Define schema for the source data
schema = StructType([ \
    StructField("cust_id",IntegerType(),True), \
    StructField("cust_name",StringType(),True), \
    StructField("cust_age",IntegerType(),True), \
    StructField("cust_loc",StringType(),True), \
    StructField("create_date", StringType(), True), \
    StructField("last_updated_time", TimestampType(), True)
  ])

#Create dataframe from the input data and the corresponding schema
inputDF = spark.createDataFrame(data=InputData,schema=schema)

In [None]:
inputDF.show()

In [3]:
# Write a DataFrame as a Delta dataset
inputDF.write.format("delta") \
       .mode("overwrite") \
       .option("overwriteSchema", "true") \
       .partitionBy("create_date") \
       .save(f"s3://emr-studio-emr-on-eks/tmp/delta/")

In [4]:
deltaPath = 's3://emr-studio-emr-on-eks/tmp/delta/'

In [5]:
df_delta = spark.read.format("delta").load(deltaPath)

In [6]:
df_delta.show()

+-------+------------+--------+--------+-----------+-------------------+
|cust_id|   cust_name|cust_age|cust_loc|create_date|  last_updated_time|
+-------+------------+--------+--------+-----------+-------------------+
|      1|Prasad Nadig|      25|      NJ| 2022-01-01|2022-12-06 16:59:54|
|      2|    Ethereum|      80|      NY| 2022-01-02|2022-12-06 16:59:54|
|      5|     Carnado|      15|      TX| 2022-01-05|2022-12-06 16:59:54|
|      4|      Solana|      55|      MD| 2022-01-04|2022-12-06 16:59:54|
|      3|      Cosmos|      25|      PA| 2022-01-03|2022-12-06 16:59:54|
|      6|        Link|      45|      NJ| 2022-01-06|2022-12-06 16:59:54|
+-------+------------+--------+--------+-----------+-------------------+



In [8]:
#Update couple fo existing records. Update cust_age to 35 for cust_id 1 and cust_loc from MD to CA for cust_id 4
#Insert a new Record cust_id = 7
UpdateData = [
    (1,'Prasad Nadig', 35, 'NJ','2022-01-01', datetime.strptime(datetime.now().strftime("%Y-%d-%m %H:%M:%S"), "%Y-%d-%m %H:%M:%S")),
    (4,'Solana', 55, 'MD', '2022-01-04', datetime.strptime(datetime.now().strftime("%Y-%d-%m %H:%M:%S"), "%Y-%d-%m %H:%M:%S")),
    (7,'Algorand', 55, 'NC', '2022-01-07', datetime.strptime(datetime.now().strftime("%Y-%d-%m %H:%M:%S"), "%Y-%d-%m %H:%M:%S"))
]

#Define schema for the source data
schema = StructType([ \
    StructField("cust_id",IntegerType(),True), \
    StructField("cust_name",StringType(),True), \
    StructField("cust_age",IntegerType(),True), \
    StructField("cust_loc",StringType(),True), \
    StructField("create_date", StringType(), True), \
    StructField("last_updated_time", TimestampType(), True)
  ])

#Create dataframe from the input data and the corresponding schema
updateDF = spark.createDataFrame(data=UpdateData,schema=schema)


In [None]:
deltaTable = DeltaTable.forPath(spark, deltaPath)

deltaTable.alias('targetData') \
    .merge(
        updateDF.alias('updatedData'),
        'targetData.cust_id = updatedData.cust_id') \
    .whenMatchedUpdate( set = 
        {
            "cust_id": "updatedData.cust_id",
            "cust_name": "updatedData.cust_name",
            "cust_age": "updatedData.cust_age",
            "cust_loc": "updatedData.cust_loc",
            "create_date": "updatedData.create_date",
            "last_updated_time": "updatedData.last_updated_time"
        } 
                      ) \
    .whenNotMatchedInsert(values = 
        {
            "cust_id": "updatedData.cust_id",
            "cust_name": "updatedData.cust_name",
            "cust_age": "updatedData.cust_age",
            "cust_loc": "updatedData.cust_loc",
            "create_date": "updatedData.create_date",
            "last_updated_time": "updatedData.last_updated_time"
        } 
                         ) \
    .execute()
        
            

In [9]:
deltaTable = DeltaTable.forPath(spark, deltaPath)

deltaTable.alias('targetData') \
    .merge(
        updateDF.alias('updatedData'),
        'targetData.cust_id = updatedData.cust_id') \
    .whenMatchedUpdateAll() \
    .whenNotMatchedInsertAll() \
    .execute()


In [10]:
df_delta_updates = spark.read.format("delta").load(deltaPath)
df_delta_updates.sort('cust_id').show()

+-------+------------+--------+--------+-----------+-------------------+
|cust_id|   cust_name|cust_age|cust_loc|create_date|  last_updated_time|
+-------+------------+--------+--------+-----------+-------------------+
|      1|Prasad Nadig|      35|      NJ| 2022-01-01|2022-12-06 17:01:43|
|      2|    Ethereum|      80|      NY| 2022-01-02|2022-12-06 16:59:54|
|      3|      Cosmos|      25|      PA| 2022-01-03|2022-12-06 16:59:54|
|      4|      Solana|      55|      MD| 2022-01-04|2022-12-06 17:01:43|
|      5|     Carnado|      15|      TX| 2022-01-05|2022-12-06 16:59:54|
|      6|        Link|      45|      NJ| 2022-01-06|2022-12-06 16:59:54|
|      7|    Algorand|      55|      NC| 2022-01-07|2022-12-06 17:01:43|
+-------+------------+--------+--------+-----------+-------------------+



In [11]:
#Update couple fo existing records. Update cust_age to 35 for cust_id 1 and cust_loc from MD to CA for cust_id 4
#Insert a new Record cust_id = 7
DeleteData = [
    (6,'Link', 45, 'NJ', '2022-01-06', datetime.strptime(datetime.now().strftime("%Y-%d-%m %H:%M:%S"), "%Y-%d-%m %H:%M:%S")),
]

#Define schema for the source data
schema = StructType([ \
    StructField("cust_id",IntegerType(),True), \
    StructField("cust_name",StringType(),True), \
    StructField("cust_age",IntegerType(),True), \
    StructField("cust_loc",StringType(),True), \
    StructField("create_date", StringType(), True), \
    StructField("last_updated_time", TimestampType(), True)
  ])

#Create dataframe from the input data and the corresponding schema
deleteDF = spark.createDataFrame(data=DeleteData,schema=schema)

In [12]:
deltaTable = DeltaTable.forPath(spark, deltaPath)

deltaTable.alias('targetData') \
    .merge(
        deleteDF.alias('deletedData'),
        'targetData.cust_id = deletedData.cust_id') \
    .whenMatchedDelete()\
    .execute()

In [13]:
df_delta_deletes = spark.read.format("delta").load(deltaPath)
df_delta_deletes.sort('cust_id').show()

+-------+------------+--------+--------+-----------+-------------------+
|cust_id|   cust_name|cust_age|cust_loc|create_date|  last_updated_time|
+-------+------------+--------+--------+-----------+-------------------+
|      1|Prasad Nadig|      35|      NJ| 2022-01-01|2022-12-06 17:01:43|
|      2|    Ethereum|      80|      NY| 2022-01-02|2022-12-06 16:59:54|
|      3|      Cosmos|      25|      PA| 2022-01-03|2022-12-06 16:59:54|
|      4|      Solana|      55|      MD| 2022-01-04|2022-12-06 17:01:43|
|      5|     Carnado|      15|      TX| 2022-01-05|2022-12-06 16:59:54|
|      7|    Algorand|      55|      NC| 2022-01-07|2022-12-06 17:01:43|
+-------+------------+--------+--------+-----------+-------------------+



In [14]:
deltaTable.history().show()

+-------+-------------------+------+--------+---------+--------------------+----+--------+---------+-----------+--------------+-------------+--------------------+------------+--------------------+
|version|          timestamp|userId|userName|operation| operationParameters| job|notebook|clusterId|readVersion|isolationLevel|isBlindAppend|    operationMetrics|userMetadata|          engineInfo|
+-------+-------------------+------+--------+---------+--------------------+----+--------+---------+-----------+--------------+-------------+--------------------+------------+--------------------+
|      2|2022-12-06 17:02:34|  null|    null|    MERGE|{predicate -> (ta...|null|    null|     null|          1|  Serializable|        false|{numTargetRowsCop...|        null|Apache-Spark/3.3....|
|      1|2022-12-06 17:02:02|  null|    null|    MERGE|{predicate -> (ta...|null|    null|     null|          0|  Serializable|        false|{numTargetRowsCop...|        null|Apache-Spark/3.3....|
|      0|2022-1

In [15]:
beforeUpsert = spark.read.format("delta").option("versionAsOf", 0).load(deltaPath)
beforeUpsert.sort('cust_id').show()

+-------+------------+--------+--------+-----------+-------------------+
|cust_id|   cust_name|cust_age|cust_loc|create_date|  last_updated_time|
+-------+------------+--------+--------+-----------+-------------------+
|      1|Prasad Nadig|      25|      NJ| 2022-01-01|2022-12-06 16:59:54|
|      2|    Ethereum|      80|      NY| 2022-01-02|2022-12-06 16:59:54|
|      3|      Cosmos|      25|      PA| 2022-01-03|2022-12-06 16:59:54|
|      4|      Solana|      55|      MD| 2022-01-04|2022-12-06 16:59:54|
|      5|     Carnado|      15|      TX| 2022-01-05|2022-12-06 16:59:54|
|      6|        Link|      45|      NJ| 2022-01-06|2022-12-06 16:59:54|
+-------+------------+--------+--------+-----------+-------------------+



In [17]:
beforeDelete = spark.read.format("delta").option("versionAsOf", 1).load(deltaPath)
beforeDelete.sort('cust_id').show()

+-------+------------+--------+--------+-----------+-------------------+
|cust_id|   cust_name|cust_age|cust_loc|create_date|  last_updated_time|
+-------+------------+--------+--------+-----------+-------------------+
|      1|Prasad Nadig|      35|      NJ| 2022-01-01|2022-12-06 17:01:43|
|      2|    Ethereum|      80|      NY| 2022-01-02|2022-12-06 16:59:54|
|      3|      Cosmos|      25|      PA| 2022-01-03|2022-12-06 16:59:54|
|      4|      Solana|      55|      MD| 2022-01-04|2022-12-06 17:01:43|
|      5|     Carnado|      15|      TX| 2022-01-05|2022-12-06 16:59:54|
|      6|        Link|      45|      NJ| 2022-01-06|2022-12-06 16:59:54|
|      7|    Algorand|      55|      NC| 2022-01-07|2022-12-06 17:01:43|
+-------+------------+--------+--------+-----------+-------------------+

