# Delta Lake: A Brief Introduction

All code and descriptions below are written by Zoya Shafique, unless where noted.

## <img src = 'https://www.svgrepo.com/show/176852/pin-signs.svg' style="height: 50px; margin: 5px; padding: 5px"/> Overview
---

In this tutorial, you'll learn more about Delta Lake tables, specifically how metadata is handled and how the Delta Lake platform performs ACID transactions. It is this metadata handling and ACID compliance which marks Delta Lake as a Lakehouse infrastructure as opposed to a data lake (even though the name Delta Lake is somewhat misleading).

By the end of this tutorial, you'll be able to:

* Understand how to read the transaction log of a Delta table. 
* Understand what separates a Data Lakehouse from a Data Lake. 

## <img src = 'https://www.svgrepo.com/show/176852/pin-signs.svg' style="height: 50px; margin: 5px; padding: 5px"/> Before you begin
---
Before you start the tutorial, you should:

* Create a compute resource. For information on how to initialize your compute, plese <a href="https://github.com/zoyashaf/DataLakehouses101/blob/21011a4ffd7e4eb7f045f393720a428b940a3b3b/docs/create_compute.pdf" target="_blank">check here</a>.

#### Additional Resources: 

* Go through the <a href="https://github.com/zoyashaf/DataLakehouses101/blob/4c4224e3f553ba00aebbde3004808f40bafb45ad/Data%20Processing%20with%20Delta%20Lakes/Data%20Ingestion%20Cleaning%20and%20Exploration.ipynb" target="_blank">Data Cleaning notebook</a> to see how Delta Tables can be used in a data processing pipelines

* Check out the <a href="https://github.com/zoyashaf/DataLakehouses101/blob/4c4224e3f553ba00aebbde3004808f40bafb45ad/docs/Delta%20Lake%20Tables.pdf" target="_blank">Delta Lake Tables docs</a> for a snapshot of a Delta Table. 
 


## <img src='https://www.svgrepo.com/show/229520/lake.svg' style="height: 80px; margin: 5px; padding: 5px"/>  The Delta Lake

A traditional data lake is a centralized repository used for storing large amounts of data in a variety of formats. Data lakes can store structured and unstructured data in a very cost effective manner. However, data lakes lack the organization and structure to maintain consistent, high quality data while allowing multiple users to simultaneously interact with the data. 

This is where Delta Lake comes in. Delta Lake is a data lakehouse, or a data management system that adds structure and organization on top of a traditional data lake. The Delta Lake adds two main components on top of a data lake, which make it much more efficient and effective in practice:  

* Metadata storage and tracking 
* ACID transactions. 

In this notebook, we will walk through both aspects of a Delta Lake and discuss how they are implemented. 



In [0]:
## We first need to create our Spark session 
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate() 

### <img src='https://www.svgrepo.com/show/122877/presentation.svg' style="height: 65px; margin: 5px; padding: 5px"/> Loading Data
---
Delta tables can be created multiple ways in Databricks, here we walk through two methods in PySpark.

In [0]:
## Loading data from source file 
file_path = "dbfs:/FileStore/shared_uploads/zshafiq001@citymail.cuny.edu/car_details_v4_edited-1.csv" 
car_data = spark.read.format("csv").option("header", "true").load(file_path)


In [0]:
# We can save our DataFrame as a Delta Table by specifying the format as "delta". The data will be saved in our tmp directory with its current schema

save_path = 'dbfs:/tmp/car_data/'
car_data.write.format("delta").mode("overwrite").save(save_path)



The above command creates a directory in our dbfs/tmp folder called car_data. This directory contains two files:

* A folder called **`_delta_log `**
* A **`part_XXXXXX.json`** file 

The **`_delta_log `** directory contains the transaction log tables for our file. The **` part_XXXXX.json `** file stores our data. Note, we can save our data in partitions. If we had denoted more than one partition, we would have had multiple **` part_XXXXX.json `** files in our folder, one representing each partition. 

<img src='https://github.com/zoyashaf/DataLakehouses101/blob/main/figures/tmp_car_data.PNG?raw=true' style="height: 200px; margin: 5px; padding: 5px"/> 


Another way to create Delta Tables is by using the **` .saveAsTable() `**  command directly (as done in the Data Cleaning notebook). Unlike the command in the first cell,  **` .saveAsTable() `**  registers the table in the metastore. Simply, instead of saving the delta table as a file in our file storage, the table lives in our "Database Tables" from where we can access it and view related metadata. 

Explicitly specifying the delta format and saving the table as a file in the DBFS is useful for when we want to reuse the data later. The  **` .saveAsTable() `** command allows us to not only reuse the data later but also view the related metadata in a user friendly interface. More information about this interface can be found in our  <a href="https://github.com/zoyashaf/DataLakehouses101/blob/4c4224e3f553ba00aebbde3004808f40bafb45ad/docs/Delta%20Lake%20Tables.pdf" target="_blank">Delta Tables docs.</a>

<img src='https://github.com/zoyashaf/DataLakehouses101/blob/main/figures/database_car_data.PNG?raw=true' style="height: 200px; margin: 5px; padding: 5px"/> 

In [0]:
table_name = "car_data_table"
car_data.write.saveAsTable(table_name)

### <img src='https://www.svgrepo.com/show/373282/log-opened.svg' style="height: 65px; margin: 5px; padding: 5px"/>   Transaction logs and ACID Complaiance
---
Now that we understand how to create Delta Tables, lets walk through what sets Delta Tables apart from traditional data lakes. It is important to remember that although Databricks has termed their data engine Delta Lake, it is a data lakehouse rather than a data lake. 

It is also important to note that as data management infrastructures evolve, the lines between different structures blur. In this case, when we refer to 'traditional data lakes', we are referring to the management system which stores vast amounts of data, structured or unstructured, but does not have ACID-complaint transactions. 

In the case of the Delta Lake engine, the transaction logs created with every table are used to ensure ACID compliancy and that the version of data seen by the user is the <b>_single source of truth_</b>. The transaction logs keep a record of all the changes made to the Delta table. When a user performs an operation on the Delta table, Spark uses the transaction log to see what new transactions have occurred since the last operation and updates the user's table. Protocols for ensuring that multiple users can work on the data without any conflicts are in place to ensure seemless and consistent updates of the data. 


In [0]:
## We can view the files in our directory with the following commands and confirm that our table has saved as expected 
display(dbutils.fs.ls(save_path))

path,name,size,modificationTime
dbfs:/tmp/car_data/_delta_log/,_delta_log/,0,0
dbfs:/tmp/car_data/part-00000-f3095cbe-3ac3-45e7-99a0-9eadeec303cd-c000.snappy.parquet,part-00000-f3095cbe-3ac3-45e7-99a0-9eadeec303cd-c000.snappy.parquet,69682,1716240669000


In [0]:
## _delta_log contains our transaction files. Lets take a look at the current status. 
display(dbutils.fs.ls(f"{save_path}/_delta_log/"))

path,name,size,modificationTime
dbfs:/tmp/car_data/_delta_log/.s3-optimization-0,.s3-optimization-0,0,1716240675000
dbfs:/tmp/car_data/_delta_log/.s3-optimization-1,.s3-optimization-1,0,1716240675000
dbfs:/tmp/car_data/_delta_log/.s3-optimization-2,.s3-optimization-2,0,1716240675000
dbfs:/tmp/car_data/_delta_log/00000000000000000000.crc,00000000000000000000.crc,5127,1716240689000
dbfs:/tmp/car_data/_delta_log/00000000000000000000.json,00000000000000000000.json,4127,1716240675000


In [0]:
## Let's display the first transaction log. 
display(spark.read.json(f"{save_path}/_delta_log/00000000000000000000.json"))

add,commitInfo,metaData,protocol
,"List(0520-211800-eqpq177b, Databricks-Runtime/12.2.x-cpu-ml-scala2.12, false, WriteSerializable, List(3488725691099835), WRITE, List(1, 69682, 2059), List(Overwrite, []), 1716240674522, 92a788e9-f2e8-4428-b1fc-38595c43c01e, 2625471505770983, zshafiq001@citymail.cuny.edu)",,
,,,"List(1, 2)"
,,"List(1716240659290, List(parquet), 21933cfd-9f19-44b8-a2b7-c12038586f14, List(), {""type"":""struct"",""fields"":[{""name"":""make"",""type"":""string"",""nullable"":true,""metadata"":{}},{""name"":""model"",""type"":""string"",""nullable"":true,""metadata"":{}},{""name"":""price"",""type"":""string"",""nullable"":true,""metadata"":{}},{""name"":""year"",""type"":""string"",""nullable"":true,""metadata"":{}},{""name"":""kilometer"",""type"":""string"",""nullable"":true,""metadata"":{}},{""name"":""fuel_type"",""type"":""string"",""nullable"":true,""metadata"":{}},{""name"":""transmission"",""type"":""string"",""nullable"":true,""metadata"":{}},{""name"":""location"",""type"":""string"",""nullable"":true,""metadata"":{}},{""name"":""color"",""type"":""string"",""nullable"":true,""metadata"":{}},{""name"":""owner"",""type"":""string"",""nullable"":true,""metadata"":{}},{""name"":""seller_type"",""type"":""string"",""nullable"":true,""metadata"":{}},{""name"":""engine"",""type"":""string"",""nullable"":true,""metadata"":{}},{""name"":""max_power"",""type"":""string"",""nullable"":true,""metadata"":{}},{""name"":""max_torque"",""type"":""string"",""nullable"":true,""metadata"":{}},{""name"":""drivetrain"",""type"":""string"",""nullable"":true,""metadata"":{}},{""name"":""length"",""type"":""string"",""nullable"":true,""metadata"":{}},{""name"":""width"",""type"":""string"",""nullable"":true,""metadata"":{}},{""name"":""height"",""type"":""string"",""nullable"":true,""metadata"":{}},{""name"":""seating_capacity"",""type"":""string"",""nullable"":true,""metadata"":{}},{""name"":""fuel_tank_capacity"",""type"":""string"",""nullable"":true,""metadata"":{}}]})",
"List(true, 1716240669000, part-00000-f3095cbe-3ac3-45e7-99a0-9eadeec303cd-c000.snappy.parquet, 69682, {""numRecords"":2059,""minValues"":{""make"":""Audi"",""model"":""2 Series Gran Coupe 220d M Sport"",""price"":""100000"",""year"":""1988"",""kilometer"":""0"",""fuel_type"":""CNG"",""transmission"":""Automatic"",""location"":""Agra"",""color"":""Beige"",""owner"":""4 or More"",""seller_type"":""Commercial Registration"",""engine"":""1047 cc"",""max_power"":""100 bhp @ 3600 rpm"",""max_torque"":""101 Nm @ 3000 rpm"",""drivetrain"":""AWD"",""length"":""3099"",""width"":""1475"",""height"":""1165"",""seating_capacity"":""2"",""fuel_tank_capacity"":""100""},""maxValues"":{""make"":""Volvo"",""model"":""i20 Sportz 1.4 CRDI"",""price"":""999000"",""year"":""2022"",""kilometer"":""99000"",""fuel_type"":""Petrol + LPG"",""transmission"":""Manual"",""location"":""Zirakpur"",""color"":""Yellow"",""owner"":""UnRegistered Car"",""seller_type"":""Individual"",""engine"":""999 cc"",""max_power"":""99 bhp @ 5000 rpm"",""max_torque"":""99@2800"",""drivetrain"":""RWD"",""length"":""5569"",""width"":""2220"",""height"":""1995"",""seating_capacity"":""8"",""fuel_tank_capacity"":""95""},""nullCount"":{""make"":0,""model"":0,""price"":0,""year"":0,""kilometer"":10,""fuel_type"":0,""transmission"":0,""location"":0,""color"":0,""owner"":0,""seller_type"":0,""engine"":80,""max_power"":80,""max_torque"":80,""drivetrain"":136,""length"":65,""width"":65,""height"":65,""seating_capacity"":69,""fuel_tank_capacity"":117}}, List(1716240669000000, 1716240669000000, 1716240669000000, 268435456))",,,


#### <img src='https://www.svgrepo.com/show/170412/notebook.svg' style="height: 65px; margin: 5px; padding: 5px"/> Task 1: 

Look through the transaction log displayed above and answer the following questions: 

1. What are the column names in the json file?
2. What do the rows represent? 
3. Which transaction does this log detail? 


-------------------------------------------------------------------------------------

* The columns are **`add`**, **`commitInfo`**, **`metaData`**, and **`protocol version`**
   * **`add`** contains statistics for the entire DataFrame 
   * **`commitInfo`** details who made what change to the data. It contains information about username and the operation committed in addition to the cluster information, the runtime used and more. 
   * **`metaData`** presents information about the schema 
   * **`protocol`** shows the software version compatibility for the Delta Table. (i.e. the Delta Table version)

* The rows represent each atomic action that the operation is broken down into. More on this in the next section. 

* The creation of the Delta Table.

#### <img src = 'https://www.svgrepo.com/show/499853/idea.svg' style="height: 60px; margin: 5px; padding: 5px"/>  Atomicity

Atomicity makes up the 'A" in ACID. Atomicity refers to treating an operation as a simgle unit and gurantees that an operation either completes execution or it does not complete at all. The addition of atomicity to the data lake ensures that no failed jobs go unnoticed and corrupt the data. The transaction log ensures atomicity as if an operation is not recorded in the transaction log, it did not happen/complete. This removes the possibility of partially complete actions as the transaction log only records actions that execute fully. When this log is then used to update users tables, the Delta Engine is ensuring that only completed operations are being executed. 


In [0]:
## Lets update the table by adding a new column 
from pyspark.sql.functions import col, when, translate

## As in the data cleaning notebook, we first need to convert the kilometer column to an integer 
car_data = car_data.withColumn('kilometer', translate(col('kilometer'), ' ', '').cast('int'))
car_data = car_data.withColumn("100k_miles", when(col("kilometer") > 100000, True).otherwise(False))

display(car_data.limit(5))


make,model,price,year,kilometer,fuel_type,transmission,location,color,owner,seller_type,engine,max_power,max_torque,drivetrain,length,width,height,seating_capacity,fuel_tank_capacity,100k_miles
Honda,Amaze 1.2 VX i-VTEC,505000,2017,87150,Petrol,Manual,Pune,Grey,First,Corporate,1198 cc,87 bhp @ 6000 rpm,109 Nm @ 4500 rpm,FWD,3990,1680,1505,5,35,False
Maruti Suzuki,Swift DZire VDI,450000,2014,75000,Diesel,Manual,Ludhiana,White,Second,Individual,1248 cc,74 bhp @ 4000 rpm,190 Nm @ 2000 rpm,FWD,3995,1695,1555,5,42,False
Hyundai,i10 Magna 1.2 Kappa2,220000,2011,67000,Petrol,Manual,Lucknow,Maroon,First,Individual,1197 cc,79 bhp @ 6000 rpm,112.7619 Nm @ 4000 rpm,FWD,3585,1595,1550,5,35,False
Toyota,Glanza G,799000,2019,37500,Petrol,Manual,Mangalore,Red,First,Individual,1197 cc,82 bhp @ 6000 rpm,113 Nm @ 4200 rpm,FWD,3995,1745,1510,5,37,False
Toyota,Innova 2.4 VX 7 STR [2016-2020],1950000,2018,69000,Diesel,Manual,Mumbai,Grey,First,Individual,2393 cc,148 bhp @ 3400 rpm,343 Nm @ 1400 rpm,RWD,4735,1830,1795,7,55,False


In [0]:
''' 
Lets save our updated table. 
.option("overwriteSchema", "true") `**:  replaces the existing schema with the updated one. We need this as we changed the datatypes of the columns, which is recorded as part of the schema. 
'''

car_data.write.format("delta").mode("overwrite").option("overwriteSchema", "true").save(save_path)

In [0]:
# Displaying our transaction logs again, we see that there is a new log added to our folder. 
display(dbutils.fs.ls(f"{save_path}/_delta_log/"))

path,name,size,modificationTime
dbfs:/tmp/car_data/_delta_log/.s3-optimization-0,.s3-optimization-0,0,1716240675000
dbfs:/tmp/car_data/_delta_log/.s3-optimization-1,.s3-optimization-1,0,1716240675000
dbfs:/tmp/car_data/_delta_log/.s3-optimization-2,.s3-optimization-2,0,1716240675000
dbfs:/tmp/car_data/_delta_log/00000000000000000000.crc,00000000000000000000.crc,5127,1716240689000
dbfs:/tmp/car_data/_delta_log/00000000000000000000.json,00000000000000000000.json,4127,1716240675000
dbfs:/tmp/car_data/_delta_log/00000000000000000001.crc,00000000000000000001.crc,3444,1716247166000
dbfs:/tmp/car_data/_delta_log/00000000000000000001.json,00000000000000000001.json,4543,1716247162000


In [0]:
display(spark.read.json(f"{save_path}/_delta_log/00000000000000000001.json"))

add,commitInfo,metaData,remove
,"List(0520-211800-eqpq177b, Databricks-Runtime/12.2.x-cpu-ml-scala2.12, false, WriteSerializable, List(3488725691099835), WRITE, List(1, 68746, 2059), List(Overwrite, []), 0, 1716247161713, e2b5a0d8-1aaa-4ef0-8c1c-967be4f6cbb0, 2625471505770983, zshafiq001@citymail.cuny.edu)",,
,,"List(1716240659290, List(parquet), 21933cfd-9f19-44b8-a2b7-c12038586f14, List(), {""type"":""struct"",""fields"":[{""name"":""make"",""type"":""string"",""nullable"":true,""metadata"":{}},{""name"":""model"",""type"":""string"",""nullable"":true,""metadata"":{}},{""name"":""price"",""type"":""string"",""nullable"":true,""metadata"":{}},{""name"":""year"",""type"":""string"",""nullable"":true,""metadata"":{}},{""name"":""kilometer"",""type"":""integer"",""nullable"":true,""metadata"":{}},{""name"":""fuel_type"",""type"":""string"",""nullable"":true,""metadata"":{}},{""name"":""transmission"",""type"":""string"",""nullable"":true,""metadata"":{}},{""name"":""location"",""type"":""string"",""nullable"":true,""metadata"":{}},{""name"":""color"",""type"":""string"",""nullable"":true,""metadata"":{}},{""name"":""owner"",""type"":""string"",""nullable"":true,""metadata"":{}},{""name"":""seller_type"",""type"":""string"",""nullable"":true,""metadata"":{}},{""name"":""engine"",""type"":""string"",""nullable"":true,""metadata"":{}},{""name"":""max_power"",""type"":""string"",""nullable"":true,""metadata"":{}},{""name"":""max_torque"",""type"":""string"",""nullable"":true,""metadata"":{}},{""name"":""drivetrain"",""type"":""string"",""nullable"":true,""metadata"":{}},{""name"":""length"",""type"":""string"",""nullable"":true,""metadata"":{}},{""name"":""width"",""type"":""string"",""nullable"":true,""metadata"":{}},{""name"":""height"",""type"":""string"",""nullable"":true,""metadata"":{}},{""name"":""seating_capacity"",""type"":""string"",""nullable"":true,""metadata"":{}},{""name"":""fuel_tank_capacity"",""type"":""string"",""nullable"":true,""metadata"":{}},{""name"":""100k_miles"",""type"":""boolean"",""nullable"":true,""metadata"":{}}]})",
"List(true, 1716247161000, part-00000-26e6a466-1c12-405c-80ed-62a0c5ebdd2d-c000.snappy.parquet, 68746, {""numRecords"":2059,""minValues"":{""make"":""Audi"",""model"":""2 Series Gran Coupe 220d M Sport"",""price"":""100000"",""year"":""1988"",""kilometer"":0,""fuel_type"":""CNG"",""transmission"":""Automatic"",""location"":""Agra"",""color"":""Beige"",""owner"":""4 or More"",""seller_type"":""Commercial Registration"",""engine"":""1047 cc"",""max_power"":""100 bhp @ 3600 rpm"",""max_torque"":""101 Nm @ 3000 rpm"",""drivetrain"":""AWD"",""length"":""3099"",""width"":""1475"",""height"":""1165"",""seating_capacity"":""2"",""fuel_tank_capacity"":""100""},""maxValues"":{""make"":""Volvo"",""model"":""i20 Sportz 1.4 CRDI"",""price"":""999000"",""year"":""2022"",""kilometer"":2000000,""fuel_type"":""Petrol + LPG"",""transmission"":""Manual"",""location"":""Zirakpur"",""color"":""Yellow"",""owner"":""UnRegistered Car"",""seller_type"":""Individual"",""engine"":""999 cc"",""max_power"":""99 bhp @ 5000 rpm"",""max_torque"":""99@2800"",""drivetrain"":""RWD"",""length"":""5569"",""width"":""2220"",""height"":""1995"",""seating_capacity"":""8"",""fuel_tank_capacity"":""95""},""nullCount"":{""make"":0,""model"":0,""price"":0,""year"":0,""kilometer"":10,""fuel_type"":0,""transmission"":0,""location"":0,""color"":0,""owner"":0,""seller_type"":0,""engine"":80,""max_power"":80,""max_torque"":80,""drivetrain"":136,""length"":65,""width"":65,""height"":65,""seating_capacity"":69,""fuel_tank_capacity"":117,""100k_miles"":0}}, List(1716247161000000, 1716247161000000, 1716247161000000, 268435456))",,,
,,,"List(true, 1716247161704, true, part-00000-f3095cbe-3ac3-45e7-99a0-9eadeec303cd-c000.snappy.parquet, 69682, List(1716240669000000, 1716240669000000, 1716240669000000, 268435456))"


Inspecting the transaction log above, we can see that the datatype for the kilometer column has changed in the 'metaData' column of the transaction file. We also see that the '100k_miles' has been added to the metaData column as well. If changing the datatype had not executed completely before we saved our file, the transaction log would not have been updated. In this way, the transaction log keeps a record of completed actions and ensures atomicity.

#### <img src = 'https://www.svgrepo.com/show/499853/idea.svg' style="height: 60px; margin: 5px; padding: 5px"/> Consistency

Consistency, or the 'C' in ACID, is extremely important when working with large datasets and collaborating with multiple users. This becomes especially important in practical settings, when one dataset is being used across teams and departments. We need to ensure everyone is working off of the same data so that the work being done is accurate and up to date. 

Based on our earlier discussions, we know that the transaction log is what is used to update Delta tables. Since the transaction log is the single source of truth, we know that any updates made to the table will be the same across the board. However, if there are multiple users writing to the table, how does the transaction log know which changes to commit and in which order? 

It is safe to assume that with petabytes of data, two users will be working on separate parts at any given time. This is, of course, completely okay. My work and your work will not interfere if we are looking at completely different parts of the data. The problem arises only when we need the same part of the data at the same time.

Lets say you and I are both out grocery shopping and we both really really want to make pasta. If there are a lot of boxes of pasta on the shelf, we will happily grab one and be on our way. However, if there is only one box left and we both reach for it at the same time ... well, that would be awkward. 

Similarly, when handling transactions on a Delta Table, we can assume that most of the time, there is no problem. This method of control is known as <b>_optimistic concurrency_</b>. Of course, conflicts can still arise. How Delta Lake handles these conflicts brings us to the concept of isolation, the next letter in our ACID transaction. 


####  <img src = 'https://www.svgrepo.com/show/499853/idea.svg' style="height: 60px; margin: 5px; padding: 5px"/> Isolation

Unfortunately, in the case of our grocery store example, one of us will have a pasta-less dinner. The same is not true, however, for ACID compliant transactions. Isolation ensures that the simultaneous actions by different users do not impact each other. For isolation to hold, we need a way to determine how to commit transactions to our transaction log. 

Delta Lake handles cases of simultaneous operations by treating them as mutually exclusive. If two write operations by different users on the same section of data happen simultaneously, the transaction log will commit one of the two operations as the next transaction, lets sat 000001.json. 

However, User 2 doesn't have to worry. Rather than throw an error for User 2, the Delta Engine checks to see if there are any new commits since the last operation and updates the users table to reflect those changes. Then, it commits User 2's commit on the updated table. This commit, which originally happened at the same time as the commit recorded in 000001.json, is saved as 000002.json. 

As stated in <a href="https://www.databricks.com/blog/2019/08/21/diving-into-delta-lake-unpacking-the-transaction-log.html" target="_blank">  this databricks post*,</a> the process proceeds like this:
- Record the starting table version.
- Record reads/writes.
- Attempt a commit.
- If someone else wins, check whether anything you read has changed.
- Repeat.

By implementing optimistic concurrency and treating all commits as mutually exclusive, the Delta Engine ensures isolation of operations. 

*Taken from the linked Databricks post

####  <img src = 'https://www.svgrepo.com/show/499853/idea.svg' style="height: 60px; margin: 5px; padding: 5px"/>  Durability

Durability! The last part of our ACID compliant transaction. Durability refers to the fact that once a change has been made to the table, it is not overwritten. The Delta Lake Engine ensures durable transactions by keeping a record of the data in time. In other words, Delta Tables are automatically assigned versions and any of the previous versions can be easily accessed. This ensures not only durability but also reproducibility in experiments and analysis. It is especially important for machine learning applications, where data may change over time. 

Two ways to access previous records of the data are 
* Using a time stamp
* Using the version number

Unfortunately, we do not have multiple days of data and transactions to display but we have included it here for completeness.

In [0]:
## time stamp
table1 = spark.read.format("delta").option("timestampAsOf", "2024-05-20").load(save_path)

[0;31m---------------------------------------------------------------------------[0m
[0;31mAnalysisException[0m                         Traceback (most recent call last)
File [0;32m<command-3488725691099848>:2[0m
[1;32m      1[0m [38;5;66;03m## time stamp[39;00m
[0;32m----> 2[0m table1 [38;5;241m=[39m [43mspark[49m[38;5;241;43m.[39;49m[43mread[49m[38;5;241;43m.[39;49m[43mformat[49m[43m([49m[38;5;124;43m"[39;49m[38;5;124;43mdelta[39;49m[38;5;124;43m"[39;49m[43m)[49m[38;5;241;43m.[39;49m[43moption[49m[43m([49m[38;5;124;43m"[39;49m[38;5;124;43mtimestampAsOf[39;49m[38;5;124;43m"[39;49m[43m,[49m[43m [49m[38;5;124;43m"[39;49m[38;5;124;43m2024-20-05[39;49m[38;5;124;43m"[39;49m[43m)[49m[38;5;241;43m.[39;49m[43mload[49m[43m([49m[43msave_path[49m[43m)[49m

File [0;32m/databricks/spark/python/pyspark/instrumentation_utils.py:48[0m, in [0;36m_wrap_function.<locals>.wrapper[0;34m(*args, **kwargs)[0m
[1;32m     46[0m start [

In [0]:
## version number 
table1 = spark.read.format("delta").option("versionAsOf", "0").load(save_path)
display(table1.limit(5))

make,model,price,year,kilometer,fuel_type,transmission,location,color,owner,seller_type,engine,max_power,max_torque,drivetrain,length,width,height,seating_capacity,fuel_tank_capacity
Honda,Amaze 1.2 VX i-VTEC,505000,2017,87150,Petrol,Manual,Pune,Grey,First,Corporate,1198 cc,87 bhp @ 6000 rpm,109 Nm @ 4500 rpm,FWD,3990,1680,1505,5,35
Maruti Suzuki,Swift DZire VDI,450000,2014,75000,Diesel,Manual,Ludhiana,White,Second,Individual,1248 cc,74 bhp @ 4000 rpm,190 Nm @ 2000 rpm,FWD,3995,1695,1555,5,42
Hyundai,i10 Magna 1.2 Kappa2,220000,2011,67000,Petrol,Manual,Lucknow,Maroon,First,Individual,1197 cc,79 bhp @ 6000 rpm,112.7619 Nm @ 4000 rpm,FWD,3585,1595,1550,5,35
Toyota,Glanza G,799000,2019,37500,Petrol,Manual,Mangalore,Red,First,Individual,1197 cc,82 bhp @ 6000 rpm,113 Nm @ 4200 rpm,FWD,3995,1745,1510,5,37
Toyota,Innova 2.4 VX 7 STR [2016-2020],1950000,2018,69000,Diesel,Manual,Mumbai,Grey,First,Individual,2393 cc,148 bhp @ 3400 rpm,343 Nm @ 1400 rpm,RWD,4735,1830,1795,7,55


In [0]:
## version number 
table2 = spark.read.format("delta").option("versionAsOf", "1").load(save_path)
display(table2.limit(5))

make,model,price,year,kilometer,fuel_type,transmission,location,color,owner,seller_type,engine,max_power,max_torque,drivetrain,length,width,height,seating_capacity,fuel_tank_capacity,100k_miles
Honda,Amaze 1.2 VX i-VTEC,505000,2017,87150,Petrol,Manual,Pune,Grey,First,Corporate,1198 cc,87 bhp @ 6000 rpm,109 Nm @ 4500 rpm,FWD,3990,1680,1505,5,35,False
Maruti Suzuki,Swift DZire VDI,450000,2014,75000,Diesel,Manual,Ludhiana,White,Second,Individual,1248 cc,74 bhp @ 4000 rpm,190 Nm @ 2000 rpm,FWD,3995,1695,1555,5,42,False
Hyundai,i10 Magna 1.2 Kappa2,220000,2011,67000,Petrol,Manual,Lucknow,Maroon,First,Individual,1197 cc,79 bhp @ 6000 rpm,112.7619 Nm @ 4000 rpm,FWD,3585,1595,1550,5,35,False
Toyota,Glanza G,799000,2019,37500,Petrol,Manual,Mangalore,Red,First,Individual,1197 cc,82 bhp @ 6000 rpm,113 Nm @ 4200 rpm,FWD,3995,1745,1510,5,37,False
Toyota,Innova 2.4 VX 7 STR [2016-2020],1950000,2018,69000,Diesel,Manual,Mumbai,Grey,First,Individual,2393 cc,148 bhp @ 3400 rpm,343 Nm @ 1400 rpm,RWD,4735,1830,1795,7,55,False


You can see from the above two cells that version 0 represents our original data and version 1 includes the column we added. 


## <img src = 'https://www.svgrepo.com/show/176852/pin-signs.svg' style="height: 50px; margin: 5px; padding: 5px"/> Summary
---
In summary, the transaction log and accompanying protocols add structure and organization on top of the data lake storage in Delta Lake, making the Delta Lake a lakehouse. The ACID compliancy and metadata handling add a lot of reliability on top of the data storage, ensure data quality and consistency, guard against data corruption, and even aid reproducibility and auditing. 

Specifically, in this tutorial, you learned:

* How to create Delta tables
* How Delta Tables ensure ACID compliant transactions
* How to view and understand transaction logs
* How to view previous data versions 


## <img src = 'https://www.svgrepo.com/show/199671/next.svg' style="height: 50px; margin: 5px; padding: 5px"/> Next steps
---
View the following Databricks blogs to learn more about the Delta Lake Engine and ACID compliancy 

*  <a href="https://docs.databricks.com/en/delta/index.html" target="_blank"> 
What is Delta Lake?</a>
* <a href=" https://www.databricks.com/blog/2019/08/21/diving-into-delta-lake-unpacking-the-transaction-log.html" target="_blank"> 
Diving Into Delta Lake: Unpacking The Transaction Log</a>
* <a href=" https://docs.databricks.com/en/lakehouse/acid.html" target="_blank"> 
What are ACID guarantees on Databricks? </a>
