###DataLake (Deltalake) + Lakehouse (Deltatables) - using Delta format (parquet+snappy+delta log)

Delta Lake is an open-source storage framework that brings reliability, ACID transactions, and performance to data lakes. It sits on top of Parquet files and is most commonly used with Apache Spark and Databricks.<br>
Delta Lake is the core storage layer behind Bronze–Silver–Gold (medallion) architectures.
<img src="https://pages.databricks.com/rs/094-YMS-629/images/delta-lake-logo-whitebackground.png" style="width:300px; float: right"/>

## ![](https://pages.databricks.com/rs/094-YMS-629/images/delta-lake-tiny-logo.png) Creating our first Delta Lake table

Delta is the default file and table format using Databricks.


1. Delta Lake | Iceberg | Parquet<br>
This is the physical data layer where data actually lives (usually on cloud storage like S3, ADLS, GCS).<br>
Parquet that add:<br>
ACID transactions<br>
Time travel:- You can go back in time and see old data.<br>
Schema enforcement:- Delta Lake prevents writing data that does not match the table schema<br>
Incremental processing:- Process only new or changed data instead of reprocessing the entire dataset<br>
----
2. Unity Catalog:- who can access what data.<br>
This is the central governance and metadata layer.<br>
---
3. Lakehouse Core<br>
Lakehouse<bt>
This is the unifying concept:
One system for batch, streaming, BI, ML, AI<br>

Instead of:<br>
Data Lake (raw)
Data Warehouse (analytics)<br>
Feature store (ML)<br>
Separate BI store<br>
You have one platform.<br>
![](https://docs.databricks.com/aws/en/assets/images/well-architected-lakehouse-7d7b521addc268ac8b3d597bafa8cae9.png)
---
Lakehouse-Specific Pillars<br>
These make Lakehouse unique:<br>
Data Governance<br>
- Unity Catalog<br>
- Fine-grained permissions<br>
- Lineage<br>

Interoperability & Usability<br>
- Works with open formats<br>
- SQL, Python, Scala<br>
- No vendor lock-in<br>


In [0]:
spark.sql("CREATE CATALOG IF NOT EXISTS lakehousecat")
spark.sql("CREATE SCHEMA IF NOT EXISTS lakehousecat.deltadb")
spark.sql("CREATE VOLUME IF NOT EXISTS lakehousecat.deltadb.datalake")

####1. Write data into delta file (Datalake) and table (Lakehouse)

In [0]:
df=spark.read.csv("/Volumes/lakehousecat/deltadb/datalake/druginfo.csv",header=True,inferSchema=True)#Reading normal data from datalake
df.write.format("delta").mode("overwrite").save("/Volumes/lakehousecat/deltadb/datalake/deltalakedir")#writing normal data from deltalake(datalake)
df.write.format("parquet").mode("overwrite").save("/Volumes/lakehousecat/deltadb/datalake/parquestlakedir")#writing normal data from parquet(datalake)

df.write.option("mergeSchema",True).saveAsTable("lakehousecat.deltadb.drugstb1",mode='overwrite')#writing normal data from deltalakehouse(lakehouse)

#behind it stores the data in deltafile format in the s3 bucket (location is hidden for us in databricks free edition)

####2. DML Operations in Delta Tables & Files
Support for DELETE/UPDATE/MERGE

In [0]:
%sql
USE lakehousecat.deltadb

In [0]:
%sql
DESC HISTORY drugstb1

In [0]:
%sql
SELECT * FROM drugstb1

In [0]:
%sql
SELECT * FROM drugstb1
WHERE uniqueid=163740;

#####a. Table Update

In [0]:
%sql
UPDATE drugstb1
SET rating=rating-1
WHERE uniqueid=163740;

In [0]:
%sql
SELECT * FROM drugstb1
where uniqueid=163740;

#####b. Table Delete

In [0]:
%sql
DELETE FROM drugstb1
WHERE uniqueid=163740;

In [0]:
%sql
SELECT * FROM drugstb1
where uniqueid in (163740,206473);

In [0]:
%sql
DELETE FROM drugstb1
WHERE uniqueid in (66736,4907,97013)

In [0]:
%sql
DESC HISTORY drugstb1

#####c. File DML (Update/Delete)
We don't do file DML usually, we are doing here just for learning about 
- file also can be undergone with limited DML operation
- we need to learn about how the background delta operation is happening when i do DML

In [0]:
spark.read.format("delta").load("/Volumes/lakehousecat/deltadb/datalake/deltalakedir").where("uniqueid==163740").show()

In [0]:
#DML on Files: How to update delta files
from delta.tables import DeltaTable #Imports the Delta Lake API that allows you to perform DML operations programmatically (update, delete, merge) using PySpark.
deltafile = DeltaTable.forPath(spark,"/Volumes/lakehousecat/deltadb/datalake/deltalakedir")#This is called a path-based Delta table (not registered in Unity Catalog / Hive metastore).
deltafile.update("uniqueid=163740", { "rating": "rating - 1" } )

In [0]:
spark.read.format("delta").load("/Volumes/lakehousecat/deltadb/datalake/deltalakedir").where("uniqueid==163740").show()

#####d. File Delete


In [0]:
df=spark.read.format("delta").load('/Volumes/lakehousecat/deltadb/datalake/deltalakedir')
df.where('uniqueid=206473').show()

In [0]:
from delta.tables import DeltaTable
deltaTable = DeltaTable.forPath(spark, "/Volumes/lakehousecat/deltadb/datalake/deltalakedir")
deltaTable.delete("uniqueid=206473")

In [0]:
df=spark.read.format("delta").load('/Volumes/lakehousecat/deltadb/datalake/deltalakedir')
df.where('uniqueid=206473').show()

#####d. Merge Operation

In [0]:
%sql
select count(*) from drugstb1

In [0]:
%sql
CREATE OR REPLACE TABLE drugstbl_merge AS SELECT * FROM drugstb1 WHERE rating<=8;

In [0]:
%sql
select count(*) from drugstbl_merge;

In [0]:
%sql
--Delta table support merge operation for (insert/update/delete)
--2899 updated
--2801 inserted
MERGE INTO drugstbl_merge tgt
USING drugstb1 src
ON tgt.uniqueid = src.uniqueid
WHEN MATCHED THEN
  UPDATE SET tgt.usefulcount= src.usefulcount,
             tgt.drugname = src.drugname,
             tgt.condition = src.condition
WHEN NOT MATCHED
  THEN INSERT (uniqueid,rating,date,usefulcount, drugname, condition ) VALUES (uniqueid,rating,date,usefulcount, drugname, condition);