# Understand Delta Lake
**Delta Lake:** is an open-source **storage layer** that adds relational database semantics to Spark-based data lake processing. 
- Tables in Microsoft Fabric lakehouses are **`Delta tables`**, which is signified by the triangular Delta (▴) icon on tables in the lakehouse user interface.

<img src="../images/01_Get started with Microsoft Fabric/04/delta-table.png" alt="Delta Table" style="border: 2px solid black; border-radius: 10px;">

Delta tables are schema abstractions over data files that are stored in **Delta format**. 
- For each table, the lakehouse stores a folder containing **Parquet data files** and a **`_delta_Log`** folder in which transaction details are logged in **`JSON format`**.

<img src="../images/01_Get started with Microsoft Fabric/04/delta-files.png" alt="Parquet files" style="border: 2px solid black; border-radius: 10px;">

## Benefits
The benefits of using Delta tables include:

- **Relational tables that support `querying` and data `modification`.** 
  - With Apache Spark, you can store data in Delta tables that support **CRUD (create, read, update, and delete)** operations. 
- **Support for `ACID` transactions.** 
  - Relational databases are designed to support transactional data modifications that provide:
    - **`A`** tomicity (transactions complete as a single unit of work)
    - **`C`** onsistency (transactions leave the database in a consistent state)
    - **`I`** solation (in-process transactions can't interfere with one another)
    - **`D`** urability (when a transaction completes, the changes it made are persisted). 
  - Delta Lake brings this same transactional support to Spark by implementing a transaction log and enforcing serializable isolation for concurrent operations.
- **`Data versioning` and `time travel`.** 
  - Because all transactions are logged in the **`transaction log`**, you can **track multiple versions** of each table row and even use the time travel feature to **retrieve a previous version** of a row in a query.
- **Support for `batch` and `streaming` data.** 
  - While most relational databases include tables that store static data, Spark includes native support for streaming data through the **Spark Structured Streaming API**. Delta Lake tables can be used as both sinks (destinations) and sources for streaming data.
- **Standard formats and interoperability.** 
  - The underlying data for Delta tables is stored in Parquet format, which is commonly used in data lake ingestion pipelines. Additionally, you can use the **`SQL analytics endpoint`** for the Microsoft Fabric lakehouse to query Delta tables in SQL.

# Create delta tables

## Create managed vs external tables
Creating both the **`table schema definition`** in the metastore and the **`data files`** in delta format

### Managed tables
- The table definition in the metastore and the underlying data files are **`both`** managed by the Spark runtime for the Fabric lakehouse. 
- Deleting the table will also **`delete` the underlying files** from the Tables storage location for the lakehouse.

In [0]:
# Load a file into a dataframe
df = spark.read.load('Files/mydata.csv', format='csv', header=True)

# Save the dataframe as a delta table
df.write.format("delta").saveAsTable("mytable")

The code specifies that:
- The table should be saved in **`delta format`** with a specified table name. 
- The data for the table is saved in **`Parquet files`** (regardless of the format of the source file you loaded into the dataframe) in the Tables storage area in the lakehouse, along with a **`_delta_log`** folder containing the transaction logs for the table. 
- The table will be listed in the Tables folder for the lakehouse in the Data explorer pane.

### External tables
- The table definition in the metastore is **`mapped`** to an alternative file storage location.
- Deleting an external table from the lakehouse metastore **`does not delete` the associated data files**.

In [0]:
df.write.format("delta").saveAsTable("myexternaltable", path="Files/myexternaltable")

In [0]:
# You can also specify a fully qualified path for a storage location, like this:
df.write.format("delta").saveAsTable("myexternaltable", path="abfss://my_store_url..../myexternaltable")

- The table definition is created in the **`metastore`** (so the table is listed in the **Tables** user interface for the lakehouse), but the Parquet data files and JSON log files for the table are stored in the **`Files storage location`** (and will be shown in the **Files node** in the **Lakehouse explorer pane**).

## Creating table metadata
Creates the **`table schema`** in the metastore **`without saving any data files`**

### Use the DeltaTableBuilder API
The **`DeltaTableBuilder API`** enables you to write Spark code to create a table based on your specifications. 

For example, the following code creates a table with a specified name and columns.

In [0]:
from delta.tables import *

DeltaTable.create(spark) \
  .tableName("products") \
  .addColumn("Productid", "INT") \
  .addColumn("ProductName", "STRING") \
  .addColumn("Category", "STRING") \
  .addColumn("Price", "FLOAT") \
  .execute()

### Use Spark SQL

In [0]:
%sql
-- Managed table
CREATE TABLE salesorders
(
    Orderid INT NOT NULL,
    OrderDate TIMESTAMP NOT NULL,
    CustomerName STRING,
    SalesTotal FLOAT NOT NULL
)
USING DELTA

In [0]:
%sql
-- External table: Specifying a LOCATION parameter
CREATE TABLE MyExternalTable
USING DELTA
LOCATION 'Files/mydata'

## Saving data in delta format
Save data in **`delta format without creating a table definition`** in the metastore

- After saving the delta file, the path location you specified includes Parquet files containing the data and a **`_delta_log`** folder containing the transaction logs for the data. 
  - Any modifications made to the data through the delta lake API or in an external table that is subsequently created on the folder will be **recorded in the transaction logs**.

In [0]:
# PySpark code saves a dataframe to a new folder location in delta format:
delta_path = "Files/mydatatable"
df.write.format("delta").save(delta_path)

In [0]:
# using the overwrite mode
new_df.write.format("delta").mode("overwrite").save(delta_path)

# using the append mode
new_rows_df.write.format("delta").mode("append").save(delta_path)

<img src="https://files.training.databricks.com/images/icon_note_32.png" alt="Note"> If you use the technique described here to save a dataframe to the Tables location in the lakehouse, Microsoft Fabric uses an automatic table discovery capability to create the corresponding table metadata in the metastore.

# Work with delta tables in Spark

## Using Spark SQL
The most common way to work with data in delta tables in Spark is to use **Spark SQL**. 
- You can embed SQL statements in other languages (such as PySpark or Scala) by using the **`spark.sql`** library. 

For example, the following code inserts a row into the products table.

In [0]:
spark.sql("INSERT INTO products VALUES (1, 'Widget', 'Accessories', 2.99)")

In [0]:
%sql

UPDATE products
SET Price = 2.49 WHERE ProductId = 1;

# Use the Delta API
When you want to work with **delta files** rather than **catalog tables**, it may be simpler to use the **`Delta Lake API`**. 
- You can create an **instance of a DeltaTable** from a folder location containing files in delta format, and then use the API to modify the data in the table.

In [0]:
from delta.tables import *
from pyspark.sql.functions import *

# Create a DeltaTable object
delta_path = "Files/mytable"
deltaTable = DeltaTable.forPath(spark, delta_path)

# Update the table (reduce price of accessories by 10%)
deltaTable.update(
    condition = "Category == 'Accessories'",
    set = { "Price": "Price * 0.9" })

## Use time travel to work with table versioning
Modifications made to delta tables are logged in the transaction log for the table. 
- You can use the logged transactions to view the **`history of changes`** made to the table and to retrieve older versions of the data (known as time travel)

To see the history of a table, you can use the **`DESCRIBE SQL`** command as shown here.

In [0]:
%sql
-- Managed table
DESCRIBE HISTORY products

In [0]:
%sql
-- External table
DESCRIBE HISTORY 'Files/mytable'

<img src="../images/01_Get started with Microsoft Fabric/04/time_travel.png" alt="Time Travel" style="border: 2px solid black; border-radius: 10px;">

## Retrieve historical information
You can retrieve data from a **specific version** of the data by reading the delta file location into a dataframe, specifying the version required as a **`versionAsOf`** option:

In [0]:
df = spark.read.format("delta").option("versionAsOf", 0).load(delta_path)

Or specify a timestamp by using the **`timestampAsOf`** option:

In [0]:
df = spark.read.format("delta").option("timestampAsOf", '2022-01-01').load(delta_path)

# Use delta tables with streaming data

## Spark Structured Streaming
Spark includes **`native support`** for streaming data through **Spark Structured Streaming**, an API that is based on a **boundless dataframe** in which streaming data is captured for processing. 
- A **Spark Structured Streaming** dataframe can read data from many different kinds of streaming source, including network ports, real time message brokering services such as Azure Event Hubs or Kafka, or file system locations.

<img src="https://files.training.databricks.com/images/icon_note_32.png" alt="Note"> For more information about Spark Structured Streaming, see [Structured Streaming Programming Guide](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html) in the Spark documentation.

## Streaming with delta tables

### Using a delta table as a streaming source
In the following PySpark example, a delta table is used to store details of Internet sales orders. A stream is created that reads data from the table folder as new data is appended.

In [0]:
from pyspark.sql.types import *
from pyspark.sql.functions import *

# Load a streaming dataframe from the Delta Table
stream_df = spark.readStream.format("delta") \
    .option("ignoreChanges", "true") \
    .load("Files/delta/internetorders")

# Now you can process the streaming data in the dataframe
# for example, show it:
stream_df.show()

After reading the data from the delta table into a streaming dataframe, you can use the Spark Structured Streaming API to **process** it. 
- In the example above, the dataframe is simply displayed; but you could use Spark Structured Streaming to **aggregate** the data over temporal windows (for example to count the number of orders placed every minute) and send the aggregated results to a downstream process for near-real-time visualization.

<img src="https://files.training.databricks.com/images/icon_note_32.png" alt="Note"> When using a delta table as a streaming source, only append operations can be included in the stream. Data modifications will cause an error unless you specify the **`ignoreChanges`** or **`ignoreDeletes`** option.

### Using a delta table as a streaming sink
In the following PySpark example, a stream of data is read from JSON files in a folder. The JSON data in each file contains the status for an IoT device in the format **`{"device":"Dev1","status":"ok"}`** 
- New data is added to the stream whenever a file is added to the folder. 
- The input stream is a boundless dataframe, which is then written in **delta format** to a folder location for a delta table.

In [0]:
from pyspark.sql.types import *
from pyspark.sql.functions import *

# Create a stream that reads JSON data from a folder
inputPath = 'Files/streamingdata/'
jsonSchema = StructType([
    StructField("device", StringType(), False),
    StructField("status", StringType(), False)
])
stream_df = spark.readStream.schema(jsonSchema).option("maxFilesPerTrigger", 1).json(inputPath)

# Write the stream to a delta table
table_path = 'Files/delta/devicetable'
checkpoint_path = 'Files/delta/checkpoint'
delta_stream = stream_df.writeStream.format("delta").option("checkpointLocation", checkpoint_path).start(table_path)

<img src="https://files.training.databricks.com/images/icon_note_32.png" alt="Note"> The **`checkpointLocation`** option is used to write a checkpoint file that **tracks the state** of the stream processing. This file enables you to recover from failure at the point where stream processing left off.

After the streaming process has started, you can query the Delta Lake table to which the streaming output is being written to see the latest data. 

For example, the following code creates a catalog table for the Delta Lake table folder and queries it:

In [0]:
%%sql

CREATE TABLE DeviceTable
USING DELTA
LOCATION 'Files/delta/devicetable';

SELECT device, status
FROM DeviceTable;

To stop the stream of data being written to the Delta Lake table, you can use the stop method of the streaming query:

In [0]:
delta_stream.stop()

<img src="https://files.training.databricks.com/images/icon_note_32.png" alt="Note"> For more information about using delta tables for streaming data, see [Table streaming reads and writes](https://docs.delta.io/latest/delta-streaming.html) in the Delta Lake documentation.