# Delta Lake Quick Start
This notebook provides a quick start guide to using Delta Lake, as outlined in the [Delta Lake Documentation](https://docs.delta.io/latest/quick-start.html).


## Installing Dependencies

This code installs the necessary dependencies for working with Apache Spark and Delta Lake. Specifically, it installs:

- `pyspark==3.5.1`: The Python interface for Apache Spark, which is a unified analytics engine for large-scale data processing.
- `delta-spark`: The Delta Lake library, which is an open-source storage layer that brings reliability to data lakes.

These dependencies are installed using the `pip` package manager. The `!` prefix is used to execute shell commands within the notebook environment.

In [152]:
# Install necessary dependencies
!pip install pyspark==3.5.1 #4.0.0.dev1
!pip install delta-spark




## Importing Libraries

This code imports the required libraries for working with Apache Spark and Delta Lake.

### Imports

- `from pyspark.sql import SparkSession`
  - This line imports the `SparkSession` class from the `pyspark.sql` module. `SparkSession` is the entry point for creating a Spark application and interacting with Spark's functionalities.

- `from delta import *`
  - This line imports all functions and classes from the `delta` module, which is part of the Delta Lake library. This library provides utilities for working with Delta Lake, a storage layer that brings ACID transactions, scalability, and performance to data lakes.

By importing these libraries, you gain access to the necessary classes and functions for creating Spark sessions, reading and writing data, and interacting with Delta Lake.

In [153]:
# Import Libraries
from pyspark.sql import SparkSession
from delta import *


## Creating a Spark Session with Delta Lake

This code creates a Spark session and configures it to work with Delta Lake.

### Steps

1. **Create a Spark Session Builder**
   - `builder = SparkSession.builder.appName("DeltaLakeQuickStart")`
     - This line creates a `SparkSession` builder with the application name set to `"DeltaLakeQuickStart"`.

2. **Configure Spark Session for Delta Lake**
   - `builder.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")`
     - This line adds a configuration to the Spark session builder, enabling the Delta Lake SQL extension.
   - `builder.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")`
     - This line configures the Spark session to use the Delta Lake catalog for managing tables and databases.

3. **Create and Configure Spark Session with Delta Lake**
   - `spark = configure_spark_with_delta_pip(builder).getOrCreate()`
     - This line creates and configures the Spark session using the `configure_spark_with_delta_pip` function from the Delta Lake library.
     - The `getOrCreate()` method creates a new Spark session or returns an existing one if it has already been created.

By configuring the Spark session with the Delta Lake extensions and catalog, you enable Delta Lake capabilities within your Spark application. This includes features like ACID transactions, data versioning, and improved performance for data lake workloads.

In [154]:
# Create a Spark Session with Delta Lake
builder = SparkSession.builder.appName("DeltaLakeQuickStart") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
spark = configure_spark_with_delta_pip(builder).getOrCreate()


## Creating a Delta Table

This code creates a Delta table using Apache Spark and the Delta Lake library.

### Steps

1. **Generate Sample Data**
   - `data = spark.range(0, 5)`
     - This line creates a DataFrame `data` containing a range of integers from 0 to 4 (inclusive) using the `spark.range` function.

2. **Write Data to Delta Table**
   - `data.write.format("delta").mode("overwrite").save("/tmp/delta-table")`
     - This line writes the `data` DataFrame to a Delta table located at the `/tmp/delta-table` path.
     - The `write` method is used to initiate the writing process.
     - `format("delta")` specifies that the data should be written in the Delta Lake format.
     - `mode("overwrite")` sets the write mode to `overwrite`, which means that any existing data in the Delta table will be replaced by the new data.
     - `save("/tmp/delta-table")` saves the data to the specified path, creating a new Delta table or overwriting an existing one.

The resulting Delta table will be stored in the `/tmp/delta-table` directory, and it will contain the integer values from 0 to 4. Delta Lake tables provide additional features like ACID transactions, data versioning, and efficient data storage and retrieval compared to traditional file-based data lakes.

In [155]:
# Create a Delta Table
data = spark.range(0, 5)
data.write.format("delta").mode("overwrite").save("/tmp/delta-table")


## Reading from the Delta Table

This code reads data from the previously created Delta table and displays the contents.

### Steps

1. **Read Data from Delta Table**
   - `df = spark.read.format("delta").load("/tmp/delta-table")`
     - This line reads data from the Delta table located at `/tmp/delta-table`.
     - The `read` method is used to initiate the reading process.
     - `format("delta")` specifies that the data should be read from a Delta Lake table.
     - `load("/tmp/delta-table")` loads the data from the specified path, which contains the Delta table.
     - The resulting data is stored in the `df` DataFrame.

2. **Display DataFrame Contents**
   - `df.show()`
     - This line prints the contents of the `df` DataFrame to the console or notebook output.

By reading data from the Delta table, you can leverage the features provided by Delta Lake, such as data versioning and efficient data retrieval. The `show()` method allows you to inspect the contents of the DataFrame, ensuring that the data was read correctly from the Delta table.

In [156]:
# Read from the Delta Table
df = spark.read.format("delta").load("/tmp/delta-table")
print("value of latest version of the delta table")
df.show()


value of latest version of the delta table
+---+
| id|
+---+
|  2|
|  3|
|  4|
|  0|
|  1|
+---+



## Updating the Delta Table

This code updates the existing Delta table by overwriting it with new data.

### Steps

1. **Generate New Sample Data**
   - `data = spark.range(5, 10)`
     - This line creates a new DataFrame `data` containing a range of integers from 5 to 9 (inclusive) using the `spark.range` function.

2. **Overwrite Delta Table with New Data**
   - `data.write.format("delta").mode("overwrite").save("/tmp/delta-table")`
     - This line overwrites the existing Delta table located at `/tmp/delta-table` with the new `data` DataFrame.
     - The `write` method is used to initiate the writing process.
     - `format("delta")` specifies that the data should be written in the Delta Lake format.
     - `mode("overwrite")` sets the write mode to `overwrite`, which means that any existing data in the Delta table will be replaced by the new data.
     - `save("/tmp/delta-table")` saves the new data to the specified path, overwriting the existing Delta table.

After executing this code, the Delta table at `/tmp/delta-table` will no longer contain the initial range of integers from 0 to 4. Instead, it will be overwritten with the new range of integers from 5 to 9.

Delta Lake's support for ACID transactions ensures that the overwrite operation is atomic, consistent, isolated, and durable, maintaining data integrity during the update process.

In [157]:
# Update the Delta Table
data = spark.range(5, 10)
data.write.format("delta").mode("overwrite").save("/tmp/delta-table")
print("updated value of delta table")
df.show()



updated value of delta table
+---+
| id|
+---+
|  7|
|  8|
|  9|
|  5|
|  6|
+---+



## Reading the Updated Delta Table

This code reads the updated Delta table and displays its contents.

### Steps

1. **Read Data from the Updated Delta Table**
   - `df = spark.read.format("delta").load("/tmp/delta-table")`
     - This line reads data from the updated Delta table located at `/tmp/delta-table`.
     - The `read` method is used to initiate the reading process.
     - `format("delta")` specifies that the data should be read from a Delta Lake table.
     - `load("/tmp/delta-table")` loads the data from the specified path, which now contains the updated Delta table.
     - The resulting data is stored in the `df` DataFrame.

2. **Display DataFrame Contents**
   - `df.show()`
     - This line prints the contents of the `df` DataFrame to the console or notebook output.

By reading the updated Delta table, you can verify that the overwrite operation was successful, and the table now contains the new range of integers from 5 to 9.

The `show()` method displays the contents of the DataFrame, allowing you to inspect the data and ensure that the update was performed correctly. Delta Lake's transaction log and data versioning capabilities help maintain data integrity and enable time travel queries, if needed.

In [158]:
# Read the Updated Delta Table
df = spark.read.format("delta").load("/tmp/delta-table")
print("read values of most recent version of the delta table")
df.show()


read values of most recent version of the delta table
+---+
| id|
+---+
|  7|
|  8|
|  9|
|  5|
|  6|
+---+



Conditional update without overwrite
Delta Lake provides programmatic APIs to conditional update, delete, and merge (upsert) data into tables. Here are a few examples.

## Delta Table Operations

This code demonstrates various operations that can be performed on a Delta table, such as updating, deleting, and upserting (merging) data. It also showcases the use of the Delta Table API provided by the Delta Lake library.

### Imports

- `from delta.tables import *`
  - This line imports the necessary classes and functions from the `delta.tables` module, which provides the Delta Table API.
- `from pyspark.sql.functions import *`
  - This line imports various functions from the `pyspark.sql.functions` module, which are used for performing data transformations and manipulations.

### Steps

1. **Create a DeltaTable Object**
   - `deltaTable = DeltaTable.forPath(spark, "/tmp/delta-table")`
     - This line creates a `DeltaTable` object `deltaTable` by pointing to the existing Delta table located at `/tmp/delta-table`.

2. **Update Even Values**
   - `deltaTable.update(condition = expr("id % 2 == 0"), set = { "id": expr("id + 100") })`
     - This line updates the Delta table by adding 100 to the `id` column for all rows where `id` is even.
     - The `update` method is used to perform the update operation.
     - `condition = expr("id % 2 == 0")` specifies the condition for which rows should be updated, in this case, rows where `id` is even.
     - `set = { "id": expr("id + 100") }` defines the update expression, which adds 100 to the `id` column.

3. **Delete Even Values**
   - `deltaTable.delete(condition = expr("id % 2 == 0"))`
     - This line deletes all rows from the Delta table where `id` is even.
     - The `delete` method is used to perform the deletion operation.
     - `condition = expr("id % 2 == 0")` specifies the condition for which rows should be deleted, in this case, rows where `id` is even.

4. **Upsert (Merge) New Data**
   - `newData = spark.range(0, 20)`
     - This line creates a new DataFrame `newData` containing a range of integers from 0 to 19.
   - `deltaTable.alias("oldData") ...`
     - This block of code performs an upsert (merge) operation on the Delta table, merging the `newData` DataFrame with the existing data.
     - The `alias` method is used to assign aliases to the Delta table (`"oldData"`) and the new data (`"newData"`).
     - The `merge` method initiates the merge operation, comparing the `"id"` column from both datasets. We use `upsert` method which combines an `insert` with an `update` operation, to improve data integrity and simplify any potential rollback. the For additional documentation on the `upsert` method, kindly refer to the Delta Lake documentation at https://docs.delta.io/latest/delta-update.html#upsert-into-a-table-using-merge
     - `whenMatchedUpdate` specifies the update logic when a row in `"oldData"` matches a row in `"newData"`.
     - `whenNotMatchedInsert` specifies the insert logic when a row in `"newData"` does not match any row in `"oldData"`.
     - The `execute` method finalizes and applies the merge operation to the Delta table.

5. **Display Updated Delta Table**
   - `deltaTable.toDF().show()`
     - This line converts the `deltaTable` object to a Spark DataFrame using `toDF()` and displays its contents using the `show()` method.

By executing this code, you will see the Delta table being updated, deleted, and merged with new data. The final state of the Delta table will be displayed, reflecting the changes made through these operations.

In [159]:
from delta.tables import *
from pyspark.sql.functions import *

deltaTable = DeltaTable.forPath(spark, "/tmp/delta-table")

# Update every even value by adding 100 to it
deltaTable.update(
  condition = expr("id % 2 == 0"),
  set = { "id": expr("id + 100") })

# Delete every even value
deltaTable.delete(condition = expr("id % 2 == 0"))

# Upsert (merge) new data
newData = spark.range(0, 20)
print("show dataframe newData")
newData.show()

print("show table after update")
deltaTable.toDF().show()

deltaTable.alias("oldData") \
  .merge(
    newData.alias("newData"),
    "oldData.id = newData.id") \
  .whenMatchedUpdate(set = { "id": col("newData.id") }) \
  .whenNotMatchedInsert(values = { "id": col("newData.id") }) \
  .execute()

print("show upserted, i.e., merged and inserted table as a DataFrame")
deltaTable.toDF().show()

show dataframe newData
+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
|  5|
|  6|
|  7|
|  8|
|  9|
| 10|
| 11|
| 12|
| 13|
| 14|
| 15|
| 16|
| 17|
| 18|
| 19|
+---+

show table after update
+---+
| id|
+---+
|  7|
|  9|
|  5|
+---+

show upserted, i.e., merged and inserted table as a DataFrame
+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
|  5|
|  6|
|  7|
|  8|
|  9|
| 10|
| 11|
| 12|
| 13|
| 14|
| 15|
| 16|
| 17|
| 18|
| 19|
+---+



You should see that some of the existing rows have been updated and new rows have been inserted.

For more information on these operations, see Table deletes, updates, and merges.



## Querying a Specific Version of the Delta Table

This code demonstrates how to query a specific version of a Delta table using the Delta Lake library's time travel capabilities.

### Steps

1. **Read Specific Version of Delta Table**
   - `df = spark.read.format("delta").option("versionAsOf", 0).load("/tmp/delta-table")`
     - This line reads data from the Delta table located at `/tmp/delta-table`.
     - The `read` method is used to initiate the reading process.
     - `format("delta")` specifies that the data should be read from a Delta Lake table.
     - `option("versionAsOf", 0)` sets the `versionAsOf` option to `0`, which instructs Delta Lake to read the data from the initial version (version 0) of the table.
     - `load("/tmp/delta-table")` loads the data from the specified path, which contains the Delta table.
     - The resulting data is stored in the `df` DataFrame.

2. **Display DataFrame Contents**
   - `df.show()`
     - This line prints the contents of the `df` DataFrame to the console or notebook output.

By using the `versionAsOf` option when reading the Delta table, you can query a specific version of the table's data. In this case, `versionAsOf=0` retrieves the initial version of the table, which corresponds to the state of the data before any updates or modifications were made.

This time travel capability provided by Delta Lake is particularly useful for auditing, reproducing results, or reverting to a previous state of the data if needed. It leverages Delta Lake's transaction log and data versioning features to maintain a history of changes and make previous versions of the data accessible.

In [160]:
df = spark.read.format("delta").option("versionAsOf", 0).load("/tmp/delta-table")
print("querying version 0 of the dataframe")
df.show()

querying version 0 of the dataframe
+---+
| id|
+---+
|  2|
|  3|
|  4|
|  0|
|  1|
+---+



You should see the first set of data, from before you overwrote it. Time travel takes advantage of the power of the Delta Lake transaction log to access data that is no longer in the table. Removing the version 0 option (or specifying version 1) would let you see the newer data again. For more information, see Query an older snapshot of a table (time travel).


## Writing a Streaming DataFrame to a Delta Table

This code demonstrates how to write a streaming DataFrame to a Delta table using Apache Spark's structured streaming capabilities and the Delta Lake library.

### Steps

1. **Create a Streaming DataFrame**
   - `streamingDf = spark.readStream.format("rate").load()`
     - This line creates a streaming DataFrame `streamingDf` by reading data from the "rate" source, which is a built-in source that generates a stream of data at a specified rate.

2. **Transform Streaming DataFrame**
   - `streamingDf.selectExpr("value as id")`
     - This line applies a transformation to the `streamingDf` DataFrame by selecting the `value` column and renaming it to `id`.

3. **Write Streaming DataFrame to Delta Table**
   - `stream = streamingDf.selectExpr("value as id").writeStream.format("delta").option("checkpointLocation", "/tmp/checkpoint").start("/tmp/delta-table")`
     - This line writes the transformed streaming DataFrame to a Delta table located at `/tmp/delta-table`.
     - The `writeStream` method is used to initiate the streaming write process.
     - `format("delta")` specifies that the data should be written in the Delta Lake format.
     - `option("checkpointLocation", "/tmp/checkpoint")` sets the checkpoint location for fault-tolerant processing of the streaming query.
     - `start("/tmp/delta-table")` starts the streaming query and writes the data to the specified Delta table path.
     - The streaming query is assigned to the `stream` variable for later reference (e.g., to stop the stream).

By writing the streaming DataFrame to a Delta table, you can take advantage of Delta Lake's features, such as ACID transactions, data versioning, and efficient storage and retrieval. Additionally, Delta Lake's support for streaming ingestion ensures that the data is written in an efficient and fault-tolerant manner.

The `checkpointLocation` option is used to specify a location where Spark can save the progress and metadata of the streaming query, enabling fault-tolerance and allowing the query to recover from failures or restarts.

In [161]:
streamingDf = spark.readStream.format("rate").load()
stream = streamingDf.selectExpr("value as id").writeStream.format("delta").option("checkpointLocation", "/tmp/checkpoint").start("/tmp/delta-table")

While the stream is running, you can read the table using the earlier commands.



## Stopping the Streaming Query

This line of code stops the streaming query that was previously started to write data to a Delta table.

### Stop Streaming Query
- `stream.stop()`
  - This line invokes the `stop()` method on the `stream` object, which represents the streaming query.
  - The `stop()` method gracefully stops the streaming query and releases any resources associated with it.

When working with streaming queries in Apache Spark, it is essential to stop the stream when you no longer need it or when you want to terminate the application. Failing to stop the stream can lead to resource leaks and prevent the application from exiting properly.

By calling `stream.stop()`, you ensure that the streaming query is terminated safely, and any associated resources, such as threads or network connections, are properly cleaned up. This is a good practice to follow when working with streaming applications to avoid resource leaks and ensure proper application shutdown.

In [162]:
stream.stop()

## Reading from Delta Table and Writing to Console

This code demonstrates how to read data from a Delta table as a streaming DataFrame and write it to the console using Apache Spark's structured streaming capabilities.

### Steps

1. **Create a Streaming DataFrame by Reading from Delta Table**
   - `spark.readStream.format("delta").load("/tmp/delta-table")`
     - This line creates a streaming DataFrame by reading data from the Delta table located at `/tmp/delta-table`.
     - The `readStream` method is used to initiate the streaming read process.
     - `format("delta")` specifies that the data should be read from a Delta Lake table.
     - `load("/tmp/delta-table")` loads the data from the specified Delta table path.

2. **Write Streaming DataFrame to Console**
   - `.writeStream.format("console").start()`
     - This line writes the streaming DataFrame to the console.
     - The `writeStream` method is used to initiate the streaming write process.
     - `format("console")` specifies that the data should be written to the console output.
     - `start()` starts the streaming query and begins writing the data to the console.
     - The streaming query is assigned to the `stream2` variable for later reference (e.g., to stop the stream).

By reading from a Delta table as a streaming DataFrame, you can continuously monitor changes or updates to the Delta table. This can be useful in scenarios where you want to process data as it arrives or get a real-time view of the data in the Delta table.

Writing the streaming DataFrame to the console allows you to inspect the data as it is being processed by the streaming query. This can be helpful for debugging purposes or for getting a quick glimpse of the data in the Delta table.

Note that writing to the console is typically used for testing and development purposes. In production scenarios, you would likely write the streaming DataFrame to another data sink, such as another Delta table, a file system, or a database.

In [163]:
stream2 = spark.readStream.format("delta").load("/tmp/delta-table").writeStream.format("console").start()
