In Databricks with Delta Lake, deletion vectors are used for handling deletes in a more efficient and manageable manner, especially in big data scenarios. When you delete data in Delta Lake, instead of physically removing rows, Delta Lake creates a deletion vector. This vector contains information about which files and which parts of those files have been logically deleted. Here's why this approach is beneficial:

###Atomic Operations: 
Deletion vectors ensure that deletes are atomic. This means they either fully succeed or fully fail, avoiding partial deletions that could lead to inconsistent data.

###Efficient Storage: 
Instead of rewriting entire files or partitions, Delta Lake just marks the rows as deleted in the deletion vector. This reduces overhead and improves performance by minimizing the amount of data that needs to be rewritten.

###Transaction History: 
Delta Lake maintains a transaction log that records all operations (including deletes). This log allows for easy auditability and rollbacks if necessary.

###Time Travel: 
Deletion vectors are also integral to Delta Lake's time travel feature, which enables you to query data as it appeared at any point in time, including before deletions.

To handle deletion vectors in PySpark with Delta Lake, you typically use Delta Lake's native APIs. For instance, you can perform delete operations using DeltaTable API in PySpark, which ensures that deletion vectors are properly managed and transactions are atomic. This ensures data consistency and integrity while maintaining efficiency in data operations.

In [0]:
from pyspark.sql import SparkSession
from delta.tables import DeltaTable

# Initialize Spark session
spark = SparkSession.builder \
    .appName("Delta Lake Deletion Vectors Example") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

# Example DataFrame
data = [(1, 'John'), (2, 'Jane'), (3, 'Doe')]
df = spark.createDataFrame(data, ["id", "name"])

# Write DataFrame to Delta Lake
df.write.format("delta").mode("overwrite").save("/delta-table")

# Read Delta table
delta_table = DeltaTable.forPath(spark, "/delta-table")

# Perform delete operation
condition = "id = 2"
delta_table.delete(condition)

# Read the Delta table after deletion
delta_df = spark.read.format("delta").load("/delta-table")
delta_df.display()

# Delta table time travel example - accessing data before deletion
old_data = delta_table.history().filter("version < 2").select("id", "name")
old_data.display()


id,name
1,John
3,Doe


[0;31m---------------------------------------------------------------------------[0m
[0;31mAnalysisException[0m                         Traceback (most recent call last)
File [0;32m<command-3516885369338641>:30[0m
[1;32m     27[0m delta_df[38;5;241m.[39mdisplay()
[1;32m     29[0m [38;5;66;03m# Delta table time travel example - accessing data before deletion[39;00m
[0;32m---> 30[0m old_data [38;5;241m=[39m delta_table[38;5;241m.[39mhistory()[38;5;241m.[39mfilter([38;5;124m"[39m[38;5;124mversion < 2[39m[38;5;124m"[39m)[38;5;241m.[39mselect([38;5;124m"[39m[38;5;124mid[39m[38;5;124m"[39m, [38;5;124m"[39m[38;5;124mname[39m[38;5;124m"[39m)
[1;32m     31[0m old_data[38;5;241m.[39mdisplay()

File [0;32m/databricks/spark/python/pyspark/instrumentation_utils.py:48[0m, in [0;36m_wrap_function.<locals>.wrapper[0;34m(*args, **kwargs)[0m
[1;32m     46[0m start [38;5;241m=[39m time[38;5;241m.[39mperf_counter()
[1;32m     47[0m [38;5;28;01mtr

###In this example:

We create a Spark session and define a sample DataFrame df with some data (id, name).

The DataFrame is written to Delta Lake (/delta-table).

We access the Delta table using DeltaTable.forPath() and perform a delete operation based on a condition (id = 2).

After deletion, we read the Delta table again (delta_df) and display its contents.

The history() function of DeltaTable allows us to view the transaction history. We filter the history to show data before the deletion (version < 2).

This example illustrates how Delta Lake manages deletion vectors internally and how you can work with them using PySpark APIs (DeltaTable).