
The VACUUM command in Delta Lake is used to clean up old files and optimize storage by removing files that are no longer needed due to updates or deletions. This is particularly useful in managing the storage footprint of Delta tables, especially when dealing with large datasets and frequent updates or deletions.


The VACUUM command in Delta Lake is used to clean up old files and optimize storage by removing files that are no longer needed due to updates or deletions. This is particularly useful in managing the storage footprint of Delta tables, especially when dealing with large datasets and frequent updates or deletions.

How VACUUM Command Works
The VACUUM command removes files that are older than a specified retention period. The default retention period is 7 days, but you can specify a different period if needed. This ensures that you don't accidentally remove data that might still be needed for potential rollbacks or time travel queries.

Example of Using VACUUM Command
Let's walk through an example of how to use the VACUUM command in Databricks for a Delta Lake table.

1. Setting Up the Environment
First, ensure you have a Delta Lake table to work with. Here’s how you might create and populate a Delta table:

In [0]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("DeltaLakeVacuumExample").getOrCreate()

# Create a Delta Lake table
data = [(1, "Alice"), (2, "Bob"), (3, "Charlie")]
df = spark.createDataFrame(data, ["id", "name"])

# Write the data to a Delta table
df.write.format("delta").mode("overwrite").save("/mnt/delta-table-example")

# Read the table to confirm
df_delta = spark.read.format("delta").load("/mnt/delta-table-example")
df_delta.display()


id,name
3,Charlie
1,Alice
2,Bob


2. Performing Updates and Deletes
Next, perform some update and delete operations on the Delta table:

In [0]:
from delta.tables import *

# Load the Delta table
deltaTable = DeltaTable.forPath(spark, "/mnt/delta-table-example")

# Update operation
deltaTable.update(
    condition = "id = 1",
    set = { "name": "'UpdatedAlice'" }
)

# Delete operation
deltaTable.delete("id = 2")


3. Using the VACUUM Command
After performing updates and deletes, you can use the VACUUM command to clean up old files:

In [0]:
# Vacuum the table to remove files older than the default 7 days
spark.sql("VACUUM '/mnt/delta-table-example'")

# Vacuum the table to remove files older than a specified retention period (e.g., 1 day)
spark.sql("VACUUM '/mnt/delta-table-example' RETAIN 24 HOURS")


[0;31m---------------------------------------------------------------------------[0m
[0;31mIllegalArgumentException[0m                  Traceback (most recent call last)
File [0;32m<command-1016938059112883>:5[0m
[1;32m      2[0m spark[38;5;241m.[39msql([38;5;124m"[39m[38;5;124mVACUUM [39m[38;5;124m'[39m[38;5;124m/mnt/delta-table-example[39m[38;5;124m'[39m[38;5;124m"[39m)
[1;32m      4[0m [38;5;66;03m# Vacuum the table to remove files older than a specified retention period (e.g., 1 day)[39;00m
[0;32m----> 5[0m spark[38;5;241m.[39msql([38;5;124m"[39m[38;5;124mVACUUM [39m[38;5;124m'[39m[38;5;124m/mnt/delta-table-example[39m[38;5;124m'[39m[38;5;124m RETAIN 24 HOURS[39m[38;5;124m"[39m)

File [0;32m/databricks/spark/python/pyspark/instrumentation_utils.py:48[0m, in [0;36m_wrap_function.<locals>.wrapper[0;34m(*args, **kwargs)[0m
[1;32m     46[0m start [38;5;241m=[39m time[38;5;241m.[39mperf_counter()
[1;32m     47[0m [38;5;28;01mtry

4. Configuring Retention Period
By default, Delta Lake has a safety check that prevents files from being deleted if they are less than 7 days old. If you need to vacuum files that are younger than 7 days (e.g., for testing purposes), you can disable this safety check by setting the delta.retentionDurationCheck.enabled configuration to false.

Note: Be very careful with this setting, as it can result in data loss if used improperly.

In [0]:
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")

# Vacuum the table to remove files older than 1 hour
spark.sql("VACUUM '/mnt/delta-table-example' RETAIN 1 HOURS")

# Re-enable the retention duration check after vacuuming
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "true")


Best Practices

Regular Maintenance: Schedule regular VACUUM commands to keep the storage optimized and remove unnecessary files.

Retention Period: Carefully choose the retention period based on your data retention policies and the frequency of updates and deletes.

Monitoring: Monitor the storage usage and performance before and after vacuuming to ensure it has the desired effect.

Safety Checks: Use the delta.retentionDurationCheck.enabled configuration with caution, especially in production environments, to avoid accidental data loss.

By using the VACUUM command effectively, you can optimize the performance and storage efficiency of your Delta Lake tables on Databricks.