## Exploring delta structure

Under the hood, Delta is composed of parquet files and a transactional log. Transactional log contains all the metadata operation. Databricks leverage this information to perform efficient data skipping at scale among other things.

In [0]:
%run ./_resources/00-setup $reset_all_data=false

In [0]:
(spark.table("user_delta")
        .write
            .mode("overwrite")
            .save(f"/Volumes/{catalog}/{schema}/{volume_name}/user_delta_table"))

In [0]:
%sql
DESCRIBE DETAIL `delta`.`/Volumes/delta_learning/dev/delta_lake_raw_data/user_delta_table`

In [0]:
# Delta is composed of parquet files:
delta_folder = spark.sql("""
                            DESCRIBE DETAIL `delta`.`/Volumes/delta_learning/dev/delta_lake_raw_data/user_delta_table`
                         """).collect()[0]['location']
print(delta_folder)

display(dbutils.fs.ls(delta_folder))

In [0]:
# And a transactional log:
display(dbutils.fs.ls(delta_folder + "/_delta_log"))

In [0]:
commit_log = dbutils.fs.head(delta_folder+"/_delta_log/00000000000000000000.json", 10000)

print(json.dumps(json.loads(commit_log.split('\n')[0]), indent = 2))

## OPTIMIZE in action
Running an `OPTIMIZE` + `VACUUM` will re-order all our files.

In [0]:
# As you can see, we have multiple small parquet files in our folder:
display(dbutils.fs.ls(delta_folder))

In [0]:
%sql
OPTIMIZE `delta`.`/Volumes/delta_learning/dev/delta_lake_raw_data/user_delta_table`;

In [0]:
%sql
-- As we vacuum with 0 hours, we need to remove the safety check:
SET spark.databricks.delta.retentionDurationCheck.enabled = false;

In [0]:
%sql
VACUUM `delta`.`/Volumes/delta_learning/dev/delta_lake_raw_data/user_delta_table` RETAIN 0 HOURS;

In [0]:
# Only one parquet file remains after the OPTIMIZE+VACUUM operation:
display(dbutils.fs.ls(delta_folder))