
# Delta Lake internals
<img src="https://pages.databricks.com/rs/094-YMS-629/images/delta-lake-logo-whitebackground.png" style="width:200px; float: right"/>

Let's deep dive into Delta Lake internals.

## Exploring delta structure

Under the hood, Delta is composed of parquet files and a transactional log. Transactional log contains all the metadata operation. Databricks leverage this information to perform efficient data skipping at scale among other things.

<!-- Collect usage data (view). Remove it to disable collection. View README for more details.  -->
<img width="1px" src="https://ppxrzfxige.execute-api.us-west-2.amazonaws.com/v1/analytics?category=data-engineering&org_id=4214571749987147&notebook=%2F05-Advanced-Delta-Lake-Internal&demo_name=delta-lake&event=VIEW&path=%2F_dbdemos%2Fdata-engineering%2Fdelta-lake%2F05-Advanced-Delta-Lake-Internal&version=1">
<!-- [metadata={"description":"Quick introduction to Delta Lake. <br/><i>Use this content for quick Delta demo.</i>",
 "authors":["quentin.ambard@databricks.com"],
 "db_resources":{}}] -->

In [0]:
%run ./_resources/00-setup $reset_all_data=false

### Exploring delta structure

Delta is composed of parquet files and a transactional log

In [0]:
%python
spark.table('user_delta').write.mode('overwrite').save(f'/Volumes/{catalog}/{schema}/{volume_name}/user_delta_table')

In [0]:

DESCRIBE DETAIL `delta`.`/Volumes/pds/dbdemos_sharing_airlinedata/delta_lake_raw_data/user_delta_table`

In [0]:
%python
delta_folder = spark.sql(f"DESCRIBE DETAIL `delta`.`/Volumes/{catalog}/{schema}/{volume_name}/user_delta_table`").collect()[0]['location']
print(delta_folder)
display(dbutils.fs.ls(delta_folder))

In [0]:
%python
display(dbutils.fs.ls(delta_folder+"/_delta_log"))

In [0]:
%python
commit_log = dbutils.fs.head(delta_folder+"/_delta_log/00000000000000000000.json", 10000)
print(json.dumps(json.loads(commit_log.split('\n')[0]), indent = 2))

## OPTIMIZE in action
Running an `OPTIMIZE` + `VACUUM` will re-order all our files.

As you can see, we have multiple small parquet files in our folder:

In [0]:
%python
display(dbutils.fs.ls(delta_folder))

Let's OPTIMIZE our table to see how the engine will compact the table:

In [0]:
OPTIMIZE `delta`.`/Volumes/pds/dbdemos_sharing_airlinedata/delta_lake_raw_data/user_delta_table`;
-- as we vacuum with 0 hours, we need to remove the safety check:

-- Note: commented out as this option isn't available on serverless compute for now - see ES-1302674
-- set spark.databricks.delta.retentionDurationCheck.enabled = false;

-- VACUUM `delta`.`/Volumes/pds/dbdemos_sharing_airlinedata/delta_lake_raw_data/user_delta_table` retain 0 hours;

In [0]:
%python
display(dbutils.fs.ls(delta_folder))

That's it! You know everything about Delta Lake!

As next step, you learn more about Delta Live Table to simplify your ingestion pipeline: `dbdemos.install('delta-live-table')`

Go back to [00-Delta-Lake-Introduction]($./00-Delta-Lake-Introduction).

In [0]:
COPY INTO `delta`.`/Volumes/pds/dbdemos_sharing_airlinedata/delta_lake_raw_data/user_delta_table`
FROM 'https://dbc-092fbfc3-eebd.cloud.databricks.com/browse/folders/1990082293548072?o=4214571749987147'
FILES = ('https://dbc-092fbfc3-eebd.cloud.databricks.com/browse/folders/357045988982663?o=4214571749987147')
