
# Delta Lake internals
<img src="https://pages.databricks.com/rs/094-YMS-629/images/delta-lake-logo-whitebackground.png" style="width:200px; float: right"/>

Let's deep dive into Delta Lake internals.

## Exploring delta structure

Under the hood, Delta is composed of parquet files and a transactional log. Transactional log contains all the metadata operation. Databricks leverage this information to perform efficient data skipping at scale among other things.

<!-- Collect usage data (view). Remove it to disable collection. View README for more details.  -->
<img width="1px" src="https://ppxrzfxige.execute-api.us-west-2.amazonaws.com/v1/analytics?category=data-engineering&org_id=7474644512555519&notebook=%2F05-Advanced-Delta-Lake-Internal&demo_name=delta-lake&event=VIEW&path=%2F_dbdemos%2Fdata-engineering%2Fdelta-lake%2F05-Advanced-Delta-Lake-Internal&version=1">
<!-- [metadata={"description":"Quick introduction to Delta Lake. <br/><i>Use this content for quick Delta demo.</i>",
 "authors":["quentin.ambard@databricks.com"],
 "db_resources":{}}] -->

In [0]:
%run ./_resources/00-setup $reset_all_data=false

## Configuration file

Please change your catalog and schema here to run the demo on a different catalog.

<!-- Collect usage data (view). Remove it to disable collection or disable tracker during installation. View README for more details.  -->
<img width="1px" src="https://ppxrzfxige.execute-api.us-west-2.amazonaws.com/v1/analytics?category=data-engineering&org_id=7474644512555519&notebook=%2Fconfig&demo_name=delta-lake&event=VIEW&path=%2F_dbdemos%2Fdata-engineering%2Fdelta-lake%2Fconfig&version=1">




# Technical Setup notebook. Hide this cell results
Initialize dataset to the current user and cleanup data when reset_all_data is set to true

Do not edit

USE CATALOG `main`
using catalog.database `main`.`dbdemos_delta_lake`


### Exploring delta structure

Delta is composed of parquet files and a transactional log

In [0]:
%python
spark.table('user_delta').write.mode('overwrite').save(f'/Volumes/{catalog}/{schema}/{volume_name}/user_delta_table')

In [0]:

DESCRIBE DETAIL `delta`.`/Volumes/main/dbdemos_delta_lake/delta_lake_raw_data/user_delta_table`

format,id,name,description,location,createdAt,lastModified,partitionColumns,clusteringColumns,numFiles,sizeInBytes,properties,minReaderVersion,minWriterVersion,tableFeatures,statistics,clusterByAuto
delta,5af6a115-5c54-46ec-9dfc-05368431297a,,,dbfs:/Volumes/main/dbdemos_delta_lake/delta_lake_raw_data/user_delta_table,2026-02-24T12:06:45.280Z,2026-02-24T12:07:45.000Z,List(),List(),1,73071,Map(delta.enableDeletionVectors -> true),3,7,"List(appendOnly, deletionVectors, invariants)","Map(numRowsDeletedByDeletionVectors -> 0, numDeletionVectors -> 0)",False


In [0]:
%python
delta_folder = spark.sql(f"DESCRIBE DETAIL `delta`.`/Volumes/{catalog}/{schema}/{volume_name}/user_delta_table`").collect()[0]['location']
print(delta_folder)
display(dbutils.fs.ls(delta_folder))

dbfs:/Volumes/main/dbdemos_delta_lake/delta_lake_raw_data/user_delta_table


path,name,size,modificationTime
dbfs:/Volumes/main/dbdemos_delta_lake/delta_lake_raw_data/user_delta_table/_delta_log/,_delta_log/,0,1771935051944
dbfs:/Volumes/main/dbdemos_delta_lake/delta_lake_raw_data/user_delta_table/part-00000-7daaa23d-d6d2-4fe8-b528-ec5ca11beca6.c000.snappy.parquet,part-00000-7daaa23d-d6d2-4fe8-b528-ec5ca11beca6.c000.snappy.parquet,73071,1771934807000
dbfs:/Volumes/main/dbdemos_delta_lake/delta_lake_raw_data/user_delta_table/part-00000-b70cf92b-d0df-4f91-a66e-9b1ae9cf703e.c000.snappy.parquet,part-00000-b70cf92b-d0df-4f91-a66e-9b1ae9cf703e.c000.snappy.parquet,73071,1771934830000
dbfs:/Volumes/main/dbdemos_delta_lake/delta_lake_raw_data/user_delta_table/part-00000-f64f4a50-c35e-4a1c-a2a3-de361bca1e6c.c000.snappy.parquet,part-00000-f64f4a50-c35e-4a1c-a2a3-de361bca1e6c.c000.snappy.parquet,73071,1771934865000


In [0]:
%python
display(dbutils.fs.ls(delta_folder+"/_delta_log"))

path,name,size,modificationTime
dbfs:/Volumes/main/dbdemos_delta_lake/delta_lake_raw_data/user_delta_table/_delta_log/00000000000000000000.crc,00000000000000000000.crc,3709,1771934808000
dbfs:/Volumes/main/dbdemos_delta_lake/delta_lake_raw_data/user_delta_table/_delta_log/00000000000000000000.json,00000000000000000000.json,2787,1771934808000
dbfs:/Volumes/main/dbdemos_delta_lake/delta_lake_raw_data/user_delta_table/_delta_log/00000000000000000001.crc,00000000000000000001.crc,3709,1771934831000
dbfs:/Volumes/main/dbdemos_delta_lake/delta_lake_raw_data/user_delta_table/_delta_log/00000000000000000001.json,00000000000000000001.json,2131,1771934830000
dbfs:/Volumes/main/dbdemos_delta_lake/delta_lake_raw_data/user_delta_table/_delta_log/00000000000000000002.crc,00000000000000000002.crc,3709,1771934865000
dbfs:/Volumes/main/dbdemos_delta_lake/delta_lake_raw_data/user_delta_table/_delta_log/00000000000000000002.json,00000000000000000002.json,2131,1771934865000
dbfs:/Volumes/main/dbdemos_delta_lake/delta_lake_raw_data/user_delta_table/_delta_log/_staged_commits/,_staged_commits/,0,1771935318473


In [0]:
%python
commit_log = dbutils.fs.head(delta_folder+"/_delta_log/00000000000000000000.json", 10000)
print(json.dumps(json.loads(commit_log.split('\n')[0]), indent = 2))

{
  "commitInfo": {
    "timestamp": 1771934807053,
    "userId": "70771123715188",
    "userName": "rishavkumar7011@gmail.com",
    "operation": "WRITE",
    "operationParameters": {
      "mode": "Overwrite",
      "statsOnLoad": false,
      "partitionBy": "[]"
    },
    "notebook": {
      "notebookId": "3154286164871011"
    },
    "queryHistoryStatementId": "f9432258-8410-4567-83a3-a03403ed539c",
    "clusterId": "0224-104039-6zcuzeq3-v2n",
    "isolationLevel": "WriteSerializable",
    "isBlindAppend": false,
    "operationMetrics": {
      "numFiles": "1",
      "numRemovedFiles": "0",
      "numRemovedBytes": "0",
      "numDeletionVectorsRemoved": "0",
      "numOutputRows": "1001",
      "numOutputBytes": "73071"
    },
    "tags": {
      "noRowsCopied": "true",
      "restoresDeletedRows": "false"
    },
    "engineInfo": "Databricks-Runtime/18.0.x-aarch64-photon-scala2.13",
    "txnId": "1d68ed46-a0de-455d-8ac7-a7ebd82c8e60"
  }
}


## Unpacking the transaction log
The transaction log is key to understanding Delta Lake because it is the common
thread that runs through many of its most important features, including ACID transactions, scalable metadata handling, time travel and more. The Delta Lake transaction log is an ordered record of every transaction that has ever been performed on
a Delta Lake table since its inception.

## OPTIMIZE in action
Running an `OPTIMIZE` + `VACUUM` will re-order all our files.

As you can see, we have multiple small parquet files in our folder:

In [0]:
%python
display(dbutils.fs.ls(delta_folder))

path,name,size,modificationTime
dbfs:/Volumes/main/dbdemos_delta_lake/delta_lake_raw_data/user_delta_table/_delta_log/,_delta_log/,0,1771935484140
dbfs:/Volumes/main/dbdemos_delta_lake/delta_lake_raw_data/user_delta_table/part-00000-7daaa23d-d6d2-4fe8-b528-ec5ca11beca6.c000.snappy.parquet,part-00000-7daaa23d-d6d2-4fe8-b528-ec5ca11beca6.c000.snappy.parquet,73071,1771934807000
dbfs:/Volumes/main/dbdemos_delta_lake/delta_lake_raw_data/user_delta_table/part-00000-b70cf92b-d0df-4f91-a66e-9b1ae9cf703e.c000.snappy.parquet,part-00000-b70cf92b-d0df-4f91-a66e-9b1ae9cf703e.c000.snappy.parquet,73071,1771934830000
dbfs:/Volumes/main/dbdemos_delta_lake/delta_lake_raw_data/user_delta_table/part-00000-f64f4a50-c35e-4a1c-a2a3-de361bca1e6c.c000.snappy.parquet,part-00000-f64f4a50-c35e-4a1c-a2a3-de361bca1e6c.c000.snappy.parquet,73071,1771934865000


Let's OPTIMIZE our table to see how the engine will compact the table:

In [0]:
OPTIMIZE `delta`.`/Volumes/main/dbdemos_delta_lake/delta_lake_raw_data/user_delta_table`;
-- as we vacuum with 0 hours, we need to remove the safety check:

-- Note: commented out as this option isn't available on serverless compute for now - see ES-1302674
-- set spark.databricks.delta.retentionDurationCheck.enabled = false;

-- VACUUM `delta`.`/Volumes/main/dbdemos_delta_lake/delta_lake_raw_data/user_delta_table` retain 0 hours;

path,metrics
dbfs:/Volumes/main/dbdemos_delta_lake/delta_lake_raw_data/user_delta_table,"List(0, 0, List(null, null, 0.0, 0, 0), List(null, null, 0.0, 0, 0), 0, null, null, 0, 0, 1, 1, true, 0, 0, 1771935506966, 1771935507716, 8, 0, null, List(0, 0), null, 8, 8, 0, 0, null, null)"


In [0]:
%python
display(dbutils.fs.ls(delta_folder))

path,name,size,modificationTime
dbfs:/Volumes/main/dbdemos_delta_lake/delta_lake_raw_data/user_delta_table/_delta_log/,_delta_log/,0,1771935534276
dbfs:/Volumes/main/dbdemos_delta_lake/delta_lake_raw_data/user_delta_table/part-00000-7daaa23d-d6d2-4fe8-b528-ec5ca11beca6.c000.snappy.parquet,part-00000-7daaa23d-d6d2-4fe8-b528-ec5ca11beca6.c000.snappy.parquet,73071,1771934807000
dbfs:/Volumes/main/dbdemos_delta_lake/delta_lake_raw_data/user_delta_table/part-00000-b70cf92b-d0df-4f91-a66e-9b1ae9cf703e.c000.snappy.parquet,part-00000-b70cf92b-d0df-4f91-a66e-9b1ae9cf703e.c000.snappy.parquet,73071,1771934830000
dbfs:/Volumes/main/dbdemos_delta_lake/delta_lake_raw_data/user_delta_table/part-00000-f64f4a50-c35e-4a1c-a2a3-de361bca1e6c.c000.snappy.parquet,part-00000-f64f4a50-c35e-4a1c-a2a3-de361bca1e6c.c000.snappy.parquet,73071,1771934865000


That's it! You know everything about Delta Lake!

As next step, you learn more about Spark Declarative Pipelines to simplify your ingestion pipeline: `dbdemos.install('pipeline-bike')`

Go back to [00-Delta-Lake-Introduction]($./00-Delta-Lake-Introduction).