# Fundamentals

## Protocol

See [Delta Transaction Log Protocol](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#delta-transaction-log-protocol)

In [3]:
protocol = {
    "protocol": {
        "minReaderVersion": 1,
        "minWriterVersion": 1
    }
}
print(protocol)

StatementMeta(, affdda35-bc35-4065-9eba-325f313bedbf, 5, Finished, Available)

{'protocol': {'minReaderVersion': 1, 'minWriterVersion': 1}}


## Tables need a schema

In [4]:
from uuid import uuid4
from pprint import pprint
import time
import json

schema = {
    "type":"struct",
    "fields":[
        {"name":"id", "type":"long", "nullable":False, "metadata":{}},
        {"name":"name", "type":"string", "nullable":False, "metadata":{}},
    ]
}

metadata = {
    "metaData": {
        "id": str(uuid4()),
        "name": None,
        "description": None,
        "format": {
            "provider": "parquet", # Theoritically speaking other formats can be used but the community uses parquet everywhere.
            "options": {}
        },
        "schemaString": json.dumps(schema),
        "partitionColumns": [],
        "createdTime": int(time.time_ns() / 1000000),
        "configuration": {}
    }
}
pprint(metadata)

StatementMeta(, affdda35-bc35-4065-9eba-325f313bedbf, 6, Finished, Available)

{'metaData': {'configuration': {},
              'createdTime': 1689735864575,
              'description': None,
              'format': {'options': {}, 'provider': 'parquet'},
              'id': '8a4e6c15-1e57-40a6-b7bc-97c9c806fc31',
              'name': None,
              'partitionColumns': [],
              'schemaString': '{"type": "struct", "fields": [{"name": "id", '
                              '"type": "long", "nullable": false, "metadata": '
                              '{}}, {"name": "name", "type": "string", '
                              '"nullable": false, "metadata": {}}]}'}}


## Helper function to write a panda in parquet

In [5]:
import pandas as pd
def write_parquet_file(data, path: str):
    df = pd.DataFrame(data)
    if not path.startswith("/lakehouse/default/"):
        path = f"/lakehouse/default/{path}"
    df.to_parquet(path)

StatementMeta(, affdda35-bc35-4065-9eba-325f313bedbf, 7, Finished, Available)

## Create an empty table

In [6]:
import uuid

root_path = "Files/delta-lake/1-fundamentals"

if mssparkutils.fs.exists(root_path):
    mssparkutils.fs.rm(root_path, recurse=True)

table_path = f"{root_path}/{uuid.uuid4()}"

protcol_str = json.dumps(protocol)
metadata_str = json.dumps(metadata)

content = protcol_str + "\n" + metadata_str

mssparkutils.fs.put(f"{table_path}/_delta_log/00000000000000000000.json", content)
spark.read.format("delta").load(table_path).show()

StatementMeta(, affdda35-bc35-4065-9eba-325f313bedbf, 8, Finished, Available)

+---+----+
| id|name|
+---+----+
+---+----+



## Add a few records to table

In [7]:
write_parquet_file( 
    {"id": [1, 2], "name": ["first", "second"]}, 
    f"{table_path}/file1.parquet"
)

add = {
    "add": {
        "path": "file1.parquet",
        "size": mssparkutils.fs.ls(f"{table_path}/file1.parquet")[0].size,
        "partitionValues": {},
        "modificationTime": int(time.time_ns() / 1000000),
        "dataChange": True,
        "tags": None
    }
}

add_content = json.dumps(add)
mssparkutils.fs.put(f"{table_path}/_delta_log/00000000000000000001.json", add_content, overwrite=True)

display(spark.read.format("delta").load(table_path))

StatementMeta(, affdda35-bc35-4065-9eba-325f313bedbf, 9, Finished, Available)

SynapseWidget(Synapse.DataFrame, 3700e8b3-716e-4fa7-880e-2e2870152f33)

## Update some record(s)

In [8]:
write_parquet_file( 
    {"id": [1, 2], "name": ["first", "second updated"]}, 
    f"{table_path}/file2.parquet"

)
add = {
    "add": {
        "path": "file2.parquet",
        "size": mssparkutils.fs.ls(f"{table_path}/file2.parquet")[0].size,
        "partitionValues": {},
        "modificationTime": int(time.time_ns() / 1000000),
        "dataChange": True,
        "tags": None
    }
}

remove = {
    "remove": {
        "path": "file1.parquet",
        "deletionTimestamp": int(time.time_ns() / 1000000),
        "dataChange": True,
        "size": mssparkutils.fs.ls(f"{table_path}/file1.parquet")[0].size
    }
}

update_content = json.dumps(remove) + "\n" + json.dumps(add)
mssparkutils.fs.put(f"{table_path}/_delta_log/00000000000000000002.json", update_content, overwrite=True)

display(spark.read.format("delta").load(table_path))

StatementMeta(, affdda35-bc35-4065-9eba-325f313bedbf, 10, Finished, Available)

SynapseWidget(Synapse.DataFrame, 2d376f32-698a-47aa-919f-0ba1138ffa3b)

## Table folder contents

Check table folder contents in OneLake explorer or side menu or `mssparkutils.fs.ls`

![root](https://i.imgur.com/qqejI9P.png)

![log](https://i.imgur.com/uL5Aadu.png)

# Time travel

In [9]:
display(spark.read.format("delta").option("versionAsOf", "1").load(table_path))

StatementMeta(, affdda35-bc35-4065-9eba-325f313bedbf, 11, Finished, Available)

SynapseWidget(Synapse.DataFrame, 273b0d9b-a2b4-4b89-9e71-a3840e850a34)

## Dummy/interim files effect in table folder

In [10]:
write_parquet_file( 
    {"id": [11, 22], "name": ["eleven", "twenty two"]}, 
    f"{table_path}/dummy-file.parquet"
)

display(spark.read.format("delta").load(table_path))

StatementMeta(, affdda35-bc35-4065-9eba-325f313bedbf, 12, Finished, Available)

SynapseWidget(Synapse.DataFrame, 304b034d-53bd-4623-b1e0-c8f9680139c3)

## Vacuum the table

In [11]:
from delta import DeltaTable
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled","false")

# number of hours can be a floating point to allow granularity of minutes or seconds
DeltaTable.forPath(spark, table_path).vacuum(0)

mssparkutils.fs.ls(table_path)

StatementMeta(, affdda35-bc35-4065-9eba-325f313bedbf, 13, Finished, Available)

[FileInfo(path=abfss://339a0302-9394-4cd7-af3c-44fd076923ce@onelake.dfs.fabric.microsoft.com/ab34098a-3173-4b46-ad67-133aa4716e8b/Files/delta-lake/1-fundamentals/7ff7b980-28af-49ac-9545-fff16b7a8c76/_delta_log, name=_delta_log, size=0),
 FileInfo(path=abfss://339a0302-9394-4cd7-af3c-44fd076923ce@onelake.dfs.fabric.microsoft.com/ab34098a-3173-4b46-ad67-133aa4716e8b/Files/delta-lake/1-fundamentals/7ff7b980-28af-49ac-9545-fff16b7a8c76/file2.parquet, name=file2.parquet, size=2249)]

## Have we lost anything with vacuuming?

In [12]:
display(spark.read.format("delta").load(table_path))

StatementMeta(, affdda35-bc35-4065-9eba-325f313bedbf, 14, Finished, Available)

SynapseWidget(Synapse.DataFrame, 5a94aa45-9a2c-45e3-ba2f-877d0c2700b5)

In [13]:
display(spark.read.format("delta").option("versionAsOf", "1").load(table_path))

StatementMeta(, affdda35-bc35-4065-9eba-325f313bedbf, 15, Finished, Available)

Py4JJavaError: An error occurred while calling z:com.microsoft.spark.notebook.visualization.display.getDisplayResultForIPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 107.0 failed 4 times, most recent failure: Lost task 0.3 in stage 107.0 (TID 2044) (vm-65d08095 executor 1): java.io.FileNotFoundException: 
Operation failed: "Not Found", 404, HEAD, https://onelake.dfs.fabric.microsoft.com/339a0302-9394-4cd7-af3c-44fd076923ce/ab34098a-3173-4b46-ad67-133aa4716e8b/Files/delta-lake/1-fundamentals/7ff7b980-28af-49ac-9545-fff16b7a8c76/file1.parquet?upn=false&action=getStatus&timeout=90

It is possible the underlying files have been updated. You can explicitly invalidate
the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by
recreating the Dataset/DataFrame involved.
       
	at org.apache.spark.sql.errors.QueryExecutionErrors$.readCurrentFileNotFoundError(QueryExecutionErrors.scala:661)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:248)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:308)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:149)
	at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:584)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:764)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:400)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:897)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:897)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:57)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:366)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:330)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2682)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2618)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2617)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2617)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1190)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1190)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1190)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2870)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2812)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2801)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:958)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2342)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2363)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2382)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:542)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:495)
	at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48)
	at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3881)
	at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2876)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:3871)
	at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:562)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3869)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:111)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:183)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:97)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:66)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3869)
	at org.apache.spark.sql.Dataset.head(Dataset.scala:2876)
	at org.apache.spark.sql.Dataset.take(Dataset.scala:3097)
	at org.apache.spark.sql.GetRowsHelper$.getRowsInJsonString(GetRowsHelper.scala:51)
	at com.microsoft.spark.notebook.visualization.display$.generateTableConfig(Display.scala:403)
	at com.microsoft.spark.notebook.visualization.display$.exec(Display.scala:256)
	at com.microsoft.spark.notebook.visualization.display$.getDisplayResultInternal(Display.scala:193)
	at com.microsoft.spark.notebook.visualization.display$.getDisplayResultForIPython(Display.scala:109)
	at com.microsoft.spark.notebook.visualization.display.getDisplayResultForIPython(Display.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.FileNotFoundException: 
Operation failed: "Not Found", 404, HEAD, https://onelake.dfs.fabric.microsoft.com/339a0302-9394-4cd7-af3c-44fd076923ce/ab34098a-3173-4b46-ad67-133aa4716e8b/Files/delta-lake/1-fundamentals/7ff7b980-28af-49ac-9545-fff16b7a8c76/file1.parquet?upn=false&action=getStatus&timeout=90

It is possible the underlying files have been updated. You can explicitly invalidate
the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by
recreating the Dataset/DataFrame involved.
       
	at org.apache.spark.sql.errors.QueryExecutionErrors$.readCurrentFileNotFoundError(QueryExecutionErrors.scala:661)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:248)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:308)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:149)
	at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:584)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:764)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:400)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:897)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:897)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:57)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:366)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:330)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more


## Let's see if the hisory is preserved

In [14]:
display(DeltaTable.forPath(spark, table_path).history())

StatementMeta(, affdda35-bc35-4065-9eba-325f313bedbf, 16, Finished, Available)

SynapseWidget(Synapse.DataFrame, 3762455a-be5b-49b8-aacc-90d36ac39bd9)

Quoting [delta lake docs](https://docs.delta.io/latest/delta-utility.html#remove-files-no-longer-referenced-by-a-delta-table):

> vacuum deletes only data files, not log files. Log files are deleted automatically and asynchronously after checkpoint operations. The default retention period of log files is 30 days, configurable through the delta.logRetentionDuration property which you set with the ALTER TABLE SET TBLPROPERTIES SQL method.

## Accumulate lots of changes in a single table

In [None]:

from joblib import Parallel, delayed

def add_couple_records_to_table(index):
  write_parquet_file( 
    {"id": [index, index + 1], "name": [f"item {index}", f"item {index + 1}"]}, 
    f"{table_path}/new-file-{index}.parquet"
  )
  
  version_str = f"{index:020d}"
  add = {
      "add": {
          "path": f"new-file-{index}.parquet",
          "size": mssparkutils.fs.ls(f"{table_path}/new-file-{index}.parquet")[0].size,
          "partitionValues": {},
          "modificationTime": int(time.time_ns() / 1000000),
          "dataChange": True,
          "tags": None
      }
  }

  add_content = json.dumps(add)
  mssparkutils.fs.put(f"{table_path}/_delta_log/{version_str}.json", add_content, overwrite=True)
  

PARALLELISM = 8
def run_paralell_jobs(func_to_run, items):
    tasks = [delayed(func_to_run)(x) for x in items]
    result = Parallel(n_jobs=PARALLELISM, prefer="threads")(tasks)
    return result
	
	
_ = run_paralell_jobs(add_couple_records_to_table, range(3, 500))

Check delta log folder now in OneLake explorer.

Other than the tiny files problem (which is not ) what is the impact on common queries on table

In [17]:
display(spark.read.format("delta").load(table_path).describe())

StatementMeta(, affdda35-bc35-4065-9eba-325f313bedbf, 19, Finished, Available)

SynapseWidget(Synapse.DataFrame, fec86a77-ebc3-406e-abf4-480539d593f5)

Big part of time needed to run the query will be needed to prepare table snapshot.

![query](https://i.imgur.com/WCPYztN.png)

In [18]:
display(spark.sql(f"INSERT INTO delta.`{table_path}` VALUES (1000000, 'item 1000000')"))

StatementMeta(, affdda35-bc35-4065-9eba-325f313bedbf, 20, Finished, Available)

SynapseWidget(Synapse.DataFrame, a62ba25e-0374-477a-86ed-cb8583897ff0)

Checkpoint marker and parquet files are generated

![checkpoint](https://i.imgur.com/zKjx0bi.png)

In [19]:
display(spark.read.format("delta").load(table_path).describe())

StatementMeta(, affdda35-bc35-4065-9eba-325f313bedbf, 21, Finished, Available)

SynapseWidget(Synapse.DataFrame, b852f25d-f7f5-4028-8ed7-f578bda0f7dd)

Table snapshot computation is much faster

![snapshot-checkpoint](https://i.imgur.com/VRUXOiw.png)

## Optimise table

In [45]:
DeltaTable.forPath(spark, table_path).optimize().executeCompaction()

StatementMeta(, e03e0aaf-8a8c-4900-afe5-5a36793d5da1, 47, Finished, Available)

DataFrame[path: string, metrics: struct<numFilesAdded:bigint,numFilesRemoved:bigint,filesAdded:struct<min:bigint,max:bigint,avg:double,totalFiles:bigint,totalSize:bigint>,filesRemoved:struct<min:bigint,max:bigint,avg:double,totalFiles:bigint,totalSize:bigint>,partitionsOptimized:bigint,zOrderStats:struct<strategyName:string,inputCubeFiles:struct<num:bigint,size:bigint>,inputOtherFiles:struct<num:bigint,size:bigint>,inputNumCubes:bigint,mergedFiles:struct<num:bigint,size:bigint>,numOutputCubes:bigint,mergedNumCubes:bigint>,numBatches:bigint,totalConsideredFiles:bigint,totalFilesSkipped:bigint,preserveInsertionOrder:boolean,numFilesSkippedToReduceWriteAmplification:bigint,numBytesSkippedToReduceWriteAmplification:bigint,startTimeMs:bigint,endTimeMs:bigint,totalClusterParallelism:bigint,totalScheduledTasks:bigint,autoCompactParallelismStats:struct<maxClusterActiveParallelism:bigint,minClusterActiveParallelism:bigint,maxSessionActiveParallelism:bigint,minSessionActiveParallelism:bigint>>]

Small files will be merged into larger ones, the top file below is a file holding all table contents.

![compaction](https://i.imgur.com/IpCkQsu.png)


Last commit file shows all old acive files are removed and a single new file added

![data](https://i.imgur.com/wdiEBfC.png)
