# Overview

In February 19, 2019, Databricks [announced](https://databricks.com/blog/2019/02/04/introducing-delta-time-travel-for-large-scale-data-lakes.html) the release of the time travel feature. 

> With this new feature, Delta automatically versions the big data that you store in your data lake, and you can access any historical version of that data. This temporal data management simplifies your data pipeline by making it easy to audit, roll back data in case of accidental bad writes or deletes, and reproduce experiments and reports. Your organization can finally standardize on a clean, centralized,  versioned big data repository in your own cloud storage for your analytics.

In this notebook we will explore this feature as well as the problems it is looking to solve.

# 1. Challenges Being Addressed

## 1.1. Audit data changes
Auditing data changes is critical from both in terms of data compliance as well as simple debugging to understand how data has changed over time. Traditionally the audit was tightly coupled with the pipelines responsible for delivering the data. This feature standardizes the audit regardless of how the data got into the data lake.

## 1.2. Reproduce experiments & reports
During model training, data scientists run various experiments with different parameters on a given set of data. When scientists revisit their experiments after a period of time to reproduce the models, typically the source data has been modified by upstream pipelines. Lot of times, they are caught unaware by such upstream data changes and hence struggle to reproduce their experiments. Some scientists and organizations engineer best practices by creating multiple copies of the data, leading to increased storage costs. The same is true for analysts generating reports.

## 1.3. Rollbacks
Data pipelines can sometimes write bad data for downstream consumers. This can happen because of issues ranging from infrastructure instabilities to messy data to bugs in the pipeline. For pipelines that do simple appends to directories or a table, rollbacks can easily be addressed by date-based partitioning. With updates and deletes, this can become very complicated, and data engineers typically have to engineer a complex pipeline to deal with such scenarios. This is no longer the case with time travel.

# 2. Under The Hood: How does time travel work?

As you write into a Delta table or directory, every operation is automatically versioned. And deltalake provides you with an API for accessing data as of a specified timestamp. But how does this work under the hood?

## 2.1. A Brief Explanation
A great video can be found [here](https://www.youtube.com/watch?v=o-zcZvfSUyIhttps://www.youtube.com/watch?v=o-zcZvfSUyI)

In short, we can think of a delta table as the result of applying a sequence of changes. We will see that a delta table is not a file, but a directory structure. When we load a table into memory we are actually assembling the table from data files stored in the directory.

As such, when we perform operations on the delta table (i.e. store a snapshot of the state of a spark dataframe), we see the files inthis dirctory are modified. 

The delta table directory contains several types of files
1. Compressed parquet files
2. Checksums to validate the parwuet files
3. Transaction logs
4. Checkpoint files (cached state to improve computational efficiency)

The delta table has a log directory which stores json log files which record operations beind performed on the delta table. These operations describe how we transform the data from one state to another state. The operations can be categorized as follows:
- update metadata
- add parquet file
- remove parquet file
- set transaction (record idempotent transaction id)
- change transaction protocol version

This log directory is extremely important because it is what provides the information for calculating versions and of data and their coresponding timestamps. The log directory consists of transaction files, each transaction file pertains to a specific version of data. File '00000000000000000000.json' coresponds to version 0 while file '00000000000000000000.json' coresponds to version 1 of the delta table. Each transaction file has a modified data assigned by the operating system. This timestamp attaches a datatime to our version.

Below we can see an example detlat table which i have stored in the `data/time-travel-demo.delta` directory

```
[root@pc]# tree data/time-travel-demo.delta
data/time-travel-demo.delta/
|-- _delta_log
|   |-- 00000000000000000000.json
|   `-- 00000000000000000001.json
|-- part-00000-41e1a8cd-15d5-4263-9413-f649ad1d51da-c000.snappy.parquet
`-- part-00000-e8891ae6-e9d4-449a-8531-de44d41f7669-c000.snappy.parquetdata
```

We see that the directory is filled with files (the crc files are hidden)

```
[root@pc]# ls -la data/time-travel-demo.delta/

total 3
drwxr-xr-x 1 root root   5 May 20 19:29 .
drwxr-xr-x 1 root root   2 May 20 18:42 ..
-rw-r--r-- 1 root root  16 May 20 18:42 .part-00000-41e1a8cd-15d5-4263-9413-f649ad1d51da-c000.snappy.parquet.crc
-rw-r--r-- 1 root root  16 May 20 19:29 .part-00000-e8891ae6-e9d4-449a-8531-de44d41f7669-c000.snappy.parquet.crc
drwxr-xr-x 1 root root   2 May 20 19:29 _delta_log
-rw-r--r-- 1 root root 735 May 20 18:42 part-00000-41e1a8cd-15d5-4263-9413-f649ad1d51da-c000.snappy.parquet
-rw-r--r-- 1 root root 946 May 20 19:29 part-00000-e8891ae6-e9d4-449a-8531-de44d41f7669-c000.snappy.parquet
```

And we can get the timestamp for the modification of the file

```
[root@pc]# stat data/time-travel-demo.delta/_delta_log/*

  File: 'data/time-travel-demo.delta/_delta_log/00000000000000000000.json'
  Size: 819       	Blocks: 2          IO Block: 262144 regular file
Device: 44h/68d	Inode: 1099511789237  Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-05-20 18:42:13.100546536 +0000
Modify: 2022-05-20 18:42:13.341529695 +0000
Change: 2022-05-20 18:42:13.691505237 +0000
 Birth: -
  File: 'data/time-travel-demo.delta/_delta_log/00000000000000000001.json'
  Size: 1049      	Blocks: 3          IO Block: 262144 regular file
Device: 44h/68d	Inode: 1099511789264  Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-05-20 19:29:50.672776623 +0000
Modify: 2022-05-20 19:29:51.389726484 +0000
Change: 2022-05-20 19:29:51.526716905 +0000
 Birth: -  
 ```
 
**Note**: While delta lake does use this modification timestamp to attach a timestamp to a version number, it is not exact. We will see that timestamps for version numbers only takethe first three digits of the microsecond field. In this example we would see a timestamp of '2022-05-20 19:29:51.389' attached to version 1. 

**Note**: If the filesystem timestamps get messed up or out of order, delta lake will add one milisecond to the previoustimestamp to compute a new timestamp for a version. So the filesystem timestamps may not always line up.

We will see that deltalake provides a utility for calculating the timestamps ov each version so we dont have to worry!

## 2.1. Fire up spark
We assume you already have a working spark implimentation and you have already installed the delta pip package.

In [85]:
import pyspark
import delta
sparkConf = pyspark.SparkConf()
sparkConf.setAppName("delta-time-travel-demo")
sparkConf.set("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
sparkConf.set("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
sparkConf.set("spark.databricks.delta.stateReconstructionValidation.enabled", "false")
sparkSessionBuilder = pyspark.sql.SparkSession.builder.config(conf=sparkConf)
sparkSession = delta.configure_spark_with_delta_pip(sparkSessionBuilder).getOrCreate()

## 2.2. Write data and observe changes

In [124]:
import pandas
pandas_df = pandas.DataFrame({
    "A": [1,2,3,4,5],
    "B": [6,7,8,9,10],
})
spark_df = sparkSession.createDataFrame(pandas_df)
spark_df.show()

+---+---+
|  A|  B|
+---+---+
|  1|  6|
|  2|  7|
|  3|  8|
|  4|  9|
|  5| 10|
+---+---+



 Before writing our data we have a look at our data directory and see that it is empty

In [78]:
! ls -la data

total 1
drwxr-xr-x 1 root root 2 May 20 18:21 .
drwxr-xr-x 1 root root 5 May 20 18:37 ..
-rw-r--r-- 1 root root 2 May 20 18:18 .gitignore
drwxr-xr-x 1 root root 3 May 20 18:21 time-travel-demo.delta


We now write the data as a delta table

In [154]:
delta_table_path = "data/time-travel-demo.delta"

In [90]:
spark_df.write.format("delta").save(delta_table_path)

                                                                                

We again observe our directory

In [91]:
! tree data

data
`-- time-travel-demo.delta
    |-- _delta_log
    |   `-- 00000000000000000000.json
    `-- part-00000-41e1a8cd-15d5-4263-9413-f649ad1d51da-c000.snappy.parquet

2 directories, 2 files


In [171]:
import json
import pyprojroot
import os

def load_delta_log(log_file_path):
    with open(log_file_path, "r") as file:
        lines = file.readlines()
        proper_json = "[" + ",".join(lines) + "]"
        d = json.loads(proper_json)
        return d

In [179]:
log_file_path = os.path.join(delta_table_path, "_delta_log","00000000000000000000.json") 
print(json.dumps(load_delta_log(log_file_path), indent=4))

[
    {
        "commitInfo": {
            "timestamp": 1653072133086,
            "operation": "WRITE",
            "operationParameters": {
                "mode": "ErrorIfExists",
                "partitionBy": "[]"
            },
            "isBlindAppend": true,
            "operationMetrics": {
                "numFiles": "1",
                "numOutputBytes": "735",
                "numOutputRows": "5"
            }
        }
    },
    {
        "protocol": {
            "minReaderVersion": 1,
            "minWriterVersion": 2
        }
    },
    {
        "metaData": {
            "id": "3af5efe7-073f-4692-9e7b-51d76ced0146",
            "format": {
                "provider": "parquet",
                "options": {}
            },
            "schemaString": "{\"type\":\"struct\",\"fields\":[{\"name\":\"A\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"B\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}}]}",
            "partitionColumns": [],
      

It's important to understand the data written in this file. We have a list of dictionaries. The first pertains to commits. These are effectively our shapshots and we can see a section for add which tells us that a new part was added to the table.

We now modify our data and again write our data to the delta lake

In [150]:
from pyspark.sql.functions import monotonically_increasing_id, row_number, udf
from pyspark.sql.types import IntegerType

C = [11,12,13,14,15]
spark_udf = udf(lambda i: C[i -1], IntegerType())
new_spark_df = spark_df.withColumn("C", spark_udf('A'))
new_spark_df.show()

+---+---+---+
|  A|  B|  C|
+---+---+---+
|  1|  6| 11|
|  2|  7| 12|
|  3|  8| 13|
|  4|  9| 14|
|  5| 10| 15|
+---+---+---+



In [156]:
! ls -la data/time-travel-demo.delta/

total 2
drwxr-xr-x 1 root root   3 May 20 18:42 .
drwxr-xr-x 1 root root   2 May 20 18:42 ..
-rw-r--r-- 1 root root  16 May 20 18:42 .part-00000-41e1a8cd-15d5-4263-9413-f649ad1d51da-c000.snappy.parquet.crc
drwxr-xr-x 1 root root   1 May 20 18:42 _delta_log
-rw-r--r-- 1 root root 735 May 20 18:42 part-00000-41e1a8cd-15d5-4263-9413-f649ad1d51da-c000.snappy.parquet


In [163]:
new_spark_df\
  .write\
  .format("delta")\
  .mode("overwrite")\
  .option("overwriteSchema", "true")\
  .save(delta_table_path)

                                                                                

In [168]:
! tree data/time-travel-demo.delta/

data/time-travel-demo.delta/
|-- _delta_log
|   |-- 00000000000000000000.json
|   `-- 00000000000000000001.json
|-- part-00000-41e1a8cd-15d5-4263-9413-f649ad1d51da-c000.snappy.parquet
`-- part-00000-e8891ae6-e9d4-449a-8531-de44d41f7669-c000.snappy.parquet

1 directory, 4 files


We can see a new part has now shown up in the log directory. If we look at the log, we can see more entries than just an add (we see an add and a remove operation).

In [181]:
log_file_path = os.path.join(delta_table_path, "_delta_log","00000000000000000001.json") 
log = load_delta_log(log_file_path)
print(json.dumps(log, indent=4))

[
    {
        "commitInfo": {
            "timestamp": 1653074990646,
            "operation": "WRITE",
            "operationParameters": {
                "mode": "Overwrite",
                "partitionBy": "[]"
            },
            "readVersion": 0,
            "isBlindAppend": false,
            "operationMetrics": {
                "numFiles": "1",
                "numOutputBytes": "946",
                "numOutputRows": "5"
            }
        }
    },
    {
        "metaData": {
            "id": "3af5efe7-073f-4692-9e7b-51d76ced0146",
            "format": {
                "provider": "parquet",
                "options": {}
            },
            "schemaString": "{\"type\":\"struct\",\"fields\":[{\"name\":\"A\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"B\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"C\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}}]}",
            "partitionColumns": [],
            "configur

Looking closely we can see that a part was removed from the table (the part we added in the first transaction file) and a new part was added.

We can also see timestamps associated with each operation (modificationTime for add and deletionTimestamp for remove operations) as well as a timestamp for the entire atomic operation (commitInfo) and a timestamp for the metadata. We 

## 2.3. Load timestamps for operations from log

Timestamps are somewhat difficult to interpret. We will load the timestamps into datetime objects so they are more human friendly and so we can see what is going on.

In [314]:
import datetime

log0 = load_delta_log(os.path.join(delta_table_path, "_delta_log","00000000000000000000.json"))

log0_commit_info = log0[0]["commitInfo"]
log0_commit_info_timestamp = log0_commit_info["timestamp"]
log0_commit_info_dt = datetime.datetime.fromtimestamp(log0_commit_info_timestamp/ 1e3)

log0_metadata_info = log0[2]["metaData"]
log0_metadata_info_timestamp = log0_metadata_info["createdTime"]
log0_metadata_info_dt = datetime.datetime.fromtimestamp(log0_metadata_info_timestamp/ 1e3)

log0_add_info = log0[3]["add"]["modificationTime"]
log0_add_info_dt = datetime.datetime.fromtimestamp(log0_add_info_timestamp/ 1e3)

log1 = load_delta_log(os.path.join(delta_table_path, "_delta_log","00000000000000000001.json"))

log1_commit_info = log1[0]["commitInfo"]
log1_commit_info_timestamp = log1_commit_info["timestamp"]
log1_commit_info_dt = datetime.datetime.fromtimestamp(log1_commit_info_timestamp/ 1e3)

log1_metadata_info = log1[1]["metaData"]
log1_metadata_info_timestamp = log1_metadata_info["createdTime"]
log1_metadata_info_dt = datetime.datetime.fromtimestamp(log0_metadata_info_timestamp/ 1e3)

log1_add_info = log1[2]["add"]
log1_add_info_timestamp = log1_add_info["modificationTime"]
log1_add_info_dt = datetime.datetime.fromtimestamp(log1_add_info_timestamp/ 1e3)

log1_remove_info = log1[3]["remove"]
log1_remove_info_timestamp = log1_remove_info["deletionTimestamp"]
log1_remove_info_dt = datetime.datetime.fromtimestamp(log1_remove_info_timestamp/ 1e3)

print("v0 Info:")
print(f"Commit: {log0_commit_info_dt}")
print(f"Add: {log0_add_info_dt}")
print(f"Creation: {log0_metadata_info_dt}")
print("")
print("v1 Info:")
print(f"Commit: {log1_commit_info_dt}")
print(f"Add: {log1_add_info_dt}")
print(f"Remove: {log1_remove_info_dt}")
print(f"Creation: {log1_metadata_info_dt}")



v0 Info:
Commit: 2022-05-20 18:42:13.086000
Add: 2022-05-20 18:42:12.985000
Creation: 2022-05-20 18:42:12.288000

v1 Info:
Commit: 2022-05-20 19:29:50.646000
Add: 2022-05-20 19:29:48.956000
Remove: 2022-05-20 19:29:50.645000
Creation: 2022-05-20 18:42:12.288000


We can see that the timestamp from the commitInfo section occurs the latest which logically makes sense.

# 3. The DeltaTable Object

A caveat of inspecting the log is that the timestamps are not always useful for doing timetravel. There is some black magic potentially going on and thankfully delta lake provides many utility functions

## 3.1. The history() function

Deltalake provides a DeltaTable object which allows us to query metadata about the table, specifically the history.

In [400]:
from delta.tables import DeltaTable

# Create a pointer tothe delta table (dont confuse with spark dataframe)
deltaTable = DeltaTable.forPath(sparkSession, delta_table_path)

# Get the history of the deltatable and store it as a dataframe
dt_history_df = deltaTable.history()
dt_history_pd_df = dt_history_df.toPandas()
dt_history_pd_df

Unnamed: 0,version,timestamp,userId,userName,operation,operationParameters,job,notebook,clusterId,readVersion,isolationLevel,isBlindAppend,operationMetrics,userMetadata
0,1,2022-05-20 19:29:51.389,,,WRITE,"{'mode': 'Overwrite', 'partitionBy': '[]'}",,,,0.0,,False,"{'numOutputRows': '5', 'numOutputBytes': '946'...",
1,0,2022-05-20 18:42:13.341,,,WRITE,"{'mode': 'ErrorIfExists', 'partitionBy': '[]'}",,,,,,True,"{'numOutputRows': '5', 'numOutputBytes': '735'...",


In [403]:
dt_history_pd_df.dtypes

version                         int64
timestamp              datetime64[ns]
userId                         object
userName                       object
operation                      object
operationParameters            object
job                            object
notebook                       object
clusterId                      object
readVersion                   float64
isolationLevel                 object
isBlindAppend                    bool
operationMetrics               object
userMetadata                   object
dtype: object

This useful table will tie a version number to a datetime. It can also provide other userful describing the operation andthe user performing the operation. In our case, we see some information is missing. That will be discussed separaetely.

In [419]:
v1_dt = dt_history_pd_df[dt_history_pd_df["version"] == 1]["timestamp"].iloc[0]
v1_dt

Timestamp('2022-05-20 19:29:51.389000')

In [418]:
dt_history_pd_df[dt_history_pd_df["timestamp"] < v1_dt.strftime('%Y-%m-%d %H:%M:%S.%f')]["version"].iloc[0]

0

# 3. Load Versioned Data
In this section we will explore the ways in which we can load our data.

## 3.1. Load based on version number

We saw the \<version>.json file in the _delta_log directory. This allows us to load a given version via it's index

In [183]:
sparkSession.read.format("delta").option("versionAsOf", 0).load(delta_table_path).show()

[Stage 51:>                                                         (0 + 1) / 1]

+---+---+
|  A|  B|
+---+---+
|  1|  6|
|  2|  7|
|  3|  8|
|  4|  9|
|  5| 10|
+---+---+



                                                                                

In [184]:
sparkSession.read.format("delta").option("versionAsOf", 1).load(delta_table_path).show()

+---+---+---+
|  A|  B|  C|
+---+---+---+
|  1|  6| 11|
|  2|  7| 12|
|  3|  8| 13|
|  4|  9| 14|
|  5| 10| 15|
+---+---+---+



                                                                                

## 3.2. Load based on timestamp
Loading data based on a version number is not that user friendly or intuitive so instead we will look at loading data based on a timestamp.

**Note**: There is a limited band of time for which we can pull data as of. There is a minimum and maximum datetime. If we try to pull a datetime that is outside this band, we will see an error thrown which looks like this:
```
Py4JJavaError: An error occurred while calling o1083.load.
: org.apache.spark.sql.delta.DeltaErrors$TimestampEarlierThanCommitRetentionException: The provided timestamp (2022-05-20 18:42:13.0) is before the earliest version available to this
table (2022-05-20 18:42:13.341). Please use a timestamp after 2022-05-20 18:42:13.
```


``` 
Py4JJavaError: An error occurred while calling o1131.load.
: org.apache.spark.sql.delta.DeltaErrors$TemporallyUnstableInputException: The provided timestamp: 2022-05-20 19:30:50.646 is after the latest commit timestamp of
2022-05-20 19:29:51.389. If you wish to query this version of the table, please either provide
the version with "VERSION AS OF 1" or use the exact timestamp
of the last commit: "TIMESTAMP AS OF '2022-05-20 19:29:51'".
```

### 3.2.1. Load a hardcoded timestamp

Note: I have seen many articles claiming that we must specify the timestamp in yyyyMMddHHmmssSSS format. This is not true as we will see.

First we review the timestamps in the log

In [315]:
print("v0 Info:")
print(f"Commit: {log0_commit_info_dt}")
print(f"Add: {log0_add_info_dt}")
print(f"Creation: {log0_metadata_info_dt}")
print("")
print("v1 Info:")
print(f"Commit: {log1_commit_info_dt}")
print(f"Add: {log1_add_info_dt}")
print(f"Remove: {log1_remove_info_dt}")
print(f"Creation: {log1_metadata_info_dt}")

v0 Info:
Commit: 2022-05-20 18:42:13.086000
Add: 2022-05-20 18:42:12.985000
Creation: 2022-05-20 18:42:12.288000

v1 Info:
Commit: 2022-05-20 19:29:50.646000
Add: 2022-05-20 19:29:48.956000
Remove: 2022-05-20 19:29:50.645000
Creation: 2022-05-20 18:42:12.288000


#### 3.2.1.1. Notes on min and max

In [371]:
min_datetime_str = "2022-05-20 18:42:13.341"
max_datetime_str = "2022-05-20 19:29:51.389"

In [383]:
min_dt = datetime.datetime.strptime(min_datetime_str, '%Y-%m-%d %H:%M:%S.%f')
min_dt_increment = (min_dt - log0_commit_info_dt)
min_dt_increment

datetime.timedelta(microseconds=255000)

In [385]:
max_dt = datetime.datetime.strptime(max_datetime_str, '%Y-%m-%d %H:%M:%S.%f')
max_dt_increment = (max_dt - log1_commit_info_dt)
max_dt_increment

datetime.timedelta(microseconds=743000)

#### 3.2.1.2. Exploring when data changes

In [374]:
sparkSession.read \
  .format("delta") \
  .option("timestampAsOf", min_datetime_str) \
  .load(delta_table_path).show()

                                                                                

+---+---+
|  A|  B|
+---+---+
|  1|  6|
|  2|  7|
|  3|  8|
|  4|  9|
|  5| 10|
+---+---+



In [375]:
sparkSession.read \
  .format("delta") \
  .option("timestampAsOf", log1_commit_info_dt) \
  .load(delta_table_path).show()

                                                                                

+---+---+
|  A|  B|
+---+---+
|  1|  6|
|  2|  7|
|  3|  8|
|  4|  9|
|  5| 10|
+---+---+



In [390]:
tmp = max_dt - datetime.timedelta(microseconds=1)

sparkSession.read \
  .format("delta") \
  .option("timestampAsOf", tmp.strftime('%Y-%m-%d %H:%M:%S.%f')) \
  .load(delta_table_path).show()

                                                                                

+---+---+
|  A|  B|
+---+---+
|  1|  6|
|  2|  7|
|  3|  8|
|  4|  9|
|  5| 10|
+---+---+



                                                                                

In [391]:
sparkSession.read \
  .format("delta") \
  .option("timestampAsOf", max_datetime_str) \
  .load(delta_table_path).show()



+---+---+---+
|  A|  B|  C|
+---+---+---+
|  1|  6| 11|
|  2|  7| 12|
|  3|  8| 13|
|  4|  9| 14|
|  5| 10| 15|
+---+---+---+



                                                                                

In [304]:
# Convert the datetime object into a date formatted string
log0_commit_info_ds = log0_commit_info_dt.strftime("%Y-%m-%d %H:%M:%S")
print(f"{log0_commit_info_dt} -> {log0_commit_info_ds}")

sparkSession.read \
  .format("delta") \
  .option("timestampAsOf", log0_commit_info_ds) \
  .load(delta_table_path).show()

2022-05-20 18:42:13.086000 -> 2022-05-20 18:42:13


Py4JJavaError: An error occurred while calling o1083.load.
: org.apache.spark.sql.delta.DeltaErrors$TimestampEarlierThanCommitRetentionException: The provided timestamp (2022-05-20 18:42:13.0) is before the earliest version available to this
table (2022-05-20 18:42:13.341). Please use a timestamp after 2022-05-20 18:42:13.
         
	at org.apache.spark.sql.delta.DeltaHistoryManager.getActiveCommitAtTime(DeltaHistoryManager.scala:137)
	at org.apache.spark.sql.delta.DeltaTableUtils$.resolveTimeTravelVersion(DeltaTable.scala:352)
	at org.apache.spark.sql.delta.catalog.DeltaTableV2.$anonfun$snapshot$1(DeltaTableV2.scala:92)
	at scala.Option.map(Option.scala:230)
	at org.apache.spark.sql.delta.catalog.DeltaTableV2.snapshot$lzycompute(DeltaTableV2.scala:90)
	at org.apache.spark.sql.delta.catalog.DeltaTableV2.snapshot(DeltaTableV2.scala:89)
	at org.apache.spark.sql.delta.catalog.DeltaTableV2.toBaseRelation(DeltaTableV2.scala:145)
	at org.apache.spark.sql.delta.sources.DeltaDataSource.createRelation(DeltaDataSource.scala:177)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:326)
	at org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:306)
	at scala.Option.map(Option.scala:230)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:266)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240)
	at jdk.internal.reflect.GeneratedMethodAccessor136.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.base/java.lang.Thread.run(Thread.java:829)


In [None]:
log_file_path = os.path.join(delta_table_path, "_delta_log","00000000000000000000.json") 
print(json.dumps(load_delta_log(log_file_path), indent=4))

In [243]:
import datetime
dt = datetime.datetime.fromtimestamp(1653074988956/ 1e3)
dt

datetime.datetime(2022, 5, 20, 19, 29, 48, 956000)

In [211]:
! date

Fri May 20 19:56:57 UTC 2022


Note: This is saying 7:29 PM. My desktop is at 2:29 PM but the server running the notebook is set to a different timezone (UCT)

According to the release article, we must specify the timestamp in yyyyMMddHHmmssSSS format

In [228]:
date_time_str = '2022-05-20 18:42:14'
date_time_obj = datetime.datetime.strptime(date_time_str, '%Y-%m-%d %H:%M:%S')
date_time_obj

datetime.datetime(2022, 5, 20, 18, 42, 14)

In [232]:
ts = int(datetime.datetime.timestamp(date_time_obj))
ts

1653072134

In [233]:
# 20190101000000000
# 16530749906450000

sparkSession.read.format("delta").load(delta_table_path + "@1653072134").show()

AnalysisException: `data/time-travel-demo.delta@1653072134` is not a Delta table.

In [215]:
version_path = delta_table_path + "@1653074990645"
version_path

'data/time-travel-demo.delta@1653074990645'

In [216]:
sparkSession.read.format("delta").load(version_path).show()

AnalysisException: `data/time-travel-demo.delta@1653074990645` is not a Delta table.

In [205]:
dt_s = dt.strftime("%y-%m-%d %H:%M:%S.%f")
dt_s

'22-05-20 19:29:50.645000'

In [206]:

sparkSession.read.format("delta").option("timestampAsOf", dt_s).load(delta_table_path).show()

AnalysisException: The provided timestamp ('22-05-20 19:29:50.645000') cannot be converted to a valid timestamp.

In [155]:
from delta.tables import DeltaTable
deltaTable = DeltaTable.forPath(sparkSession, delta_table_path)

In [162]:
deltaTable.alias("original").merge(
    new_spark_df.alias("updates"),
    "original.A = updates.A and original.B = updates.B"
).whenMatchedUpdate(set = {
    "A": "updates.A",
    "B": "updates.B",
    "C": "updates.C"
}) \
.whenNotMatchedInsert(values = {
    "A": "updates.A",
    "B": "updates.B",
    "C": "updates.C"
}).execute()

AnalysisException: cannot resolve `C` in UPDATE clause given columns original.`A`, original.`B`

# Retention

# Restoring