d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

# Use Databricks Delta Time Travel to View an Older Snapshot of Data
Databricks&reg; Delta Time Travel allows you to work with older snapshots of data.

## In this lesson you:
0. Stream power plant data to a Databricks Delta table
0. Look at summary statistics of current data set
0. Rewind to an older version of the data
0. Look at summary statistics of older data set

## Audience
* Primary Audience: Data Engineers, Data Scientists
* Secondary Audience: Data Analysts

## Prerequisites
* Web browser: **Chrome**
* A cluster configured with **8 cores** and **DBR 6.2**
* Suggested Courses from <a href="https://academy.databricks.com/" target="_blank">Databricks Academy</a>:
  - ETL Part 1
  - Spark-SQL
  - Structured Streaming

## Datasets Used
A powerplant dataset found in
`/mnt/training/power-plant/streamed.parquet`.

The schema definition is:

- AT = Atmospheric Temperature [1.81-37.11]°C
- V = Exhaust Vaccum Speed [25.36-81.56] cm Hg
- AP = Atmospheric Pressure in [992.89-1033.30] milibar
- RH = Relative Humidity [0-100]%
- PE = Power Output [420.26-495.76] MW

PE is the label or target. This is the value we are trying to predict given the measurements.

*Reference [UCI Machine Learning Repository Combined Cycle Power Plant Data Set](https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant)*

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Setup

For each lesson to execute correctly, please make sure to run the **`Classroom-Setup`** cell at the<br/>
start of each lesson (see the next cell) and the **`Classroom-Cleanup`** cell at the end of each lesson.

In [4]:
%run "./Includes/Classroom-Setup"

## Databricks Delta Time Travel

The Databricks Delta log has a list of what files are valid for each read / write operation.

By referencing this list, a request can be made for the data at a specific point in time. 

This is similar to the concept of code Revision histories.

Examples of Time Travel use cases are:
* Re-creating analyses, reports, or outputs (for example, the output of a machine learning model). 
  * This could be useful for debugging or auditing, especially in regulated industries.
* Writing complex temporal queries.
* Fixing mistakes in your data.
* Providing snapshot isolation for a set of queries for fast changing tables.

## Slow Stream of Files

Our stream source is a repository of many small files.

In [7]:
from pyspark.sql.types import StructType, StructField, DoubleType
spark.conf.set("spark.sql.shuffle.partitions", 8)

dataPath = "/mnt/training/power-plant/streamed.parquet"

dataSchema = StructType([
  StructField("AT", DoubleType(), True),
  StructField("V", DoubleType(), True),
  StructField("AP", DoubleType(), True),
  StructField("RH", DoubleType(), True),
  StructField("PE", DoubleType(), True)
])

initialDF = (spark
  .readStream                        # Returns DataStreamReader
  .option("maxFilesPerTrigger", 1)   # Force processing of only 1 file per trigger 
  .schema(dataSchema)                # Required for all streaming DataFrames
  .parquet(dataPath) 
)

## Append to a Databricks Delta Table

Use this to create `powerTable`.

In [9]:
from pyspark.sql.types import TimestampType

writePath      = workingDir + "/output.parquet"    # A subdirectory for our output
checkpointPath = workingDir + "/output.checkpoint" # A subdirectory for our checkpoint & W-A logs

powerTable = "powerTable"

And to help us manage our streams better, we will make use of **`untilStreamIsReady()`**, **`stopAllStreams()`** and define the following, **`myStreamName`**:

In [11]:
myStreamName = "lesson08_ps"

##Introducing Time Travel

Databricks Delta time travel allows you to query an older snapshot of a table.

Here, we introduce a new option to Databricks Delta.

`.option("timestampAsOf", now)` 

Where `now` is the current timestamp, that must be a STRING that can be cast to a Timestamp.

There is an alternate notation as well 

`.option("versionAsOf", version)`

More details are described in the <a href="https://docs.databricks.com/delta/delta-batch.html#deltatimetravel" target="_blank">official documentation</a>.

In [13]:
import datetime
now = datetime.datetime.now()

streamingQuery = (initialDF                     # Start with our "streaming" DataFrame
  .writeStream                                  # Get the DataStreamWriter
  .trigger(processingTime="3 seconds")          # Configure for a 3-second micro-batch
  .queryName(myStreamName)                       # Specify Query Name
  .format("delta")                              # Specify the sink type, a Parquet file
  .option("timestampAsOf", now)                 # Timestamp the stream in the form of string that can be converted to TimeStamp
  .outputMode("append")                         # Write only new data to the "file"
  .option("checkpointLocation", checkpointPath) # Specify the location of checkpoint files & W-A logs
  .table(powerTable)
)

In [14]:
# Wait until the stream is ready before proceeding
untilStreamIsReady(myStreamName)

-sandbox
## Retention Period and Table Properties

You configure retention periods using `ALTER TABLE` syntax with the following table properties:

* `delta.logRetentionDuration "interval interval-string" `
  * Configure how long you can go back in time. Default is interval 30 days.

* `delta.deletedFileRetentionDuration = "interval interval-string" `
  * Configure how long stale data files are kept around before being deleted with VACUUM. Default is interval 1 week.
  
* `interval-string` is in the form `30 days` or `1 week`

For full access to 30 days of historical data, set `delta.deletedFileRetentionDuration = "interval 30 days" ` on your table. 

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Using a large number of days may cause your storage costs to go way up.

In [16]:
spark.sql(f"""ALTER TABLE {powerTable} SET TBLPROPERTIES (delta.deletedFileRetentionDuration="interval 10 days") """)
tblPropDF = spark.sql(f"SHOW TBLPROPERTIES {powerTable}")
display(tblPropDF)

key,value
delta.deletedFileRetentionDuration,interval 10 days


-sandbox
<img alt="Caution" title="Caution" style="vertical-align: text-bottom; position: relative; height:1.3em; top:0.0em" src="https://files.training.databricks.com/static/images/icon-warning.svg"/> Run this cell multiple times to show that the data is changing.

In [18]:
countDF = spark.sql(f"SELECT count(*) FROM {powerTable}")
display(countDF)

count(1)
72


In [19]:
historyDF = spark.sql(f"SELECT timestamp FROM (DESCRIBE HISTORY {powerTable}) ORDER BY timestamp")
display(historyDF)

timestamp
2020-04-17T21:33:48.000+0000
2020-04-17T21:33:58.000+0000
2020-04-17T21:34:10.000+0000
2020-04-17T21:34:26.000+0000
2020-04-17T21:34:46.000+0000
2020-04-17T21:35:09.000+0000


-sandbox

Let's rewind back to almost the beginning (where we had just a handful of rows), let's say the 2nd write.

<div><img src="https://files.training.databricks.com/images/eLearning/Delta/second-write.png" style="height: 250px"/></div><br/>

In [21]:
# List timestamps of when table writes occurred
historyDF = spark.sql(f"SELECT timestamp FROM (DESCRIBE HISTORY {powerTable}) ORDER BY timestamp")

# Pick out 2nd write
oldTimestamp = historyDF.take(2)[-1].timestamp

# Re-build the DataFrame as it was in the 2nd write
rewoundDF = spark.sql(f"SELECT * FROM {powerTable} TIMESTAMP AS OF '{oldTimestamp}'")

We had this many (few) rows back then.

In [23]:
rewoundDF.count()

## Clean Up

Stop all remaining streams.

In [26]:
stopAllStreams()

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Cleanup<br>

Run the **`Classroom-Cleanup`** cell below to remove any artifacts created by this lesson.

In [28]:
%run "./Includes/Classroom-Cleanup"

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> All done!</h2>

Thank you for your participation!

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>