d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Batch Write to Delta Tables

**Objective:** Append files to an existing Delta Table

## Notebook Configuration

Before you run this cell, make sure to add a unique user name to the file
<a href="$./includes/configuration" target="_blank">
includes/configuration</a>, e.g.

```
username = "yourfirstname_yourlastname"
```

In [0]:
%run ./includes/configuration

In [0]:
%run ./includes/main/python/operations

#### Step 1: Load New Data

Within the context of our data ingestion pipeline, this is the addition of new raw files to our Single Source of Truth.

We begin by loading the data from the file `health_tracker_data_2020_2.json`, using the `.format("json")` option as before.

In [0]:
file_path = health_tracker + "raw/health_tracker_data_2020_2.json"

health_tracker_data_2020_2_df = (
  spark.read
  .format("json")
  .load(file_path)
)

#### Step 2: Transform the Data
We perform the same data engineering on the data:
- Use the from_unixtime Spark SQL function to transform the unixtime into a time string
- Cast the time column to type timestamp to replace the column time
- Cast the time column to type date to create the column dte

In [0]:
processedDF = process_health_tracker_data(spark, health_tracker_data_2020_2_df)

#### Step 3: Append the Data to the `health_tracker_processed` Delta table
We do this using `.mode("append")`. Note that it is not necessary to perform any action on the Metastore.

In [0]:
(processedDF.write
 .mode("append")
 .format("delta")
 .save(health_tracker + "processed"))

### View the Commit Using Time Travel
Delta Lake can query an earlier version of a Delta table using a feature known as time travel. Here, we query the data as of version 0, that is, the initial conversion of the table from Parquet.

#### Step 1: View the table as of Version 0
This is done by specifying the option `"versionAsOf"` as 0. When we time travel to Version 0, we see **only** the first month of data, five device measurements, 24 hours a day for 31 days.

In [0]:
(spark.read
 .option("versionAsOf", 0)
 .format("delta")
 .load(health_tracker + "processed")
 .count())

#### Step 2: Count the Most Recent Version
When we query the table without specifying a version, it shows the latest version of the table and includes the new records added.
When we look at the current version, we expect to see two months of data: January 2020 and February 2020. 

The data should include the following records: 

``` 5 devices * 60 days * 24 hours = 7200 records```

Note that the range of data includes the month of February during a leap year. 29 days in Feb plus 31 in January gives us 60 days total.

In [0]:
(spark.read
 .format("delta")
 .load(health_tracker + "processed")
 .count())

Note that we do not have a correct count. We are missing 72 records.

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>