d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

# Databricks Delta Batch Operations - Append

Databricks&reg; Delta allows you to read, write and query data in data lakes in an efficient manner.

## In this lesson you:
* Append new records to a Databricks Delta table

## Audience
* Primary Audience: Data Engineers 
* Secondary Audience: Data Analysts and Data Scientists

## Prerequisites
* Web browser: **Chrome**
* A cluster configured with **8 cores** and **DBR 6.2**
* Suggested Courses from <a href="https://academy.databricks.com/" target="_blank">Databricks Academy</a>:
  - ETL Part 1
  - Spark-SQL

## Datasets Used
We will use online retail datasets from
* `/mnt/training/online_retail` in the demo part and
* `/mnt/training/structured-streaming/events/` in the exercises

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Setup

For each lesson to execute correctly, please make sure to run the **`Classroom-Setup`** cell at the<br/>
start of each lesson (see the next cell) and the **`Classroom-Cleanup`** cell at the end of each lesson.

In [4]:
%run "./Includes/Classroom-Setup"

<iframe  
src="//fast.wistia.net/embed/iframe/igyrqrnn3t?videoFoam=true"
style="border:1px solid #1cb1c2;"
allowtransparency="true" scrolling="no" class="wistia_embed"
name="wistia_embed" allowfullscreen mozallowfullscreen webkitallowfullscreen
oallowfullscreen msallowfullscreen width="640" height="360" ></iframe>
<div>
<a target="_blank" href="https://fast.wistia.net/embed/iframe/igyrqrnn3t?seo=false">
  <img alt="Opens in new tab" src="https://files.training.databricks.com/static/images/external-link-icon-16x16.png"/>&nbsp;Watch full-screen.</a>
</div>

## Refresh Base Data Set

In [7]:
inputPath = "/mnt/training/online_retail/data-001/data.csv"
inputSchema = "InvoiceNo STRING, StockCode STRING, Description STRING, Quantity INT, InvoiceDate STRING, UnitPrice DOUBLE, CustomerID INT, Country STRING"
parquetDataPath  = workingDir + "/customer-data/"

(spark.read 
  .option("header", "true")
  .schema(inputSchema)
  .csv(inputPath) 
  .write
  .mode("overwrite")
  .format("parquet")
  .partitionBy("Country")
  .save(parquetDataPath)
)

Create table out of base data set

In [9]:
spark.sql("""
    CREATE TABLE IF NOT EXISTS {}.customer_data 
    USING parquet 
    OPTIONS (path = '{}')
  """.format(databaseName, parquetDataPath))

spark.sql("MSCK REPAIR TABLE {}.customer_data".format(databaseName))

The original count of records is:

In [11]:
sqlCmd = "SELECT count(*) FROM {}.customer_data".format(databaseName)
origCount = spark.sql(sqlCmd).first()[0]

print(origCount)

## Read in Some New Data

In [13]:
inputSchema = "InvoiceNo STRING, StockCode STRING, Description STRING, Quantity INT, InvoiceDate STRING, UnitPrice DOUBLE, CustomerID INT, Country STRING"
miniDataInputPath = "/mnt/training/online_retail/outdoor-products/outdoor-products-mini.csv"

newDataDF = (spark
  .read
  .option("header", "true")
  .schema(inputSchema)
  .csv(miniDataInputPath)
)

Do a simple count of number of new items to be added to production data.

In [15]:
newDataDF.count()

## APPEND Using Non-Databricks Delta pipeline

Append the new data to `parquetDataPath`.

In [17]:
(newDataDF
  .write
  .format("parquet")
  .partitionBy("Country")
  .mode("append")
  .save(parquetDataPath)
)

Let's count the rows in `customer_data`.

We expect to see `36` additional rows, but we do not.

Why not?

You will get the same count of old vs new records because the metastore doesn't know about the addition of new records yet.

In [19]:
sqlCmd = "SELECT count(*) FROM {}.customer_data".format(databaseName)
newCount = spark.sql(sqlCmd).first()[0]
print("The old count of records is {}".format(origCount))
print("The new count of records is {}".format(newCount))

## Schema-on-Read Problem Revisited

We've added new data the metastore doesn't know about.

* It knows there is a `Sweden` partition, 
  - but it doesn't know about the 19 new records for `Sweden` that have come in.
* It does not know about the new `Sierra-Leone` partition, 
 - nor the 17 new records for `Sierra-Leone` that have come in.

Here are the the original table partitions:

In [21]:
sqlCmd = "SHOW PARTITIONS {}.customer_data".format(databaseName)

originalSet = spark.sql(sqlCmd).collect()

for x in originalSet: 
  print(x)

Here are the partitions the new data belong to:

In [23]:
spark.sql("DROP TABLE IF EXISTS {}.mini_customer_data".format(databaseName))
newDataDF.write.partitionBy("Country").saveAsTable("{}.mini_customer_data".format(databaseName))

sqlCmd = "SHOW PARTITIONS {}.mini_customer_data ".format(databaseName)

newSet = set(spark.sql(sqlCmd).collect())

for x in newSet: 
  print(x)

In order to get correct counts of records, we need to make these new partitions and new data known to the metadata.

To do this, we apply `MSCK REPAIR TABLE`.

In [25]:
sqlCmd = "MSCK REPAIR TABLE {}.customer_data".format(databaseName)
spark.sql(sqlCmd)

Count the number of records:
* The count should be correct now.
* That is, 65499 + 36 = 65535

In [27]:
sqlCmd = "SELECT count(*) FROM {}.customer_data".format(databaseName)
print(spark.sql(sqlCmd).first()[0])

## Refresh Base Data Set, Write to Databricks Delta

In [29]:
deltaDataPath  = workingDir + "/customer-data-delta/"

(spark.read 
  .option("header", "true")
  .schema(inputSchema)
  .csv(inputPath) 
  .write
  .mode("overwrite")
  .format("delta")
  .partitionBy("Country")
  .save(deltaDataPath) )

## APPEND Using Databricks Delta Pipeline

Next, repeat the process by writing to Databricks Delta format. 

In the next cell, load the new data in Databricks Delta format and save to `../delta/customer-data-delta/`.

In [31]:
miniDataInputPath = "/mnt/training/online_retail/outdoor-products/outdoor-products-mini.csv"

(newDataDF
  .write
  .format("delta")
  .partitionBy("Country")
  .mode("append")
  .save(deltaDataPath)
)

Perform a simple `count` query to verify the number of records and notice it is correct and does not first require a table repair.

Should have 36 more entries from before.

In [33]:
sqlCmd = "SELECT count(*) FROM delta.`{}` ".format(deltaDataPath)
print(spark.sql(sqlCmd).first()[0])

## More Options?

Additional Databricks Delta Reader and Writer options are included in the [Extra folder]($./Extra/Delta 01E - RW-Options).

# LAB

## Step 1

0. Apply the schema provided under the variable `jsonSchema`
0. Read the JSON data under `streamingEventPath` into a DataFrame
0. Add a `date` column using `to_date(from_unixtime(col("time"),"yyyy-MM-dd"))`
0. Add a `deviceId` column consisting of random numbers from 0 to 99 using this expression `expr("cast(rand(5) * 100 as int)")`
0. Use the `repartition` method to split the data into 200 partitions

Refer to  <a href="http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#" target="_blank">Pyspark function documentation</a>.

In [37]:

from pyspark.sql.functions import expr, col, from_unixtime, to_date
jsonSchema = "action string, time long"
streamingEventPath = "/mnt/training/structured-streaming/events/"

rawDataDF = (spark
  .read 
  .schema(jsonSchema)
  .json(streamingEventPath) 
  .withColumn("date", to_date(from_unixtime(col("time"),"yyyy-MM-dd")))
  .withColumn("deviceId", expr("cast(rand(5) * 100 as int)"))
  .repartition(200)
)
display(rawDataDF)

action,time,date,deviceId
Close,1469589656,2016-07-27,29
Open,1469539372,2016-07-26,20
Open,1469528994,2016-07-26,93
Close,1469661939,2016-07-27,81
Close,1469652824,2016-07-27,77
Close,1469652664,2016-07-27,17
Close,1469590827,2016-07-27,28
Open,1469531659,2016-07-26,26
Open,1469680803,2016-07-28,28
Open,1469591001,2016-07-27,83


In [38]:
# TEST - Run this cell to test your solution.
schema = str(rawDataDF.schema)
dbTest("assert-1", True, "action,StringType" in schema)
dbTest("assert-2", True, "time,LongType" in schema)
dbTest("assert-3", True, "date,DateType" in schema)
dbTest("assert-4", True, "deviceId,IntegerType" in schema)

print("Tests passed!")

## Step 2

Write out the raw data.
* Use `overwrite` mode
* Use format `delta`
* Partition by `date`
* Save to `deltaIotPath`

In [40]:
# ANSWER
deltaIotPath = workingDir + "/iot-pipeline/"

(rawDataDF
  .write
  .mode("overwrite")
  .format("delta")
  .partitionBy("date")
  .save(deltaIotPath)
)

In [41]:
# TEST - Run this cell to test your solution.
spark.sql("""
  CREATE TABLE IF NOT EXISTS {}.iot_data_delta
  USING DELTA
  LOCATION '{}' """.format(databaseName, deltaIotPath))

try:
  tableExists = (spark.table("{}.iot_data_delta".format(databaseName)).count() > 0)
except:
  tableExists = False
  
dbTest("Delta-02-backfillTableExists", True, tableExists)  

print("Tests passed!")

## Step 3

Create a new DataFrame with columns `action`, `time`, `date` and `deviceId`. The columns contain the following data:

* `action` contains the value `Open`
* `time` contains the Unix time cast into a long integer `cast(1529091520 as bigint)`
* `date` contains `cast('2018-06-01' as date)`
* `deviceId` contains a random number from 0 to 499 given by `expr("cast(rand(5) * 500 as int)")`

In [43]:
# ANSWER
from pyspark.sql.functions import expr

newDF = (spark.range(10000) 
  .repartition(200)
  .selectExpr("'Open' as action", "cast(1529091520 as bigint) as time",  "cast('2018-06-01' as date) as date") 
  .withColumn("deviceId", expr("cast(rand(5) * 500 as int)"))
)

In [44]:
# TEST - Run this cell to test your solution.
total = newDF.count()

dbTest("Delta-03-newDF-count", 10000, total)

print("Tests passed!")

## Step 4

Append new data to `deltaIotPath`

* Use `append` mode
* Use format `delta`
* Partition by `date`
* Save to `deltaIotPath`

In [46]:
# ANSWER
(newDF
  .write
  .format("delta")
  .partitionBy("date")
  .mode("append")
  .save(deltaIotPath)
)

In [47]:
# TEST - Run this cell to test your solution.
numFiles = spark.sql("SELECT count(*) as total FROM delta.`{}` ".format(deltaIotPath)).first()[0]

dbTest("Delta-03-numFiles", 110000 , numFiles)

print("Tests passed!")

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Cleanup<br>

Run the **`Classroom-Cleanup`** cell below to remove any artifacts created by this lesson.

In [49]:
#                  %run "./Includes/Classroom-Cleanup"

## Summary

In this Lesson we:
* Encountered the schema-on-read problem when appending new data in a traditional data lake pipeline.
* Learned how to append new data to existing Databricks Delta data (that mitigates the above problem).
* Showed how to look at the set of partitions in the data set.

## Review Questions
**Q:** What parameter do you need to add to an existing dataset in a Delta table?<br>
**A:** 
`df.write...mode("append").save("..")`

**Q:** What's the difference between `.mode("append")` and `.mode("overwrite")` ?<br>
**A:** `append` atomically adds new data to an existing Databricks Delta table and `overwrite` atomically replaces all of the data in a table.

**Q:** I've just repaired `myTable` using `MSCK REPAIR TABLE myTable` on a non-Databaricks Delta table.
How do I verify that the repair worked ?<br>
**A:** `SELECT count(*) FROM myTable` and make sure the count is what I expected
  
**Q:** In exercise 2, why did we use `.withColumn(.. cast(rand(5) ..)` i.e. pass a seed to the `rand()` function ?<br>
**A:** In order to ensure we get the SAME set of pseudo-random numbers every time, on every cluster.

## Next Steps

Start the next lesson, [Upsert]($./Delta 04 - Upsert).

## Additional Topics & Resources

* <a href="https://docs.databricks.com/delta/delta-batch.html#" target="_blank">Delta Table Batch Read and Writes</a>

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>