d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

# Databricks Delta Batch Operations - Create Table

Databricks&reg; Delta allows you to read, write and query data in data lakes in an efficient manner.

## In this lesson you:
* Work with a traditional data pipeline using online shopping data
* Identify problems with the traditional data pipeline
* Use Databricks Delta features to mitigate those problems

## Audience
* Primary Audience: Data Engineers 
* Secondary Audience: Data Analysts and Data Scientists

## Prerequisites
* Web browser: **Chrome**
* A cluster configured with **8 cores** and **DBR 6.2**
* Suggested Courses from <a href="https://academy.databricks.com/" target="_blank">Databricks Academy</a>:
  - ETL Part 1
  - Spark-SQL

## Datasets Used
We will use online retail datasets from `/mnt/training/online_retail`

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Setup

For each lesson to execute correctly, please make sure to run the **`Classroom-Setup`** cell at the<br/>
start of each lesson (see the next cell) and the **`Classroom-Cleanup`** cell at the end of each lesson.

In [4]:
%run "./Includes/Classroom-Setup"

<iframe  
src="//fast.wistia.net/embed/iframe/s8bs0vhivz?videoFoam=true"
style="border:1px solid #1cb1c2;"
allowtransparency="true" scrolling="no" class="wistia_embed"
name="wistia_embed" allowfullscreen mozallowfullscreen webkitallowfullscreen
oallowfullscreen msallowfullscreen width="640" height="360" ></iframe>
<div>
<a target="_blank" href="https://fast.wistia.net/embed/iframe/s8bs0vhivz?seo=false">
  <img alt="Opens in new tab" src="https://files.training.databricks.com/static/images/external-link-icon-16x16.png"/>&nbsp;Watch full-screen.</a>
</div>

### Getting Started

You will notice that throughout this course, there is a lot of context switching between PySpark, Scala and SQL.

This is because:
* `read` and `write` operations are performed on DataFrames using PySpark or Scala
* table creates and queries are performed directly off Databricks Delta tables using SQL

Run the following cell to configure our "classroom."

Set up relevant paths.

In [8]:
inputPath = "/mnt/training/online_retail/data-001/data.csv"

parquetDataPath  = workingDir + "/customer-data/"
deltaDataPath    = workingDir + "/customer-data-delta/"

###  READ CSV Data

Read the data into a DataFrame. We supply the schema.

Partition on `Country` because there are only a few unique countries and because we will use `Country` as a predicate in a `WHERE` clause.

More information on table partitioning is contained in the links at the bottom of this notebook.

In [10]:

inputSchema = "InvoiceNo STRING, StockCode STRING, Description STRING, Quantity INT, InvoiceDate STRING, UnitPrice DOUBLE, CustomerID INT, Country STRING"

rawDF = (spark.read 
  .option("header", "true")
  .schema(inputSchema)   #  odd the way they did this... 
  .csv(inputPath) 
)


In [11]:

display(rawDF)


InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/10 8:26,2.55,17850.0,United Kingdom
536365,71053,WHITE METAL LANTERN,6,12/1/10 8:26,3.39,17850.0,United Kingdom
536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/10 8:26,2.75,17850.0,United Kingdom
536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/10 8:26,3.39,17850.0,United Kingdom
536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/10 8:26,3.39,17850.0,United Kingdom
536365,22752,SET 7 BABUSHKA NESTING BOXES,2,12/1/10 8:26,7.65,17850.0,United Kingdom
536365,21730,GLASS STAR FROSTED T-LIGHT HOLDER,6,12/1/10 8:26,4.25,17850.0,United Kingdom
536366,22633,HAND WARMER UNION JACK,6,12/1/10 8:28,1.85,17850.0,United Kingdom
536366,22632,HAND WARMER RED POLKA DOT,6,12/1/10 8:28,1.85,17850.0,United Kingdom
536367,84879,ASSORTED COLOUR BIRD ORNAMENT,32,12/1/10 8:34,1.69,13047.0,United Kingdom


<br>

###  WRITE to Parquet and Databricks Delta

Use `overwrite` mode so that it is not a problem to re-write data in case you end up running the cell again.

In [15]:
# write using Parquet format
(rawDF.write
  .mode("overwrite")
  .format("parquet")
  .partitionBy("Country")
  .save(parquetDataPath) )


In [16]:

display(rawDF)


In [17]:
# write using Databricks Delta format
(rawDF.write
  .mode("overwrite")
  .format("delta")
  .partitionBy("Country")
  .save(deltaDataPath) )

-sandbox
### CREATE Statement Using Non-Databricks Delta Pipeline

Create a table called `customer_data` using `parquet` out of the above data.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Notice how we do not need to specify the schema and partition info!

In [19]:
spark.sql("""
    CREATE TABLE IF NOT EXISTS customer_data 
    USING parquet 
    OPTIONS (path = '{}')
  """.format(parquetDataPath))

Perform a simple `count` query to verify the number of records.

In [21]:

spark.sql("select count(*) from customer_data").show()


### Why 0 records? 

It's the concept of
<b>schema on read</b> where data is applied to a plan or schema as it is pulled out of a stored location, rather than as it goes into a stored location.

In the traditional data lake architecture (including our pre-Databricks Delta), 
 * The data backing the table **`customer_data`** is located in **`parquetDataPath`** (which you can see below).
 * The paths to the meta data backing the table **`customer-data`** (the schema, partitioning info and other table properties) are stored elsewhere 
  - This is called the **metastore**.

Suppose, we add more data to **`parquetDataPath`**, 
 * Then, we need to run a separate step for the metastore to become aware of this.
 * We use the **`MSCK REPAIR TABLE`** command. 
 * **`MSCK`** stands for "**M**eta**S**tore **C**hec**K**", modeled after Unix **`FSCK`** (**F**ile **S**ystem **C**hec**K**)

Schema on read is explained in more detail <a href="https://stackoverflow.com/a/11764519/53495#" target="_blank">in this article</a>.

In [23]:
print(parquetDataPath)

After using `MSCK REPAIR TABLE`, the count is correct.

-sandbox
### CREATE Statement Using Databricks Delta Pipeline

Create a table called `<database-name>.customer_data_delta` using `DELTA` out of `<path-to-data> = deltaDataPath`     

The notation is:
> `CREATE TABLE IF NOT EXISTS <database-name>.customer_data_delta` <br>
  `USING DELTA` <br>
  `LOCATION <path-to-data> ` <br>
  
Then, perform SQL queries on the table you just created.
> `SELECT count(*) FROM <database-name>.customer_data_delta`

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Notice how you do not have to specify a schema or partition info here:
* Databricks Delta stores schema and partition info in the `_delta_log` directory.
* It infers schema from the data sitting in `<path-to-data>`.

In [26]:
spark.sql("""
  CREATE TABLE IF NOT EXISTS customer_data_delta 
  USING DELTA 
  LOCATION '{}' 
""".format(deltaDataPath))

In [27]:

spark.sql("select * from customer_data_delta limit 10 ").show(10, False)


In [28]:

spark.sql("select count(*) from customer_data_delta").show()


Perform a simple `count` query to verify the number of records.

Notice how the count is right off the bat; no need to worry about table repairs.

## A New Notation

But, there is a more compact notation as well, one where you do not explicitly have to create a table.

Simply specify `delta.` along with the path to your Databricks Delta directory (in backticks!) directly in the SQL query.
* The dot in ```delta.`<path>` ``` means "Spark, recognize `<path>` as a Databricks Delta directory"

> ```SELECT count(*) FROM delta.`<path-to-Delta-data>` ```

We will use this notation extensively throughout the rest of the course.

In your own work, you may chose either notation:
* Sometimes, SQL queries are more readable than DataFrame queries.

Make sure you use BACKTICKS in the statement ``` delta.`<path-to-Delta-data>` ``` .

In [31]:
### use this approach ! 

sqlCmd = "SELECT count(*) FROM delta.`{}` ".format(deltaDataPath)

display(spark.sql(sqlCmd))


count(1)
65499


##  The Transaction Log (Metadata)
Databricks Delta stores the schema, partitioning info and other table properties in the same place as the data:
 * The schema and partition info is located in the `00000000000000000000.json` file under the `_delta_log` directory as shown below.
 * Subsequent `write` operations create additional `json` files.
 * In addition to the schema, the `json` file(s) contain information such as
   - Which files were added.
   - Which files were removed.
   - Transaction IDs.
 * Each Delta table should correspond to a unique `_delta_log` directory.

In [33]:
dbutils.fs.head(deltaDataPath + "/_delta_log/00000000000000000000.json")

Metadata is displayed through `DESCRIBE DETAIL <tableName>`.

As long as we have some data in place already for a Databricks Delta table, we can infer schema.

In [35]:
sqlCmd = "DESCRIBE DETAIL delta.`{}` ".format(deltaDataPath)
display(spark.sql(sqlCmd))

format,id,name,description,location,createdAt,lastModified,partitionColumns,numFiles,sizeInBytes,properties,minReaderVersion,minWriterVersion
delta,26092e54-df1e-4bdc-a466-92169127fbff,,,dbfs:/user/tbresee@umich.edu/delta/delta_02_create_psp/customer-data-delta,2020-04-17T17:13:15.266+0000,2020-04-17T17:13:26.000+0000,List(Country),37,636918,Map(),1,2


## Converting Parquet Workloads to Databricks Delta

A Databricks Delta workload is defined by the presence of the `_delta_log` directory containing metadata files.

Given a generic Parquet-based data lake, converting to Databricks Delta is quite straightforward.

Suppose our Parquet-based data lake is found under `/data-pipeline`.

To convert it to Databricks Delta, simply do

> ```CONVERT TO DELTA parquet.`/data-pipeline` ``` <br>
  ```[PARTITIONED BY (col_name1 col_type1, col_name2 col_type2, ...)] ```
  
More details in <a href="https://docs.databricks.com/spark/latest/spark-sql/language-manual/convert-to-delta.html" target="_blank">Porting Existing Workloads to Delta</a>.

# LAB

## Step 1

Read in data in `outdoorSmallPath` using `inputSchema` to DataFrame `inventoryDF`.

Use appropriate options, given that this is a CSV file.

In [40]:

outdoorSmallPath = "/mnt/training/online_retail/outdoor-products/outdoor-products-small.csv"
inputSchema = "InvoiceNo STRING, StockCode STRING, Description STRING, Quantity INT, InvoiceDate STRING, UnitPrice DOUBLE, CustomerID INT, Country STRING"


inventoryDF = (spark.read 
  .option("header", "true")
  .schema(inputSchema)   
  .csv(outdoorSmallPath) 
)



In [41]:

inventoryCount = inventoryDF.count()
inventoryCount


In [43]:
# TEST - Run this cell to test your solution.
inventoryCount = inventoryDF.count()

dbTest("Delta-02-schemas", 99999, inventoryCount)

print("Tests passed!")

## Step 2

Write data to a Databricks path `inventoryDataPath = workingDir + "/inventory-data/"` 
* Make sure to set the `format` to `delta`
* Use overwrite mode 
* Partititon by `Country`

In [45]:

inventoryDataPath = workingDir + "/inventory-data/"

# write using Databricks Delta format

(inventoryDF.write
  .mode("overwrite")
  .format("delta")
  .partitionBy("Country")
  .save(inventoryDataPath))



In [46]:
# TEST - Run this cell to test your solution.
try:
  tableNotEmpty = spark.sql("SELECT count(*) FROM delta.`{}` ".format(inventoryDataPath)).first()[0] > 0
except:
  tableNotEmpty = False
  
dbTest("Delta-02-inventoryTableExists", True, tableNotEmpty)  

print("Tests passed!")

## Step 3

Count number of records found under `inventoryDataPath` where the `Country` is `Sweden`.

In [48]:

spark.sql("""
  CREATE TABLE IF NOT EXISTS temp 
  USING DELTA 
  LOCATION '{}' 
""".format(inventoryDataPath))

count = spark.sql("select count(*) from temp where Country = 'Sweden'").first()[0]
count


In [49]:
# TEST - Run this cell to test your solution.
dbTest("Delta-L2-inventoryDataDelta-count", 2925, count)
print("Tests passed!")

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Cleanup<br>

Run the **`Classroom-Cleanup`** cell below to remove any artifacts created by this lesson.

In [51]:
%run "./Includes/Classroom-Cleanup"

## Summary
In this lesson we learned:
* That Databricks Delta overcomes the schema-on-read problem 
  - where data is applied to a plan or schema as it is pulled out of a stored location, rather than as it goes into a stored location.
* About a compact notation that allows you to work with Databricks Delta data as tables (without having to explicitly create tables).
* How to convert existing Parquet-based workloads to Databricks Delta workloads.

## Review Questions

**Q:** What is the Databricks Delta command to display metadata?<br>
**A:** Metadata is displayed through `DESCRIBE DETAIL tableName`.

**Q:** Where does the schema for a Databricks Delta data set reside?<br>
**A:** The table name, path, database info are stored in Hive metastore, the actual schema is stored in the `_delta_log` directory.

**Q:** What is schema-on-read?<br>
**A:** It stems from Hive and roughly means: the schema for a data set is unknown until you perform a read operation.

**Q:** How does this problem manifest assuming a `parquet` based data lake?<br>
**A:** It shows up as missing data upon load into a table.

**Q:** How do you remedy this problem described above?<br>
**A:** To remedy, you repair the table using `MSCK REPAIR TABLE` or switch to Databricks Delta!

## Next Steps

Start the next lesson, [Append]($./Delta 03 - Append).

## Additional Topics & Resources

* <a href="https://docs.databricks.com/delta/delta-batch.html#" target="_blank">Table Batch Read and Writes</a>
* <a href="https://en.wikipedia.org/wiki/Partition_(database)#" target="_blank">Database Partitioning</a>

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>