# Databricks Delta Batch Operations - Create Table

Databricks&reg; Delta allows you to read, write and query data in data lakes in an efficient manner.

## Datasets Used
We will use online retail datasets from `/mnt/training/online_retail`

### Getting Started

You will notice that throughout this course, there is a lot of context switching between PySpark/Scala and SQL.

This is because:
* `read` and `write` operations are performed on DataFrames using PySpark or Scala
* table creates and queries are performed directly off Databricks Delta tables using SQL

Run the following cell to configure our "classroom."

In [3]:
%run ./Includes/Classroom-Setup

Set up relevant paths.

In [5]:
inputPath = "/mnt/training/online_retail/data-001/data.csv"
genericDataPath = userhome + "/generic/customer-data/"
deltaDataPath = userhome + "/delta/customer-data/"
backfillDataPath = userhome + "/delta/backfill-data/"

###  READ CSV data then WRITE to Parquet / Databricks Delta

Read the data into a DataFrame. Since this is a CSV file, let Spark infer the schema from the first row by setting
* `inferSchema` to `true`
* `header` to `true`

Use overwrite mode so that it is not a problem to re-write data in case you end up running the cell again.

Partition on `Country` because there are only a few unique countries. 

More information on the how and why of partitioning is contained in the links at the bottom of this notebook.

Then write the data to Parquet and Databricks Delta.

In [7]:
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import col

rawDataDF = (spark.read 
  .option("inferSchema", "true") 
  .option("header", "true")
  .csv(inputPath) 
  .withColumn("InvoiceNo", col("InvoiceNo").cast(IntegerType()))
)

# write to generic dataset
rawDataDF.write.mode("overwrite").format("parquet").partitionBy("Country").save(genericDataPath)

# write to delta dataset
rawDataDF.write.mode("overwrite").format("delta").partitionBy("Country").save(deltaDataPath)

-sandbox
### CREATE Using Non-Databricks Delta Pipeline

Create a table called `customer_data` using `parquet` out of the above data.

<img alt="Caution" title="Caution" style="vertical-align: text-bottom; position: relative; height:1.3em; top:0.0em" src="https://files.training.databricks.com/static/images/icon-warning.svg"/> Notice how you MUST specify a schema and partitioning info!

In [9]:
spark.sql("""
    DROP TABLE IF EXISTS customer_data
  """)
spark.sql("""
    CREATE TABLE customer_data (
      InvoiceNo INTEGER,
      StockCode STRING,
      Description STRING,
      Quantity INTEGER,
      InvoiceDate STRING,
      UnitPrice DOUBLE,
      CustomerID INTEGER,
      Country STRING)
    USING parquet 
    OPTIONS (path = '{}' )
    PARTITIONED BY (Country)
  """.format(genericDataPath))

Perform a simple `count` query to verify the number of records.

In [11]:
%sql
SELECT count(*) FROM customer_data

-sandbox
<img alt="Caution" title="Caution" style="vertical-align: text-bottom; position: relative; height:1.3em; top:0.0em" src="https://files.training.databricks.com/static/images/icon-warning.svg"/> Wait, no results? 

What is going on here is a problem that stems from its Apache Hive origins.

It's the concept of
<b>schema on read</b> where data is applied to a plan or schema as it is pulled out of a stored location, rather than as it goes into a stored location.

This means that as soon as you put data into a data lake, the schema is unknown <i>until</i> you perform a read operation.

To remedy, you repair the table using `MSCK REPAIR TABLE`.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Only after table repair is our count of customer data correct.

Schema on read is explained in more detail <a href="https://stackoverflow.com/a/11764519/53495#" target="_blank">in this article</a>.

In [13]:
%sql
MSCK REPAIR TABLE customer_data;

SELECT count(*) FROM customer_data

-sandbox
### CREATE Using Databricks Delta Pipeline

Create a table called `customer_data_delta` using `DELTA` out of the above data.

The notation is:
> `CREATE TABLE <table-name>` <br>
  `USING DELTA` <br>
  `LOCATION <path-do-data> ` <br>

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Notice how we do not have to specify partition columns.

In [15]:
spark.sql("""
  DROP TABLE IF EXISTS customer_data_delta
""")
spark.sql("""
  CREATE TABLE customer_data_delta 
  USING DELTA 
  LOCATION '{}' 
""".format(deltaDataPath))

-sandbox
Perform a simple `count` query to verify the number of records.

<img alt="Caution" title="Caution" style="vertical-align: text-bottom; position: relative; height:1.3em; top:0.0em" src="https://files.training.databricks.com/static/images/icon-warning.svg"/> Notice how the count is right off the bat; no need to worry about table repairs.

In [17]:
%sql
SELECT count(*) FROM customer_data_delta

#### Metadata

Since we already have data backing `customer_data_delta` in place, 
the table in the Hive metastore automatically inherits the schema, partitioning, 
and table properties of the existing data. 

Note that we only store table name, path, database info in the Hive metastore,
the actual schema is stored in `_delta_logs`.

Metadata is displayed through `DESCRIBE DETAIL <tableName>`.

As long as we have some data in place already for a Databricks Delta table, we can infer schema.

In [19]:
%sql
DESCRIBE DETAIL customer_data_delta

-sandbox
## Exercise 1

Read data in `outdoorSmallPath` with options:
* first row is the header
* infer schema from the header

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** Since `StockCode` looks numeric, you will need to convert `StockCode` explicitly to String. 

* Use this notation `withColumn("StockCode", col("StockCode").cast(StringType()))`

In [21]:
# ANSWER
from pyspark.sql.types import StringType
from pyspark.sql.functions import col

outdoorSmallPath = "/mnt/training/online_retail/outdoor-products/outdoor-products-small.csv"
backfillDF = (spark       
  .read                                               
  .option("inferSchema","true")                       
  .option("header","true")                            
  .csv(outdoorSmallPath)   
  .withColumn("StockCode", col("StockCode").cast(StringType()))
)

In [22]:
# TEST - Run this cell to test your solution.
from pyspark.sql.types import StructField, StructType, StringType, DoubleType, IntegerType, DoubleType

expectedSchema = StructType([
   StructField("InvoiceNo", IntegerType(), True),
   StructField("StockCode", StringType(), True),
   StructField("Description", StringType(), True),
   StructField("Quantity", IntegerType(), True),
   StructField("InvoiceDate", StringType(), True),
   StructField("UnitPrice", StringType(), True),
   StructField("CustomerID", IntegerType(), True),
   StructField("Country", StringType(), True),
])

dbTest("Delta-02-schemas", set(expectedSchema), set(backfillDF.schema))

print("Tests passed!")

-sandbox
## Exercise 2

Create a Databricks Delta table `backfill_data_delta` backed by `backfillDataPath`.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** 
* Don't forget to use overwrite mode just in case
* Partititon by `Country`

In [24]:
# ANSWER
(backfillDF
  .write
  .mode("overwrite")
  .format("delta")
  .partitionBy("Country")
  .save(backfillDataPath)
)

spark.sql("""
    DROP TABLE IF EXISTS backfill_data_delta
  """)
spark.sql("""
    CREATE TABLE backfill_data_delta 
    USING DELTA 
    LOCATION '{}' 
  """.format(backfillDataPath))
None

In [25]:
# TEST - Run this cell to test your solution.
try:
  tableExists = (spark.table("backfill_data_delta") is not None)
except:
  tableExists = False
  
dbTest("Delta-02-backfillTableExists", True, tableExists)  

print("Tests passed!")

## Exercise 3

Count number of records from `backfill_data_delta` where the `Country` is `Sweden`.

In [27]:
# ANSWER
count = spark.sql("SELECT count(*) as total FROM backfill_data_delta WHERE Country='Sweden'").collect()[0][0]

In [28]:
# TEST - Run this cell to test your solution.
dbTest("Delta-L2-backfillDataDelta-count", 2925, count)
print("Tests passed!")

## Summary
Using Databricks Delta to create tables is quite straightforward and you do not need to specify schemas.

## Review Questions

**Q:** What is the Databricks Delta command to display metadata?<br>
**A:** Metadata is displayed through `DESCRIBE DETAIL tableName`.

**Q:** Where does the schema for a Databricks Delta data set reside?<br>
**A:** The table name, path, database info are stored in Hive metastore, the actual schema is stored in the `_delta_logs` directory.

**Q:** What is the general rule about partitioning and the cardinality of a set?<br>
**A:** We should partition on sets that are of small cardinality to avoid penalties incurred with managing large quantities of partition info meta-data.

**Q:** What is schema-on-read?<br>
**A:** It stems from Hive and roughly means: the schema for a data set is unknown until you perform a read operation.

**Q:** How does this problem manifest in Databricks assuming a `parquet` based data lake?<br>
**A:** It shows up as missing data upon load into a table in Databricks.

**Q:** How do you remedy this problem in Databricks above?<br>
**A:** To remedy, you repair the table using `MSCK REPAIR TABLE` or switch to Databricks Delta!

## Next Steps

Start the next lesson, [Append]($./03-Append).

## Additional Topics & Resources

* <a href="https://docs.azuredatabricks.net/delta/delta-batch.html" target="_blank">Table Batch Read and Writes</a>
* <a href="https://en.wikipedia.org/wiki/Partition_(database)#" target="_blank">Database Partitioning</a>