# Databricks Delta Optimizations and Best Practices

Databricks&reg; Delta has nifty optimizations to speed up your queries.

## Datasets Used
* Online retail datasets from
`/mnt/training/online_retail`

### Getting Started

Run the following cell to configure our "classroom."

In [3]:
%run ./Includes/Classroom-Setup

Set up relevant paths.

In [5]:
deltaIotPath = userhome + "/delta/iot-pipeline/"
deltaDataPath = userhome + "/delta/customer-data/"

## SMALL FILE PROBLEM

Historical and new data is often written in very small files and directories. 

This data may be spread across a data center or even across the world (that is, not co-located).

The result is that a query on this data may be very slow due to
* network latency 
* volume of file metatadata 

The solution is to compact many small files into one larger file.
Databricks Delta has a mechanism for compacting small files.

-sandbox


Use Azure Data Explorer to see many small files.

<img alt="Caution" title="Caution" style="vertical-align: text-bottom; position: relative; height:1.3em; top:0.0em" src="https://files.training.databricks.com/static/images/icon-warning.svg"/> Data Explorer is available ONLY on Azure (not in Databricks)

<img src="https://s3-us-west-2.amazonaws.com/curriculum-release/images/eLearning/Delta/azure-small-file.png" style="border: 1px solid #aaa; border-radius: 10px 10px 10px 10px"/></div>

-sandbox
### OPTIMIZE
Databricks Delta supports the `OPTIMIZE` operation, which performs file compaction.

<img alt="Caution" title="Caution" style="vertical-align: text-bottom; position: relative; height:1.3em; top:0.0em" src="https://files.training.databricks.com/static/images/icon-warning.svg"/> Small files are compacted together into new larger files up to 1GB.
Thus, at this point the number of files increases!

The 1GB size was determined by the Databricks optimization team as a trade-off between query speed and run-time performance.

`OPTIMIZE` is not run automatically because you must collect many small files first.

* Run `OPTIMIZE` more often if you want better end-user query performance 
* Since `OPTIMIZE` is a time consuming step, run it less often if you want to optimize cost of compute hours
* To start with, run `OPTIMIZE` on a daily basis (preferably at night when spot prices are low), and determine the right frequency for your particular business case
* In the end, the frequency at which you run `OPTIMIZE` is a business decision

The easiest way to see what `OPTIMIZE` does is to perform a simple `count(*)` query before and after and compare the timing!

Take a look at the `deltaIotPath + "/date=2018-06-01/" ` directory.

Notice, in particular files like `../delta/iot-pipeline/date=2018-06-01/part-xxxx.snappy.parquet`. There are hundreds of small files!

**Make sure you run exercises 2, 3 and 4 from lesson 03-Append before running the next command**

In [10]:
display(dbutils.fs.ls(deltaIotPath + "/date=2018-06-01/"))

path,name,size
dbfs:/user/lino@solliance.net/delta/iot-pipeline/date=2018-06-01/part-00000-d4e96381-bae6-45c0-b3dc-4c08f4241c85.c000.snappy.parquet,part-00000-d4e96381-bae6-45c0-b3dc-4c08f4241c85.c000.snappy.parquet,997
dbfs:/user/lino@solliance.net/delta/iot-pipeline/date=2018-06-01/part-00001-7bcaf569-19fe-4d1a-ba0f-7776df1f472f.c000.snappy.parquet,part-00001-7bcaf569-19fe-4d1a-ba0f-7776df1f472f.c000.snappy.parquet,1008
dbfs:/user/lino@solliance.net/delta/iot-pipeline/date=2018-06-01/part-00002-64cce2f9-03b8-490a-b1f2-48ecd22613b0.c000.snappy.parquet,part-00002-64cce2f9-03b8-490a-b1f2-48ecd22613b0.c000.snappy.parquet,1013
dbfs:/user/lino@solliance.net/delta/iot-pipeline/date=2018-06-01/part-00003-37999a80-f625-4821-bcfe-8d2869c9117a.c000.snappy.parquet,part-00003-37999a80-f625-4821-bcfe-8d2869c9117a.c000.snappy.parquet,1003
dbfs:/user/lino@solliance.net/delta/iot-pipeline/date=2018-06-01/part-00004-56e42edc-16bd-4bc3-8826-c811248dce85.c000.snappy.parquet,part-00004-56e42edc-16bd-4bc3-8826-c811248dce85.c000.snappy.parquet,997
dbfs:/user/lino@solliance.net/delta/iot-pipeline/date=2018-06-01/part-00005-4519d901-20c3-426e-931f-b8daaf77d87b.c000.snappy.parquet,part-00005-4519d901-20c3-426e-931f-b8daaf77d87b.c000.snappy.parquet,999
dbfs:/user/lino@solliance.net/delta/iot-pipeline/date=2018-06-01/part-00006-2c72c44d-d34f-4d2c-90d3-2ce978380a66.c000.snappy.parquet,part-00006-2c72c44d-d34f-4d2c-90d3-2ce978380a66.c000.snappy.parquet,1012
dbfs:/user/lino@solliance.net/delta/iot-pipeline/date=2018-06-01/part-00007-83f86e84-801e-4026-b94b-725b21b382aa.c000.snappy.parquet,part-00007-83f86e84-801e-4026-b94b-725b21b382aa.c000.snappy.parquet,1007
dbfs:/user/lino@solliance.net/delta/iot-pipeline/date=2018-06-01/part-00008-c2a8221a-8f71-4cbb-99c3-5be53b8dac07.c000.snappy.parquet,part-00008-c2a8221a-8f71-4cbb-99c3-5be53b8dac07.c000.snappy.parquet,1013
dbfs:/user/lino@solliance.net/delta/iot-pipeline/date=2018-06-01/part-00009-7c4adbc2-f404-41fe-a51b-7f80df6787ca.c000.snappy.parquet,part-00009-7c4adbc2-f404-41fe-a51b-7f80df6787ca.c000.snappy.parquet,1011


CAUTION: Run this query. Notice it is very slow, due to the number of small files.

In [12]:
%sql
SELECT * FROM demo_iot_data_delta where deviceId=379

action,time,date,deviceId
Open,1529091520,2018-06-01,379
Open,1529091520,2018-06-01,379
Open,1529091520,2018-06-01,379
Open,1529091520,2018-06-01,379
Open,1529091520,2018-06-01,379
Open,1529091520,2018-06-01,379
Open,1529091520,2018-06-01,379
Open,1529091520,2018-06-01,379
Open,1529091520,2018-06-01,379
Open,1529091520,2018-06-01,379


### Data Skipping and ZORDER

Databricks Delta uses two mechanisms to speed up queries.

<b>Data Skipping</b> is a performance optimization that aims at speeding up queries that contain filters (WHERE clauses). 

For example, we have a data set that is partitioned by `date`. 

A query using `WHERE date > 2018-06-01` would not access data that resides in partitions that correspond to dates prior to `2018-06-01`.

<b>ZOrdering</b> is a technique to colocate related information in the same set of files. 

ZOrdering maps multidimensional data to one dimension while preserving locality of the data points.

-sandbox


#### ZORDER example
In the image below, table `Students` has 4 columns: 
* `gender` with 2 distinct values
* `Pass-Fail` with 2 distinct values
* `Class` with 4 distinct values  
* `Student` with many distinct values 

Suppose you wish to perform the following query:

```SELECT Name FROM Students WHERE gender = 'M' AND Pass_Fail = 'P' AND Class = 'Junior'```

```ORDER BY Gender, Pass_Fail```

The most effective way of performing that search is to order the data starting with the largest set, which is `Gender` in this case. 

If you're searching for `gender = 'M'`, then you don't even have to look at students with `gender = 'F'`. 

Note that this technique only works if all `gender = 'M'` values are co-located.


<div><img src="https://s3-us-west-2.amazonaws.com/curriculum-release/images/eLearning/Delta/zorder.png" style="height: 300px"/></div><br/>

#### ZORDER usage

With Databricks Delta the notation is:

> `OPTIMIZE Students`<br>
`ZORDER BY Gender, Pass_Fail`

This will ensure all the data backing `Gender = 'M' ` is colocated, then data associated with `Pass_Fail = 'P' ` is colocated.

See References below for more details on the algorithms behind ZORDER.

Using ZORDER, you can order by multiple columns as a comma separated list; however, the effectiveness of locality drops.

In streaming, where incoming events are inherently ordered (more or less) by event time, use `ZORDER` to reorder by a more meaningful index.

In [16]:
%sql
OPTIMIZE demo_iot_data_delta
ZORDER by (deviceId)

path
""


In [17]:
%sql
SELECT * FROM demo_iot_data_delta WHERE deviceId=379

-sandbox
## VACUUM

To save on storage costs you should occasionally clean up invalid files using the `VACUUM` command. 

Invalid files are small files compacted into a larger file with the `OPTIMIZE` command.

The  syntax of the `VACUUM` command is 
>`VACUUM name-of-table RETAIN number-of HOURS;`

The `number-of` parameter is the <b>retention interval</b>, specified in hours.

<img alt="Caution" title="Caution" style="vertical-align: text-bottom; position: relative; height:1.3em; top:0.0em" src="https://files.training.databricks.com/static/images/icon-warning.svg"/> Databricks does not recommend you set a retention interval shorter than seven days because old snapshots and uncommitted files can still be in use by concurrent readers or writers to the table.

The scenario here is:
0. User A starts a query off uncompacted files, then
0. User B invokes a `VACUUM` command, which deletes the uncompacted files
0. User A's query fails because the underlying files have disappeared

Invalid files can also result from updates/upserts/deletions.

More details are provided here: <a href="https://docs.azuredatabricks.net/delta/optimizations.html#garbage-collection" target="_blank"> Garbage Collection</a>.

In [19]:
len(dbutils.fs.ls(deltaIotPath + "/date=2018-06-01"))

-sandbox
<img alt="Caution" title="Caution" style="vertical-align: text-bottom; position: relative; height:1.3em; top:0.0em" src="https://files.training.databricks.com/static/images/icon-warning.svg"/> In the example below we set off an immediate `VACUUM` operation with an override of the retention check so that all files are cleaned up immediately.

Do not do this in production!

In [21]:
%sql

VACUUM demo_iot_data_delta RETAIN 0 HOURS;

-sandbox
<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Notice how the directory looks vastly cleaned up!

In [23]:
len(dbutils.fs.ls(deltaIotPath + "/date=2018-06-01"))

## Exercise 1: OPTIMIZE and ZORDER

Let's apply some of these optimizations to `../delta/customer-data/`.

Our data is partitioned by `Country`.

We want query the data for `StockCode` equal to `22301`. 

We expect this query to be slow because we have to examine ALL OF `../delta/customer-data/` to find the desired `StockCode` and not just in one or two partitions.

First, let's time the above query: you will need to form a DataFrame to pass to `preZorderQuery`.

In [25]:
# ANSWER
%timeit preZorderQuery = spark.sql("SELECT * FROM customer_data_delta WHERE StockCode=22301 ").collect()

Compact the files and re-index by `StockCode`.

In [27]:
%sql
-- ANSWER
OPTIMIZE customer_data_delta
ZORDER by (StockCode)

Let's time the above query again: you will need to form a DataFrame to pass to `postZorderQuery`.

In [29]:
# ANSWER
%timeit postZorderQuery = spark.sql("SELECT * FROM customer_data_delta WHERE StockCode=22301").collect()

## Exercise 2: VACUUM

Count number of files before `VACUUM` for `Country=Sweden`.

In [31]:
# ANSWER
preNumFiles = len(dbutils.fs.ls(deltaDataPath + "/Country=Sweden"))

In [32]:
# TEST - Run this cell to test your solution.
dbTest("Delta-08-numFilesSweden-pre", True, preNumFiles > 1)

print("Tests passed!")

Now, watch the number of files shrink as you perform `VACUUM`.

In [34]:
%sql
-- ANSWER
VACUUM customer_data_delta RETAIN 0 HOURS;

Count how many files there are for `Country=Sweden`.

In [36]:
# ANSWER
postNumFiles = len(dbutils.fs.ls(deltaDataPath + "/Country=Sweden"))

In [37]:
# TEST - Run this cell to test your solution.
dbTest("Delta-08-numFilesSweden-post", 1, postNumFiles)

print("Tests passed!")

## Summary
Databricks Delta offers key features that allow for query optimization and garbage collection, resulting in improved performance.

## Review Questions

**Q:** Why are many small files problematic when doing queries on data backed by these?<br>
**A:** If there are many files, some of whom may not be co-located the principal sources of slowdown are
* network latency 
* (volume of) file metatadata 

**Q:** What do `OPTIMIZE` and `VACUUM` do?<br>
**A:** `OPTIMIZE` creates the larger file from a collection of smaller files and `VACUUM` deletes the invalid small files that were used in compaction.

**Q:** What size files does `OPTIMIZE` compact to and why that value?<br>
**A:** Small files are compacted to around 1GB; this value was determined by the Spark optimization team as a good compromise between speed and performace.

**Q:** What should one be careful of when using `VACUUM`?<br>
**A:** Don't set a retention interval shorter than seven days because old snapshots and uncommitted files can still be in use by concurrent readers or writers to the table.

**Q:** What does `ZORDER` do?<br>
**A:** It is a technique to colocate related information in the same set of files.

## Next Steps

Start the next lesson, [Architecture]($./07-Architecture ).

## Additional Topics & Resources

* <a href="https://docs.azuredatabricks.net/delta/optimizations.html" target="_blank">Optimizing Performance and Cost</a>
* <a href="http://parquet.apache.org/documentation/latest/" target="_blank">Parquet Metadata</a>
* <a href="https://en.wikipedia.org/wiki/Z-order_curve" target="_blank">Z-Order Curve</a>