# <img src="https://files.training.databricks.com/images/DeltaLake-logo.png" width=80px> Managed Delta Lake

Delta Lake&reg; managed and queried via the Databricks platform includes additional features and optimizations.

These include:

- **Optimize**

- **Data skipping**

- **Z-Order**

- **Caching**

<img src="https://www.evernote.com/l/AAGv1SuWeRNJM4TI4bIOyGNPm0CTHa17PLwB/image.png" width=900px>

### Getting Started

Run the following cell to configure our "classroom."

In [0]:
%run ./Includes/Classroom-Setup

In [0]:
# Mount "/mnt/training" again using "%run "./Includes/Dataset-Mounts-New"" if it is failed in "./Includes/Classroom-Setup"
try:
    files = dbutils.fs.ls("/mnt/training")
except:
    dbutils.fs.unmount('/mnt/training/')


/mnt/training/ has been unmounted.


In [0]:
%run "./Includes/Dataset-Mounts-New"

In [0]:
%run ./Includes/Delta-Optimization-Setup

Datasets are mounted


In [0]:
spark.sql("""
    DROP TABLE IF EXISTS iot_data
  """)
spark.sql("""
    CREATE TABLE iot_data
    USING DELTA
    LOCATION '{}/delta/iot-events/'
  """.format(userhome))

DataFrame[]

Set up relevant paths.

In [0]:
iotPath = userhome + "/delta/iot-events/"

## SMALL FILE PROBLEM

Historical and new data is often written in very small files and directories.

This data may be spread across a data center or even across the world (that is, not co-located).

The result is that a query on this data may be very slow due to
* network latency
* volume of file metatadata

The solution is to compact many small files into one larger file.
Delta Lake has a mechanism for compacting small files.

-sandbox
### OPTIMIZE
Delta Lake supports the `OPTIMIZE` operation, which performs file compaction.

<img alt="Caution" title="Caution" style="vertical-align: text-bottom; position: relative; height:1.3em; top:0.0em" src="https://files.training.databricks.com/static/images/icon-warning.svg"/> Small files are compacted together into new larger files up to 1GB.
Thus, at this point the number of files increases!

The 1GB size was determined by the Databricks optimization team as a trade-off between query speed and run-time performance when running Optimize.

`OPTIMIZE` is not run automatically because you must collect many small files first.

* Run `OPTIMIZE` more often if you want better end-user query performance
* Since `OPTIMIZE` is a time consuming step, run it less often if you want to optimize cost of compute hours
* To start with, run `OPTIMIZE` on a daily basis (preferably at night when spot prices are low), and determine the right frequency for your particular business case
* In the end, the frequency at which you run `OPTIMIZE` is a business decision

The easiest way to see what `OPTIMIZE` does is to perform a simple `count(*)` query before and after and compare the timing!

Take a look at the `iotPath + "/date=2018-06-01/" ` directory.

Notice, in particular files like `../delta/iot-events/date=2018-07-26/part-xxxx.snappy.parquet`. There are hundreds of small files!

In [0]:
display(dbutils.fs.ls(iotPath + "/date=2016-07-26"))

path,name,size,modificationTime
dbfs:/user/vishal.abnave@borregaard.com/delta/iot-events/date=2016-07-26/part-00000-c6a1ccc6-3885-40da-bbd8-88690553ad02.c000.snappy.parquet,part-00000-c6a1ccc6-3885-40da-bbd8-88690553ad02.c000.snappy.parquet,2662,1684694003000
dbfs:/user/vishal.abnave@borregaard.com/delta/iot-events/date=2016-07-26/part-00001-24245c01-6937-4633-a25d-7c4890ba89c7.c000.snappy.parquet,part-00001-24245c01-6937-4633-a25d-7c4890ba89c7.c000.snappy.parquet,2645,1684694003000
dbfs:/user/vishal.abnave@borregaard.com/delta/iot-events/date=2016-07-26/part-00002-db1f5e34-9c10-405a-a9b9-755caf76d127.c000.snappy.parquet,part-00002-db1f5e34-9c10-405a-a9b9-755caf76d127.c000.snappy.parquet,2683,1684694003000
dbfs:/user/vishal.abnave@borregaard.com/delta/iot-events/date=2016-07-26/part-00003-1d3b260d-db3b-4a92-9a0a-565198bad72f.c000.snappy.parquet,part-00003-1d3b260d-db3b-4a92-9a0a-565198bad72f.c000.snappy.parquet,2655,1684694003000
dbfs:/user/vishal.abnave@borregaard.com/delta/iot-events/date=2016-07-26/part-00004-511f5833-c704-457d-acde-a564cb5f7f12.c000.snappy.parquet,part-00004-511f5833-c704-457d-acde-a564cb5f7f12.c000.snappy.parquet,2611,1684694003000
dbfs:/user/vishal.abnave@borregaard.com/delta/iot-events/date=2016-07-26/part-00005-231550ff-206c-469e-83f2-3d2ac0f034a8.c000.snappy.parquet,part-00005-231550ff-206c-469e-83f2-3d2ac0f034a8.c000.snappy.parquet,2805,1684694003000
dbfs:/user/vishal.abnave@borregaard.com/delta/iot-events/date=2016-07-26/part-00006-30501fca-f2a8-4fa3-b04b-2246b2cfc1fd.c000.snappy.parquet,part-00006-30501fca-f2a8-4fa3-b04b-2246b2cfc1fd.c000.snappy.parquet,2729,1684694003000
dbfs:/user/vishal.abnave@borregaard.com/delta/iot-events/date=2016-07-26/part-00007-21f661c5-585c-424a-8a2d-f1381c4367fe.c000.snappy.parquet,part-00007-21f661c5-585c-424a-8a2d-f1381c4367fe.c000.snappy.parquet,2759,1684694003000
dbfs:/user/vishal.abnave@borregaard.com/delta/iot-events/date=2016-07-26/part-00008-1f0c0707-88c9-429e-a7ee-83373333e401.c000.snappy.parquet,part-00008-1f0c0707-88c9-429e-a7ee-83373333e401.c000.snappy.parquet,2749,1684694003000
dbfs:/user/vishal.abnave@borregaard.com/delta/iot-events/date=2016-07-26/part-00009-49fc7090-604c-498d-a5f9-909578dd0125.c000.snappy.parquet,part-00009-49fc7090-604c-498d-a5f9-909578dd0125.c000.snappy.parquet,2713,1684694003000


CAUTION: Run this query. Notice it is very slow, due to the number of small files.

In [0]:
%sql
SELECT * FROM iot_data where deviceId=92

action,time,date,deviceId
Open,1469652094,2016-07-27,92
Open,1469616802,2016-07-27,92
Close,1469597377,2016-07-27,92
Open,1469619261,2016-07-27,92
Close,1469634527,2016-07-27,92
Open,1469649277,2016-07-27,92
Open,1469600928,2016-07-27,92
Close,1469607492,2016-07-27,92
Open,1469638947,2016-07-27,92
Open,1469649405,2016-07-27,92


-sandbox
### Data Skipping and ZORDER

Delta Lake uses two mechanisms to speed up queries.

<b>Data Skipping</b> is a performance optimization that aims at speeding up queries that contain filters (WHERE clauses).

For example, we have a data set that is partitioned by `date`.

A query using `WHERE date > 2016-07-26` would not access data that resides in partitions that correspond to dates prior to `2016-07-26`.

<b>ZOrdering</b> is a technique to colocate related information in the same set of files.

ZOrdering maps multidimensional data to one dimension while preserving locality of the data points.

Given a column that you want to perform ZORDER on, say `OrderColumn`, Delta
* takes existing parquet files within a partition
* maps the rows within the parquet files according to `OrderColumn` using the algorithm described <a href="https://en.wikipedia.org/wiki/Z-order_curve" target="_blank">here</a>
* (in the case of only one column, the mapping above becomes a linear sort)
* rewrites the sorted data into new parquet files

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> You cannot use the partition column also as a ZORDER column.

#### ZORDER Technical Overview

A brief example of how this algorithm works (refer to [this blog](https://databricks.com/blog/2018/07/31/processing-petabytes-of-data-in-seconds-with-databricks-delta.html) for more details):

![](https://files.training.databricks.com/images/adbcore/zorder.png)

Legend:
- Gray dot = data point e.g., chessboard square coordinates
- Gray box = data file; in this example, we aim for files of 4 points each
- Yellow box = data file that’s read for the given query
- Green dot = data point that passes the query’s filter and answers the query
- Red dot = data point that’s read, but doesn’t satisfy the filter; “false positive”

-sandbox


#### ZORDER example
In the image below, table `Students` has 4 columns:
* `gender` with 2 distinct values
* `Pass-Fail` with 2 distinct values
* `Class` with 4 distinct values
* `Student` with many distinct values

Suppose you wish to perform the following query:

```SELECT Name FROM Students WHERE gender = 'M' AND Pass_Fail = 'P' AND Class = 'Junior'```

```ORDER BY Gender, Pass_Fail```

The most effective way of performing that search is to order the data starting with the largest set, which is `Gender` in this case.

If you're searching for `gender = 'M'`, then you don't even have to look at students with `gender = 'F'`.

Note that this technique only works if all `gender = 'M'` values are co-located.


<div><img src="https://files.training.databricks.com/images/eLearning/Delta/zorder.png" style="height: 300px"/></div><br/>

#### ZORDER usage

With Delta Lake the notation is:

> `OPTIMIZE Students`<br>
`ZORDER BY Gender, Pass_Fail`

This will ensure all the data backing `Gender = 'M' ` is colocated, then data associated with `Pass_Fail = 'P' ` is colocated.

See References below for more details on the algorithms behind ZORDER.

Using ZORDER, you can order by multiple columns as a comma separated list; however, the effectiveness of locality drops.

In streaming, where incoming events are inherently ordered (more or less) by event time, use `ZORDER` to sort by a different column, say 'userID'.

In [0]:
%sql
OPTIMIZE iot_data
ZORDER by (deviceId)

path,metrics
dbfs:/user/vishal.abnave@borregaard.com/delta/iot-events,"List(3, 600, List(64788, 262068, 183414.0, 3, 550242), List(1444, 3117, 2412.8483333333334, 600, 1447709), 3, List(minCubeSize(107374182400), List(0, 0), List(600, 1447709), 0, List(600, 1447709), 3, null), 1, 600, 0, false, 0, 0, 1684694033126, 1684694041405, 4, 3, null, List(0, 0), 4, 4, 1206)"


In [0]:
%sql
SELECT * FROM iot_data WHERE deviceId=92

action,time,date,deviceId
Open,1469652094,2016-07-27,92
Open,1469616802,2016-07-27,92
Close,1469597377,2016-07-27,92
Open,1469619261,2016-07-27,92
Open,1469600928,2016-07-27,92
Close,1469607492,2016-07-27,92
Open,1469638947,2016-07-27,92
Open,1469649405,2016-07-27,92
Open,1469632502,2016-07-27,92
Close,1469634527,2016-07-27,92


-sandbox
## VACUUM

To save on storage costs you should occasionally clean up invalid files using the `VACUUM` command.

Invalid files are small files compacted into a larger file with the `OPTIMIZE` command.

The  syntax of the `VACUUM` command is
>`VACUUM name-of-table RETAIN number-of HOURS;`

The `number-of` parameter is the <b>retention interval</b>, specified in hours.

<img alt="Caution" title="Caution" style="vertical-align: text-bottom; position: relative; height:1.3em; top:0.0em" src="https://files.training.databricks.com/static/images/icon-warning.svg"/> Databricks does not recommend you set a retention interval shorter than seven days because old snapshots and uncommitted files can still be in use by concurrent readers or writers to the table.

The scenario here is:
0. User A starts a query off uncompacted files, then
0. User B invokes a `VACUUM` command, which deletes the uncompacted files
0. User A's query fails because the underlying files have disappeared

Invalid files can also result from updates/upserts/deletions.

More details are provided here: <a href="https://docs.databricks.com/delta/optimizations.html#garbage-collection" target="_blank"> Garbage Collection</a>.

In [0]:
len(dbutils.fs.ls(iotPath + "date=2016-07-26"))

201

-sandbox
<img alt="Caution" title="Caution" style="vertical-align: text-bottom; position: relative; height:1.3em; top:0.0em" src="https://files.training.databricks.com/static/images/icon-warning.svg"/> In the example below we set off an immediate `VACUUM` operation with an override of the retention check so that all files are cleaned up immediately.

Do not do this in production!

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> If using Databricks Runtime 5.1, in order to use a retention time of 0 hours, the following flag must be set.

In [0]:
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", False)

In [0]:
%sql

VACUUM iot_data RETAIN 0 HOURS;

path
dbfs:/user/vishal.abnave@borregaard.com/delta/iot-events


-sandbox
<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Notice how the directory looks vastly cleaned up!

In [0]:
len(dbutils.fs.ls(iotPath + "date=2016-07-26"))

1

## Summary
Delta Lake offers key features that allow for query optimization and garbage collection, resulting in improved performance.

## Additional Topics & Resources

* <a href="https://docs.databricks.com/delta/optimizations.html#" target="_blank">Optimizing Performance and Cost</a>
* <a href="http://parquet.apache.org/documentation/latest/" target="_blank">Parquet Metadata</a>
* <a href="https://en.wikipedia.org/wiki/Z-order_curve" target="_blank">Z-Order Curve</a>