In [0]:
%run ./_resources/00-setup $reset_all_data=false

## Delta Lake Liquid Clustering

Data Layout is key to increase performance and query speed. Manual tuning trough hive-style partitioning is not efficient (creating too big or small partitions) and hard to maintain.

To solve this issue, Delta Lake released Liquid Clustering. Liquid will automatically adjusts the data layout based on clustering keys, which helps to avoid the over or under-partitioning problems that can occur with Hive partitioning.

Liquid clustering can be specified on any columns to provide fast access, including high cardinality or data skew. 

* **Liquid is simple**: You set Liquid clustering keys on the columns that are most often queried - no more worrying about traditional considerations like column cardinality, partition ordering, or creating artificial columns that act as perfect partitioning keys.
* **Liquid is efficient**: It incrementally clusters new data, so you don't need to trade off between improving performance with reducing cost/write amplification.
* **Liquid is flexible**: You can quickly change which columns are clustered by Liquid without rewriting existing data.

In [0]:
-- Liquid will properly layout the data to speedup queries by firstname or lastname.
-- Adding the CLUSTER BY keyword during your standard table creation.
-- Clustered table can't have partitions.
CREATE OR REPLACE TABLE user_clustering CLUSTER BY (firstname, lastname)
  AS SELECT * FROM user_delta;

In [0]:
-- Liquid Clustering appears under "Clustering Information"
DESCRIBE TABLE user_clustering;

### How to trigger liquid clustering

Liquid clustering is incremental, meaning that data is only rewritten as necessary to accommodate data that needs to be clustered.

For best performance, Databricks recommends scheduling regular OPTIMIZE jobs to cluster data. 

For tables experiencing many updates or inserts, Databricks recommends scheduling an OPTIMIZE job every one or two hours. 

Because liquid clustering is incremental, most OPTIMIZE jobs for clustered tables run quickly. No need to specify any ZORDER columns.

*Note: Liquid clustering will automatically re-arrange your data during writes above a given threshold. As with all indexes, this will add a small write cost.*

In [0]:
-- Trigger liquid clustering:
OPTIMIZE user_clustering;

In [0]:
-- Periodically remove your history and previous files:
VACUUM user_clustering;

In [0]:
SELECT * FROM user_delta where firstname = 'Teresa';

In [0]:
-- Our requests using firstname and lastname are now super fast!
SELECT * FROM user_clustering where firstname = 'Teresa';

### Dynamically changing your clustering columns

Liquid table are flexible, you can change your clustering columns without having to re-write all your data. 

Let's make sure our table provides fast queries for ID:

In [0]:
ALTER TABLE user_clustering CLUSTER BY (id, firstname, lastname);

In [0]:
-- Disabled liquid clustering:
-- Note: this does not rewrite data that has already been clustered, but prevents future OPTIMIZE operations from using clustering keys.
ALTER TABLE user_clustering CLUSTER BY NONE;

### Compacting without Liquid Clustering

While recommended to accelerate your queries, some tables might not always have Liquid Clustering enabled.

Adding data to the table results in new file creation, and your table can quickly have way too many small files which is going to impact performances over time.

This becomes especially true with streaming operation where you add new data every few seconds, in near realtime.

Just like for Liquid Clusteing, Delta Lake solves this operation with the `OPTIMIZE` command, which is going to optimize the file layout for you, picking the proper file size based on heuristics. As no Cluster are defined, this will simply compact the files.

In [0]:
OPTIMIZE user_delta;

-- The engine decided to compact 3 files into 1 ("numFilesAdded": 1, "numFilesRemoved": 3)

These maintenance operation have to be triggered frequently to keep our table properly optimized.

Using Databricks, you can have your table automatically optimized out of the box, without having you to worry about it. All you have to do is set the [proper table properties](https://docs.databricks.com/optimizations/auto-optimize.html), and the engine will optimize your table when needed, without having you to run manual OPTIMIZE operation.

In [0]:
ALTER TABLE user_delta SET TBLPROPERTIES (
  delta.autoOptimize.optimizeWrite = true,
  delta.autoOptimize.autoCompact = true
);

### Note: Auto Optimize with Liquid Clustering

Liquid Clustering will automatically kick off eager optimization starting from a given write size, based on heuristic. 

You can also turn on `delta.autoOptimize.optimizeWrite = true` on your liquid table starting from DBR 13.3 to make sure all writes will be optimized. 

While you can enable `delta.autoOptimize.autoCompact = true`, it won't have any effect for now (as of DBR 13.3, this might change in the future).

## Legacy file layout optimizations

Below are previous Delta Lake optimization leveraging Zordering and Partitioning techniques. 

### ZORDER

ZORDER will optimize the file layout by multiple columns, but it's often used in addition to partitioning and is not as efficient as Liquid Clustering. It'll increase the write amplification and won't solve your small partitions issues.

### Adding indexes (ZORDER) to your table:

If you request your table using a specific predicat (ex: username), you can speedup your request by adding an index on these columns. We call this operation ZORDER.

You can ZORDER on any column, especially the one having high cardinality (id, firstname etc). 

*Note: We recommand to stay below 4 ZORDER columns for better query performance.*

In [0]:
OPTIMIZE user_delta ZORDER BY (id, firstname);

-- Our next queries using a filter on id or firstname will be much faster:
SELECT * FROM user_delta where id = 4 or firstname = 'Quentin';

### Delta Lake Generated columns for dynamic partitions:

Adding partitions to your table is a way of saving data having the same column under the same location. Our engine will then be able to read less data and have better read performances.

Using Delta Lake, partitions can be generated based on expression, and the engine will push-down your predicate applying the same expression even if the request is on the original field.

A typical use-case is to partition per a given time (ex: year, month or even day). 

Our user table has a `creation_date` field. We'll generate a `creation_day` field based on an expression and use it as partition for our table with `GENERATED ALWAYS`.

In addition, we'll let the engine generate incremental ID.

*Note: Remember that partition will also create more files under the hood. You have to be careful using them. Make sure you don't over-partition your table (aim for 100's of partition max, having at least 1GB of data). We don't recommend creating partition on table smaller than 1TB. Use LIQUID CLUSTERING instead.*

In [0]:
CREATE TABLE IF NOT EXISTS user_delta_partition (
  id BIGINT GENERATED ALWAYS AS IDENTITY (START WITH 10000 INCREMENT BY 1), 
  firstname STRING, 
  lastname STRING, 
  email STRING, 
  address STRING, 
  gender INT, 
  age_group INT,
  creation_date timestamp, 
  creation_day DATE GENERATED ALWAYS AS (CAST(creation_date AS DATE))
  )
PARTITIONED BY (creation_day);

In [0]:
INSERT INTO user_delta_partition (firstname, lastname, email, address, 
                                  gender, age_group, creation_date) 
  SELECT firstname, lastname, email, address,
         gender, age_group, creation_date
  FROM user_delta;

In [0]:
DESCRIBE TABLE user_delta_partition;

In [0]:
SELECT * FROM user_delta_partition
WHERE creation_day = CAST(NOW() AS DATE);