Skip to content

Conversation

@chenjian2664
Copy link
Contributor

@chenjian2664 chenjian2664 commented Mar 21, 2025

Introduces handling and pushdown of the metadata columns('$file_modified_time', '$path', '$file_size') in Delta Lake

Description

Closes #25369

Release notes

## Delta Lake
* Add support for pushdown on `$path`, `$file_modified_time` and `$file_size` hidden columns. ({issue}`25369`)

@cla-bot cla-bot bot added the cla-signed label Mar 21, 2025
@github-actions github-actions bot added the delta-lake Delta Lake connector label Mar 21, 2025
public void testOptimizeWithFileModifiedTimeColumn()
throws Exception
{
try (TestTable table = newTrinoTable("test_optimize_with_file_modified_time_", "(id INT)")) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests are copied from BaseIcebergConnectorTest with very small changes:

  1. using MILLISECONDS.sleep(1);
  2. adapt the result getActiveFiles
  3. using newTrinoTable to create test table

@github-actions github-actions bot added the docs label Mar 24, 2025
@chenjian2664 chenjian2664 changed the title Add support for pruning splits using file_modified_time in Delta Lake Add support for pruning splits using metadata columns in Delta Lake Mar 24, 2025
@ebyhr
Copy link
Member

ebyhr commented Mar 24, 2025

@chenjian2664 Could you confirm CI failure?

@chenjian2664 chenjian2664 force-pushed the delta_file_time branch 4 times, most recently from 3481036 to b1008c0 Compare March 25, 2025 03:55
Copy link
Member

@ebyhr ebyhr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR still has a correctness issue on DELETE operation. You can reproduce by the following steps:

CREATE TABLE test(a int);
INSERT INTO test VALUES 1;
INSERT INTO test VALUES 2;
SELECT "$path" FROM test;
DELETE FROM test WHERE "$path" = 'x'; -- removes all rows
TABLE test;

Extracted `partitionMatchesPredicate` from `DeltaLakeSplitManager` into a new utility class `DeltaLakeDomains`
@chenjian2664 chenjian2664 force-pushed the delta_file_time branch 2 times, most recently from 144888b to ef439e4 Compare March 27, 2025 04:42
Co-Authored-By: Marius Grama <findinpath@gmail.com>
The `$path` column is already used for pruning splits in
DeltaLakeSplitManager, This commit removes the `$path` domain
from `remainingFilter` of the filter pushdown result

Co-Authored-By: Marius Grama <findinpath@gmail.com>
Co-Authored-By: Marius Grama <findinpath@gmail.com>
@ebyhr ebyhr force-pushed the delta_file_time branch from af40db1 to 0eae780 Compare March 27, 2025 22:52
@ebyhr
Copy link
Member

ebyhr commented Mar 27, 2025

(Fixed a small typo)

@ebyhr ebyhr merged commit d64c1c8 into trinodb:master Mar 27, 2025
5 checks passed
@github-actions github-actions bot added this to the 475 milestone Mar 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Development

Successfully merging this pull request may close these issues.

Delta Lake: OPTIMIZE with $file_modified_time WHERE filter does not work

3 participants