Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable by default checkpoint filtering in Delta Lake #20901

Conversation

findinpath
Copy link
Contributor

@findinpath findinpath commented Mar 1, 2024

Description

Enable by default the session property checkpoint_filtering_enabled.

This change is about reducing the amount of metadata information read for the a Delta Lake query because only the information relevant for the query will be read from the checkpoint file.
Not dealing with caching of the active data files metadata anymore means a massively reduced memory footprint of the Delta Lake connector.

The potential downside of this change compared to caching the active data files is that they need to be read from the storage for every single query.
Potential solution to alleviate this problem: #20851

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Delta Lake
* Improve latency for queries on delta lake tables with checkpoints ({issue}`20901`)

@cla-bot cla-bot bot added the cla-signed label Mar 1, 2024
@github-actions github-actions bot added docs delta-lake Delta Lake connector labels Mar 1, 2024
@findinpath findinpath self-assigned this Mar 1, 2024
Copy link
Contributor

@jkylling jkylling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we wait with this until we know that #20851 is a workaround? The current caching solution optimizes for frequent queries on slowly changing tables, with a medium amount of files. The checkpoint filtering optimizes for more infrequent queries on tables with wide schemas and a large amount of files (and simpler code if we can drop the caching code completely).

@findinpath findinpath force-pushed the findinpath/delta-lake-enable-by-default-checkpoint-filtering branch from 8eed66b to 56b3d94 Compare March 1, 2024 11:06
@hashhar
Copy link
Member

hashhar commented Mar 1, 2024

Should we wait with this until we know that #20851 is a workaround?

Not everyone is going to want to or be able to use file-system cache. this change still makes sense as the default behaviour % thinking about impact of repeated reads.

@findinpath findinpath force-pushed the findinpath/delta-lake-enable-by-default-checkpoint-filtering branch from 56b3d94 to 8ea4a7e Compare March 4, 2024 07:54
@findinpath findinpath force-pushed the findinpath/delta-lake-enable-by-default-checkpoint-filtering branch from 8ea4a7e to 9d3f15b Compare March 4, 2024 09:03
@raunaqmorarka raunaqmorarka merged commit e2993a6 into trinodb:master Mar 5, 2024
28 checks passed
@github-actions github-actions bot added this to the 440 milestone Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

None yet

5 participants