New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable by default checkpoint filtering in Delta Lake #20901
Enable by default checkpoint filtering in Delta Lake #20901
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we wait with this until we know that #20851 is a workaround? The current caching solution optimizes for frequent queries on slowly changing tables, with a medium amount of files. The checkpoint filtering optimizes for more infrequent queries on tables with wide schemas and a large amount of files (and simpler code if we can drop the caching code completely).
8eed66b
to
56b3d94
Compare
Not everyone is going to want to or be able to use file-system cache. this change still makes sense as the default behaviour % thinking about impact of repeated reads. |
plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeConnectorTest.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeConnectorTest.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeConnectorTest.java
Outdated
Show resolved
Hide resolved
56b3d94
to
8ea4a7e
Compare
8ea4a7e
to
9d3f15b
Compare
Description
Enable by default the session property
checkpoint_filtering_enabled
.This change is about reducing the amount of metadata information read for the a Delta Lake query because only the information relevant for the query will be read from the checkpoint file.
Not dealing with caching of the active data files metadata anymore means a massively reduced memory footprint of the Delta Lake connector.
The potential downside of this change compared to caching the active data files is that they need to be read from the storage for every single query.
Potential solution to alleviate this problem: #20851
Additional context and related issues
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text: