-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store NDV stats in Iceberg Puffin statistics file #15400
Store NDV stats in Iceberg Puffin statistics file #15400
Conversation
It's always same as the object field.
This is in preparation for introducing a statistics writer.
31bbd34
to
ea68ce7
Compare
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/TableStatisticsReader.java
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/TableStatisticsWriter.java
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java
Show resolved
Hide resolved
Hi @findepi , i have a question: is there a following change for computing incremental ndv sketch, and merging with previous existing sketch? thanks. |
yes, @yittg , we will make a PR with incremental sketch updates soonish |
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/TableStatisticsWriter.java
Outdated
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/TableStatisticsWriter.java
Outdated
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java
Show resolved
Hide resolved
Iceberg 1.1.0 brought a standard way of storing and sharing statistics information. This commits makes use of it, deprecating statistics stored in Trino-specific table properties.
ea68ce7
to
5ab0204
Compare
Thanks for review! AC |
|
||
// TODO (https://github.com/trinodb/trino/issues/15397): remove support for Trino-specific statistics properties | ||
// Drop all stats. Empty table needs none | ||
UpdateProperties updateProperties = transaction.updateProperties(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we have two blocks to remove the old properties if they exist. Would be better to have one at the top before the empty snapshotId check. But if we're going to remove this soon anyway, probably doesn't matter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if we're going to remove this soon anyway, probably doesn't matter
exactly
{ | ||
if (!isExtendedStatisticsEnabled(session)) { | ||
return ImmutableMap.of(); | ||
} | ||
|
||
ImmutableMap.Builder<Integer, Long> ndvByColumnId = ImmutableMap.builder(); | ||
icebergTable.properties().forEach((key, value) -> { | ||
Set<Integer> remainingColumnIds = new HashSet<>(columnIds); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're reading a column that's a projected dereference, will it show up here? Might need to filter those out if they don't have NDV stats, otherwise they'll always stick around in the remaining set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're reading a column that's a projected dereference, will it show up here?
i think this looks only for top level primitive fields -- at least that was the intention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, note that:
trino/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java
Line 2317 in 0f3ac07
ImmutableSet.of(), // projectedColumns don't affect stats |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looking a few layers farther these are definitely only base columns. I don't think they're necessarily primitives though, could include struct/map/array types. Do we collect NDVs for those during analyze?
Iceberg 1.1.0 brought a standard way of storing and sharing statistics
information. This commits makes use of it, deprecating statistics stored
in Trino-specific table properties.