-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache Iceberg column types in Glue for faster access #18315
Conversation
da9a649
to
dcc9e29
Compare
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/catalog/glue/GlueIcebergUtil.java
Outdated
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/catalog/glue/GlueIcebergUtil.java
Outdated
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/catalog/glue/GlueIcebergUtil.java
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/catalog/glue/GlueIcebergUtil.java
Show resolved
Hide resolved
dcc9e29
to
bb6dbc6
Compare
bb6dbc6
to
9a259ca
Compare
1fb31bf
to
e2c3fa7
Compare
(just rebased) |
b187e54
to
0d1de42
Compare
0d1de42
to
0f4a14c
Compare
14c180e
to
9973336
Compare
CI hit #18446 |
That's a very good point. |
will rebase and add a test. |
7c32725
to
aad0ca1
Compare
(just rebased) |
No need to build intermediate `Map<.., Integer>`.
Outdated after project rename.
Introduce separate column class representing column serialized in file metastore.
Make `information_schema.columns` queries faster by storing the necessary information directly in Glue so that loading Iceberg metadata from storage is not need. - Table comment is stored as `comment` Glue table parameter. - Column Trino type is stored as Glue column type (for some cases) or as `trino_type` column table parameter. This is because, following Iceberg own's `org.apache.iceberg.aws.glue.IcebergToGlueConverter.toTypeString` the column type written to Glue is not accurate, so this piece of information may be lossy. In such cases the column parameter is used to store the Trino type. - For more compatibility we could store Iceberg type string, but Iceberg lacks API to reconstruct Type from string (except for primitive types). This is not surprising, as it needs IDs for all fields. Something that is not needed to answer metadata queries. - Column NOT NULL constraint is stored as `trino_not_null=true` column table parameter (omitted for nullable columns). Before the above cached information is used, the following conditions are checked - the `trino_table_metadata_info_valid_for` table property must be set to current metadata location. This ensures that the cached information is invalided whenever metadata location changes. - at least one column must have `trino_type` property set. This ensures the cached information is not used when column parameters were lost in transit or otherwise erased. `iceberg.glue.cache-table-metadata` serves as a kill-switch for the new functionality (both write & read parts).
aad0ca1
to
63c10e5
Compare
test added ( |
again. |
Make
information_schema.columns
queries faster by storing thenecessary directly in Glue so that loading Iceberg metadata from storage
is not need.
Table comment is stored as
comment
Glue table parameter.Column Trino type is stored as Glue column type (for some cases) or as
trino_type
column table parameter. This is because, followingIceberg own's
org.apache.iceberg.aws.glue.IcebergToGlueConverter.toTypeString
the column type written to Glue is not accurate, so this piece of
information may be lossy. In such cases the column parameter is used
to store the Trino type.
Iceberg lacks API to reconstruct Type from string (except for
primitive types). This is not surprising, as it needs IDs for all
fields. Something that is not needed to answer metadata queries.
Column NOT NULL constraint is stored as
trino_not_null=true
column tableparameter (omitted for nullable columns).
Before the above cached information is used, the following conditions
are checked
the
trino_table_metadata_info_valid_for
table property must be setto current metadata location. This ensures that the cached information
is invalided whenever metadata location changes.
at least one column must have
trino_type
property set. This ensuresthe cached information is not used when column parameters were lost in
transit or otherwise erased.
iceberg.glue.cache-table-metadata
serves as a kill-switch for the newfunctionality (both write & read parts).
Alternative to #18299