Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache Iceberg column types in Glue for faster access #18315

Merged
merged 12 commits into from
Aug 1, 2023

Conversation

findepi
Copy link
Member

@findepi findepi commented Jul 17, 2023

Make information_schema.columns queries faster by storing the
necessary directly in Glue so that loading Iceberg metadata from storage
is not need.

  • Table comment is stored as comment Glue table parameter.

  • Column Trino type is stored as Glue column type (for some cases) or as
    trino_type column table parameter. This is because, following
    Iceberg own's org.apache.iceberg.aws.glue.IcebergToGlueConverter.toTypeString
    the column type written to Glue is not accurate, so this piece of
    information may be lossy. In such cases the column parameter is used
    to store the Trino type.

    • For more compatibility we could store Iceberg type string, but
      Iceberg lacks API to reconstruct Type from string (except for
      primitive types). This is not surprising, as it needs IDs for all
      fields. Something that is not needed to answer metadata queries.
  • Column NOT NULL constraint is stored as trino_not_null=true column table
    parameter (omitted for nullable columns).

Before the above cached information is used, the following conditions
are checked

  • the trino_table_metadata_info_valid_for table property must be set
    to current metadata location. This ensures that the cached information
    is invalided whenever metadata location changes.

  • at least one column must have trino_type property set. This ensures
    the cached information is not used when column parameters were lost in
    transit or otherwise erased.

iceberg.glue.cache-table-metadata serves as a kill-switch for the new
functionality (both write & read parts).

Alternative to #18299

@cla-bot cla-bot bot added the cla-signed label Jul 17, 2023
@findepi findepi force-pushed the findepi/iceberg-is-columns-2 branch from da9a649 to dcc9e29 Compare July 17, 2023 15:51
@findepi
Copy link
Member Author

findepi commented Jul 20, 2023

(just rebased)

@findepi findepi force-pushed the findepi/iceberg-is-columns-2 branch 3 times, most recently from b187e54 to 0d1de42 Compare July 20, 2023 14:15
@findepi findepi force-pushed the findepi/iceberg-is-columns-2 branch from 0d1de42 to 0f4a14c Compare July 20, 2023 19:42
@findepi findepi changed the title Write Iceberg types to Glue metastore (v2) Cache Iceberg table types in Glue for faster access Jul 20, 2023
@findepi
Copy link
Member Author

findepi commented Jul 20, 2023

cc @pettyjamesm @phd3

@findepi findepi force-pushed the findepi/iceberg-is-columns-2 branch 2 times, most recently from 14c180e to 9973336 Compare July 20, 2023 20:27
@ebyhr
Copy link
Member

ebyhr commented Aug 1, 2023

CI hit #18446

@findepi
Copy link
Member Author

findepi commented Aug 1, 2023

IcebergMetadata#streamTableColumns calls TrinoCatalog#listTables. The method internally callsGetTablesResult that returns com.amazonaws.services.glue.model.Table.

That's a very good point.

@findepi
Copy link
Member Author

findepi commented Aug 1, 2023

will rebase and add a test.
last build https://github.com/trinodb/trino/actions/runs/5716561690/job/15493176390?pr=18315 is "green"

@findepi findepi force-pushed the findepi/iceberg-is-columns-2 branch from 7c32725 to aad0ca1 Compare August 1, 2023 07:31
@findepi
Copy link
Member Author

findepi commented Aug 1, 2023

(just rebased)

No need to build intermediate `Map<.., Integer>`.
Outdated after project rename.
Introduce separate column class representing column serialized in file
metastore.
Make `information_schema.columns` queries faster by storing the
necessary information directly in Glue so that loading Iceberg metadata
from storage is not need.

- Table comment is stored as `comment` Glue table parameter.

- Column Trino type is stored as Glue column type (for some cases) or as
  `trino_type` column table parameter. This is because, following
  Iceberg own's `org.apache.iceberg.aws.glue.IcebergToGlueConverter.toTypeString`
  the column type written to Glue is not accurate, so this piece of
  information may be lossy. In such cases the column parameter is used
  to store the Trino type.
  - For more compatibility we could store Iceberg type string, but
    Iceberg lacks API to reconstruct Type from string (except for
    primitive types). This is not surprising, as it needs IDs for all
    fields. Something that is not needed to answer metadata queries.

- Column NOT NULL constraint is stored as `trino_not_null=true` column table
  parameter (omitted for nullable columns).

Before the above cached information is used, the following conditions
are checked

- the `trino_table_metadata_info_valid_for` table property must be set
  to current metadata location. This ensures that the cached information
  is invalided whenever metadata location changes.

- at least one column must have `trino_type` property set. This ensures
  the cached information is not used when column parameters were lost in
  transit or otherwise erased.

`iceberg.glue.cache-table-metadata` serves as a kill-switch for the new
functionality (both write & read parts).
@findepi findepi force-pushed the findepi/iceberg-is-columns-2 branch from aad0ca1 to 63c10e5 Compare August 1, 2023 08:40
@findepi
Copy link
Member Author

findepi commented Aug 1, 2023

test added (io.trino.plugin.iceberg.catalog.glue.TestIcebergGlueCatalogAccessOperations#testInformationSchemaColumns)

@findepi
Copy link
Member Author

findepi commented Aug 1, 2023

CI hit #18446

again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed hive Hive connector iceberg Iceberg connector
Development

Successfully merging this pull request may close these issues.

5 participants