Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support DynamoDB catalog in Iceberg #12173

Closed
wants to merge 3 commits into from

Conversation

ebyhr
Copy link
Member

@ebyhr ebyhr commented Apr 28, 2022

Description

Support DynamoDB catalog in Iceberg
Fixes #9953

Documentation

(x) Sufficient documentation is included in this PR.

Release notes

(x) Release notes entries required with the following suggested text:

# Iceberg
* Add support for DynamoDB catalog. ({issue}`9953`)

@cla-bot cla-bot bot added the cla-signed label Apr 28, 2022
@github-actions github-actions bot added the docs label Apr 28, 2022
@ebyhr ebyhr force-pushed the ebi/iceberg-dynamodb-catalog branch 2 times, most recently from a802ffc to 8d1a22e Compare May 2, 2022 05:58
@ebyhr ebyhr force-pushed the ebi/iceberg-dynamodb-catalog branch from 8d1a22e to 0ddd7a5 Compare May 2, 2022 07:36
@findepi
Copy link
Member

findepi commented May 4, 2022

Do we have / want to have some compatibility tests with Spark?

@ebyhr
Copy link
Member Author

ebyhr commented May 9, 2022

I guess it's hard to test with dockerized DynamoDB at this time. Sent PR apache/iceberg#4726 to Iceberg repository.

@ebyhr ebyhr marked this pull request as draft May 17, 2022 06:43
@peay
Copy link

peay commented May 19, 2022

@ebyhr looking forward to this, thanks for the changes!

I gave this branch a try, and it might be missingsoftware.amazon.awssdk:s3 in trino-iceberg/pom.xml, leading to

java.lang.NoClassDefFoundError: software/amazon/awssdk/services/s3/model/ObjectCannedACL
    at org.apache.iceberg.aws.AwsProperties.<init>(AwsProperties.java:289)
    at io.trino.plugin.iceberg.catalog.dynamodb.DynamoDbClientFactory.createDynamoClient(DynamoDbClientFactory.java:69)
...

With this small addition, I was able to query an Iceberg DynamoDB catalog, although I hit another issue when trying to actually query some data from a table created from Spark. This might be unrelated/off-topic so feel free to ignore. After selecting a catalog and a schema,show tables does show a test table, butselect or describe indicate that it does not exist:

trino:develop> show tables;
  Table
----------
test_table

trino:develop> select * from test_table limit 5;
Query 20220519_140253_00003_r7jzn failed: line 1:15: Table 'iceberg.develop.test_table' does not exist

trino:develop> describe test_table;
Query 20220519_141137_00004_r7jzn failed: line 1:1: Table 'iceberg.develop.test_table' does not exist

The coordinator logs do indicate that BaseMetastoreCatalog.loadTable at least succeeded:

2022-05-19T14:02:53.346Z	INFO	Query-20220519_140253_00003_r7jzn-142	org.apache.iceberg.BaseMetastoreTableOperations	Refreshing table metadata from new version: s3://warehouse-bucket/develop.db/test_table/metadata/00051-4fc1d8bb-8e4b-4db7-9372-aa7cfaddb385.metadata.json
2022-05-19T14:02:55.038Z	INFO	Query-20220519_140253_00003_r7jzn-142	org.apache.iceberg.BaseMetastoreCatalog	Table loaded by catalog: iceberg.develop.test_table

This doesn't occur for tables created from Trino: DESCRIBE and SELECT work fine on them. The only difference I could spot is that my tables created from Spark have their data objects at s3://bucket/database.db/..., but tables created from Trino have their data objects at s3://bucket/database/... without the .db extension, for some reason.

edit: actually, my Spark-written tables are format-version: 1, while Trino's are format-version: 2, so this is likely unrelated to the DynamoDB catalog, sorry for the noise!

@ebyhr
Copy link
Member Author

ebyhr commented May 23, 2022

@peay Thanks for letting us know. I guess the failure comes from other dependency change. I will restart working and confirm the issue after Iceberg community releases the next version.

@etiennecl
Copy link

Now the iceberg has officially released v14 with support for overriding the DynamoDB endpoint is there any plan to soon promote this PR to ready?

@peay
Copy link

peay commented Aug 24, 2022

@ebyhr I gave this PR another try, and I still can't read tables created by Spark. I assumed above that it was an Iceberg format version issue but after testing out that hypothesis, that doesn't seem to be the case.

One thing I've observed is that when creating or writing to a table from Trino with the DynamoDB catalog, two JSON table metadata are written:

  • DynamoDB item has, say, p.metadata_location=s3://bucket/table/metadata/A1.json

  • A1.json has no snapshots, but properties.metadata_location pointing to another JSON table metadata file A2.json:

    {
      "properties" : {
        "metadata_location" : "s3://bucket/table/metadata/A2.json",   <--- points to another metadata file
        "write.format.default" : "ORC"
      },
      "current-snapshot-id" : -1,
      "snapshots" : [ ],
      "snapshot-log" : [ ],
  • A2.json then has the snapshot(s), and properties.metadata_location is unset:

    {
    "properties" : {
        "write.format.default" : "ORC"
      },
      "current-snapshot-id" : 8449590332803853466,
      "snapshots" : [ {
         ...
      } ]
    }

Both metadata files are written at the same timestamp when creating or committing to the table. However, this does not occur when writing with Spark 3.2.1 + Iceberg 0.13.1 and the DynamoDB catalog: a single table metadata file is written every time, and it does not have properties.metadata_location. I suspect this difference is what prevents the Trino implementation from reading the table written by Spark, although I do not yet understand exactly what's happening and why we have two metadata files.

@ebyhr
Copy link
Member Author

ebyhr commented Sep 27, 2022

Closing as I have no bandwidth to continue this PR.

@ebyhr ebyhr closed this Sep 27, 2022
@ebyhr ebyhr deleted the ebi/iceberg-dynamodb-catalog branch September 27, 2022 12:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Feature Request: support for Iceberg Dynamodb catalogues
5 participants