Iceberg: Use table schema corresponding to snapshot in snapshot queries #12786

findinpath · 2022-06-10T09:06:57Z

Description

When performing a snapshot/time travel query, use the table schema corresponding to the snapshot.

Is this change a fix, improvement, new feature, refactoring, or other?

Bugfix

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

How would you describe this change to a non-technical end user or system administrator?

In the context of the dealing with an Iceberg table with a structure which evolves over time (columns are added / dropped) in case of performing a snapshot/time travel query, the schema of the output should match the corresponding schema of the table snapshot queried.

Related issues, pull requests, and links

Fixes #12743

Documentation

(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
(x) Release notes entries required with the following suggested text:

# Iceberg
* Use table schema corresponding to snapshot in snapshot queries

plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/BaseIcebergConnectorTest.java

findinpath · 2022-06-10T15:12:46Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java

+    private Schema getSchema(ConnectorSession session, IcebergTableHandle table)
+    {
+        Table icebergTable = catalog.loadTable(session, table.getSchemaTableName());
+        if (table.getSnapshotId().isEmpty() || table.getSnapshotId().get() == icebergTable.currentSnapshot().snapshotId()) {


table.getSnapshotId() is always filled even when not doing snapshot queries because it is the way to see whether the table has or not data in IcebergSplitManager or TableStatisticsMaker.

With this value always filled, there is currently no way to know whether a specific snapshot of the table is being queried.

This is desired behavior (explained here: #12786 (comment))

what you discovered is that snapshot-id doesn't describe table's state fully, in case there was ADD COLUMN (and no further INSERT yet).
We just need to capture what identifies the table state. I assume this is (snapshot-id, schema-id) pair. The table handle should therefore carry both.

I've modified the logic of getTableHandle() method so that the correct tableSchemaJson and partitionSpecJson are transported over the IcebergTableHandle.
Thank you @findepi for hinting me towards this path.

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java

alexjo2144 · 2022-06-13T20:22:41Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java


-        return new ConnectorTableMetadata(table, columns.build(), getIcebergTableProperties(icebergTable), getTableComment(icebergTable));
+        return new ConnectorTableMetadata(table, columns, getIcebergTableProperties(icebergTable), getTableComment(icebergTable));


Do table comments / properties also have a history we should be looking through during TT?

If we are to be very puristic about time travelling, then yes.

The table metadata(s) corresponding to a snapshot contain the properties as well.

The table properties would need to be probably transported along the IcebergTableHandle (in order to avoid obtaining them on the fly by going through the table metadata files and causing unnecessary I/O operations).

cc @findepi @electrum

alexjo2144 · 2022-06-13T20:25:10Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java


-        ImmutableList.Builder<ColumnMetadata> columns = ImmutableList.builder();
-        columns.addAll(getColumnMetadatas(icebergTable));
-        columns.add(pathColumnMetadata());


Nit, personally I think it make sense to add the path column to the list here. The other method is responsible for columns that come from the Iceberg schema.

I did this refactor with the intention of not having code duplication in the methods where the column metadatas are built.
Should I try a better naming for the method?

plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/BaseIcebergConnectorTest.java

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java

ebyhr

Left initial comments.

plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/BaseIcebergConnectorTest.java

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java

findepi

"Retrieve schema and partition spec depending on the table snapshot"

findepi · 2022-07-22T13:30:58Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java

+    {
+        return snapshotIds.computeIfAbsent(
+                table.name() + "@" + id,
+                ignored -> IcebergUtil.resolveSnapshotId(table, id, allowLegacySnapshotSyntax));


allowLegacySnapshotSyntax should be part of the cache key

(pre-existing, but still sth to fix)

trino/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergUtil.java

Lines 221 to 232 in 9b7c1b9

public static long resolveSnapshotId(Table table, long snapshotId, boolean allowLegacySnapshotSyntax)

{

if (!allowLegacySnapshotSyntax) {

throw new TrinoException(

NOT_SUPPORTED,

format(

"Failed to access snapshot %s for table %s. This syntax for accessing Iceberg tables is not "

+ "supported. Use the AS OF syntax OR set the catalog session property "

+ "allow_legacy_snapshot_syntax=true for temporarily restoring previous behavior.",

snapshotId,

table.name()));

}

In case that allowLegacySnapshotSyntax is false we'd get a Trino exception.

I don't see the purpose behind this request.

In case that allowLegacySnapshotSyntax is false we'd get a Trino exception.

only if snapshotIds entry doesn't exist yet.
cc @phd3

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java

findepi · 2022-07-25T15:06:15Z

ACK the small change (https://github.com/trinodb/trino/compare/ae09364cc7eae9e7dc13a4b8e9f84e5fcbc4a6de..3270605c060f415edcdf0a2579a312323e296a18)

The CI didn't run due to a merge conflict.
Can you please rebase (just rebase) before making further changes?

findinpath · 2022-07-26T15:28:54Z

Followed up on your hint @alexjo2144 #12786 (comment) and made the partitionSpecJson field in the IcebergTableHandle as optional.

alexjo2144

Looks reasonable to me. Just take a look at the conflicts

alexjo2144 · 2022-07-27T14:58:52Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java

@@ -338,7 +351,7 @@ public IcebergTableHandle getTableHandle(
                Optional.empty());
    }

-    private static long getSnapshotIdFromVersion(Table table, ConnectorTableVersion version)
+    private long getSnapshotIdFromVersion(Table table, ConnectorTableVersion version)


Is this necessary?

findepi

"Retrieve table schema depending on the table snapshot"

findepi · 2022-07-28T12:16:13Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java

+    {
+        return snapshotIds.computeIfAbsent(
+                table.name() + "@" + id,
+                ignored -> IcebergUtil.resolveSnapshotId(table, id, allowLegacySnapshotSyntax));


In case that allowLegacySnapshotSyntax is false we'd get a Trino exception.

only if snapshotIds entry doesn't exist yet.
cc @phd3

findepi · 2022-07-28T12:17:33Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSourceProvider.java

+            verify(table.getPartitionSpecJson().isPresent(), "The table handle must contain the partion spec definition");
+            return new IcebergPageSink(
+                    tableSchema,
+                    PartitionSpecParser.fromJson(tableSchema, table.getPartitionSpecJson().get()),


table.getPartitionSpecJson() .orElseThrow(() -> new VerifyException("Partition spec missing in the table handle")

and revert other changes here (addition of { ... })

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergTableHandle.java

In the context of dealing with time travel queries, the partition spec is intentionally not retrieved because it would involve going through the all the metadata files of the table and finding out which is the initial metadata file (containing the partition spec) corresponding to the specified table snapshot.

In the context of the dealing with an Iceberg table with a structure which evolves over time (columns are added / dropped) in case of performing a snapshot/time travel query, the schema of the output should match the corresponding schema of the table snapshot queried.

addressed

cla-bot bot added the cla-signed label Jun 10, 2022

findinpath commented Jun 10, 2022

View reviewed changes

plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/BaseIcebergConnectorTest.java Show resolved Hide resolved

findepi requested review from homar, ebyhr and alexjo2144 June 10, 2022 09:15

findinpath force-pushed the iceberg-snaphsot-query-schema branch from dccbfe2 to 3035458 Compare June 10, 2022 09:19

homar approved these changes Jun 10, 2022

View reviewed changes

findinpath commented Jun 10, 2022

View reviewed changes

electrum previously requested changes Jun 10, 2022

View reviewed changes

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java Outdated Show resolved Hide resolved

findinpath force-pushed the iceberg-snaphsot-query-schema branch 5 times, most recently from 4327df3 to e28fab9 Compare June 13, 2022 09:59

homar approved these changes Jun 13, 2022

View reviewed changes

alexjo2144 reviewed Jun 13, 2022

View reviewed changes

findinpath force-pushed the iceberg-snaphsot-query-schema branch 2 times, most recently from 489a3d5 to ee3a5b0 Compare June 14, 2022 12:56

findinpath force-pushed the iceberg-snaphsot-query-schema branch from ee3a5b0 to cd5782a Compare June 22, 2022 08:01

findinpath requested a review from electrum June 22, 2022 08:10

findinpath force-pushed the iceberg-snaphsot-query-schema branch 2 times, most recently from 9e778d2 to 741f4d1 Compare June 27, 2022 06:09

findinpath requested review from homar and alexjo2144 June 27, 2022 13:08

homar reviewed Jun 28, 2022

View reviewed changes

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java Outdated Show resolved Hide resolved

findinpath force-pushed the iceberg-snaphsot-query-schema branch from 741f4d1 to e362c8d Compare July 6, 2022 11:56

ebyhr reviewed Jul 7, 2022

View reviewed changes

findinpath force-pushed the iceberg-snaphsot-query-schema branch 2 times, most recently from 0b9e6ce to 279413c Compare July 8, 2022 05:35

findinpath force-pushed the iceberg-snaphsot-query-schema branch from 279413c to ae09364 Compare July 22, 2022 03:31

findepi requested a review from ebyhr July 22, 2022 13:17

findepi reviewed Jul 22, 2022

View reviewed changes

findinpath force-pushed the iceberg-snaphsot-query-schema branch from ae09364 to 3270605 Compare July 25, 2022 05:17

findinpath force-pushed the iceberg-snaphsot-query-schema branch 2 times, most recently from daa59bb to 641cc85 Compare July 26, 2022 15:21

alexjo2144 approved these changes Jul 27, 2022

View reviewed changes

findinpath force-pushed the iceberg-snaphsot-query-schema branch 3 times, most recently from 824b41b to 9c4ade2 Compare July 28, 2022 06:40

findepi reviewed Jul 28, 2022

View reviewed changes

findinpath force-pushed the iceberg-snaphsot-query-schema branch 2 times, most recently from 4d3fd06 to 4bb99a3 Compare July 29, 2022 11:10

findinpath requested a review from findepi July 29, 2022 11:11

findinpath added 3 commits July 30, 2022 17:01

Verify accuracy of reading from versioned table

c30d109

findinpath force-pushed the iceberg-snaphsot-query-schema branch from 4bb99a3 to c30d109 Compare July 30, 2022 15:05

findepi requested a review from alexjo2144 August 1, 2022 16:12

findepi approved these changes Aug 1, 2022

View reviewed changes

findepi requested review from electrum and removed request for electrum August 1, 2022 16:19

findepi merged commit ee3fd2f into trinodb:master Aug 2, 2022

github-actions bot added this to the 392 milestone Aug 2, 2022

findepi mentioned this pull request Aug 2, 2022

Release notes for 392 #13320

Closed

colebow mentioned this pull request Aug 2, 2022

Add Trino 392 release notes #13342

Closed

xiacongling mentioned this pull request Aug 11, 2022

Iceberg: use native implementation to obtain snapshot schema #13614

Merged

findinpath mentioned this pull request Sep 9, 2022

Use table schema from the table handle #14076

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Iceberg: Use table schema corresponding to snapshot in snapshot queries #12786

Iceberg: Use table schema corresponding to snapshot in snapshot queries #12786

findinpath commented Jun 10, 2022

findinpath Jun 10, 2022

findepi Jun 10, 2022

findinpath Jun 13, 2022

alexjo2144 Jun 13, 2022

findinpath Jun 14, 2022

alexjo2144 Jun 13, 2022

findinpath Jun 14, 2022

ebyhr left a comment

findepi left a comment

findepi Jul 22, 2022

findinpath Jul 26, 2022

findepi Jul 28, 2022

findepi commented Jul 25, 2022

findinpath commented Jul 26, 2022

alexjo2144 left a comment

alexjo2144 Jul 27, 2022

findepi left a comment

findepi Jul 28, 2022

findepi Jul 28, 2022


		return new ConnectorTableMetadata(table, columns.build(), getIcebergTableProperties(icebergTable), getTableComment(icebergTable));
		return new ConnectorTableMetadata(table, columns, getIcebergTableProperties(icebergTable), getTableComment(icebergTable));

	public static long resolveSnapshotId(Table table, long snapshotId, boolean allowLegacySnapshotSyntax)
	{
	if (!allowLegacySnapshotSyntax) {
	throw new TrinoException(
	NOT_SUPPORTED,
	format(
	"Failed to access snapshot %s for table %s. This syntax for accessing Iceberg tables is not "
	+ "supported. Use the AS OF syntax OR set the catalog session property "
	+ "allow_legacy_snapshot_syntax=true for temporarily restoring previous behavior.",
	snapshotId,
	table.name()));
	}

Iceberg: Use table schema corresponding to snapshot in snapshot queries #12786

Iceberg: Use table schema corresponding to snapshot in snapshot queries #12786

Conversation

findinpath commented Jun 10, 2022

Description

Related issues, pull requests, and links

Documentation

Release notes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ebyhr left a comment

Choose a reason for hiding this comment

findepi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

findepi commented Jul 25, 2022

findinpath commented Jul 26, 2022

alexjo2144 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

findepi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment