Support branching of Iceberg tables #12844

findinpath · 2022-06-14T21:05:27Z

trino:default> insert into "test1@3692406374410249224" values (10);
INSERT: 1 row

Query 20220614_205840_00020_m6z2x, FINISHED, 1 node
Splits: 19 total, 19 done (100.00%)
1.16 [0 rows, 0B] [0 rows/s, 0B/s]

trino:default> insert into test1 FOR VERSION AS OF 3692406374410249224 values (10);
Query 20220614_205904_00021_m6z2x failed: line 1:19: mismatched input 'FOR'. Expecting: '(', <query>
insert into test1 FOR VERSION AS OF 3692406374410249224 values (10)

By using the now deprecated syntax, we are able to do insertions into a versioned Iceberg table.
However, when using the syntax AS OF this functionality is not anymore supported.

This functionality will likely require the grammar SqlBase.g4 to be extended.

The text was updated successfully, but these errors were encountered:

alexjo2144 · 2022-06-14T21:15:48Z

This shouldn't work right? You should only be able to write data to the most recent version of a table.

electrum · 2022-06-14T21:17:12Z

This was never intended behavior. Does this do anything different than inserting into the base table?

homar · 2022-06-14T21:21:28Z

Inserting into an old snapshot looks strange. Instead of linear history we would get a tree.

findinpath · 2022-06-14T21:26:59Z

Instead of linear history we would get a tree.

Think of zero-copy clones.
Instead of copying GB of data with a CTAS statement and waiting until a copy of the table is done we could have several independent "branches" of a table.

findinpath · 2022-06-14T21:27:50Z

Does this do anything different than inserting into the base table?

No, it doesn't at the time of this writing.

alexjo2144 · 2022-06-14T21:35:51Z

I'd reword this as a broader issue to support branching and tagging then. I'm not sure how far that project has made it into implementation though. @rdblue would know.

rdblue · 2022-06-14T22:26:44Z

What is the actual behavior when you write to a table at some version? Does it create a snapshot with that version as the parent?

Right now, branching and tagging is about adding the metadata to the table. We haven't added the ability to commit to a branch yet. We are thinking that could look something similar to what is above, but the branch needs to be identified by name. In the Iceberg API, we'll add toBranch:

table.newAppend()
    .toBranch("test")
    .addFile(FILE_A)
    .commit();

In SQL, that could be INSERT INTO 'table@branch' ... but it could also be INSERT INTO table BRANCH branch .... It's up to engines to decide (though hopefully we can agree on similar syntax).

I don't think that we would allow doing the same thing with tags or individual commits (which are very similar to tags). Since those represent points in time, it makes no sense.

martint · 2022-06-14T23:05:36Z

By using the now deprecated syntax, we are able to do insertions into a versioned Iceberg table.

That should probably be rejected, as it’s semantically meaningless and potentially confusing based on current behavior. It’s impossible to insert into a specific version of a table without the versión fundamentally changing.

Think of zero-copy clones.
Instead of copying GB of data with a CTAS statement and waiting until a copy of the table is done we could have several independent "branches" of a table.

There’s no concept of branches in standard SQL. That brings up all sorts of complexities such as:

How would you refer to a branch?
Can you ever merge branches or are they effectively independent tables at that point?

Also, note that a CTAS doesn’t necessarily need to copy data or be “slow”. It’s an implementation detail whether the data is read and written or a more efficient approach such as creating a O(1) snapshot is used (not currently supported by connector APIs or the engine, though)

findepi · 2022-06-15T09:49:40Z

By using the now deprecated syntax, we are able to do insertions into a versioned Iceberg table.

That should probably be rejected, as it’s semantically meaningless and potentially confusing based on current behavior. It’s impossible to insert into a specific version of a table without the versión fundamentally changing.

agreed. @findinpath can you please file a separate bug issue about this?

findinpath · 2022-06-15T10:01:32Z

@findepi I've created #12860

rdblue · 2022-06-16T23:56:45Z

Can you ever merge branches or are they effectively independent tables at that point?

@martint, I am pushing against any expectation that branches are ever merged. The use cases we're targeting are:

Tags that are friendly names for certain snapshots that allow you to manage the snapshot's lifecycle differently. For example, audit_2022Q1
Branches for code testing. Run new code and test results, without needing to copy existing data. When the new code is validated, push to production and stop writing to the branch, or re-point main to the branch's state.
Branches for large atomic changes. In some cases you may want to break up a large operation into multiple writes or use a pattern where you write, audit the result, and then publish it. These patterns use a "fast-forward" pattern where the table is temporarily branched and then master is updated to the head of a branch.

Merging branches is a really bad idea because it is essentially a transaction that is open for a long time. The chances of being able to commit the transaction safely get lower and lower as the branch ages. And what's even worse is that you'd have to encode all of the transaction information (like what other tables were read and the versions that were read) into the branch. Branches for transactions are a really, really bad idea.

itaise · 2023-10-03T12:13:42Z

@rdblue Are iceberg branches fully supported in Trino?
I can't find resources which shows the relevant DDL,
thanks!

findinpath · 2023-10-04T10:48:47Z

@itaise Are iceberg branches fully supported in Trino?

Trino supports now querying by tag/branch name - see https://trino.io/docs/current/connector/iceberg.html#time-travel-queries
However, Trino does not have currently support for creating tags or branches.

nicor88 · 2023-10-12T08:31:16Z

@findinpath is there any plan to support "creating" tags or branches from Trino? This will make easy to implement WAP for example with dbt.

findinpath · 2023-10-12T08:33:41Z

@nicor88 this topic is actually currently being analyzed.
Please dive in with a more detailed usage (the more details the better) scenario of what you have in mind.

itaise · 2023-10-12T08:44:07Z

I will just say that what we would like to do with data quality is a WAP process like in this article:
https://www.dremio.com/blog/streamlining-data-quality-in-apache-iceberg-with-write-audit-publish-branching/
Write to a specific branch, audit this branch (with data quality tools) and if results are good, make this branch the main branch.

kmurra · 2023-11-03T16:32:52Z

Hello @findinpath, we have a use case for a WAP (write-audit-publish) workflow. Basically, we want our users to have a good way to test changes to their transform logic. Due to compliance requirements, we can't use production data in our test environment, but it is very difficult to get representative data in our test environment from test systems. Since our organization uses a service-oriented architecture, test environments have test data populated in isolation and so invariants are often not respected (i.e. a transaction must be assigned to an existing customer with contact information -- always true in prod, not so in test).

A write-audit-publish workflow fixes this because we could have users check in their code and deploy it to some environment between test and prod. The code would run against a branch of production data, and then verify the two branches of data. If the user thinks the changes are fine, we publish the data, and make future runs of their transform logic automatically publish the change.

Essentially, we would like a feature that is similar to Snowflake's zero-copy clone.

I can go into more detail if necessary.

lorransr · 2024-03-12T13:21:59Z

We have a similar use case for WAP:

Currently, we use DBT to transform data, and more than 40 people work concurrently on our project. We have duplicated our lake zones so that each developer can run their tests. This duplicated infrastructure is also used to run our CI, ensuring that any modification won’t break our pipeline.

To reduce data duplication, we could use WAP. Another case is testing incremental problems. For example, I could create a branch from a table and test if the DBT incremental model would work on that table. We already face challenges in ensuring the same behavior as in production, and WAP makes this possible.

Another option is to include a step in our pipeline. For instance, for some tables, we need to ensure a certain level of quality. Therefore, we only publish if we meet this quality score, so we branch them, test them, and if everything is ok we commit the changes

findinpath added enhancement New feature or request syntax-needs-review labels Jun 14, 2022

findinpath changed the title ~~Iceberg add support for inserting into versioned table with AS OF syntax~~ Support branching of Iceberg tables Jun 14, 2022

deniskuzZ mentioned this issue Aug 24, 2023

HIVE-27630: Iceberg: Fast forward branch. apache/hive#4627

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support branching of Iceberg tables #12844

Support branching of Iceberg tables #12844

findinpath commented Jun 14, 2022

alexjo2144 commented Jun 14, 2022

electrum commented Jun 14, 2022

homar commented Jun 14, 2022

findinpath commented Jun 14, 2022

findinpath commented Jun 14, 2022

alexjo2144 commented Jun 14, 2022

rdblue commented Jun 14, 2022

martint commented Jun 14, 2022

findepi commented Jun 15, 2022

findinpath commented Jun 15, 2022

rdblue commented Jun 16, 2022

itaise commented Oct 3, 2023

findinpath commented Oct 4, 2023

nicor88 commented Oct 12, 2023

findinpath commented Oct 12, 2023

itaise commented Oct 12, 2023

kmurra commented Nov 3, 2023 •

edited

lorransr commented Mar 12, 2024

Support branching of Iceberg tables #12844

Support branching of Iceberg tables #12844

Comments

findinpath commented Jun 14, 2022

alexjo2144 commented Jun 14, 2022

electrum commented Jun 14, 2022

homar commented Jun 14, 2022

findinpath commented Jun 14, 2022

findinpath commented Jun 14, 2022

alexjo2144 commented Jun 14, 2022

rdblue commented Jun 14, 2022

martint commented Jun 14, 2022

findepi commented Jun 15, 2022

findinpath commented Jun 15, 2022

rdblue commented Jun 16, 2022

itaise commented Oct 3, 2023

findinpath commented Oct 4, 2023

nicor88 commented Oct 12, 2023

findinpath commented Oct 12, 2023

itaise commented Oct 12, 2023

kmurra commented Nov 3, 2023 • edited

lorransr commented Mar 12, 2024

kmurra commented Nov 3, 2023 •

edited