Support SQL MERGE in the Trino engine and Hive and Kudu connectors #7386

djsstarburst · 2021-03-23T03:24:24Z

This PR consists of three commits that add support for SQL MERGE in the Trino engine, the Hive connector and the Kudu connector. The implementation is structured so that most of the work happens in the Trino engine, so adding support in a connector is pretty simple.

The SQL MERGE implementation allows update of all columns, including partition or bucket columns, and the Trino engine performs redistribution to ensure that the updated rows end up on the appropriate nodes.

The Trino engine commit introduces an enum RowChangeParadigm, which characterizes how a connector modifies rows. Hive uses and Iceberg will use the DELETE_ROW_AND_INSERT_ROW paradigm, since they represent an updated row as a deleted row and an inserted row. Kudu uses the CHANGE_ONLY_UPDATED_COLUMNS paradigm.

Each paradigm corresponds to an implementation of the RowChangeProcessor interface. After this PR is merged, the intent is to retrofit SQL UPDATE to use the same RowChangeParadigm/Processor mechanism.

Extensive documentation on the internal MERGE architecture can be found in the developer doc supporting-merge.rst.

This is a big, complicated PR, and requires close review. @kasiafi, I'm hoping can find time to do your usual excellent job of identifying what doesn't make sense and what needs to be improved.

For #7708

findepi

(not a review)

core/trino-main/src/main/java/io/trino/operator/SqlMergeOperator.java

findepi · 2021-03-23T14:08:40Z

core/trino-spi/src/main/java/io/trino/spi/connector/ConnectorMergeSink.java

+    {
+        throw new UnsupportedOperationException("This connector does not support row merge");
+    }
+


the connector may need to do a prolonged IO operation.
if it does it synchronously, the query may be non-cancellable

add CompletableFuture<?> isBlocked() and use it in MergeOperator.isBlocked

As side conversations confirmed, this is correctly handled by the fact that AbstractRowChangeOperator and ScanFilterAndProjectOperator both talk to the pageSource.

Can you point me where AbstractRowChangeOperator can return blocked?
i.e. It does not feel correct to push more updates into an updatable page source that declares "i am blocked", does it?

(also, if something required side conversations to confirm, it needs to be documented)

core/trino-spi/src/main/java/io/trino/spi/connector/ConnectorMetadata.java

core/trino-spi/src/main/java/io/trino/spi/connector/RowChangeParadigm.java

mosabua · 2021-03-23T19:26:54Z

General question.. how do you plan to document this in the connectors. Specifically what do you document in the connectors where it is supported (Kudu and Hive) and what do we document in all others?

I suggest that we add a Limitations section to all other connectors and add MERGER there .. and build that up to be more complete over time. It might also make sense to have a generic paragraph for readonly connectors. Should we do this in a separate PR or add it here? Either works for me as long as we get both merged for the same release..

wdyt @electrum ?

djsstarburst · 2021-03-23T19:42:58Z

I suggest that we add a Limitations section to all other connectors and add ...

This is a repeated suggestion from @mosabua, and there is certainly a problem to be resolved - - how can you tell if a connector supports DELETE or UPDATE or MERGE? There seem to be three choices:

Centralize that knowledge in the documentation for those SQL operations, e.g., in merge.rst, say: "Only the Hive and Kudu connectors now support MERGE."
Provide a single table of connectors X operations that shows the which operations are supported for each connectors.
Add a table in the limitations section in each connector showing the status of each operation, as suggested by @mosabua.

I'm not sure which alternative is best, but the current situation, where you can only tell if a connector supports an operation by trying it and seeing it fail doesn't seem ideal.

kasiafi

I've only gone through a part of the execution, and took a look at the StatementAnalyzer.
I had difficulties understanding the page transformations.
One thing I'm concerned about is synthesizing AST in the StatementAnalyzer. That logic should probably be moved to the Planner.

core/trino-main/src/main/java/io/trino/metadata/Metadata.java

core/trino-main/src/main/java/io/trino/operator/ChangeOnlyUpdatedColumnsMergeProcessor.java

core/trino-main/src/main/java/io/trino/operator/DuplicateRowFinder.java

core/trino-main/src/main/java/io/trino/operator/RowChangeProcessor.java

kasiafi · 2021-03-24T21:17:39Z

core/trino-main/src/main/java/io/trino/sql/analyzer/StatementAnalyzer.java

+                }
+            }
+
+            // The final version of MergeAnalysis, with the finalQuery


The finalQuery should not be created in the Analyzer. This belongs to the Planner. cc @martint

Let's get @martint's take on it.

djsstarburst · 2021-03-24T22:55:13Z

I had difficulties understanding the page transformations.

Did you take a look at https://github.com/djsstarburst/trino/blob/david.stryker/support-sql-merge/docs/src/main/sphinx/develop/supporting-merge.rst? Can you suggest where additional documentation would make it clearer?

One thing I'm concerned about is synthesizing AST in the StatementAnalyzer. That logic should probably be moved to the Planner.

That's what I started out doing. What I found was that it seemed to require re-inventing all the analysis that went on in the StatementAnalysis phase. Moreover, the result seemed fragile. The implementation of MERGE really does need a RIGHT JOIN and a SELECT CASE, and creating them in the planning phase ends up copying a ton of code.

I'd like to get @martint's thoughts on this question.

findepi · 2021-03-25T08:07:27Z

testing/trino-product-tests/src/main/java/io/trino/tests/hive/TestHiveTransactionalTable.java

+    }
+
+    @Test(groups = HIVE_TRANSACTIONAL, timeOut = TEST_TIMEOUT)
+    public void testMergeUnBucketedUnPartitionedFailure()


What Failure is here?

There is a log line at the end that implies the problem:

`log.info("Verifying MERGE on Hive fails - - and it shouldn't");`

But it needs a real comment. Here is the comment I added:

/** * This test demonstrates a failure of Hive to verify the result of a MERGE operation, * specifically, Hive fails to recognize the delete_delta file written by the MERGE. I * captured the HDFS delta and delete_delta files and verified that they are correct. * I used Wireshark to capture the traffic between Trino and the Hive metastore during * the MERGE, and it was all as expected. I tried to vary the test to understand the * issue, but almost any change I made to the test caused Hive to correctly verify the * MERGE. * TODO: Determine what is causing the Hive verification failure */

Thanks for pointing this out. This looks like a bug somewhere and requires an action item. Can you please make sure this gets properly addressed?

findepi · 2021-03-25T08:09:00Z

testing/trino-product-tests/src/main/java/io/trino/tests/hive/TestHiveTransactionalTable.java

+
+                String sql = format("MERGE INTO %s t USING %s s ON (t.purchase = s.purchase)", targetTable, sourceTable) +
+                        "    WHEN MATCHED AND s.purchase = 'limes' THEN DELETE" +
+                        "    WHEN MATCHED THEN UPDATE SET customer = CONCAT(t.customer, '_', s.customer)" +


Do you have a test that would update the partition key (purchase here)?

Several of the tests update partition and bucket columns. For example, testMergeSimpleSelectPartitioned updates the address partition column at: ccc584f#diff-ee4c5c10f12500119c780c8c1ba6b88df410f13b0282bd69e315fe32c5938ffaR1336

However, I would love to have suggestions for more tests to write. I'm always happy to write more tests, but lack imagination as to what would be good ones 😞

core/trino-main/src/main/java/io/trino/sql/analyzer/StatementAnalyzer.java

This commit adds support for SQL MERGE in the Trino engine. It introduces an enum RowChangeParadigm, which characterizes how a connector modifies rows. Hive and Iceberg will use the DELETE_ROW_AND_INSERT_ROW paradigm, since they represent an updated row as a deleted row and an inserted row. Kudu will use the CHANGE_ONLY_UPDATED_COLUMNS paradigm. Each paradigm corresponds to an implementation of the RowChangeProcessor interface. The intent is to retrofit SQL UPDATE to use the same RowChangeParadigm/Processor mechanism. The SQL MERGE implementation allows update of all columns, including partition or bucket columns, and the Trino engine performs redistribution to ensure that the updated rows end up on the appropriate nodes.

This commit adds SQL MERGE support in the Hive connector and a raft of MERGE tests to verify that it works.

djsstarburst · 2021-05-16T19:24:01Z

This PR is closed in favor of the improved implementation in #7933

cla-bot bot added the cla-signed label Mar 23, 2021

djsstarburst requested review from kasiafi, martint and electrum March 23, 2021 03:24

djsstarburst changed the title ~~Support sql merge in the Trino engine and Hive and Kudu connectors~~ Support SQL MERGE in the Trino engine and Hive and Kudu connectors Mar 23, 2021

djsstarburst force-pushed the david.stryker/support-sql-merge branch 2 times, most recently from c8e9014 to 55edb9e Compare March 23, 2021 04:44

findepi reviewed Mar 23, 2021

View reviewed changes

findepi added tests:all Run all tests tests:hive labels Mar 23, 2021

djsstarburst force-pushed the david.stryker/support-sql-merge branch 3 times, most recently from e8e58b8 to f637166 Compare March 23, 2021 18:59

djsstarburst force-pushed the david.stryker/support-sql-merge branch from f637166 to e1d32ef Compare March 24, 2021 03:17

kasiafi reviewed Mar 24, 2021

View reviewed changes

djsstarburst force-pushed the david.stryker/support-sql-merge branch 2 times, most recently from e3ca571 to 1944681 Compare March 25, 2021 04:10

findepi reviewed Mar 25, 2021

View reviewed changes

djsstarburst force-pushed the david.stryker/support-sql-merge branch from 1944681 to 9212cdd Compare March 25, 2021 14:22

electrum reviewed Mar 25, 2021

View reviewed changes

core/trino-main/src/main/java/io/trino/sql/analyzer/StatementAnalyzer.java Outdated Show resolved Hide resolved

djsstarburst force-pushed the david.stryker/support-sql-merge branch 3 times, most recently from 3c58d20 to f78f9b8 Compare March 29, 2021 15:43

djsstarburst force-pushed the david.stryker/support-sql-merge branch 2 times, most recently from a974359 to 777dfce Compare April 6, 2021 18:20

djsstarburst added 2 commits April 8, 2021 08:15

Support SQL MERGE in the Hive connector

e8a3dd4

This commit adds SQL MERGE support in the Hive connector and a raft of MERGE tests to verify that it works.

Support SQL MERGE in the Kudu connector

ee06f4e

djsstarburst force-pushed the david.stryker/support-sql-merge branch from 777dfce to ee06f4e Compare April 8, 2021 15:16

losipiuk mentioned this pull request Apr 16, 2021

Hive ORC ACID INSERT/UPDATE/DELETE/MERGE support #7611

Closed

6 tasks

hashhar mentioned this pull request May 12, 2021

Support PostgreSQL insert into ON CONFLICT #7897

Open

djsstarburst mentioned this pull request May 16, 2021

Support SQL MERGE in the Trino engine and five connectors #7933

Merged

djsstarburst closed this May 16, 2021

djsstarburst deleted the david.stryker/support-sql-merge branch July 15, 2021 23:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support SQL MERGE in the Trino engine and Hive and Kudu connectors #7386

Support SQL MERGE in the Trino engine and Hive and Kudu connectors #7386

djsstarburst commented Mar 23, 2021 •

edited by findepi

Loading

findepi left a comment

findepi Mar 23, 2021

djsstarburst Mar 23, 2021

findepi Mar 23, 2021

mosabua commented Mar 23, 2021 •

edited

Loading

djsstarburst commented Mar 23, 2021

kasiafi left a comment

kasiafi Mar 24, 2021

djsstarburst Mar 25, 2021

djsstarburst commented Mar 24, 2021

findepi Mar 25, 2021

djsstarburst Mar 25, 2021

findepi Mar 25, 2021

findepi Mar 25, 2021

djsstarburst Mar 25, 2021 •

edited

Loading

djsstarburst commented May 16, 2021

Support SQL MERGE in the Trino engine and Hive and Kudu connectors #7386

Support SQL MERGE in the Trino engine and Hive and Kudu connectors #7386

Conversation

djsstarburst commented Mar 23, 2021 • edited by findepi Loading

findepi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mosabua commented Mar 23, 2021 • edited Loading

djsstarburst commented Mar 23, 2021

kasiafi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

djsstarburst commented Mar 24, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

djsstarburst Mar 25, 2021 • edited Loading

Choose a reason for hiding this comment

djsstarburst commented May 16, 2021

djsstarburst commented Mar 23, 2021 •

edited by findepi

Loading

mosabua commented Mar 23, 2021 •

edited

Loading

djsstarburst Mar 25, 2021 •

edited

Loading