Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support SQL UPDATE in the Presto Engine and Hive ACID Tables #5861

Merged

Conversation

djsstarburst
Copy link
Member

This PR consists of two commits. The first commit adds support for SQL UPDATE in the Presto engine. That commit also includes a new connector developer doc delete-and-update.rst that explains in details how both DELETE and UPDATE work, and documents the APIs that the connector developer must implement.

@martint, please take a close look at the QueryPlanner changes to support UPDATE. I think I got it right, but you are the expert.

Copy link
Member

@martint martint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial quick comments.

@djsstarburst djsstarburst force-pushed the david.stryker/hive-acid-update branch 2 times, most recently from 9713f42 to 4dd2668 Compare November 6, 2020 18:04
@martint martint self-requested a review November 6, 2020 19:57
@martint martint self-requested a review November 6, 2020 20:30
@djsstarburst
Copy link
Member Author

Thanks for the great tips, @martint! I've force-pushed an update with your suggested changes.

@djsstarburst djsstarburst force-pushed the david.stryker/hive-acid-update branch 4 times, most recently from c491f42 to 6bc9c60 Compare November 10, 2020 01:12
@abhishekkhanna1
Copy link

Hi @djsstarburst I have checked it out locally and running tests manually to do some sanity testing through my local setup for different scenarios. Will try to see if I can figure out any corner cases. Also going through the code and debugging to familiarize with the flow and the code structure so that going forward I can help with the Merge changes.

Copy link
Member

@electrum electrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done with the first commit, except for the developer documentation

presto-docs/src/main/sphinx/develop.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/connector/hive.rst Outdated Show resolved Hide resolved
@djsstarburst djsstarburst force-pushed the david.stryker/hive-acid-update branch 2 times, most recently from 0cdedf5 to 25d4987 Compare November 19, 2020 04:32
@djsstarburst
Copy link
Member Author

Thanks very much for the close and tasteful review of the first commit, @electrum! I force-pushed all your suggested changes except for validation of projected symbols in UpdateNode, and the discussion of whether HiveColumnHandle.isHidden() columns should be removed from the list of eligible UPDATE SET targets.

@djsstarburst djsstarburst force-pushed the david.stryker/hive-acid-update branch 2 times, most recently from e25c048 to 2bf6506 Compare November 19, 2020 05:48
presto-docs/src/main/sphinx/develop/delete-and-update.rst Outdated Show resolved Hide resolved
Comment on lines 179 to 178
All column lists are in table column order, as are the ``updateExpressionChannelNumbers``. The size of
``updateExpressionChannelNumbers`` is one greater than the number of updated columns - - the last element of the
list is the channel number of the block of rowIds for the page, constructed from the rowId column handle.

``beginUpdate()`` performs any orchestration needed in the connector to start processing the ``UPDATE``.
This orchestration varies from connector to connector. In the Hive ACID connector, for example, ``beginUpdate()``
starts the Hive Metastore transaction; checks that the updated table is transactional and that neither

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might not be relevant, just thinking out loud. What would be the behavior if there is a concurrent Alter table which drops or renames the columns before the beginUpdate is invoked?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good question, and an area that is poorly tested. If you could think of a way to test commit and rollback it would be a big contribution. Here is the way I think it works:

If the ALTER TABLE transaction committed before beginUpdate() was called, the UPDATE operation would fail because the StatementAnalyzer would fail to find a referenced column. If the ALTER TABLE transaction was started but not committed by the time beginUpdate() was called, I think the call to metastore.declareIntentionToWrite() will cause a rollback of the beginUpdate() transaction when the ALTER TABLE transaction commits. But I'm guessing, and don't have tests to prove that it works that way :(

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming the metastore was a proper transactional system that implemented optimistic concurrency, the commit of the UPDATE operation would fail due to the conflict. That's how it will work for Iceberg, which properly supports transaction semantics.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense that as long as we have OCC backed metastore, one of the conflicting transaction would be rejected. But, I was thinking can there be an Alter table request which makes the conflict not detectable because we rely on the names/order of the columns? e.g the scenarios listed here https://iceberg.apache.org/evolution/#correctness

    Formats that track columns by name can inadvertently un-delete a column if a name is reused, which violates #1.
    Formats that track columns by position cannot delete columns without changing the names that are used for each column, which violates #2.

@djsstarburst
Copy link
Member Author

Thanks for the review, @abhishekkhanna1! As I mentioned, it would be great if you could figure out a way to add even very simple concurrency tests for Hive ACID operations, since we have almost none right now.

@abhishekkhanna1
Copy link

  • I will look into the next steps for the Invalid method name: 'alter_table_req' issue as being discussed here: https://prestosql.slack.com/archives/CGB0QHWSW/p1605751292043600
  • Next I will review the other commit specific to Hive ACID Update
  • I will also start looking into ways to write tests for conflicting transactions and rollbacks etc

@djsstarburst djsstarburst force-pushed the david.stryker/hive-acid-update branch 8 times, most recently from 1f87ba4 to 70fe17c Compare January 8, 2021 22:40
@djsstarburst djsstarburst mentioned this pull request Jan 30, 2021
3 tasks
@abhishekkhanna1
Copy link

Hi @djsstarburst Can we push these commits while we work on improving the limitations as part of #3325 separately?

This commit adds parsing, analysis and query planning for SQL
UPDATE statements.  A second commit on top of this commit adds Hive ACID
support for UPDATE, and product tests that show UPDATE works
as expected.

This commit includes a new developer documentation page for connector
developers, delete-and-update.rst, that explains in detail how DELETE
and UPDATE work.  This document refers to the Hive ACID implementation of
UPDATE, coming in the second commit.

One note for the reader: before this commit, method ConnectorMetadata method
getUpdateRowIdColumnHandle() was used to get the rowId column handle for DELETE operations.
This was renamed to getDeleteRowIdColumnHandle(), to make way for the new
getUpdateRowIdColumnHandle(), used to get the rowId column handle for UPDATE operations.
@djsstarburst djsstarburst force-pushed the david.stryker/hive-acid-update branch 2 times, most recently from a11dd8d to 3fbdb8f Compare February 3, 2021 17:04
This commit implements SQL UPDATE for Hive ACID Tables, and adds
product tests that demonstrate that it works as expected.  Those tests
check:
o Expected failures cases - - e.g., non-transactional table; updating
  partition or bucket columns.
o Some or all columns of the table updated.
o A spanning set of cases of overlap between dependency columns and
  updated columns.
o A spanning set of cases of overlap between dependency columns and
  non-updated columns.
o Permuted order of columns updated compared to table declaration order.
o Unpartitioned, partitioned and bucketed table updates.
o Some or all columns updated to either litereal or computed NULLs.
@electrum electrum merged commit cf68459 into trinodb:master Feb 3, 2021
@electrum
Copy link
Member

electrum commented Feb 3, 2021

Thanks for the work on this @djsstarburst!

@djsstarburst djsstarburst deleted the david.stryker/hive-acid-update branch February 3, 2021 23:05
@martint martint added this to the 352 milestone Feb 4, 2021
@electrum electrum mentioned this pull request Feb 9, 2021
10 tasks
@XiaoqiMa
Copy link

XiaoqiMa commented Mar 1, 2021

@djsstarburst Hi, David. I have seen your commit. But do you have any demo code for us to learn how to implement UPDATE and DELETE for JDBC-based connectors. Thanks

@djsstarburst
Copy link
Member Author

@djsstarburst Hi, David. I have seen your commit. But do you have any demo code for us to learn how to implement UPDATE and DELETE for JDBC-based connectors. Thanks

I don't have example code, but how UPDATE works is documented in detail in:
https://github.com/trinodb/trino/blob/master/docs/src/main/sphinx/develop/delete-and-update.rst

throw new UnsupportedOperationException("This connector does not support row-level delete");
}

default void updateRows(Page page, List<Integer> columnValueAndRowIdChannels)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some info how this list should be decomposed, perhaps as a javadoc

@djsstarburst
Copy link
Member Author

Add some info how this list should be decomposed, perhaps as a javadoc

Have you seen the developer documentation for DELETE and UPDATE?
https://github.com/trinodb/trino/blob/master/docs/src/main/sphinx/develop/delete-and-update.rst

@findepi
Copy link
Member

findepi commented Mar 23, 2021

Good point. We should either link to the full docs from the javadoc or add some excerpts from docs into a javadoc.
The good side is that you have good wording for everything already.

@divyanshidm
Copy link

Is this available in Trino? (I am using Trino CLI 376, Hive Metastore 3.1.2, Both on dataproc gcp.)
But it still complains on update/delete - "Invalid method name: 'alter_table_req'"
Can somebody help?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

7 participants