Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement table changes function for delta lake #16205

Conversation

homar
Copy link
Member

@homar homar commented Feb 21, 2023

Description

Adds table_changes function to delta lake.
It allows reading CDF entries from table for which CDF is enabled.

This PR requires #15575 to be merged first.
Commits 5th and 6th adds some additional things to TableFunction implementation that I found necessary while working on table_changes function.

Additional context and related issues

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# delta
* Add table_changes function. ({issue}`issuenumber`)

@martint
Copy link
Member

martint commented Feb 21, 2023

This shouldn’t need that other table functions PR. That PR is only needed for table function that take subqueries or descriptors as input.

It should be possible to model this function in the same way as the query pass-through functions that already exist in some connectors.

@github-actions github-actions bot added the docs label Feb 21, 2023
@homar
Copy link
Member Author

homar commented Feb 21, 2023

This shouldn’t need that other table functions PR. That PR is only needed for table function that take subqueries or descriptors as input.

It should be possible to model this function in the same way as the query pass-through functions that already exist in some connectors.

Yes I know it should be possible to use query pass-through. We considered that but our conclusion was that if we went that way we could have ended up with a more complex code. Using #15575 allows to nicely encapsulate the function.

@homar homar force-pushed the homar/implement_table_changes_function_for_delta_lake branch 2 times, most recently from 5af6d0f to 63e654b Compare February 22, 2023 08:51
if (processedVersion <= endVersion) {
Path transactionLogDir = getTransactionLogDir(new Path(tableLocation));
try {
Optional<List<DeltaLakeTransactionLogEntry>> entriesFromJson = getEntriesFromJson(processedVersion, transactionLogDir, fileSystem);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens in case that the transaction log entry file is missing?
This may very well happen when delta.logRetentionDuration kicks in and the transaction log files before the last checkpoints get removed.

So if we request the changes from 100 to 200, but we have transaction log entries starting from 180 (which is the last checkpoint of the table) should we get 180-200 range ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #16192 which deals with this kind of situations for retrieving the $history of the table.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if we request the changes from 100 to 200, but we have transaction log entries starting from 180 (which is the last checkpoint of the table) should we get 180-200 range ?

No, it should fail

Copy link
Member

@alexjo2144 alexjo2144 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an initial skim. @homar since you have the function taking in the schema and table name I don't think you need the initial commits from #15575 can you remove them from the branch?

docs/src/main/sphinx/connector/delta-lake.rst Outdated Show resolved Hide resolved
if (processedVersion <= endVersion) {
Path transactionLogDir = getTransactionLogDir(new Path(tableLocation));
try {
Optional<List<DeltaLakeTransactionLogEntry>> entriesFromJson = getEntriesFromJson(processedVersion, transactionLogDir, fileSystem);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if we request the changes from 100 to 200, but we have transaction log entries starting from 180 (which is the last checkpoint of the table) should we get 180-200 range ?

No, it should fail

@homar
Copy link
Member Author

homar commented Feb 25, 2023

Just an initial skim. @homar since you have the function taking in the schema and table name I don't think you need the initial commits from #15575 can you remove them from the branch?

Not entirely true, I still need at least some of them. For example my io.trino.plugin.deltalake.function.tablechanges.TableChangesFunctionProcessor relies on io.trino.spi.ptf.TableFunctionSplitProcessor that was introduced in 5ceef49
In general I am using this paradigm Execution by operator for table functions without sources.
Which was explicitly introduced in following commits 5ceef49 and 300eb25
But these commits relies on previous changes so I need all of them ;)

@homar homar force-pushed the homar/implement_table_changes_function_for_delta_lake branch from 63e654b to d10f71c Compare February 25, 2023 12:51
@homar homar force-pushed the homar/implement_table_changes_function_for_delta_lake branch 4 times, most recently from 18e1759 to eddba0d Compare February 27, 2023 10:47
@homar homar force-pushed the homar/implement_table_changes_function_for_delta_lake branch from eddba0d to 9f56faf Compare March 3, 2023 10:47
@github-actions github-actions bot added bigquery BigQuery connector delta-lake Delta Lake connector mongodb MongoDB connector labels Mar 3, 2023
@homar homar force-pushed the homar/implement_table_changes_function_for_delta_lake branch from 9f56faf to ab94a16 Compare March 3, 2023 13:33
@homar
Copy link
Member Author

homar commented Apr 29, 2023

just a rebase

@homar
Copy link
Member Author

homar commented Apr 29, 2023

CI: #17295

Copy link
Member

@mosabua mosabua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the replacement docs PR that is linked now. Taking off the request changes here and leaving that up to other maintainers to discuss.

@findepi findepi dismissed mosabua’s stale review May 2, 2023 12:49

Manfred supposed to dismiss this per his last comment

@findepi
Copy link
Member

findepi commented May 2, 2023

@homar please make sure there are tests for table_changes for tables having row fields exercising reading whole row columns, but also dereferencing particular fields. see conversation: https://github.com/trinodb/trino/pull/17085/files#r1176399103
cc @krvikash

@findepi
Copy link
Member

findepi commented May 4, 2023

@homar please rebase to resolve conflict
also note that delta tests aren't green

martint
martint previously requested changes May 4, 2023
Copy link
Member

@martint martint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mosabua there is no new syntax. this is a connector-level feature.

@findepi, any changes to the language or function library (syntax or semantics) qualifies.

@homar, please update per that discussion.

@homar homar force-pushed the homar/implement_table_changes_function_for_delta_lake branch from 6b1cc6c to 25e6998 Compare May 5, 2023 14:21
@homar
Copy link
Member Author

homar commented May 5, 2023

@findepi test added, see io.trino.plugin.deltalake.TestDeltaLakeConnectorTest#testReadChangesOnTableWithRowColumn
@martint label added, what else do I need to do ?

@martint
Copy link
Member

martint commented May 6, 2023

@homar, I meant, update the PR based on the discussion above to make the argument optional instead of giving -1 a special meaning.

@homar homar force-pushed the homar/implement_table_changes_function_for_delta_lake branch 2 times, most recently from 44bd539 to 7665f63 Compare May 8, 2023 09:05
@homar
Copy link
Member Author

homar commented May 8, 2023

@homar, I meant, update the PR based on the discussion above to make the argument optional instead of giving -1 a special meaning.

@martint that is done, can we now proceed with merging this ?

@martint martint dismissed their stale review May 8, 2023 17:49

Comments addressed

@homar homar force-pushed the homar/implement_table_changes_function_for_delta_lake branch from 7665f63 to 946d31c Compare May 9, 2023 14:11
.filter(column -> column.getColumnType() != SYNTHESIZED)
.collect(toImmutableList());
accessControl.checkCanSelectFromColumns(null, schemaTableName, columnHandles.stream()
.map(DeltaLakeColumnHandle::getBaseColumnName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getColumnName


ImmutableList.Builder<Descriptor.Field> outputFields = ImmutableList.builder();
columnHandles.stream()
.map(columnHandle -> new Descriptor.Field(columnHandle.getBaseColumnName(), Optional.of(columnHandle.getBaseType())))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getBaseColumnName -> getColumnName

getBaseType -> getType, new method

@JsonIgnore
public Type getType()
{
    return projectionInfo.map(DeltaLakeColumnProjectionInfo::getType)
            .orElse(baseType);
}

@findepi
Copy link
Member

findepi commented May 10, 2023

force push https://github.com/trinodb/trino/compare/7665f6332438f2f8af1891a1ee30b94c87b72d73..946d31c37fb4f46b34371eb9b339d589fca5ddf4 combining rebase and changes made it hard to see what changed, but i tried to fish out the relevant part. i could have missed some things though

table_changes allows reading cdf entries stream.
since_version is optional and exclusive.
If since_version is not provided entire history of the table will be read.
@homar homar force-pushed the homar/implement_table_changes_function_for_delta_lake branch from 946d31c to 80c52ad Compare May 11, 2023 10:08
@homar
Copy link
Member Author

homar commented May 12, 2023

CI: #17158

@findepi
Copy link
Member

findepi commented May 12, 2023

CI: #17158

and #17472 ?

@findepi findepi merged commit 4cf66ff into trinodb:master May 12, 2023
90 of 92 checks passed
@github-actions github-actions bot added this to the 418 milestone May 12, 2023
@colebow
Copy link
Member

colebow commented May 17, 2023

@homar @findepi we need docs for this - @homar, would you be willing to submit a follow-up PR for that?

@homar
Copy link
Member Author

homar commented May 17, 2023

@homar @findepi we need docs for this - @homar, would you be willing to submit a follow-up PR for that?

@colebow The PR is there #17252

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

None yet

9 participants