-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Allow time travel reads to Hudi Tables #27140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
Reviewer's GuideThis PR introduces a new session variable Sequence diagram for time travel read in Hudi connectorsequenceDiagram
actor User
participant TrinoSession
participant HudiSplitSource
participant HudiReadOptimizedDirectoryLister
participant HudiFileSystemView
User->>TrinoSession: Set hudi.time_travel_read_timestamp
TrinoSession->>HudiSplitSource: getTimeTravelReadTimestamp(session)
HudiSplitSource->>HudiReadOptimizedDirectoryLister: Pass timeTravelReadTimestamp
HudiReadOptimizedDirectoryLister->>HudiFileSystemView: getLatestBaseFilesBeforeOrOn(partitionPath, timeTravelReadTimestamp)
HudiFileSystemView-->>HudiReadOptimizedDirectoryLister: Return base files as of timestamp
HudiReadOptimizedDirectoryLister-->>HudiSplitSource: Return filtered file statuses
HudiSplitSource-->>TrinoSession: Return data as of timestamp
Class diagram for updated HudiReadOptimizedDirectoryLister and HudiSessionPropertiesclassDiagram
class HudiReadOptimizedDirectoryLister {
- HoodieTableFileSystemView fileSystemView
- List<Column> partitionColumns
- Map<String, HudiPartitionInfo> allPartitionInfoMap
- String timeTravelReadTimestamp
+ HudiReadOptimizedDirectoryLister(..., String timeTravelReadTimestamp)
+ List<HudiFileStatus> listStatus(HudiPartitionInfo partitionInfo)
- static StoragePathInfo getStoragePathInfo(HoodieBaseFile baseFile)
+ void close()
}
class HudiSessionProperties {
- static final String TIME_TRAVEL_READ_TIMESTAMP
+ static String getTimeTravelReadTimestamp(ConnectorSession session)
+ List<PropertyMetadata<?>> getSessionProperties()
}
HudiSplitSource --> HudiReadOptimizedDirectoryLister : passes timeTravelReadTimestamp
HudiSessionProperties <.. HudiSplitSource : static method getTimeTravelReadTimestamp
HudiSessionProperties <.. TrinoSession : session property
Class diagram for HudiSplitSource changesclassDiagram
class HudiSplitSource {
+ HudiSplitSource(..., String timeTravelReadTimestamp)
+ CompletableFuture<ConnectorSplitBatch> getNextBatch(int maxSize)
+ boolean isFinished()
- static HudiSplitWeightProvider createSplitWeightProvider(ConnectorSession session)
}
HudiSplitSource --> HudiReadOptimizedDirectoryLister : instantiates with timeTravelReadTimestamp
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes and they look great!
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location> `plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/HudiSessionProperties.java:150` </location>
<code_context>
- {
- return sessionProperties;
+ false),
+ stringProperty(TIME_TRAVEL_READ_TIMESTAMP, "Read data as of provided timestamp - if empty Trino will read from current snapshot", "", false));
}
</code_context>
<issue_to_address>
**suggestion:** Clarify expected format for time_travel_read_timestamp property.
Consider adding the required timestamp format to the property description to help users avoid mistakes.
```suggestion
stringProperty(
TIME_TRAVEL_READ_TIMESTAMP,
"Read data as of provided timestamp in format 'yyyy-MM-dd HH:mm:ss' - if empty Trino will read from current snapshot",
"",
false));
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| { | ||
| return sessionProperties; | ||
| false), | ||
| stringProperty(TIME_TRAVEL_READ_TIMESTAMP, "Read data as of provided timestamp - if empty Trino will read from current snapshot", "", false)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Clarify expected format for time_travel_read_timestamp property.
Consider adding the required timestamp format to the property description to help users avoid mistakes.
| stringProperty(TIME_TRAVEL_READ_TIMESTAMP, "Read data as of provided timestamp - if empty Trino will read from current snapshot", "", false)); | |
| stringProperty( | |
| TIME_TRAVEL_READ_TIMESTAMP, | |
| "Read data as of provided timestamp in format 'yyyy-MM-dd HH:mm:ss' - if empty Trino will read from current snapshot", | |
| "", | |
| false)); |
ef12d42 to
6746772
Compare
|
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
6746772 to
5102fa4
Compare
|
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/HudiMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/HudiSessionProperties.java
Outdated
Show resolved
Hide resolved
plugin/trino-hudi/src/test/java/io/trino/plugin/hudi/TestHudiMetadata.java
Outdated
Show resolved
Hide resolved
5102fa4 to
3471a5a
Compare
|
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
|
@ebyhr I signed my CLA but nobody has reviewed it in a week. Would it be possible to get anybody to take a look? |
plugin/trino-hudi/src/test/java/io/trino/plugin/hudi/TestHudiConnectorTest.java
Show resolved
Hide resolved
3471a5a to
63c448a
Compare
|
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
This commit handles simple "FOR VERSION AS OF X" syntax in the Hudi connector order to ensure that we can read previous table state. Previously, all reads to Hudi tables would occur at the latest commit timestamp. While a user could technically filter down the data using a predicate on the _hoodie_commit_time column, there is no functionality that pushes this down into the Hudi API to minimize file reads. Timestamps are provided as longs in yyyyMMddHHmmssSSS of the metastore timezone (which align with _hoodie_commit_time).
63c448a to
7c6ca6c
Compare
|
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
This commit handles simple "FOR VERSION AS OF X" syntax in
the Hudi connector order to ensure that we can read previous
table state.
Previously, all reads to Hudi tables would occur at the latest
commit timestamp. While a user could technically filter down
the data using a predicate on the _hoodie_commit_time column,
there is no functionality that pushes this down into the Hudi
API to minimize file reads.
Timestamps are provided as longs in yyyyMMddHHmmssSSS of
the metastore timezone (which align with _hoodie_commit_time).
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:
Added time travel reeads in Hudi connector.
Summary by Sourcery
Enable time travel reads in the Hudi connector by adding a session property that specifies the commit timestamp, updating the directory lister to fetch base files before or on that timestamp, and propagating it through split generation.
New Features:
Enhancements:
Documentation: