Skip to content

Conversation

@jordepic
Copy link

@jordepic jordepic commented Oct 29, 2025

This commit handles simple "FOR VERSION AS OF X" syntax in
the Hudi connector order to ensure that we can read previous
table state.

Previously, all reads to Hudi tables would occur at the latest
commit timestamp. While a user could technically filter down
the data using a predicate on the _hoodie_commit_time column,
there is no functionality that pushes this down into the Hudi
API to minimize file reads.

Timestamps are provided as longs in yyyyMMddHHmmssSSS of
the metastore timezone (which align with _hoodie_commit_time).

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

Added time travel reeads in Hudi connector.

## Section
* Fix some things. ({issue}`issuenumber`)

Summary by Sourcery

Enable time travel reads in the Hudi connector by adding a session property that specifies the commit timestamp, updating the directory lister to fetch base files before or on that timestamp, and propagating it through split generation.

New Features:

  • Introduce session property "hudi.time_travel_read_timestamp" to read Hudi tables at a specified commit time
  • Use HoodieFileSystemView.getLatestBaseFilesBeforeOrOn when a time travel timestamp is provided
  • Propagate the time travel timestamp through HudiSplitSource to directory listing logic

Enhancements:

  • Remove duplicate helper methods getStoragePathInfo and createSplitWeightProvider
  • Reorganize split weight provider creation and directory lister constructor signature

Documentation:

  • Document the new time_travel_read_timestamp session property and its behavior in the Hudi connector guide

@cla-bot
Copy link

cla-bot bot commented Oct 29, 2025

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

@sourcery-ai
Copy link

sourcery-ai bot commented Oct 29, 2025

Reviewer's Guide

This PR introduces a new session variable hudi.time_travel_read_timestamp and threads it from the connector session through HudiSplitSource into HudiReadOptimizedDirectoryLister, where listStatus conditionally uses Hudi’s time-travel API (getLatestBaseFilesBeforeOrOn) when a timestamp is supplied. It also consolidates duplicated helper methods and updates the documentation to reflect the new property.

Sequence diagram for time travel read in Hudi connector

sequenceDiagram
  actor User
  participant TrinoSession
  participant HudiSplitSource
  participant HudiReadOptimizedDirectoryLister
  participant HudiFileSystemView

  User->>TrinoSession: Set hudi.time_travel_read_timestamp
  TrinoSession->>HudiSplitSource: getTimeTravelReadTimestamp(session)
  HudiSplitSource->>HudiReadOptimizedDirectoryLister: Pass timeTravelReadTimestamp
  HudiReadOptimizedDirectoryLister->>HudiFileSystemView: getLatestBaseFilesBeforeOrOn(partitionPath, timeTravelReadTimestamp)
  HudiFileSystemView-->>HudiReadOptimizedDirectoryLister: Return base files as of timestamp
  HudiReadOptimizedDirectoryLister-->>HudiSplitSource: Return filtered file statuses
  HudiSplitSource-->>TrinoSession: Return data as of timestamp
Loading

Class diagram for updated HudiReadOptimizedDirectoryLister and HudiSessionProperties

classDiagram
  class HudiReadOptimizedDirectoryLister {
    - HoodieTableFileSystemView fileSystemView
    - List<Column> partitionColumns
    - Map<String, HudiPartitionInfo> allPartitionInfoMap
    - String timeTravelReadTimestamp
    + HudiReadOptimizedDirectoryLister(..., String timeTravelReadTimestamp)
    + List<HudiFileStatus> listStatus(HudiPartitionInfo partitionInfo)
    - static StoragePathInfo getStoragePathInfo(HoodieBaseFile baseFile)
    + void close()
  }

  class HudiSessionProperties {
    - static final String TIME_TRAVEL_READ_TIMESTAMP
    + static String getTimeTravelReadTimestamp(ConnectorSession session)
    + List<PropertyMetadata<?>> getSessionProperties()
  }

  HudiSplitSource --> HudiReadOptimizedDirectoryLister : passes timeTravelReadTimestamp
  HudiSessionProperties <.. HudiSplitSource : static method getTimeTravelReadTimestamp
  HudiSessionProperties <.. TrinoSession : session property
Loading

Class diagram for HudiSplitSource changes

classDiagram
  class HudiSplitSource {
    + HudiSplitSource(..., String timeTravelReadTimestamp)
    + CompletableFuture<ConnectorSplitBatch> getNextBatch(int maxSize)
    + boolean isFinished()
    - static HudiSplitWeightProvider createSplitWeightProvider(ConnectorSession session)
  }

  HudiSplitSource --> HudiReadOptimizedDirectoryLister : instantiates with timeTravelReadTimestamp
Loading

File-Level Changes

Change Details Files
Propagate time travel timestamp from session into the Hudi reader
  • Define TIME_TRAVEL_READ_TIMESTAMP constant and stringProperty in HudiSessionProperties
  • Implement getTimeTravelReadTimestamp accessor in HudiSessionProperties
  • Extend HudiSplitSource to fetch and pass the timestamp into the reader
  • Add a new field and constructor parameter in HudiReadOptimizedDirectoryLister
  • Conditionally call getLatestBaseFilesBeforeOrOn vs getLatestBaseFiles based on timestamp
plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/HudiSessionProperties.java
plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/HudiSplitSource.java
plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/query/HudiReadOptimizedDirectoryLister.java
Consolidate duplicated helper methods
  • Remove duplicate getStoragePathInfo implementation and move a single copy to the top of HudiReadOptimizedDirectoryLister
  • Eliminate redundant createSplitWeightProvider method in HudiSplitSource
plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/query/HudiReadOptimizedDirectoryLister.java
plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/HudiSplitSource.java
Update connector documentation with time travel property
  • Add hudi.time_travel_read_timestamp entry and description in hudi.md
docs/src/main/sphinx/connector/hudi.md

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@github-actions github-actions bot added docs hudi Hudi connector labels Oct 29, 2025
Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes and they look great!

Prompt for AI Agents
Please address the comments from this code review:

## Individual Comments

### Comment 1
<location> `plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/HudiSessionProperties.java:150` </location>
<code_context>
-    {
-        return sessionProperties;
+                        false),
+                stringProperty(TIME_TRAVEL_READ_TIMESTAMP, "Read data as of provided timestamp - if empty Trino will read from current snapshot", "", false));
     }

</code_context>

<issue_to_address>
**suggestion:** Clarify expected format for time_travel_read_timestamp property.

Consider adding the required timestamp format to the property description to help users avoid mistakes.

```suggestion
                stringProperty(
                    TIME_TRAVEL_READ_TIMESTAMP,
                    "Read data as of provided timestamp in format 'yyyy-MM-dd HH:mm:ss' - if empty Trino will read from current snapshot",
                    "",
                    false));
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

{
return sessionProperties;
false),
stringProperty(TIME_TRAVEL_READ_TIMESTAMP, "Read data as of provided timestamp - if empty Trino will read from current snapshot", "", false));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Clarify expected format for time_travel_read_timestamp property.

Consider adding the required timestamp format to the property description to help users avoid mistakes.

Suggested change
stringProperty(TIME_TRAVEL_READ_TIMESTAMP, "Read data as of provided timestamp - if empty Trino will read from current snapshot", "", false));
stringProperty(
TIME_TRAVEL_READ_TIMESTAMP,
"Read data as of provided timestamp in format 'yyyy-MM-dd HH:mm:ss' - if empty Trino will read from current snapshot",
"",
false));

@jordepic jordepic force-pushed the trino-hudi-snapshot branch from ef12d42 to 6746772 Compare October 30, 2025 00:54
@cla-bot
Copy link

cla-bot bot commented Oct 30, 2025

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

@jordepic jordepic force-pushed the trino-hudi-snapshot branch from 6746772 to 5102fa4 Compare October 30, 2025 00:57
@cla-bot
Copy link

cla-bot bot commented Oct 30, 2025

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

@jordepic jordepic force-pushed the trino-hudi-snapshot branch from 5102fa4 to 3471a5a Compare October 30, 2025 23:50
@cla-bot
Copy link

cla-bot bot commented Oct 30, 2025

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

@jordepic
Copy link
Author

@ebyhr I signed my CLA but nobody has reviewed it in a week. Would it be possible to get anybody to take a look?

@jordepic jordepic force-pushed the trino-hudi-snapshot branch from 3471a5a to 63c448a Compare October 31, 2025 00:59
@cla-bot
Copy link

cla-bot bot commented Oct 31, 2025

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

This commit handles simple "FOR VERSION AS OF X" syntax in
the Hudi connector order to ensure that we can read previous
table state.

Previously, all reads to Hudi tables would occur at the latest
commit timestamp. While a user could technically filter down
the data using a predicate on the _hoodie_commit_time column,
there is no functionality that pushes this down into the Hudi
API to minimize file reads.

Timestamps are provided as longs in yyyyMMddHHmmssSSS of
the metastore timezone (which align with _hoodie_commit_time).
@jordepic jordepic force-pushed the trino-hudi-snapshot branch from 63c448a to 7c6ca6c Compare October 31, 2025 14:45
@cla-bot
Copy link

cla-bot bot commented Oct 31, 2025

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs hudi Hudi connector

Development

Successfully merging this pull request may close these issues.

2 participants