Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support embedded alluxio cache in hive #20658

Merged
merged 3 commits into from
Feb 13, 2024

Conversation

raunaqmorarka
Copy link
Member

@raunaqmorarka raunaqmorarka commented Feb 12, 2024

Description

Support embedded alluxio cache in hive

Additional context and related issues

Part of #20550

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Hive
* Improve performance of scans by adding the ability to cache data files on local SSDs ({issue}`20658`)

@@ -305,7 +311,7 @@ else if (maxSplitBytes * 2 >= remainingBlockBytes) {
internalSplit.getFileModifiedTime(),
internalSplit.getSchema(),
internalSplit.getPartitionKeys(),
block.getAddresses(),
cachingHostAddressProvider.getHosts(internalSplit.getPath(), block.getAddresses()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to extend the interface with defaultAddresses? We could also do
block.getAddresses().isEmpty() ? cachingHostAddressProvider.getHosts(internalSplit.getPath()) : block.getAddresses(),

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Block addresses are populated when the filesystem is HDFS. When caching is used with HDFS, we still want caching to drive split scheduling decision rather than HDFS block locality.

@raunaqmorarka
Copy link
Member Author

Alluxio hive unpartitioned sf1k.pdf
Screenshot 2024-02-13 at 12 01 59 AM

@wendigo
Copy link
Contributor

wendigo commented Feb 13, 2024

@raunaqmorarka merge it!

@raunaqmorarka raunaqmorarka merged commit c736b20 into trinodb:master Feb 13, 2024
68 checks passed
@raunaqmorarka raunaqmorarka deleted the hive-cache branch February 13, 2024 08:23
@github-actions github-actions bot added this to the 439 milestone Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed delta-lake Delta Lake connector docs hive Hive connector
Development

Successfully merging this pull request may close these issues.

None yet

3 participants