-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Hudi connector #10228
Add Hudi connector #10228
Conversation
i want to know , ”codope:hudi-plugin“ - branch is running ok on hive-rt-table? |
@feiwei8586 Thanks for your interest in the new connector. It is not ready for hive-rt tables yet. The plan in the first phase is to get copy-on-write tables ready. |
Thanks for your answer. our difficulties is the query hive-rt table . please . |
I debug in idea get some error etc/catalog/hudi.properties 2021-12-10T17:55:08.296+0800 INFO main io.trino.metadata.StaticCatalogStore -- Loading catalog etc/catalog/hudi.properties -- |
@super-sponge Can you pull the latest changes and try again? You will have to place hbase-common (version 1.2.3) jar in the classpath (inside cc @caneGuy |
thanks! |
@findepi This PR is ready for review. Could you please approve the workflows? |
I meet some error when selecting COW hudi table with the newest PR, but it runs well with the older version, do you have any idea about it? 2022-02-15T17:16:43.942+0800 INFO pool-60-thread-1 org.apache.hudi.common.table.view.FileSystemViewManager Creating InMemory based view for basePath hdfs://xxxxx 2022-02-15T17:16:44.470+0800 ERROR query-execution-0 io.trino.execution.scheduler.SqlQueryScheduler Failure in distributed stage for query 20220215_091643_00002_y7wg5 |
@feiwei8586 We will add support for _rt table soon. The goal is to get this PR landed and keep building incrementally. We are tracking _rt support in https://issues.apache.org/jira/browse/HUDI-2687 |
@mxdzs0612 Thanks for trying it out. I did query COW and MOR _ro tables before pushing the changes. The last commit was mostly refactoring rather than any logic change. I'll debug your error and get back to you by Wednesday. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial set of comments based on skiming subset of classes.
plugin/trino-hive/src/main/java/io/trino/plugin/hive/util/HiveUtil.java
Outdated
Show resolved
Hide resolved
plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/HudiConfig.java
Outdated
Show resolved
Hide resolved
plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/HudiConfig.java
Outdated
Show resolved
Hide resolved
plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/InternalHudiConnectorFactory.java
Outdated
Show resolved
Hide resolved
plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/HudiSplitSource.java
Outdated
Show resolved
Hide resolved
plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/HudiSplitSource.java
Outdated
Show resolved
Hide resolved
plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/HudiSplitSource.java
Outdated
Show resolved
Hide resolved
plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/HudiSplitSource.java
Outdated
Show resolved
Hide resolved
plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/HudiSplitSource.java
Outdated
Show resolved
Hide resolved
@losipiuk Thanks for reviewing the PR. I'll address the comments by end of this week as we're code freezing Hudi this week in preparation for the next release. Few things have changed since I last rebased. Thanks for the tip in comments. |
@losipiuk I've addressed all but one comment. Need a clarification on this comment in
For this, i will have to use an |
Yeah - you can follow the pattern with PTAL at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing. some more comments.
Ultimatelly we need a review from @electrum i think.
plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/HudiConnector.java
Outdated
Show resolved
Hide resolved
plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/HudiErrorCode.java
Outdated
Show resolved
Hide resolved
plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/HudiSplitSource.java
Outdated
Show resolved
Hide resolved
plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/HudiSplitSource.java
Outdated
Show resolved
Hide resolved
plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/HudiTableHandle.java
Outdated
Show resolved
Hide resolved
plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/HudiMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/HudiMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/HudiMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/HudiMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/HudiPageSource.java
Outdated
Show resolved
Hide resolved
Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to cla@trino.io. For more information, see https://github.com/trinodb/cla. |
Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to cla@trino.io. For more information, see https://github.com/trinodb/cla. |
@electrum @findinpath thanks for testing and going over the PR again. Regarding the reflective access issue, we plan to fix it in the next patch release of Hudi (0.12.1). It is being tracked in https://issues.apache.org/jira/browse/HUDI-4687 |
@findinpath Regarding #10228 (comment) and #10228 (comment), I have fixed the issue. Regarding the issue in MOR table #10228 (comment), it is expected as the connector does not have the snapshot query support for MOR table yet. Once the compaction/cleaning kicks in, then new base files will be produced and then the results would show the latest records. There is already a ticket to track support for snapshot query. |
Please, just update the Hudi client to not do trigger the reflection exceptions, and do a patch release. I don't think we should add stuff that needs these kinds of exceptions. |
private static final Splitter COMMA_SPLITTER = Splitter.on(",").omitEmptyStrings().trimResults(); | ||
|
||
private String baseFileFormat = PARQUET.name(); | ||
private List<String> columnsToHide = ImmutableList.of(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya, that is not how this is done in Trino. These should implemented like the other connectors.
|
||
import java.util.List; | ||
|
||
public abstract class HudiPageSourceCreator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This abstract class only has one implementation. Please remove this (and any other base class like this). If we need the abstraction in the future, an abstraction can be added then, and it can be tuned to the actual usage. Also, this seems to be really just a interface, and the Trino code base does not use this kind of interface as an abstract class pattern... instead simply add these 4 fields to any implementation, and if you need an interface add a plain old interface
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@codope please include the code from HudiParquetPageSourceCreator
directly in HudiPageSourceProvider
tableName.getSchemaName(), | ||
tableName.getTableName(), | ||
table.get().getStorage().getLocation(), | ||
HoodieTableType.COPY_ON_WRITE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you need this value, you can simply add the field to the Handle. It is better to wait until something is actually needed before adding it.
HiveMetastore metastore = metastoreProvider.apply(session.getIdentity(), (HiveTransactionHandle) transaction); | ||
Table table = metastore.getTable(hudiTableHandle.getSchemaName(), hudiTableHandle.getTableName()) | ||
.orElseThrow(() -> new TableNotFoundException(schemaTableName(hudiTableHandle.getSchemaName(), hudiTableHandle.getTableName()))); | ||
final FileSystem fs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't use final
on local variables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also please don't use abbreviations like this, instead just use filesystem
|
||
public HudiBackgroundSplitLoader( | ||
ConnectorSession session, | ||
FileSystem fs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no abbreviations
private final int minPartitionBatchSize; | ||
private final int maxPartitionBatchSize; | ||
private final Deque<HudiPartitionInfo> partitionQueue; | ||
private int currBatchSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No abbreviations. Can you scan through this PR and fix any places that have abbreviations like this?
@Override | ||
public String getHivePartitionName() | ||
{ | ||
throw new HoodieException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to wrap. Please fix all of the excessive wrapping in this class
final boolean useMetastoreForPartitions; | ||
final HoodieTableFileSystemView fileSystemView; | ||
final TupleDomain<String> partitionKeysFilter; | ||
final List<Column> partitionColumns; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mark all of these private
.filter(partitionInfo -> partitionInfo.getHivePartitionKeys().isEmpty() || partitionInfo.doesMatchPredicates()) | ||
.collect(Collectors.toList()); | ||
|
||
log.debug("Get partitions to scan in %d ms (useMetastoreForPartitions: %s): %s", timer.endTimer(), useMetastoreForPartitions, filteredPartitionInfoList); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like excessive logging. If you want to collect stats, please add some @Managed
stats
hiveMetastore); | ||
String relativePartitionPath = partitionHiveInfo.getRelativePartitionPath(); | ||
List<String> partitionValues = partitionHiveInfo.getHivePartitionKeys().stream() | ||
.map(HivePartitionKey::getValue).collect(Collectors.toList()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put collect
on a separate line. Generally, if you chop down a statement like this, do it for each step. If it is a short one liner then it is fine to keep everything on one line
|
||
package io.trino.plugin.hudi.query; | ||
|
||
public enum HudiQueryMode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this is not used anywhere (the only usage is hard-coded to READ_OPTIMIZED
), so please remove for now. We can add it back later if needed.
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the blank line between the license header and the package
statement, in order to match the convention used in all other files in Trino.
: TimelineUtils.getPartitionsWritten(metaClient.getActiveTimeline()); | ||
allPartitionInfoList = relativePartitionPathList.stream() | ||
.map(relativePartitionPath -> | ||
buildHudiPartitionInfo( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to inline this method, since the useMetastoreForPartitions
field is a constant here.
} | ||
} | ||
|
||
if (isNull(allPartitionInfoList)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a normal value == null
check. This method should be reserved for usage in streams, etc.
} | ||
|
||
@Config("hudi.skip-metastore-for-partition") | ||
@ConfigDescription("By default, partition info is fetched from the metastore. " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would this custom extractor be used in Trino? We don't want users to have to add custom code.
This config property needs to be covered by integration tests, so that we test both code paths. If it's too hard to test, then let's remove it for now.
return this.useMetastoreForPartitions; | ||
} | ||
|
||
@Config("hudi.parquet.use-column-names") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need an integration test covering this option.
return this; | ||
} | ||
|
||
@Config("hudi.metadata-enabled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this disabled by default? When would it be safe (or not be safe) to enable?
|
||
private void loadPartitionInfoFromHiveMetastore() | ||
{ | ||
Optional<Partition> partition = hiveMetastore.getPartition(table, HiveUtil.toPartitionValues(hivePartitionName)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for this local variable
{ | ||
private HudiPartitionInfoFactory() {} | ||
|
||
public static HudiPartitionInfo buildHudiPartitionInfo( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inline this method into the callers. The way it mixes the boolean useMetastoreForPartitions
with Optional
parameters is confusing.
@Override | ||
public String toString() | ||
{ | ||
StringBuilder stringBuilder = new StringBuilder(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ToStringHelper
|
||
import static java.lang.String.format; | ||
|
||
public final class HudiFileListerFactory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this for now, since it's only used in one place and is just forwarding to a single implementation.
{ | ||
private static final Logger log = Logger.get(HudiReadOptimizedFileLister.class); | ||
|
||
final HoodieMetadataConfig metadataConfig; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make all these fields private
@electrum @dain @findinpath I've addressed your comments. There are just two open items right now:
For the first item, I would suggest to go with the current workaround of
EDIT: Adding the corresponding tickets from Hudi JIRAs: HUDI-4687 and HUDI-4789 |
At Robinhood, we have been waiting for this Hudi connector to land for more than 6 months now. This will help us use Trino better with Hudi tables. Can we kindly prioritize landing this PR. |
hi , i am eagerly to use snapshot query for mor table,could you please give me the only code for support this feature? |
@@ -72,6 +73,8 @@ | |||
public class HudiMetadata | |||
implements ConnectorMetadata | |||
{ | |||
public static final Logger log = Logger.get(HudiMetadata.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is the removal of the warning related to removing the dependency towards hudi-hive-sync
?
|
||
public abstract class HudiPageSourceCreator | ||
{ | ||
protected final HudiConfig hudiConfig; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the hudiConfig
is nowhere used.
Anyway the config classes shouldn't be saved, but instead only their required configurations should be retrieved and be saved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fixup commits look good. Can you squash them into the Add Hudi Connector
commit?
I believe the last remaining question, and my biggest concern, is around integration tests against Spark. We do these as product tests for the Iceberg connector. Do you have plans to work on this as a follow up?
ConnectorPageSource dataPageSource = pageSourceBuilderMap.get(hudiFileFormat).createPageSource(configuration, session, regularColumns, split); | ||
TrinoFileSystem fileSystem = fileSystemFactory.create(session); | ||
TrinoInputFile inputFile = fileSystem.newInputFile(path.toString(), split.getFileSize()); | ||
ConnectorPageSource dataPageSource = pageSourceBuilderMap.get(hudiFileFormat).createPageSource(configuration, session, regularColumns, split, inputFile); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this still need Configuration
?
HdfsEnvironment hdfsEnvironment, | ||
FileFormatDataSourceStats fileFormatDataSourceStats, | ||
ParquetReaderConfig parquetReaderConfig, | ||
HudiConfig hudiConfig) | ||
{ | ||
this.fileSystemFactory = requireNonNull(fileSystemFactory, "fileSystemFactory is null"); | ||
this.hdfsEnvironment = requireNonNull(hdfsEnvironment, "hdfsEnvironment is null"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need to pass HdfsEnvironment
to HudiParquetPageSourceCreator
, or can that be removed now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both Configuration
and HdfsEvironment
can be removed now.
Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com> Co-authored-by: Todd Gao <gaoguantao@bytedance.com>
Done. All other comments are addressed too. Squashed them to the main commit.
Absolutely! I did start on this sometime back. I have a PR for Hudi Spark3 image. I will pick up the product tests in follow-up to this PR. |
Also, the PR is not dependent on FileSystem anymore. Replaced all usages of FileSystem with |
@codope any issue with bumping to 0.12 like prestodb/presto@9b36723? |
@tooptoop4 We're going to release 0.12.1 very soon. So, I'll directly upgrade to 0.12.1. |
The fix is in |
This PR adds a new connector for Hudi (Issue #9877 ). For new Hudi features within the existing Hive connector please take a look at #9641 . In order to avoid making core changes in Hive connector we wrote a new connector as suggested in #9641. The design of the connector can be viewed here.