Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Presto plugin for Hoodie #81

Closed
vinothchandar opened this issue Feb 18, 2017 · 10 comments
Closed

Presto plugin for Hoodie #81

vinothchandar opened this issue Feb 18, 2017 · 10 comments

Comments

@vinothchandar
Copy link
Member

No description provided.

@vinothchandar
Copy link
Member Author

Here we hope to have support for faster, point lookup (batch)

@leletan
Copy link
Contributor

leletan commented Jul 4, 2018

Is this completed? It seems the presto patch is merged.

I was trying to follow the hudi doc to build a presto + minimum hive metastore POC (https://stackoverflow.com/questions/43727964/does-presto-require-a-hive-metastore-to-read-parquet-files-from-s3) on top of s3 files written by hudi. It seems presto

  • cannot match the field names vs the field values (it shows the hudi commit version as one of the real data column value)
  • does not leverage the commits (2 records in 2 commits with the same primary key, it shows both instead of the later one)

Any guidance on what went wrong? Thanks in advance!

@vinothchandar
Copy link
Member Author

yes. presto patch is merged. Yes presto support is via the Hive catalog.

cannot match the field names vs the field values (it shows the hudi commit version as one of the real data column value)
Not following fully.. But hudi does add few metadata fields to the table including the commit version of the records.

does not leverage the commits (2 records in 2 commits with the same primary key, it shows both instead of the later one)
This is weird.. What presto version are you running? Also be sure to provide the hoodie-hadoop-mr jar to presto workers/coordinators.

For Copy-on-write, hoodie support works without any need for a plugin.

This ticket is more for longer term..

@leletan
Copy link
Contributor

leletan commented Jul 6, 2018

Thanks for the clarification, @vinothchandar. I guess it was due to my own local env set up.

So I fell back to local env set up in quickstart with hadoop-2.6.0-cdh5.4.7, hive-1.1.0-cdh5.4.7, presto 0.205 and dataset generated from HoodieJavaApp default options. Presto is working fine for select count(*) but when I tried to select *, I got the following error, no matter if I add hoodie-hadoop-mr-*.jar to the plugin or not:

java.lang.UnsupportedOperationException: com.facebook.presto.spi.type.DoubleType
at com.facebook.presto.spi.type.AbstractType.writeSlice(AbstractType.java:135)
at com.facebook.presto.hive.parquet.reader.ParquetBinaryColumnReader.readValue(ParquetBinaryColumnReader.java:55)
at com.facebook.presto.hive.parquet.reader.ParquetPrimitiveColumnReader.lambda$readValues$1(ParquetPrimitiveColumnReader.java:184)
at com.facebook.presto.hive.parquet.reader.ParquetPrimitiveColumnReader.processValues(ParquetPrimitiveColumnReader.java:204)
at com.facebook.presto.hive.parquet.reader.ParquetPrimitiveColumnReader.readValues(ParquetPrimitiveColumnReader.java:183)
at com.facebook.presto.hive.parquet.reader.ParquetPrimitiveColumnReader.readPrimitive(ParquetPrimitiveColumnReader.java:171)
at com.facebook.presto.hive.parquet.reader.ParquetReader.readPrimitive(ParquetReader.java:208)
at com.facebook.presto.hive.parquet.reader.ParquetReader.readColumnChunk(ParquetReader.java:258)
at com.facebook.presto.hive.parquet.reader.ParquetReader.readBlock(ParquetReader.java:241)
at com.facebook.presto.hive.parquet.ParquetPageSource$ParquetBlockLoader.load(ParquetPageSource.java:243)
at com.facebook.presto.hive.parquet.ParquetPageSource$ParquetBlockLoader.load(ParquetPageSource.java:221)
at com.facebook.presto.spi.block.LazyBlock.assureLoaded(LazyBlock.java:262)
at com.facebook.presto.spi.Page.assureLoaded(Page.java:244)
at com.facebook.presto.operator.TableScanOperator.getOutput(TableScanOperator.java:245)
at com.facebook.presto.operator.Driver.processInternal(Driver.java:373)
at com.facebook.presto.operator.Driver.lambda$processFor$8(Driver.java:282)
at com.facebook.presto.operator.Driver.tryWithLock(Driver.java:672)
at com.facebook.presto.operator.Driver.processFor(Driver.java:276)
at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:973)
at com.facebook.presto.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:162)
at com.facebook.presto.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:477)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

@leletan
Copy link
Contributor

leletan commented Jul 6, 2018

Dug a little bit more, it seems presto is trying to decode the binary serialized parquet written by hudi into double, which will require a DoubleType.writeSlice call, which DoubleType does not implement and its parent class AbstractType will just throw UnsupportedOperationException upon writeSlice call.

Not sure if the above holds true. If yes, all open source presto user with uber hudi will suffer the same issue? Wondering how this is resolved in Uber.

@vinothchandar
Copy link
Member Author

vinothchandar commented Jul 12, 2018

So, this does not seem like a Hudi issue to me. the parquet files generated by hudi are the same standard parquet files. Can you try just copying the files themselves into another table (non-hudi) and see if it works.. Also validate that the parquet files are good via spark sql?

Also please feel free to open a new issue around this, since this one is about the presto plugin.

@leletan
Copy link
Contributor

leletan commented Jul 12, 2018

Thanks for the info @vinothchandar
I figured it out - missed hive.parquet.use-column-names=true in the presto hive config.
Now everything is working!

@vinothchandar
Copy link
Member Author

Nice to hear.. lets put together a hoodie docker container for the future.. Given so many dependencies, it can be overwhelming at times :)

@leletan
Copy link
Contributor

leletan commented Jul 12, 2018

Totally agree. Any thoughts on which folder this should go?

@vinothchandar
Copy link
Member Author

we can create a hoodie-docker and host all scripts and DockerFiles there

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants