Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a Spark Datasource to query Hoodie tables natively #7

Closed
2 tasks
vinothchandar opened this issue Dec 27, 2016 · 5 comments
Closed
2 tasks

Add a Spark Datasource to query Hoodie tables natively #7

vinothchandar opened this issue Dec 27, 2016 · 5 comments
Assignees
Projects

Comments

@vinothchandar
Copy link
Member

Would be nice to have something native that understand queries on Hoodie tables

  • Supporting queries with _hoodie_commit_time to switch to incremental mode
  • Can cross join with other tables seamlessly
@hoodielib
Copy link

And also

  • df.write.format('hoodie').save() should work seamlessly

@vinothchandar
Copy link
Member Author

vinothchandar commented May 25, 2017

https://hackernoon.com/extending-our-spark-sql-query-engine-5f4a088de986 is an example for writing a datasource
https://michalsenkyr.github.io/2017/02/spark-sql_datasource

@vinothchandar
Copy link
Member Author

Here is a brain dump on this.. Feel free to pick the first subset of this list, to get started.

Goal:
Ultimately,

  • Subsume most functionality of HoodieWriteClient for saving a dataframe back into a hoodie dataset, including all two storage types (COPY_ON_WRITE, MERGE_ON_READ)
  • Subsume most functionality of HoodieReadClient for running queries on a Hoodie dataset - including all three views (ReadOptimized, RealTime, IncrementalView)

For starters,
May be just support COPY_ON_WRITE & ReadOptimized/IncrementalView.
A good test would be that

  • the HoodieDeltaStreamer will be able to use this to write data (all the things that are configurable like key extractors, schema providers etc... can be provided as options to datasource)
  • dataset can be queried incrementally (currently done by HiveIncrementalPuller or HoodieReadClient::readSince, HoodieReadClient::readCommit methods) or normally via the datasource like HoodieReadClient::read(paths).

Writing

  • Main thing here is to support modes for : upsert, insert, bulkInsert & if/how the saveModes map to these.. https://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes
  • User is able to specify a key-generator-class to control, record key and partition path as in DeltaStreamer
  • Internall payload will just be something (say HoodieRowPayload) that converts Row to GenericRecord, so we can probably fix this. May be for first cut, all we do is blindly replace the existing value with the provided record and we can leave this part pluggable where, by extending the HoodieRowPayload (all of this again stems from HoodieDeltaStreamer)
  • Schema is in the dataframe, so we need not worry about that.

Reading

  • We can simply wrap the parquet source with additional filtering like in HoodieReadClient::read for normal queries. (This already gives a DataSet/DataFrame)
  • Tricky things are how we support incremental queries and how to specify commit time for this. A powerful way is to have special columns right in the query like
    select * from table where a = 1 // this does a normal query 
    select * from table where _mode = incremental and _commit_time_start = "20171020120102" // this is a incremental query
    

Don't know if this makes sense or doable. Something to ponder. Of course easier thing is to specify the mode & commit time as options for the spark.read....

  • Need to find a way to support read(RDD<HoodieKey> keys, kind of like a point look up query.. but may be better to leave it out also and support only like the HoodieReadClient

Once we have a basic datasource, we will integrate support for hive sync and merge-on-read and and a structured streaming sink (in that order). Fun stuff

@vinothchandar
Copy link
Member Author

@zqureshi here you go.. cc @prazanna

@prazanna prazanna modified the milestone: 0.3.9 Jun 16, 2017
@prazanna prazanna added this to 06/20/2017 in Sprint Jun 20, 2017
@prazanna prazanna moved this from 06/20/2017 to 06/26 - 07/07 in Sprint Jun 26, 2017
@vinothchandar
Copy link
Member Author

vinothchandar commented Aug 28, 2017

@zqureshi Started something here.. Plan to push a minimal working version soon. We can iterate on that..

vinothchandar@286f38c

thanks to @alunarbeach for some row -> avro conversion code :)

vinishjail97 pushed a commit to vinishjail97/hudi that referenced this issue Dec 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Sprint
Scoped
Development

No branches or pull requests

2 participants