Add a Spark Datasource to query Hoodie tables natively #7

vinothchandar · 2016-12-27T15:54:23Z

Would be nice to have something native that understand queries on Hoodie tables

Supporting queries with _hoodie_commit_time to switch to incremental mode
Can cross join with other tables seamlessly

hoodielib · 2017-03-18T18:17:19Z

And also

df.write.format('hoodie').save() should work seamlessly

vinothchandar · 2017-05-25T04:13:39Z

https://hackernoon.com/extending-our-spark-sql-query-engine-5f4a088de986 is an example for writing a datasource
https://michalsenkyr.github.io/2017/02/spark-sql_datasource

vinothchandar · 2017-05-26T01:20:24Z

Here is a brain dump on this.. Feel free to pick the first subset of this list, to get started.

Goal:
Ultimately,

Subsume most functionality of HoodieWriteClient for saving a dataframe back into a hoodie dataset, including all two storage types (COPY_ON_WRITE, MERGE_ON_READ)
Subsume most functionality of HoodieReadClient for running queries on a Hoodie dataset - including all three views (ReadOptimized, RealTime, IncrementalView)

For starters,
May be just support COPY_ON_WRITE & ReadOptimized/IncrementalView.
A good test would be that

the HoodieDeltaStreamer will be able to use this to write data (all the things that are configurable like key extractors, schema providers etc... can be provided as options to datasource)
dataset can be queried incrementally (currently done by HiveIncrementalPuller or HoodieReadClient::readSince, HoodieReadClient::readCommit methods) or normally via the datasource like HoodieReadClient::read(paths).

Writing

Main thing here is to support modes for : upsert, insert, bulkInsert & if/how the saveModes map to these.. https://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes
User is able to specify a key-generator-class to control, record key and partition path as in DeltaStreamer
Internall payload will just be something (say HoodieRowPayload) that converts Row to GenericRecord, so we can probably fix this. May be for first cut, all we do is blindly replace the existing value with the provided record and we can leave this part pluggable where, by extending the HoodieRowPayload (all of this again stems from HoodieDeltaStreamer)
Schema is in the dataframe, so we need not worry about that.

Reading

We can simply wrap the parquet source with additional filtering like in HoodieReadClient::read for normal queries. (This already gives a DataSet/DataFrame)

Tricky things are how we support incremental queries and how to specify commit time for this. A powerful way is to have special columns right in the query like

select * from table where a = 1 // this does a normal query 
select * from table where _mode = incremental and _commit_time_start = "20171020120102" // this is a incremental query

Don't know if this makes sense or doable. Something to ponder. Of course easier thing is to specify the mode & commit time as options for the spark.read....

Need to find a way to support read(RDD<HoodieKey> keys, kind of like a point look up query.. but may be better to leave it out also and support only like the HoodieReadClient

Once we have a basic datasource, we will integrate support for hive sync and merge-on-read and and a structured streaming sink (in that order). Fun stuff

vinothchandar · 2017-05-26T01:20:46Z

@zqureshi here you go.. cc @prazanna

vinothchandar · 2017-08-28T08:32:19Z

@zqureshi Started something here.. Plan to push a minimal working version soon. We can iterate on that..

vinothchandar@286f38c

thanks to @alunarbeach for some row -> avro conversion code :)

[MINOR] Hudi internal release v0.0.1

prazanna modified the milestone: 0.3.9 Jun 16, 2017

prazanna added this to 06/20/2017 in Sprint Jun 20, 2017

prazanna moved this from 06/20/2017 to 06/26 - 07/07 in Sprint Jun 26, 2017

prazanna assigned vinothchandar Jun 26, 2017

vinothchandar mentioned this issue Sep 27, 2017

Spark Data Source (finally) #266

Merged

vinothchandar closed this as completed in #266 Oct 3, 2017

vinishjail97 pushed a commit to vinishjail97/hudi that referenced this issue Dec 15, 2023

Merge pull request apache#7 from InfiniLake/hudi-internal-release-v0.0.1

72e979d

[MINOR] Hudi internal release v0.0.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a Spark Datasource to query Hoodie tables natively #7

Add a Spark Datasource to query Hoodie tables natively #7

vinothchandar commented Dec 27, 2016

hoodielib commented Mar 18, 2017

vinothchandar commented May 25, 2017 •

edited

Loading

vinothchandar commented May 26, 2017

vinothchandar commented May 26, 2017

vinothchandar commented Aug 28, 2017 •

edited

Loading

Add a Spark Datasource to query Hoodie tables natively #7

Add a Spark Datasource to query Hoodie tables natively #7

Comments

vinothchandar commented Dec 27, 2016

hoodielib commented Mar 18, 2017

vinothchandar commented May 25, 2017 • edited Loading

vinothchandar commented May 26, 2017

vinothchandar commented May 26, 2017

vinothchandar commented Aug 28, 2017 • edited Loading

vinothchandar commented May 25, 2017 •

edited

Loading

vinothchandar commented Aug 28, 2017 •

edited

Loading