Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement HoodieRealTimeInputFormat #42

Closed
vinothchandar opened this issue Jan 6, 2017 · 6 comments
Closed

Implement HoodieRealTimeInputFormat #42

vinothchandar opened this issue Jan 6, 2017 · 6 comments
Assignees

Comments

@vinothchandar
Copy link
Member

This houses the merge-on-read record reader

@vinothchandar
Copy link
Member Author

Think about supporting multiple delta files - on compaction failures
Fall back of on-disk merge if the delta file size exceeds certain limit

@prazanna prazanna self-assigned this Apr 2, 2017
@vinothchandar
Copy link
Member Author

Agree on approach in https://gist.github.com/prazanna/698459049447d8898a9de11e3863e99d , wrapping is the natural way to go.

But planning to approach the reading a little differently..

  1. next() on underlying recordReader to get new array writable (need to add projection to get _hoodie_record_key out always)
  2. Somehow turn the original avro delta logs into a HashMap<_hoodie_record_key, ArrayWritable> (this needs some tricky maneuvers)
  3. Then the logic is pretty simple.. where we return either from 1 or 2

@prazanna
Copy link
Contributor

prazanna commented Apr 4, 2017

Pasting a scrap of thoughts/impl of the record reader - https://gist.github.com/prazanna/698459049447d8898a9de11e3863e99d

@vinothchandar

@prazanna
Copy link
Contributor

prazanna commented Apr 4, 2017

Think about supporting multiple delta files - on compaction failures
Fall back of on-disk merge if the delta file size exceeds certain limit

Just a note - I guess we are okay to assume that we are able to merge in all the updates in memory with HoodieAvroReader. #134 tracks the disk spill if needed.

@vinothchandar
Copy link
Member Author

Upon more digging, it all boils down to the following to get the merging right

  • Projecting only requested fields in query off Avro records

Reading using the a sub schema built off conf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR) should do the trick. Need to test with nested records and complex data types. Then compare to Parquet projections, in terms of order & so forth and make sure its all good.

http://grepcode.com/file/repo1.maven.org/maven2/org.apache.avro/avro/1.7.7/org/apache/avro/generic/GenericDatumReader.java#GenericDatumReader.read%28java.lang.Object%2Corg.apache.avro.io.Decoder%29 is how the data is read in AvroRecordReader (link)

  • Converting the GenericRecord to same ArrayWritable

We need to get the types here and turn it into ArrayWritable

  • Extra projection added for _hoodie_record_key should be removed out in the end

Should be zero effect to the change above

@prazanna prazanna modified the milestone: 0.3.8 Jun 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants