Implement HoodieRealTimeInputFormat #42

vinothchandar · 2017-01-06T19:44:11Z

This houses the merge-on-read record reader

vinothchandar · 2017-01-06T19:44:27Z

Think about supporting multiple delta files - on compaction failures
Fall back of on-disk merge if the delta file size exceeds certain limit

vinothchandar · 2017-04-04T14:52:50Z

Agree on approach in https://gist.github.com/prazanna/698459049447d8898a9de11e3863e99d , wrapping is the natural way to go.

But planning to approach the reading a little differently..

next() on underlying recordReader to get new array writable (need to add projection to get _hoodie_record_key out always)
Somehow turn the original avro delta logs into a HashMap<_hoodie_record_key, ArrayWritable> (this needs some tricky maneuvers)
Then the logic is pretty simple.. where we return either from 1 or 2

prazanna · 2017-04-04T19:46:39Z

Pasting a scrap of thoughts/impl of the record reader - https://gist.github.com/prazanna/698459049447d8898a9de11e3863e99d

@vinothchandar

prazanna · 2017-04-04T19:52:38Z

Think about supporting multiple delta files - on compaction failures
Fall back of on-disk merge if the delta file size exceeds certain limit

Just a note - I guess we are okay to assume that we are able to merge in all the updates in memory with HoodieAvroReader. #134 tracks the disk spill if needed.

vinothchandar · 2017-04-10T15:29:32Z

Upon more digging, it all boils down to the following to get the merging right

Projecting only requested fields in query off Avro records

Reading using the a sub schema built off conf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR) should do the trick. Need to test with nested records and complex data types. Then compare to Parquet projections, in terms of order & so forth and make sure its all good.

http://grepcode.com/file/repo1.maven.org/maven2/org.apache.avro/avro/1.7.7/org/apache/avro/generic/GenericDatumReader.java#GenericDatumReader.read%28java.lang.Object%2Corg.apache.avro.io.Decoder%29 is how the data is read in AvroRecordReader (link)

Converting the GenericRecord to same ArrayWritable

We need to get the types here and turn it into ArrayWritable

Extra projection added for _hoodie_record_key should be removed out in the end

Should be zero effect to the change above

vinothchandar · 2017-04-19T14:56:09Z

As for the design of the InputFormat itself, Hive seems to extend FileSplit across the board

https://github.com/apache/hive/search?p=3&q=FileSplit&type=&utf8=✓

ORC Split : https://github.com/apache/hive/blob/ff67cdda1c538dc65087878eeba3e165cf3230f4/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcNewSplit.java
HBase Split : https://github.com/apache/hive/blob/ff67cdda1c538dc65087878eeba3e165cf3230f4/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseSplit.java

Of the two places I saw hardcoding of checks, all of them check for instanceof, so a subclass should be fine.

https://github.com/apache/hive/blob/a1cbccb8dad1824f978205a1e93ec01e87ed8ed5/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java#L82
https://github.com/apache/hive/blob/ff67cdda1c538dc65087878eeba3e165cf3230f4/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/HiveSplitGenerator.java#L305

prazanna self-assigned this Apr 2, 2017

vinothchandar assigned vinothchandar and unassigned prazanna Apr 3, 2017

vinothchandar mentioned this issue Apr 26, 2017

HoodieRealtimeInputFormat #156

Merged

prazanna closed this as completed in #156 May 2, 2017

prazanna modified the milestone: 0.3.8 Jun 15, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement HoodieRealTimeInputFormat #42

Implement HoodieRealTimeInputFormat #42

vinothchandar commented Jan 6, 2017

vinothchandar commented Jan 6, 2017

vinothchandar commented Apr 4, 2017

prazanna commented Apr 4, 2017

prazanna commented Apr 4, 2017

vinothchandar commented Apr 10, 2017

vinothchandar commented Apr 19, 2017

Implement HoodieRealTimeInputFormat #42

Implement HoodieRealTimeInputFormat #42

Comments

vinothchandar commented Jan 6, 2017

vinothchandar commented Jan 6, 2017

vinothchandar commented Apr 4, 2017

prazanna commented Apr 4, 2017

prazanna commented Apr 4, 2017

vinothchandar commented Apr 10, 2017

vinothchandar commented Apr 19, 2017