Support multiple file format in HDFS (CSV, Parquet... in addition to JSON) #303

frbattid · 2015-02-05T08:42:09Z

Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. (http://parquet.incubator.apache.org/)

Current cygnusagent.sinks.hdfs-sink.attr_persistence parameter can be extended not only for working with row and column but parquet as well.

Even, the row and column modes could be renamed to json-row and json-column respectively. The cygnusagent.sinks.hdfs-sink.attr_persistence colud be also renamed as cygnusagent.sinks.hdfs-sink.file_format.

These changes would allow for further future file formats.

The text was updated successfully, but these errors were encountered:

frbattid · 2015-06-24T17:45:21Z

As discussed offline with @elenatid, another file format could be CSV, both row and column.

frbattid · 2015-07-13T10:15:36Z

Regarding the CSV persistence, a problem arise when dealing with the attribute's metadata.

Metadata is a field that cannot be easily serialized as CSV because its value may be an array of objects. Thus, the CSV records may be result with different lengths, and this may be a problem for Hive tables.

The following solution was discussed with @elenatid: to link from the metadata field to a new HDFS file containing a new record for each object within the array. This link will be simply the name of the file, for instance:

# content of /usr/frb/def_serv/def_serv_path/room1_room/room1_room.txt:
23-05-2015T06:37:41,1230923875,room1,room,temperature,float,23.5,/user/frb/def_serv/def_serv_path/room1_room_temperature_md/room1_room_temperature_md.txt

# content of /user/frb/def_serv/def_serv_path/room1_room_temperature_md/room1_room_temperature_md.txt
1230923875,timestamp_meassure,long,1230923874
1230923875,timestamp_sent,logn,1230923875

The recvTimeTs value within /usr/frb/def_serv/def_serv_path/room1_room/room1_room.txt is used as a key for finding the metadata associated to each record. I mean, if a second update is notified then we will have:

# content of /usr/frb/def_serv/def_serv_path/room1_room/room1_room.txt:
23-05-2015T06:37:41,1230923875,room1,room,temperature,float,23.9,/user/frb/def_serv/def_serv_path/room1_room_temperature_md/room1_room_temperature_md.txt
23-05-2015T06:37:51,1230933870,room1,room,temperature,float,24.1,/user/frb/def_serv/def_serv_path/room1_room_temperature_md/room1_room_temperature_md.txt

# content of /user/frb/def_serv/def_serv_path/room1_room_temperature_md/room1_room_temperature_md.txt
1230923875,timestamp_meassure,long,1230923874
1230923875,timestamp_sent,long,1230923875
1230933870,timestamp_meassure,long,1230933869
1230933870,timestamp_sent,long,1230933870

As said, this works if the metadata length changes:

# content of /usr/frb/def_serv/def_serv_path/room1_room/room1_room.txt:
23-05-2015T06:37:41,1230923875,room1,room,temperature,float,23.9,/user/frb/def_serv/def_serv_path/room1_room_temperature_md/room1_room_temperature_md.txt
23-05-2015T06:37:51,1230933870,room1,room,temperature,float,24.1,/user/frb/def_serv/def_serv_path/room1_room_temperature_md/room1_room_temperature_md.txt
23-05-2015T06:38:01,1230943871,room1,room,temperature,float,24.2,/user/frb/def_serv/def_serv_path/room1_room_temperature_md/room1_room_temperature_md.txt

# content of /user/frb/def_serv/def_serv_path/room1_room_temperature_md/room1_room_temperature_md.txt
1230923875,timestamp_meassure,long,1230923874
1230923875,timestamp_sent,long,1230923875
1230933870,timestamp_meassure,long,1230933869
1230933870,timestamp_sent,long,1230933870
1230943871,timestamp_meassure,long,1230943870
1230943871,timestamp_sent,long,1230943871
1230943871,verified,boolean,true

For any other attribute, we would have /user/frb/def_serv/def_serv_path/room1_room_<attr_name>_md/room1_room_<attr_name>_md.txt.

The mechanism is valid both for row and column mode.

frbattid · 2015-07-25T18:04:42Z

Implemented in PR #492

frbattid added the backlog label Feb 5, 2015

frbattid changed the title ~~Support Parquet file format in HDFS~~ Support multiple file format in HDFS (CSV, Parquet... in addition to JSON) Jun 25, 2015

frbattid added this to the release/0.9.0 milestone Jun 29, 2015

frbattid self-assigned this Jun 29, 2015

frbattid mentioned this issue Jul 17, 2015

feature/303_multiple_output_format #492

Merged

frbattid closed this as completed Jul 25, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support multiple file format in HDFS (CSV, Parquet... in addition to JSON) #303

Support multiple file format in HDFS (CSV, Parquet... in addition to JSON) #303

frbattid commented Feb 5, 2015

frbattid commented Jun 24, 2015

frbattid commented Jul 13, 2015

frbattid commented Jul 25, 2015

Support multiple file format in HDFS (CSV, Parquet... in addition to JSON) #303

Support multiple file format in HDFS (CSV, Parquet... in addition to JSON) #303

Comments

frbattid commented Feb 5, 2015

frbattid commented Jun 24, 2015

frbattid commented Jul 13, 2015

frbattid commented Jul 25, 2015