New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support multiple file format in HDFS (CSV, Parquet... in addition to JSON) #303
Comments
As discussed offline with @elenatid, another file format could be CSV, both row and column. |
Regarding the CSV persistence, a problem arise when dealing with the attribute's metadata. Metadata is a field that cannot be easily serialized as CSV because its value may be an array of objects. Thus, the CSV records may be result with different lengths, and this may be a problem for Hive tables. The following solution was discussed with @elenatid: to link from the metadata field to a new HDFS file containing a new record for each object within the array. This link will be simply the name of the file, for instance:
The
As said, this works if the metadata length changes:
For any other attribute, we would have The mechanism is valid both for row and column mode. |
Implemented in PR #492 |
Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. (http://parquet.incubator.apache.org/)
Current
cygnusagent.sinks.hdfs-sink.attr_persistence
parameter can be extended not only for working withrow
andcolumn
butparquet
as well.Even, the
row
andcolumn
modes could be renamed tojson-row
andjson-column
respectively. Thecygnusagent.sinks.hdfs-sink.attr_persistence
colud be also renamed ascygnusagent.sinks.hdfs-sink.file_format
.These changes would allow for further future file formats.
The text was updated successfully, but these errors were encountered: