New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement a parquet
parser and printer
#4116
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial passthrough of buffer output stream. Noticed style issues, header formatting issues, and potential change for simplifying buffer output stream logic.
de5f56b
to
bba8cea
Compare
parquet
parser and printer
8c4e97e
to
97993b5
Compare
7f1f94f
to
bb55300
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We reviewed this together. This is a really great piece of work, and I'm looking forward to having it soon.
ef81bba
to
ddf98f9
Compare
modify .gitignore
Improve naming conventions and remove comments Clean up naming Handle errors in parquet printer instance construction Move `contiguous_buffer_stream` into the `parquet` plugin
Fix parquet round trip errors and start test file from feather tests
modify gitignore
…omments for stream API.
Reduce filesize of feather test outputs Modify feather tests to account for batching Update parquet test with correct formatting Modify parquet tests to remove feather-only python script
Modify tests to remove local file dependence Format whitespace test.bats Modify feather and parquet tests to not rely on local user system
ddf98f9
to
1fb3114
Compare
We developed a plugin for Tenzir that parses and prints data in the parquet format, similar to other formats like JSON, feather, and YAML. Our parquet parser and printer enable parsing and printing through operators such as write, to, and from. Parquet requires a single schema. The parser is implemented in a blocking format and receives an upstream generator of chunk_ptr (byte representation) as input, which returns a generator of table_slice for downstream operators. Within the parser, the complete parquet input with metadata is collected and converted into record batches before being co_yielded as a table_slice.
In contrast, the printer is a streaming class that takes a generator of table slices as input and outputs a generator of chunks. Table slices are converted to record batches and written as parquet, with the parquet footer written at the end based on the metadata from the stream. To create valid metadata, we derived a new arrow::io::OutputStream that acts as a file for metadata purposes and a stream for memory efficiency. The generator of table_slice input is written to a byte buffer before being co_yielded at the end of a row.
Change log: The
parquet
format now reads and writes Arrow IPC streams and Parquet files. The format is streamable forwrite parquet
.