Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a parquet parser and printer #4116

Merged
merged 19 commits into from Apr 22, 2024

Conversation

balavinaithirthan
Copy link
Contributor

@balavinaithirthan balavinaithirthan commented Apr 12, 2024

We developed a plugin for Tenzir that parses and prints data in the parquet format, similar to other formats like JSON, feather, and YAML. Our parquet parser and printer enable parsing and printing through operators such as write, to, and from. Parquet requires a single schema. The parser is implemented in a blocking format and receives an upstream generator of chunk_ptr (byte representation) as input, which returns a generator of table_slice for downstream operators. Within the parser, the complete parquet input with metadata is collected and converted into record batches before being co_yielded as a table_slice.
In contrast, the printer is a streaming class that takes a generator of table slices as input and outputs a generator of chunks. Table slices are converted to record batches and written as parquet, with the parquet footer written at the end based on the metadata from the stream. To create valid metadata, we derived a new arrow::io::OutputStream that acts as a file for metadata purposes and a stream for memory efficiency. The generator of table_slice input is written to a byte buffer before being co_yielded at the end of a row.

Change log: The parquet format now reads and writes Arrow IPC streams and Parquet files. The format is streamable for write parquet.

Copy link
Contributor Author

@balavinaithirthan balavinaithirthan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial passthrough of buffer output stream. Noticed style issues, header formatting issues, and potential change for simplifying buffer output stream logic.

libtenzir/src/contiguous_buffer_stream.cpp Outdated Show resolved Hide resolved
libtenzir/src/contiguous_buffer_stream.cpp Outdated Show resolved Hide resolved
libtenzir/src/contiguous_buffer_stream.cpp Outdated Show resolved Hide resolved
libtenzir/src/contiguous_buffer_stream.cpp Outdated Show resolved Hide resolved
libtenzir/src/contiguous_buffer_stream.cpp Outdated Show resolved Hide resolved
libtenzir/src/contiguous_buffer_stream.cpp Outdated Show resolved Hide resolved
plugins/parquet/parquet.cpp Outdated Show resolved Hide resolved
plugins/parquet/parquet.cpp Outdated Show resolved Hide resolved
plugins/parquet/parquet.cpp Outdated Show resolved Hide resolved
plugins/parquet/parquet.cpp Outdated Show resolved Hide resolved
plugins/parquet/parquet.cpp Outdated Show resolved Hide resolved
plugins/parquet/parquet.cpp Outdated Show resolved Hide resolved
libtenzir/include/tenzir/contiguous_buffer_stream.hpp Outdated Show resolved Hide resolved
libtenzir/include/tenzir/contiguous_buffer_stream.hpp Outdated Show resolved Hide resolved
@dominiklohmann dominiklohmann changed the title Topic/parquet printer and parser Implement a parquet parser and printer Apr 12, 2024
@balavinaithirthan balavinaithirthan marked this pull request as ready for review April 19, 2024 11:02
@balavinaithirthan balavinaithirthan force-pushed the topic/parquetPrinterAndParser branch 3 times, most recently from 7f1f94f to bb55300 Compare April 19, 2024 13:09
@balavinaithirthan balavinaithirthan added format Parser and printer improvement An incremental enhancement of an existing feature labels Apr 19, 2024
.gitignore Outdated Show resolved Hide resolved
changelog/next/changes/4116--parquet.format.md Outdated Show resolved Hide resolved
plugins/parquet/integration/tests/tests.bats Outdated Show resolved Hide resolved
plugins/parquet/parquet.cpp Outdated Show resolved Hide resolved
plugins/parquet/parquet.cpp Outdated Show resolved Hide resolved
plugins/parquet/parquet.cpp Show resolved Hide resolved
plugins/parquet/parquet.cpp Show resolved Hide resolved
plugins/parquet/parquet.cpp Outdated Show resolved Hide resolved
plugins/parquet/parquet.cpp Outdated Show resolved Hide resolved
Copy link
Member

@dominiklohmann dominiklohmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We reviewed this together. This is a really great piece of work, and I'm looking forward to having it soon.

@balavinaithirthan balavinaithirthan force-pushed the topic/parquetPrinterAndParser branch 2 times, most recently from ef81bba to ddf98f9 Compare April 22, 2024 11:35
Reduce filesize of feather test outputs

Modify feather tests to account for batching

Update parquet test with correct formatting

Modify parquet tests to remove feather-only python script
Modify tests to remove local file dependence

Format whitespace test.bats

Modify feather and parquet tests to not rely on local user system
@balavinaithirthan balavinaithirthan merged commit e160e01 into main Apr 22, 2024
50 checks passed
@balavinaithirthan balavinaithirthan deleted the topic/parquetPrinterAndParser branch April 22, 2024 13:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
format Parser and printer improvement An incremental enhancement of an existing feature
Projects
None yet
2 participants