Implement a `parquet` parser and printer #4116

balavinaithirthan · 2024-04-12T12:36:33Z

We developed a plugin for Tenzir that parses and prints data in the parquet format, similar to other formats like JSON, feather, and YAML. Our parquet parser and printer enable parsing and printing through operators such as write, to, and from. Parquet requires a single schema. The parser is implemented in a blocking format and receives an upstream generator of chunk_ptr (byte representation) as input, which returns a generator of table_slice for downstream operators. Within the parser, the complete parquet input with metadata is collected and converted into record batches before being co_yielded as a table_slice.
In contrast, the printer is a streaming class that takes a generator of table slices as input and outputs a generator of chunks. Table slices are converted to record batches and written as parquet, with the parquet footer written at the end based on the metadata from the stream. To create valid metadata, we derived a new arrow::io::OutputStream that acts as a file for metadata purposes and a stream for memory efficiency. The generator of table_slice input is written to a byte buffer before being co_yielded at the end of a row.

Change log: The parquet format now reads and writes Arrow IPC streams and Parquet files. The format is streamable for write parquet.

balavinaithirthan

Initial passthrough of buffer output stream. Noticed style issues, header formatting issues, and potential change for simplifying buffer output stream logic.

libtenzir/src/contiguous_buffer_stream.cpp

plugins/parquet/parquet.cpp

libtenzir/include/tenzir/contiguous_buffer_stream.hpp

.gitignore

changelog/next/changes/4116--parquet.format.md

plugins/parquet/integration/tests/tests.bats

plugins/parquet/parquet.cpp

dominiklohmann

We reviewed this together. This is a really great piece of work, and I'm looking forward to having it soon.

modify .gitignore

Improve naming conventions and remove comments Clean up naming Handle errors in parquet printer instance construction Move `contiguous_buffer_stream` into the `parquet` plugin

Fix parquet round trip errors and start test file from feather tests

modify gitignore

…omments for stream API.

Reduce filesize of feather test outputs Modify feather tests to account for batching Update parquet test with correct formatting Modify parquet tests to remove feather-only python script

Modify tests to remove local file dependence Format whitespace test.bats Modify feather and parquet tests to not rely on local user system

balavinaithirthan commented Apr 12, 2024

View reviewed changes

balavinaithirthan force-pushed the topic/parquetPrinterAndParser branch from de5f56b to bba8cea Compare April 12, 2024 13:56

dominiklohmann reviewed Apr 12, 2024

View reviewed changes

dominiklohmann changed the title ~~Topic/parquet printer and parser~~ Implement a parquet parser and printer Apr 12, 2024

balavinaithirthan force-pushed the topic/parquetPrinterAndParser branch from 8c4e97e to 97993b5 Compare April 18, 2024 13:06

balavinaithirthan marked this pull request as ready for review April 19, 2024 11:02

balavinaithirthan force-pushed the topic/parquetPrinterAndParser branch 3 times, most recently from 7f1f94f to bb55300 Compare April 19, 2024 13:09

balavinaithirthan added format Parser and printer improvement An incremental enhancement of an existing feature labels Apr 19, 2024

balavinaithirthan commented Apr 19, 2024

View reviewed changes

.gitignore Outdated Show resolved Hide resolved

balavinaithirthan commented Apr 19, 2024

View reviewed changes

dominiklohmann approved these changes Apr 19, 2024

View reviewed changes

balavinaithirthan force-pushed the topic/parquetPrinterAndParser branch 2 times, most recently from ef81bba to ddf98f9 Compare April 22, 2024 11:35

balavinaithirthan and others added 14 commits April 22, 2024 13:50

Initial commit

1553392

modify .gitignore

Write a minimal parquet printer

f6ea5e1

Improve naming conventions and remove comments Clean up naming Handle errors in parquet printer instance construction Move `contiguous_buffer_stream` into the `parquet` plugin

Simplify output buffer stream logic used in printer

c52cfd7

Implement a minimal parser for small parquet inputs

0c6cb6e

Fix parquet round trip errors and start test file from feather tests

Add round trip tests and modify parser to work on large inputs

fcf97a2

modify gitignore

Add a change log entry, append to parquet docs, and add comments

fbc2762

Fix parquet reader optimization for single chunks

bfa7c10

Remove the old parquet store implementation

c665faf

Render an error for invalid parquet files

9229e1c

Add printer flags and incorporate changes into testing file

60e34f2

Modify markdown file to include new compression types

f3211d1

Fix links in Feather and Parquet docs

28a2f70

Modify test files and references to reduce output file size

f32beed

Change naming and namespace convention for ADL and ODR reasons. Add c…

5e9595a

…omments for stream API.

balavinaithirthan added 5 commits April 22, 2024 13:50

Fix spacing for parquet.md file

2636535

Modify parquet tests to remove time dependence

42244a2

Reduce filesize of feather test outputs Modify feather tests to account for batching Update parquet test with correct formatting Modify parquet tests to remove feather-only python script

Modify feather tests to remove local folder dependency

38cd6a3

Modify tests to remove local file dependence Format whitespace test.bats Modify feather and parquet tests to not rely on local user system

Modify gitignore and changelog to adhere to naming conventions

4d71848

Broaden .gitignore file

1fb3114

balavinaithirthan force-pushed the topic/parquetPrinterAndParser branch from ddf98f9 to 1fb3114 Compare April 22, 2024 11:51

balavinaithirthan merged commit e160e01 into main Apr 22, 2024
50 checks passed

balavinaithirthan deleted the topic/parquetPrinterAndParser branch April 22, 2024 13:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement a `parquet` parser and printer #4116

Implement a `parquet` parser and printer #4116

balavinaithirthan commented Apr 12, 2024 •

edited

balavinaithirthan left a comment •

edited

dominiklohmann left a comment

Implement a parquet parser and printer #4116

Implement a parquet parser and printer #4116

Conversation

balavinaithirthan commented Apr 12, 2024 • edited

balavinaithirthan left a comment • edited

Choose a reason for hiding this comment

dominiklohmann left a comment

Choose a reason for hiding this comment

Implement a `parquet` parser and printer #4116

Implement a `parquet` parser and printer #4116

balavinaithirthan commented Apr 12, 2024 •

edited

balavinaithirthan left a comment •

edited