Write support for Parquet (low-level writes) #116

sadikovi · 2018-05-09T02:39:38Z

This is an RFC for write support in this crate.

Prototype PR #127. API resembles closely read support. See the design overview and implementation details below in the comments.

Sub-tasks:

Add writer properties (Add WriterProperties to configure file writer #128).
Extend and refactor IO module to include Thrift related streams/buffers and file sink (Update Thrift IO structs, add file sink #129).
Add schema type conversion to Thrift (Add to_thrift conversion for schema type, fix inconsistencies for unset properties #130).
Add metadata conversion to Thrift (Add to_thrift conversion for ColumnChunkMetaData and RowGroupMetaData #131).
Add page writer (Add page writer #133).
Add column writer (Add column writer #138).
Add row group writer and file writer (Add file writer and row group writer #149).
Update all crate documentation with regard to writes (Update documentation for write support #157).
~~Add reader properties~~ Update the batch size (Add reader properties #161).

sadikovi · 2018-05-09T02:44:36Z

@sunchao Would you mind commenting on this issue? I would like to know your opinion and any suggestions you have on the write support and overall direction, or if there is anything that I missed.

I am happy to take the ownership of this work and my original plan was exploring this in my branch and creating incremental PRs to support write functionality.

Let me know what you think and if there are any changes you want to make. Thanks!

sunchao · 2018-05-09T05:18:25Z

Thanks @sadikovi , this will be a great feature to have. I'm not very familiar with the write path so far but will take a look at the existing implementations and add my inputs. I do suggest we make this as a umbrella issue and break the task into pieces (just discovered this feature in GitHub).

sadikovi · 2018-05-31T07:10:35Z

Sorry for the delay, I can finally start working on this, will experiment in my branch, see what happens.

sunchao · 2018-05-31T07:27:58Z

no worries at all! I was also busy with some other stuff recently and haven't done much at all on parquet-rs... do plan to start looking at arrow-parquet integration as well as understanding how write can be implemented for parquet-rs.

sadikovi · 2018-07-01T19:46:20Z

Initial write support is proposed in #127 and is considered WIP at the moment. The PR implements low-level write API, which means we provide some basic traits and structs for user to write data: FileWriter, RowGroupWriter, ColumnWriter, PageWriter. User must take care of converting nested values into values, definition and repetition levels.

Unfortunately, PR contains some other code for metadata conversion, etc., which might complicate
the review process. Below I will give an overview of the design and how it all works, then I list
gotchas and limitations.

Overview

You can have a look at src/bin/parquet-write.rs. It contains a simple code to show workflow, but
below is an explanation what each interface does.

User creates FileWriter from a file, input schema and writer properties. We assume that file is
newly created, and nothing has been written to it. Currently it is not enforced, let me know if this
is required.

Each FileWriter can write 0 or more row groups. For this user asks for a new RowGroupWriter.
Note that we return actual struct, not a reference here. This is done to ease problems with lifetimes
when using the API, but it creates some other problems like tracking row groups - I added some simple
code there, not sure if it is enough to assume that users are going to follow the convention.

All reads are sequential, when user asks for a row group, she cannot write data in parallel, it must
be row group by row group. Each RowGroupWriter gives an access of a certain number of ColumnWriter,
which is determined by the number of leave nodes in schema.

User is expected to write column by column; in fact, it enforced in the implementation. Every time
user asks for a new column, we automatically close the previous one. User does not need to close
column writer, everything is closed automatically under the hood (this applies to row groups as well).

Technically RowGroupWriter creates ColumnWriter(PageWriter), PageWriter is responsible for
low level writes of pages into the sink (Write + Position). We write CompressedPage which is a
mirror of Page enum, that allows us to store compressed buffer and uncompressed length. Page
writer also maintains several metrics, same for row group writer.

The overall API resembles the structure of read path, including ColumnWriter and ColumnWriterImpl<T>.

The main files are:

src/bin/parquet-write.rs shows the API in action, writes simple file.
src/file/writer.rs contains writer implementations for file, row group and page writer.
src/column/writer.rs contains code for column writer.
src/column/page.rs contains page interfaces.
src/file/properties.rs contains code for writer properties. I think we should have reader options
for read path as well, so user can set batch size, for example.

Features

Actually supports any types of values, as long as they are split into values, definition levels,
and repetition levels. Both write_batch and write_mini_batch are supported in column writer.
Data pages v1 and data pages v2 are both supported in writes, as they are supported in reads.
Added a special trait and struct to track position in file without requesting &mut reference. This helps in PageWriter.
All encodings are supported, same as reads.
All compression levels are supported, same as reads.

Gotchas and limitations

Current code does not support statistics. To be honest, I think we should rework statistics entirely,
including the read path (especially the link between sort order and values).
Current code does not support new logical type in format 2.5.0. We just pass None for it.
The write path in Parquet is inherently sequential, so you cannot write row groups and column chunks
in parallel. Even though I tried adding such constraints and check and design for it, there are some
gaps in it (this is mostly a concern how file writer tracks row group writers, it is okay for column writers, IMHO).

sadikovi · 2018-07-01T19:47:38Z

@sunchao I opened a PR with the initial write support. Could you review when you have time? We can discuss the details and high level approaches in the PR. Thanks!

sadikovi · 2018-07-08T11:17:18Z

I updated description with the sub-tasks.

sadikovi · 2018-07-09T06:58:41Z

Prototype did not use max_row_group_size to configure the size of the row group, which was a design problem. I think I will get back to it, once I start working on row group writer. We also need to make sure that we use all, but max statistics size in the first version of writer (statistics are not supported in writes currently).

sadikovi · 2018-07-29T09:21:33Z

@sunchao would it be okay to open PR with just column writer with some changes in page writer, and add tests in a separate pull request? It could be difficult to review them in one PR.

sunchao · 2018-07-29T16:10:48Z

@sadikovi Sure. Please go ahead. Thanks.

sadikovi · 2018-09-14T17:57:48Z

@sunchao There are 2 stories left to address. Would you like to wait until we fix them, or would you like to close the write support issue and move those 2 stories into separate issues?

sunchao · 2018-09-14T22:20:46Z

It's up to you :) we can tackle them in later releases if you think they do not affect the functionality of writer but just an improvement.

sadikovi · 2018-09-15T08:21:21Z

I moved two last tasks into separate issues, because they are merely enhancements and do not affect the core functionality of the writers.

I am going to close this issue, indicating that write support has been added and we can start writing files using column writers or building on top of it using Arrow or some other approach.

Any issues that arise as features or bugs will go as separate issues.

Thanks!

sadikovi mentioned this issue May 9, 2018

Writing? #114

Closed

sunchao added the new feature label May 9, 2018

This was referenced Sep 15, 2018

Support writing statistics #164

Open

Limit writing row groups based either on size or number of records #165

Open

sadikovi closed this as completed Sep 15, 2018

sunchao mentioned this issue Oct 7, 2018

Release parquet-rs 0.4 #168

Closed

xrl mentioned this issue Oct 12, 2018

Implement Drop for RowGroupWriter, ColumnWriter, and friends #173

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write support for Parquet (low-level writes) #116

Write support for Parquet (low-level writes) #116

sadikovi commented May 9, 2018 •

edited

sadikovi commented May 9, 2018

sunchao commented May 9, 2018

sadikovi commented May 31, 2018

sunchao commented May 31, 2018

sadikovi commented Jul 1, 2018 •

edited

sadikovi commented Jul 1, 2018

sadikovi commented Jul 8, 2018

sadikovi commented Jul 9, 2018

sadikovi commented Jul 29, 2018

sunchao commented Jul 29, 2018

sadikovi commented Sep 14, 2018

sunchao commented Sep 14, 2018

sadikovi commented Sep 15, 2018

Write support for Parquet (low-level writes) #116

Write support for Parquet (low-level writes) #116

Comments

sadikovi commented May 9, 2018 • edited

sadikovi commented May 9, 2018

sunchao commented May 9, 2018

sadikovi commented May 31, 2018

sunchao commented May 31, 2018

sadikovi commented Jul 1, 2018 • edited

Overview

Features

Gotchas and limitations

sadikovi commented Jul 1, 2018

sadikovi commented Jul 8, 2018

sadikovi commented Jul 9, 2018

sadikovi commented Jul 29, 2018

sunchao commented Jul 29, 2018

sadikovi commented Sep 14, 2018

sunchao commented Sep 14, 2018

sadikovi commented Sep 15, 2018

sadikovi commented May 9, 2018 •

edited

sadikovi commented Jul 1, 2018 •

edited