Skip to content
This repository has been archived by the owner on Jan 11, 2021. It is now read-only.

Add file writer and row group writer #149

Merged
merged 13 commits into from
Aug 31, 2018
Merged

Conversation

sadikovi
Copy link
Collaborator

@sadikovi sadikovi commented Aug 16, 2018

This PR is final part of the core implementation of Parquet writes, adds:

  • FileWriter and its serialised version SerializedFileWriter.
  • RowGroupWriter and its serialised version SerializedRowGroupWriter.

I tried to maintain a fairly simple API. The idea and steps are as follows:

  1. Create file writer based on file std::fs::File, schema TypePtr and writer properties WriterPropertiesPtr that you want to write.
  2. Call file_writer.next_row_group to get a mutable reference to a row group writer to add data.
  3. Call row_group_writer.next_column to add values to a column. Keep iterating until all columns have been written - we do not support partially written row group!
  4. After that you can call close() on column writer and/or row group writer. They will be closed regardless, before either the next column is requested or the next row group is requested.
  5. Keep adding row groups until you have added enough.
  6. Call file_writer.close to write final metadata to a file.
  7. That is it, after that you should have a complete file, so you can check schema and read records from it.

I added a set of unit tests. Please review them as well, they show API in action.

@sadikovi
Copy link
Collaborator Author

@sunchao Could you review this PR when you have time. I have several high-level questions and workarounds that I would like to discuss below:

  • I had to modify API of a column writer, so I can return a mutable reference and update it within row group writer. It is mainly around the fact that close takes &mut self now instead of mut self. My goal was mainly around the fact that I can store a reference to a current column writer in row group writer and call close() multiple times (user might call close, and we need to ensure that also close column writer in case she did not). Is it okay or should I look for other ways of implementing it somehow?
  • Please, check ergonomics of the new API. I tried my best to come up with easy to use and expressive API, but I might have missed something. Is API good enough?
  • We currently do not use limit on a row group size or DEFAULT_MAX_ROW_GROUP_SIZE from WriterProperties. Should we use it?

Let me know what you think. Thanks!

@coveralls
Copy link

coveralls commented Aug 16, 2018

Pull Request Test Coverage Report for Build 612

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 36 unchanged lines in 3 files lost coverage.
  • Overall coverage increased (+0.1%) to 95.606%

Files with Coverage Reduction New Missed Lines %
column/writer.rs 3 95.93%
file/reader.rs 13 96.97%
file/writer.rs 20 95.91%
Totals Coverage Status
Change from base Build 610: 0.1%
Covered Lines: 12359
Relevant Lines: 12927

💛 - Coveralls

@sadikovi sadikovi requested a review from sunchao August 16, 2018 15:24
@sunchao
Copy link
Owner

sunchao commented Aug 21, 2018

@sadikovi : really sorry that I haven't got chance to look at this yet. Will take a look today.

@sadikovi
Copy link
Collaborator Author

sadikovi commented Aug 21, 2018 via email

///
/// This method finalises previous column writer, so make sure that all writes are
/// finished before calling this method.
fn next_column(&mut self) -> Result<&mut ColumnWriter>;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it better to return a Option here? this more align with a iterator API.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have to return Result<Option<&mut ColumnWriter>>, is it okay?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this will be better. We should differentiate the error case and the case where there's no next.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I changed the code to return Result<Option<Something>> so looks like iterator and it is similar to how we read pages.

use util::io::{FileSink, Position};

/// File writer interface.
pub trait FileWriter {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: perhaps document more on how FileWriter should be used. For instance, how many row group writers should one create? does user need to explicitly call close the row group writers?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will do.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the method docs to emphasise what you mentioned. There is a follow-up sub-task to update all of the documentation related to writes after we have merged the implementation.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Sounds good then.

impl SerializedFileWriter {
/// Creates new file writer.
pub fn new(mut file: File, schema: TypePtr, properties: WriterPropertiesPtr) -> Self {
Self::start_file(&mut file).expect("Start Parquet file");
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the error message needs improvement.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update the new method to return Result<Self> similar to FileReader and improve the error message.


#[inline]
fn next_column(&mut self) -> Result<&mut ColumnWriter> {
if self.row_group_metadata.is_some() {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can just call has_next_column at the beginning.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going to make it iterator-like, this condition will be updated.

}

#[inline]
fn close(&mut self) -> Result<RowGroupMetaDataPtr> {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if user calls close() before iterating over all columns. Will that cause some error?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will raise an error, because there will be mismatch between expected columns (must be greater than 0) in the schema and actual number of columns written.

/// File writer interface.
pub trait FileWriter {
/// Creates new row group from this file writer.
fn create_row_group(&mut self) -> Result<&mut RowGroupWriter>;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The good thing with this is we cannot call close until all the returned RowGroupWriter are out of scope. The downside is the restrictiveness: now it's hard to pass the RowGroupWriter to some struct which lives beyond the current call stack. Not sure if this will be an issue.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I had the same thought, but it looks like this is one of the few options to keep track of row group writers ourselves. Do you have other ideas? Should we just keep a reference count of row group writers and check later when asking for the next one?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think another approach is to return a unique reference to the RowGroupWriter:

fn create_row_group(&mut self) -> Result<RowGroupWriter>;

And a separate function to explicitly close it:

fn close_row_group(&mut self, RowGroupWriter) -> Result<()>;

This will call finalise_row_group_writer to close the row group writer and set some flag in the file writer indicating that the current row group writer has been closed. We can do similar things for RowGroupWriter to make sure all column writers are handed back before the close is called.

Let me know what you think. This is just some rough thought.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I will make changes, see what it looks like, and let you know. Thanks!

@sadikovi
Copy link
Collaborator Author

@sunchao I addressed your comments, but might have missed some. Would you mind doing another review pass, including tests (I feel they are a bit overcomplicated)? And could you also give your thoughts on the questions I raised in my first comment?

Thanks!

@sadikovi
Copy link
Collaborator Author

sadikovi commented Aug 29, 2018

@sunchao I made the changes to the API as per your suggestion. Can you have a look? Thanks!

Now we have next_XYZ and close_XYZ, we will raise an error if next method is called before close method is called for the previous XYZ.

Let me know if I should add more unit tests.

@sunchao
Copy link
Owner

sunchao commented Aug 31, 2018

@sadikovi : the latest PR looks good to me in general. However, the test test_column_writer_check_metadata is failing. Can you fix that?

@sadikovi
Copy link
Collaborator Author

sadikovi commented Aug 31, 2018 via email

@sadikovi
Copy link
Collaborator Author

@sunchao I fixed the test - it failed because I added test on encodings in this branch, but in master we have updated to add levels encoding as well, so it was failing with missing encoding for levels.

Should be all good now, would appreciate if you could look over the code again. There is also an open question about limiting row groups somehow, either based on size or number of rows - currently we do not limit them in any way. We could also add this as a follow-up issue.

@sunchao
Copy link
Owner

sunchao commented Aug 31, 2018

Should be all good now, would appreciate if you could look over the code again. There is also an open question about limiting row groups somehow, either based on size or number of rows - currently we do not limit them in any way. We could also add this as a follow-up issue.

Yes we should have some config properties for this. Good for a follow-up. :)

@sunchao sunchao merged commit 9a46f6c into sunchao:master Aug 31, 2018
@sunchao
Copy link
Owner

sunchao commented Aug 31, 2018

Merged. Thanks @sadikovi !

@sadikovi
Copy link
Collaborator Author

sadikovi commented Sep 1, 2018 via email

@sadikovi sadikovi deleted the add-file-writer branch September 1, 2018 07:54
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants