-
Notifications
You must be signed in to change notification settings - Fork 20
Conversation
@sunchao Could you review this PR when you have time. I have several high-level questions and workarounds that I would like to discuss below:
Let me know what you think. Thanks! |
Pull Request Test Coverage Report for Build 612
💛 - Coveralls |
@sadikovi : really sorry that I haven't got chance to look at this yet. Will take a look today. |
No worries, take as much time as needed.
…On Tue, 21 Aug 2018 at 6:37 PM, Chao Sun ***@***.***> wrote:
@sadikovi <https://github.com/sadikovi> : really sorry that I haven't got
chance to look at this yet. Will take a look today.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#149 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHbY3vuyRRb6CI-LUG8TIF2iTgpCybYVks5uTDdegaJpZM4V_Uzh>
.
|
src/file/writer.rs
Outdated
/// | ||
/// This method finalises previous column writer, so make sure that all writes are | ||
/// finished before calling this method. | ||
fn next_column(&mut self) -> Result<&mut ColumnWriter>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it better to return a Option
here? this more align with a iterator API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have to return Result<Option<&mut ColumnWriter>>
, is it okay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think this will be better. We should differentiate the error case and the case where there's no next.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I changed the code to return Result<Option<Something>>
so looks like iterator and it is similar to how we read pages.
src/file/writer.rs
Outdated
use util::io::{FileSink, Position}; | ||
|
||
/// File writer interface. | ||
pub trait FileWriter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: perhaps document more on how FileWriter
should be used. For instance, how many row group writers should one create? does user need to explicitly call close
the row group writers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the method docs to emphasise what you mentioned. There is a follow-up sub-task to update all of the documentation related to writes after we have merged the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Sounds good then.
src/file/writer.rs
Outdated
impl SerializedFileWriter { | ||
/// Creates new file writer. | ||
pub fn new(mut file: File, schema: TypePtr, properties: WriterPropertiesPtr) -> Self { | ||
Self::start_file(&mut file).expect("Start Parquet file"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the error message needs improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will update the new
method to return Result<Self>
similar to FileReader
and improve the error message.
src/file/writer.rs
Outdated
|
||
#[inline] | ||
fn next_column(&mut self) -> Result<&mut ColumnWriter> { | ||
if self.row_group_metadata.is_some() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can just call has_next_column
at the beginning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am going to make it iterator-like, this condition will be updated.
src/file/writer.rs
Outdated
} | ||
|
||
#[inline] | ||
fn close(&mut self) -> Result<RowGroupMetaDataPtr> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if user calls close()
before iterating over all columns. Will that cause some error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it will raise an error, because there will be mismatch between expected columns (must be greater than 0) in the schema and actual number of columns written.
src/file/writer.rs
Outdated
/// File writer interface. | ||
pub trait FileWriter { | ||
/// Creates new row group from this file writer. | ||
fn create_row_group(&mut self) -> Result<&mut RowGroupWriter>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The good thing with this is we cannot call close
until all the returned RowGroupWriter
are out of scope. The downside is the restrictiveness: now it's hard to pass the RowGroupWriter
to some struct which lives beyond the current call stack. Not sure if this will be an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I had the same thought, but it looks like this is one of the few options to keep track of row group writers ourselves. Do you have other ideas? Should we just keep a reference count of row group writers and check later when asking for the next one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think another approach is to return a unique reference to the RowGroupWriter
:
fn create_row_group(&mut self) -> Result<RowGroupWriter>;
And a separate function to explicitly close it:
fn close_row_group(&mut self, RowGroupWriter) -> Result<()>;
This will call finalise_row_group_writer
to close the row group writer and set some flag in the file writer indicating that the current row group writer has been closed. We can do similar things for RowGroupWriter
to make sure all column writers are handed back before the close
is called.
Let me know what you think. This is just some rough thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I will make changes, see what it looks like, and let you know. Thanks!
@sunchao I addressed your comments, but might have missed some. Would you mind doing another review pass, including tests (I feel they are a bit overcomplicated)? And could you also give your thoughts on the questions I raised in my first comment? Thanks! |
@sunchao I made the changes to the API as per your suggestion. Can you have a look? Thanks! Now we have Let me know if I should add more unit tests. |
@sadikovi : the latest PR looks good to me in general. However, the test |
Yes, thanks. I just checked the tests, will fix shortly, make some other
minor changes and ping you again.
…On Fri, 31 Aug 2018 at 10:46 AM, Chao Sun ***@***.***> wrote:
@sadikovi <https://github.com/sadikovi> : the latest PR looks good to me
in general. However, the test test_column_writer_check_metadata is
failing. Can you fix that?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#149 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHbY3uJ5jEm7WWh0bZ7u1q2WaJObELlLks5uWPfXgaJpZM4V_Uzh>
.
|
@sunchao I fixed the test - it failed because I added test on encodings in this branch, but in master we have updated to add levels encoding as well, so it was failing with missing encoding for levels. Should be all good now, would appreciate if you could look over the code again. There is also an open question about limiting row groups somehow, either based on size or number of rows - currently we do not limit them in any way. We could also add this as a follow-up issue. |
Yes we should have some config properties for this. Good for a follow-up. :) |
Merged. Thanks @sadikovi ! |
Thanks!
…On Fri, 31 Aug 2018 at 8:48 PM, Chao Sun ***@***.***> wrote:
Merged. Thanks @sadikovi <https://github.com/sadikovi> !
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#149 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHbY3s1WzMAaaSJdyY6K9K4a31reIExWks5uWYTtgaJpZM4V_Uzh>
.
|
This PR is final part of the core implementation of Parquet writes, adds:
FileWriter
and its serialised versionSerializedFileWriter
.RowGroupWriter
and its serialised versionSerializedRowGroupWriter
.I tried to maintain a fairly simple API. The idea and steps are as follows:
std::fs::File
, schemaTypePtr
and writer propertiesWriterPropertiesPtr
that you want to write.file_writer.next_row_group
to get a mutable reference to a row group writer to add data.row_group_writer.next_column
to add values to a column. Keep iterating until all columns have been written - we do not support partially written row group!close()
on column writer and/or row group writer. They will be closed regardless, before either the next column is requested or the next row group is requested.file_writer.close
to write final metadata to a file.I added a set of unit tests. Please review them as well, they show API in action.