-
Notifications
You must be signed in to change notification settings - Fork 20
Implement Drop for RowGroupWriter, ColumnWriter, and friends #173
Comments
Have you read the documentation? Was it clear how to use column writers and row group writers? From definition: IMHO, the better approach would be our first design, when we would return a mutable reference of column writer (similar for row group writer), this allowed us to keep track of what is being written and closing row groups and column writers automatically and enforce the order - but we decided to go with the current implementation, you can find more details on the PR and issue. Column reader and column writer APIs are considered low-level, e.g. does not handle def/rep levels for you, and it is up to you to follow the doc. You can compare that to our record reader API, which is idiomatic Rust. I think we just need to make it clear in our documentation. |
I am very new to parquet and I'm still working through the core concepts of row group writers (columns I understand, but I guess column aren't necessarily contiguous? they go through pages/chunks?). I was happy to just write a file with INT32 IDs in it! Now I am moving on to more complex values and it does seem very error prone. I understand it's a low level writer and there's room for operator error. I did see there was a record reader API, which looks very friendly to use, but I don't see any similar API for a record writer. I think you are right, a mutable reference where you have to request a new row group writer/column writer would be more idiomatic rust, the compiler would enforce there's only ever one mutable borrow of the file. The documentation didn't have many annotated examples of writing data, I've been peeking through test cases to figure out more. As I learn more I hope to contribute those back! |
Yes, you are right - there is a room for improvement in parquet writer. I have been snowed under recently, will try getting back and improving things in parquet, in particular working on the tools to convert csv/json tables into parquet (no guarantees though!). Parquet page is the smallest self contained unit of data that you can read, it has all of the encoding information, compression information, byte data, definition and repetition levels. There are different types of pages: data, dictionary, index, etc. Each column chunk is the set of pages that has to be read contiguously and represents a leaf column in parquet schema. All of the nesting and nulls are encoded with repetition and definition levels respectively. Normally row group will contain the number of column chunks to match parquet schema leaf nodes. It is a requirement that all column chunks are written sequentially, one column after another. Meanwhile, metadata is recorded and stored as part of column chunk and row group. This allows column pruning and statistics based filtering. Each file can contain some number of row groups, or can contain 0 row groups (empty file). Users are supposed to handle complex values like lists, maps and structs themselves using current writer API. Writer API has been added recently and simply has not had much of API usability testing, and I wrote documentation as much as I could at the time. It would be great if you could have a look and see if there are any problems/missing docs in the code and help fixing them! |
Actually, the more I think about this, the more I tend to agree that your approach is the best and removes the necessity to close writers. We should implement Drop for all of those types, because it is quite annoying to call close every time. I will try it out and open a PR for it. |
It'll be interesting to see what you come up with. I'm concerned with your initial, and correct, observation:
|
I think we will just throw an error when this happens. I will try and see
what I can do. Thanks!
…On Sat, 3 Nov 2018 at 18:13 Xavier Lange ***@***.***> wrote:
It'll be interesting to see what you come up with. I'm concerned with your
initial, and correct, observation: you can still close out of order.
Perhaps closing the individual column writers may be left up to the
RowGroup?
{
row_group.next_column() // pushes the column on to a stack of "to be closed" references
// do stuff with column
} // row_group is dropped, closes referenced columns one a a time and panics(?) if all columns where not handled? or perhaps it just lets users write corrupt files?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#173 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHbY3oaIGlQKS0coIUGbxYgARzut8HnBks5urc7HgaJpZM4XYwGH>
.
|
I don’t think Drop can return an error. I think the return type is ().
… On Nov 3, 2018, at 14:14, Ivan ***@***.***> wrote:
I think we will just throw an error when this happens. I will try and see
what I can do. Thanks!
On Sat, 3 Nov 2018 at 18:13 Xavier Lange ***@***.***> wrote:
> It'll be interesting to see what you come up with. I'm concerned with your
> initial, and correct, observation: you can still close out of order.
> Perhaps closing the individual column writers may be left up to the
> RowGroup?
>
> {
> row_group.next_column() // pushes the column on to a stack of "to be closed" references
> // do stuff with column
> } // row_group is dropped, closes referenced columns one a a time and panics(?) if all columns where not handled? or perhaps it just lets users write corrupt files?
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#173 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AHbY3oaIGlQKS0coIUGbxYgARzut8HnBks5urc7HgaJpZM4XYwGH>
> .
>
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
It is not what I meant, I know it returns Unit. I was thinking about simply panicking when when one of the operations is not performed correctly, in order to at least indicate an error somehow. But that results in some weird behaviour, and expected error would always result in panic. It does look like moving to &mut might be the best solution yet. Closing out of order can be relatively easily solved by (global) id counter, so you would always know what column writer/ row group writer should be closed next. Let's keep it open, until someone comes up with an idea or this will simply be not a problem when we build high level API. |
Thanks for the discussion. I think my original concern is that, if we return With |
We already have that, in a way. I suggest we keep it open, I will keep
thinking how to solve it.
…On Wed, 7 Nov 2018 at 7:48 PM, Chao Sun ***@***.***> wrote:
Thanks for the discussion. I think my original concern is that, if we
return &mut, the returned row group writer can only be used within the
same call stack as the file writer, which is a limitation. I agree that
changing it to &mut can simplify things though. We can explore this idea.
With &mut, instead of having a drop, can we just close the previous
writer when calling next_row_group? I believe this can solve most of the
issues @xrl <https://github.com/xrl> encountered. We can still provide a
close method for file/row group/column etc, if people need to explicitly
close them.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#173 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHbY3pqTlGlrvq-vC-_A4AmWfze2i1uvks5usysIgaJpZM4XYwGH>
.
|
@sunchao can you further explain why it's useful for row group writer and file writer to be used in different call stacks? Does that mean you want to send row group writers to different threads? |
@xrl I don't have a concrete use case for now - just think people may want the flexibility to store the writers in heap. I'm pretty OK to change to |
I think it will look more "rust-y" to have the If users did want to put some part of the writer in the heap I think it'd be best to just put the whole file handle in there. Otherwise the parquet library would need Rc/Arc in order to track when those writers are done. But I don't understand why a user would want just one part of the writer in the heap and not just the whole file. |
Yes it looks more rusty that way and potentially more performant, though I don't understand the second paragraph you said though. Can you elaborate that? |
My comments about I think we can support people putting their values in Box. |
I'm learning parquet with this library and it there's a lot of room for messing up when it comes to closing various resources. It would be more rust-y (idiomatic) if the library provided
Drop
implementations for these types.Take this code for example:
there are two ways I can mess up (and indeed did), forgetting to call
close_column
or forgetting to callclose_row_group
.The
std::ops::Drop
trait could be rigged up to call those close functions, if we can keep track of the parent object somehow. The trait definition:and the compiler will make sure this destructor is called in reverse order of value creation, so we can be sure that a column writer is closed before a row group, etc.
Hopefully the code would someday turn out simpler:
The text was updated successfully, but these errors were encountered: