Skip to content
This repository has been archived by the owner on Jan 11, 2021. It is now read-only.

Implement Drop for RowGroupWriter, ColumnWriter, and friends #173

Open
xrl opened this issue Oct 12, 2018 · 16 comments
Open

Implement Drop for RowGroupWriter, ColumnWriter, and friends #173

xrl opened this issue Oct 12, 2018 · 16 comments

Comments

@xrl
Copy link
Contributor

xrl commented Oct 12, 2018

I'm learning parquet with this library and it there's a lot of room for messing up when it comes to closing various resources. It would be more rust-y (idiomatic) if the library provided Drop implementations for these types.

Take this code for example:

{
    let mut pfile = create_addresses_parquet_file().unwrap();

    let mut row_group = pfile.next_row_group().unwrap();

    for p in 0..page_count {
        let res = addresses.order(ad_dsl::id.desc()).limit(page_size).offset(p * page_size).load::<Address>(&conn).unwrap();
        println!("got a page {}", p);

        let mut column_writer = row_group.next_column().unwrap().unwrap();
        let addr_ids = res.iter().map(|x| x.id).collect::<Vec<i32>>();
        if let ColumnWriter::Int32ColumnWriter(ref mut typed) = column_writer {
            typed.write_batch(&addr_ids[..], None, None).unwrap();
        }
        row_group.close_column(column_writer).unwrap();
    }

    pfile.close_row_group(row_group).unwrap();
}

there are two ways I can mess up (and indeed did), forgetting to call close_column or forgetting to call close_row_group.

The std::ops::Drop trait could be rigged up to call those close functions, if we can keep track of the parent object somehow. The trait definition:

pub trait Drop {
    fn drop(&mut self);
}

and the compiler will make sure this destructor is called in reverse order of value creation, so we can be sure that a column writer is closed before a row group, etc.

Hopefully the code would someday turn out simpler:

{
    let mut pfile = create_addresses_parquet_file().unwrap();

    let mut row_group = pfile.next_row_group().unwrap();

    for p in 0..page_count {
        let res = addresses.order(ad_dsl::id.desc()).limit(page_size).offset(p * page_size).load::<Address>(&conn).unwrap();
        println!("got a page {}", p);

        let mut column_writer = row_group.next_column().unwrap().unwrap();
        let addr_ids = res.iter().map(|x| x.id).collect::<Vec<i32>>();
        if let ColumnWriter::Int32ColumnWriter(ref mut typed) = column_writer {
            typed.write_batch(&addr_ids[..], None, None).unwrap();
        }
        // std::ops::Drop::drop(column_writer) called by compiler
    }

    // std::ops::Drop::drop(row_group) called by compiler
}
@sadikovi
Copy link
Collaborator

sadikovi commented Oct 12, 2018

Have you read the documentation? Was it clear how to use column writers and row group writers?

From definition: Because of this recursive dropping, you do not need to implement this trait unless your type needs its own destructor logic.. Even having implemented this trait, you can still close out of order - so it does not necessarily solve the problem. The major issue is keeping the track of a row group writer or a file writer, but we were trying to avoid that by design. And, if I remember correctly, we had bunch of checks to prevent out of order writes.

IMHO, the better approach would be our first design, when we would return a mutable reference of column writer (similar for row group writer), this allowed us to keep track of what is being written and closing row groups and column writers automatically and enforce the order - but we decided to go with the current implementation, you can find more details on the PR and issue.

Column reader and column writer APIs are considered low-level, e.g. does not handle def/rep levels for you, and it is up to you to follow the doc. You can compare that to our record reader API, which is idiomatic Rust.

I think we just need to make it clear in our documentation.

@xrl
Copy link
Contributor Author

xrl commented Oct 12, 2018

I am very new to parquet and I'm still working through the core concepts of row group writers (columns I understand, but I guess column aren't necessarily contiguous? they go through pages/chunks?). I was happy to just write a file with INT32 IDs in it!

Now I am moving on to more complex values and it does seem very error prone. I understand it's a low level writer and there's room for operator error.

I did see there was a record reader API, which looks very friendly to use, but I don't see any similar API for a record writer. I think you are right, a mutable reference where you have to request a new row group writer/column writer would be more idiomatic rust, the compiler would enforce there's only ever one mutable borrow of the file.

The documentation didn't have many annotated examples of writing data, I've been peeking through test cases to figure out more. As I learn more I hope to contribute those back!

@xrl
Copy link
Contributor Author

xrl commented Oct 12, 2018

I think this is the relevant issue/pr pair:

#116
#149

@sadikovi
Copy link
Collaborator

sadikovi commented Oct 12, 2018

Yes, you are right - there is a room for improvement in parquet writer. I have been snowed under recently, will try getting back and improving things in parquet, in particular working on the tools to convert csv/json tables into parquet (no guarantees though!).

Parquet page is the smallest self contained unit of data that you can read, it has all of the encoding information, compression information, byte data, definition and repetition levels. There are different types of pages: data, dictionary, index, etc.

Each column chunk is the set of pages that has to be read contiguously and represents a leaf column in parquet schema. All of the nesting and nulls are encoded with repetition and definition levels respectively.

Normally row group will contain the number of column chunks to match parquet schema leaf nodes. It is a requirement that all column chunks are written sequentially, one column after another. Meanwhile, metadata is recorded and stored as part of column chunk and row group. This allows column pruning and statistics based filtering.

Each file can contain some number of row groups, or can contain 0 row groups (empty file).
Normally query engines parallelise files on row groups, with the smallest unit of reading data being column chunk, since they create an iterator for a column.

Users are supposed to handle complex values like lists, maps and structs themselves using current writer API. Writer API has been added recently and simply has not had much of API usability testing, and I wrote documentation as much as I could at the time.

It would be great if you could have a look and see if there are any problems/missing docs in the code and help fixing them!

@sadikovi
Copy link
Collaborator

sadikovi commented Nov 3, 2018

Actually, the more I think about this, the more I tend to agree that your approach is the best and removes the necessity to close writers. We should implement Drop for all of those types, because it is quite annoying to call close every time. I will try it out and open a PR for it.

@xrl
Copy link
Contributor Author

xrl commented Nov 3, 2018

It'll be interesting to see what you come up with. I'm concerned with your initial, and correct, observation: you can still close out of order. Perhaps closing the individual column writers may be left up to the RowGroup?

{
  row_group.next_column() // pushes the column on to a stack of "to be closed" references
  // do stuff with column
} // row_group is dropped, closes referenced columns in reverse order of allocation and panics(?) if all columns where not handled?
  // or perhaps it just lets users write corrupt files?

@sadikovi
Copy link
Collaborator

sadikovi commented Nov 3, 2018 via email

@xrl
Copy link
Contributor Author

xrl commented Nov 3, 2018 via email

@sadikovi
Copy link
Collaborator

sadikovi commented Nov 4, 2018

It is not what I meant, I know it returns Unit. I was thinking about simply panicking when when one of the operations is not performed correctly, in order to at least indicate an error somehow. But that results in some weird behaviour, and expected error would always result in panic.

It does look like moving to &mut might be the best solution yet. Closing out of order can be relatively easily solved by (global) id counter, so you would always know what column writer/ row group writer should be closed next.

Let's keep it open, until someone comes up with an idea or this will simply be not a problem when we build high level API.

@sunchao
Copy link
Owner

sunchao commented Nov 7, 2018

Thanks for the discussion. I think my original concern is that, if we return &mut, the returned row group writer can only be used within the same call stack as the file writer, which is a limitation. I agree that changing it to &mut can simplify things though. We can explore this idea.

With &mut, instead of having a drop, can we just close the previous writer when calling next_row_group? I believe this can solve most of the issues @xrl encountered. We can still provide a close method for file/row group/column etc, if people need to explicitly close them.

@sadikovi
Copy link
Collaborator

sadikovi commented Nov 7, 2018 via email

@xrl
Copy link
Contributor Author

xrl commented Nov 8, 2018

@sunchao can you further explain why it's useful for row group writer and file writer to be used in different call stacks? Does that mean you want to send row group writers to different threads?

@sunchao
Copy link
Owner

sunchao commented Nov 9, 2018

@xrl I don't have a concrete use case for now - just think people may want the flexibility to store the writers in heap. I'm pretty OK to change to &mut as long as it makes things easy. I haven't thought about the scenario of concurrent row group/page writers, though I think either approach can support that use case.

@xrl
Copy link
Contributor Author

xrl commented Nov 11, 2018

I think it will look more "rust-y" to have the &mut and there will be less boilerplate.

If users did want to put some part of the writer in the heap I think it'd be best to just put the whole file handle in there. Otherwise the parquet library would need Rc/Arc in order to track when those writers are done. But I don't understand why a user would want just one part of the writer in the heap and not just the whole file.

@sunchao
Copy link
Owner

sunchao commented Nov 11, 2018

Yes it looks more rusty that way and potentially more performant, though I don't understand the second paragraph you said though. Can you elaborate that?

@xrl
Copy link
Contributor Author

xrl commented Nov 15, 2018

My comments about Rc/Arc are probably non-sense :)

I think we can support people putting their values in Box.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants