parquet_derive for the new RecordWriter trait #197

xrl · 2018-12-01T22:36:58Z

Goal

Writing many columns to a file is a chore. If you can put your values in to a struct which mirrors the schema of your file, this derive(ParquetRecordWriter) will write out all the fields, in the order in which they are defined, to a row_group.

How to Use

extern crate parquet;
#[macro_use] extern crate parquet_derive;

#[derive(ParquetRecordWriter)]
struct ACompleteRecord<'a> {
  pub a_bool: bool,
  pub a_str: &'a str,
}

RecordWriter trait

This is the new trait which parquet_derive will implement for your structs.

use super::RowGroupWriter;

pub trait RecordWriter<T> {
  fn write_to_row_group(&self, row_group_writer: &mut Box<RowGroupWriter>);
}

How does it work?

The parquet_derive crate adds code generating functionality to the rust compiler. The code generation takes rust syntax and emits additional syntax. This macro expansion works on rust 1.15+ stable. This is a dynamic plugin, loaded by the machinery in cargo. Users don't have to do any special build.rs steps or anything like that, it's automatic by including parquet_derive in their project. The parquet_derive/src/Cargo.toml has a section saying as much:

[lib]
proc-macro = true

The rust struct tagged with #[derive(ParquetRecordWriter)] is provided to the parquet_record_writer function in parquet_derive/src/lib.rs. The syn crate parses the struct from a string-representation to a AST (a recursive enum value). The AST contains all the values I care about when generating a RecordWriter impl:

the name of the struct
the lifetime variables of the struct
the fields of the struct

The fields of the struct are translated from AST to a flat FieldInfo struct. It has the bits I care about for writing a column: field_name, field_lifetime, field_type, is_option, column_writer_variant.

The code then does the equivalent of templating to build the RecordWriter implementation. The templating functionality is provided by the quote crate. At a high-level the template for RecordWriter looks like:

impl RecordWriter for $struct_name {
  fn write_row_group(..) {
    $({
      $column_writer_snippet
    })
  } 
}

this template is then added under the struct definition, ending up something like:

struct MyStruct {
}
impl RecordWriter for MyStruct {
  fn write_row_group(..) {
    {
       write_col_1();
    };
   {
       write_col_2();
   }
  }
}

and finally THIS is the code passed to rustc. It's just code now, fully expanded and standalone. If a user ever changes their struct MyValue definition the ParquetRecordWriter will be regenerated. There's no intermediate values to version control or worry about.

Viewing the Derived Code

To see the generated code before it's compiled, one very useful bit is to install cargo expand more info on gh, then you can do:

$WORK_DIR/parquet-rs/parquet_derive_test
cargo expand --lib > ../temp.rs

then you can dump the contents:

struct DumbRecord {
    pub a_bool: bool,
    pub a2_bool: bool,
}
impl RecordWriter<DumbRecord> for &[DumbRecord] {
    fn write_to_row_group(
        &self,
        row_group_writer: &mut Box<parquet::file::writer::RowGroupWriter>,
    ) {
        let mut row_group_writer = row_group_writer;
        {
            let vals: Vec<bool> = self.iter().map(|x| x.a_bool).collect();
            let mut column_writer = row_group_writer.next_column().unwrap().unwrap();
            if let parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) =
                column_writer
            {
                typed.write_batch(&vals[..], None, None).unwrap();
            }
            row_group_writer.close_column(column_writer).unwrap();
        };
        {
            let vals: Vec<bool> = self.iter().map(|x| x.a2_bool).collect();
            let mut column_writer = row_group_writer.next_column().unwrap().unwrap();
            if let parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) =
                column_writer
            {
                typed.write_batch(&vals[..], None, None).unwrap();
            }
            row_group_writer.close_column(column_writer).unwrap();
        }
    }
}

now I need to write out all the combinations of types we support and make sure it writes out data.

Procedural Macros

The parquet_derive crate can ONLY export the derivation functionality. No traits, nothing else. The derive crate can not host test cases. It's kind of like a "dummy" crate which is only used by the compiler, never the code.

The parent crate cannot use the derivation functionality, which is important because it means test code cannot be in the parent crate. This forces us to have a third crate, parquet_derive_test.

I'm open to being wrong on any one of these finer points. I had to bang on this for a while to get it to compile!

Potentials For Better Design

Recursion could be limited by generating the code as "snippets" instead of one big quote! AST generator. Or so I think. It might be nicer to push generating each columns writing code to another loop.
It would be nicer if I didn't have to be so picky about data going in to the write_batch function. Is it possible we could make a version of the function which accept Into<DataType> or similar? This would greatly simplify this derivation code as it would not need to enumerate all the supported types. Something like write_generic_batch(&[impl Into<DataType>]) would be neat. (not tackling in this generation of the plugin)
Another idea to improving writing columns, could we have a write function for Iterators? I already have a Vec<DumbRecord>, if I could just write a mapping for accessing the one value, we could skip the whole intermediate vec for write_batch. Should have some significant memory advantages. (not tackling in this generation of the plugin, it's a bigger parquet-rs enhancement)
~~It might be worthwhile to derive a parquet schema directly from a struct definition. That should stamp out opportunities for type errors.~~ (moved to Derive parquet schema from struct #203)

Status

I have successfully integrated this work with my own data exporter (takes postgres/couchdb and outputs a single parquet file).

I think this code is worth including in the project, with the caveat that it only generates simplistic RecordWriters. As people start to use we can add code generation for more complex, nested structs.

Closes #192

xrl · 2018-12-01T22:39:48Z

What's really unfortunate is that cargo expand detected rustfmt and reformatted the entire project. I will need to clean that up somehow. What a pain 😄 (edit: I think I mostly fixed it!)

xrl · 2018-12-03T17:00:29Z

@sadikovi have you had a chance to review this work? is this anywhere close to what you had in mind when I opened #192? 😄

sadikovi · 2018-12-03T21:29:19Z

Your approach looks certainly interesting, I did not know that it is possible to to that with macros and parsing of structs.

As I mentioned previously, my plan was reusing Row enum that we have since it maps to pretty much anything and using triplet iterators to write data. The tricky part is specifying schema, but it could always be inferred from the data if needed.

Anyway, I might start writing some code later to see the amount of work required, but would prefer not duplicating effort if possible - will be happy to help out with writes.

xrl · 2018-12-03T21:31:43Z

@sadikovi can you provide some pseudocode for how the Row enum would work?

sadikovi · 2018-12-03T21:40:28Z

Good question. It would be similar to what you have, except we would use getters to extract values and write them. Plus, certain types will allow recursive (maybe) calls to writes maps, arrays and structs. I can't show the exact code to do so, but this is how I think it could be done. Not sure if it helps.

xrl · 2018-12-03T21:45:18Z

That's very interesting. I think that technique could be folded in to this code generation. I think I'm generating too much code, code which could be solved with some From/Into/AsRef traits or a slightly different API. Handling all the recursive definitions is going to need some more sophistication I'm not quite equipped to write!

There's a lot of value to just putting a derive(...) on a structure and letting the procedural macro emit the writer code for each field. I'm very interested to hear more from you.

What is your opinion on getting this work merged in to master? What do you want to see/know?

sadikovi · 2018-12-03T22:11:40Z

Well, I need to study your code. Usage example would be nice and more documentation around the code - will make it easy for people to follow and extend. It is a great work, though it would be appreciated if you could submit the work in chunks into master, not as one large PR - easier for people to review - but not required.

Don't have to do now, I assume there is still a lot of work to do. TLDR I need to review the code again:).

sadikovi · 2018-12-03T22:12:40Z

I will have a look in the next couple of days.

xrl · 2018-12-03T22:21:43Z

A usage example is in the README.md for parquet_derive.
Similarly, I would have preferred a smaller unit of work but I didn't see an alternative path. This is the whole functionality.

Would you like to set up a time for a call to walk through the code and ask questions? After you've had time to satisfy your own regular code review methods 😄

I will satisfy one more pointer style (a_property: &'a Option<_>). That won't change much of the existing code then it's useful for a lot of situations.

I don't see this code as being the end of RecordWriter generation, just a first step for shallow/unnested structs. It satisfies my production code needs. In the future it can be modified or rewritten to support arbitrarily nested structs, just needs some procedural macro-fu 😄

sunchao · 2018-12-05T07:47:16Z

Thanks for the effort @xrl ! I'll also take a look on this soon. :)

sadikovi · 2018-12-05T08:19:03Z

@xrl As I am reviewing the code can you update the PR description to mention the goal of this PR? I struggle to derive it from the code. Is it adding record writer? Thanks.

xrl · 2018-12-05T13:21:58Z

@sadikovi please post the code you are trying to derive and what errors you are seeing. Also, did you check out this branch and try running cargo test from the parquet_derive_test folder?

Edit: I added a Goals, How to Use and RecordWriter trait section to the PR.

xrl · 2018-12-05T14:56:59Z

@sadikovi I have added more explanation to the PR text above. The How does it work? section should help clarify!

xrl · 2018-12-06T19:19:21Z

I'm happy with the current status of this crate. I am using it in my own project and it generates reliable RecordWriter implementations. I think this means it's good for a thorough tire kicking @sadikovi @sunchao :)

sadikovi

Thanks for the PR. Looks good. It is very interesting that code can write structs.

I would like to see how much work it would be to handle nested types similar to what we do in reads. I am inclined to suggest using existing Row API, which allows to specify those things, but it might need to be extended to provide information for decimals, etc. I am biased towards reusing what we have built in record reader and applying similar concepts to build record writers.

These are just my two cents. If @sunchao LGTMs it, then merge it.

Cargo.toml

sadikovi · 2018-12-07T07:52:15Z

parquet_derive/README.md

+```
+
+## Features
+- [X] Support writing `String`, `&str`, `bool`, `i32`, `f32`, `f64`


Byte arrays and unsigned variants?

Good call on the unsigned variants. To be clear, to satisfy that requirement I need to add u32/u64, right? Parquet does not handle 8 or 16 bit values, right?

And when it comes to "byte arrays", are you talking fixed-sized arrays like let arr : [u8; 16] = [0; 16];? Or do you mean let vec : Vec<u8> = vec![0; 16];? I haven't used either of those in parquet-rs so I need a little guidance.

By "byte arrays" I meant parquet type ByteArray, not just strings.

@sadikovi how do I write unsigned variants? I don't see a column writer for that.

parquet_derive/README.md

parquet_derive/src/lib.rs

parquet_derive_test/Cargo.toml

parquet_derive_test/src/lib.rs

src/file/writer/record_writer.rs

parquet_derive/src/lib.rs

sadikovi · 2018-12-07T08:05:19Z

parquet_derive/src/lib.rs

+    let field_name = self.field_name.clone();
+    let field_type = self.field_type.clone();
+    let column_writer_variant = self.column_writer_variant.clone();
+    let is_option = self.is_option;


My main concern is that Option of values [1, 2, 3, 4, 5, 6] will be encoded differently compared to i32 [1, 2, 3, 4, 5, 6], because of the definition levels. I think we should check if there are nulls first and then decide on what option to choose.

Great point.

sadikovi · 2018-12-09T21:12:46Z

I normally put some doc to any function, but it is up to you.

…

On Sun, 9 Dec 2018 at 10:08 PM, Xavier Lange ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In parquet_derive/src/lib.rs <#197 (comment)>: > + "i64" => quote! { parquet::column::writer::ColumnWriter::Int64ColumnWriter }, + "f32" => quote! { parquet::column::writer::ColumnWriter::FloatColumnWriter }, + "f64" => quote! { parquet::column::writer::ColumnWriter::DoubleColumnWriter }, + o => unimplemented!("don't know {} for {:#?}", o, f), + }; + + FieldInfo { + field_name: f.ident.clone().expect("must be a named field"), + field_lifetime, + field_type, + is_option, + column_writer_variant, + } + } + + pub fn to_writer_snippet(&self) -> proc_macro2::TokenStream { This function does not need to be public, that was my mistake. Do you still want it documented? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#197 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHbY3szG6fk3tvjWAqqS6EsORhibvIMyks5u3XvmgaJpZM4Y9IbM> .

sunchao

Thanks @xrl ! This is a great addition. I'm still getting more familiar with procedure macros but left some initial feedback. My biggest comment right now is whether we can remove or merge those unimplemented clauses to make the code shorter and cleaner. Also, as @sadikovi pointed out, I wonder if there's a plan to support nested types at some point, and whether that's easy to achieve. Obviously we don't have to tackle that in this PR. :-)

parquet_derive/src/lib.rs

sunchao · 2018-12-10T06:31:01Z

parquet_derive/src/lib.rs

+  let input: DeriveInput = parse_macro_input!(input as DeriveInput);
+  let fields: Fields = match input.data {
+    Data::Struct(DataStruct { fields, .. }) => fields,
+    Data::Enum(_) => unimplemented!("don't support enum"),


Can we change this to assertions and panic? the input should always be struct, right?

The input should always be a struct, yes.

The reason I match on all the input.data variants is so I can get a nicer message in the panic. Under the hood unimplemented!() is actually panic!() but with a little bit of text:

macro_rules! unimplemented { () => (panic!("not yet implemented")); ($($arg:tt)+) => (panic!("not yet implemented: {}", format_args!($($arg)*))); }

so the error will look like not yet implemented: don't support enum. I think that's a little more user friendly -- and it still works the same as a panic. Is that OK? Now that I've explained the choice I am happy to switch to the plain panic!()-macro if you'd like.

@sunchao how do you feel about this point?

Yes unimplemented looks good. Can you combine all the "unimplemented" branches though? Assuming Data implements Display or Debug, you can print the type in the error message in this way.

parquet_derive/src/lib.rs

sunchao · 2018-12-10T07:19:07Z

parquet_derive/src/lib.rs

+impl FieldInfo {
+  fn from(f: &Field) -> Self {
+    let (field_type, field_lifetime, is_option) = match &f.ty {
+      Type::Slice(_) => unimplemented!("unsupported type: Slice"),


can we merge all the unhandled clauses into a single one?

My concern here is handling user-reported errors. I can do a generic fall-through but the error can't include what generated the error. Instead of this match it would be something like:

match &f.ty { Type::Reference(ref tr) => { // unimplemented!("lifetime: {:#?}", lifetime); extract_type_reference_info(tr) }, Type::Path(tp) => { let (ident, is_option, _) = extract_path_info(&tp); (ident, None, is_option) }, _ => unimplemented!("unsupported type") }

so if someone tries to derive:

#[derive(ParquetRecordWriter)] struct Foo<'a> { baz: String bar: &'a [u8] }

it will just panic at compile time with unsupported type and I don't think the user will know right away which field caused the failure. This was a user-friendly consideration.

If we're OK omitting type-failure information then I'm OK simplifying, it will remove quite a few of lines.

I just got to see this error in action when I tried compiling to test other fields into ByteArray:

not yet implemented: don't know Vec for Field

it's not the best error but it let's me know the Vec is the issue.

@sunchao what do you think about this response?

@xrl sorry I just saw your comments. In the example you gave above, maybe we can do this:

match &f.ty { Type::Reference(ref tr) => { // unimplemented!("lifetime: {:#?}", lifetime); extract_type_reference_info(tr) }, Type::Path(tp) => { let (ident, is_option, _) = extract_path_info(&tp); (ident, None, is_option) }, t @ _ => unimplemented!("unsupported type {}", t) }

assuming Type implements Display or Debug. What do you think?

This is great, thanks!

sunchao · 2018-12-10T07:51:35Z

parquet_derive/src/lib.rs

+  ))
+}
+
+fn extract_type_reference_info(


can we move this and extract_path_info into FieldInfo as member function? this makes it easier to read.

OK, good idea!

sunchao · 2018-12-11T04:23:35Z

parquet_derive/src/lib.rs

+  field_lifetime: Option<Lifetime>,
+  field_type: Ident,
+  is_option: bool,
+  column_writer_variant: proc_macro2::TokenStream,


any reason we need to depend both on proc_macro and proc_macro2? just curious :)

I chose proc_macro2 to maximize our backwards compatibility, from their description:

A small shim over the proc_macro crate in the compiler intended to multiplex the stable interface as of 1.15.0 and the interface as of 1.30.0.

So I take in a compiler-defined proc_macro::TokenStream and convert it to proc_macro2::TokenStream so I can work on it in a rustc version-independent way. We only claim compatibility with nightly I can see this being unnecessary.

I'm open to removing it, I think it would just a matter of renaming to proc_macro::TokenStream, etc.

xrl · 2018-12-19T01:10:17Z

I will be addressing this feedback soon. I'm waiting to see what happens with the switchover to the arrow project.

sunchao · 2018-12-19T03:38:14Z

Thanks @xrl . Since the merge is done, could you create a JIRA in Arrow for this feature? please also paste the description there :)

xrl · 2018-12-19T14:21:43Z

@sunchao are we doing PRs against the arrow repo? or is the work still here and it gets rebased in to arrow?

sunchao · 2018-12-19T16:00:17Z

@xrl From now on, we should do PRs against the arrow repo. This repo will be frozen. I can help review there.

xrl · 2018-12-19T16:01:08Z

Ah, I thought that would be the case! Thanks for the clarity.

xrl · 2018-12-21T04:40:42Z

My goals with nested types would be convert over to the Row enum (see: #197 (comment)). Unfortunately I haven't used that API and looking over it I wasn't sure how it would work in in this case.

I'd like to learn more then revisit this crate to make it more general with row enum or any other techniques. Especially anything that makes determining definition levels easier, it's a tricky part of Parquet I don't have the strongest handle on.

This PR, as it is now, is useful in my own project. I am deriving a write for a field with 30+ fields and it has greatly improved my productivity. Hopefully that extends to other users and it shows a little sample of what is to come!

coveralls · 2018-12-21T05:35:32Z

Pull Request Test Coverage Report for Build 711

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at ?%

Totals
Change from base Build 690:	0.0%
Covered Lines:
Relevant Lines:	0

💛 - Coveralls

coveralls · 2018-12-21T05:35:32Z

Pull Request Test Coverage Report for Build 711

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at ?%

Totals
Change from base Build 690:	0.0%
Covered Lines:
Relevant Lines:	0

💛 - Coveralls

sunchao · 2019-01-04T07:07:36Z

Sorry @xrl I just saw your updates. Posted some replies. I do recommend to create an Arrow JIRA for this and post the PR against the repo though, since eventually this should go there as a sub-crate.

xrl · 2019-04-11T01:09:37Z

Closing in favor of the ARROW PR!

…elated (ongoing) issues to get parquet write support into arrow/parquet-rs

A rebase and significant rewrite of sunchao/parquet-rs#197 Big improvement: I now use a more natural nested enum style, it helps break out what patterns of data types are . The rest of the broad strokes still apply. Goal === Writing many columns to a file is a chore. If you can put your values in to a struct which mirrors the schema of your file, this `derive(ParquetRecordWriter)` will write out all the fields, in the order in which they are defined, to a row_group. How to Use === ``` extern crate parquet; #[macro_use] extern crate parquet_derive; #[derive(ParquetRecordWriter)] struct ACompleteRecord<'a> { pub a_bool: bool, pub a_str: &'a str, } ``` RecordWriter trait === This is the new trait which `parquet_derive` will implement for your structs. ``` use super::RowGroupWriter; pub trait RecordWriter<T> { fn write_to_row_group(&self, row_group_writer: &mut Box<RowGroupWriter>); } ``` How does it work? === The `parquet_derive` crate adds code generating functionality to the rust compiler. The code generation takes rust syntax and emits additional syntax. This macro expansion works on rust 1.15+ stable. This is a dynamic plugin, loaded by the machinery in cargo. Users don't have to do any special `build.rs` steps or anything like that, it's automatic by including `parquet_derive` in their project. The `parquet_derive/src/Cargo.toml` has a section saying as much: ``` [lib] proc-macro = true ``` The rust struct tagged with `#[derive(ParquetRecordWriter)]` is provided to the `parquet_record_writer` function in `parquet_derive/src/lib.rs`. The `syn` crate parses the struct from a string-representation to a AST (a recursive enum value). The AST contains all the values I care about when generating a `RecordWriter` impl: - the name of the struct - the lifetime variables of the struct - the fields of the struct The fields of the struct are translated from AST to a flat `FieldInfo` struct. It has the bits I care about for writing a column: `field_name`, `field_lifetime`, `field_type`, `is_option`, `column_writer_variant`. The code then does the equivalent of templating to build the `RecordWriter` implementation. The templating functionality is provided by the `quote` crate. At a high-level the template for `RecordWriter` looks like: ``` impl RecordWriter for $struct_name { fn write_row_group(..) { $({ $column_writer_snippet }) } } ``` this template is then added under the struct definition, ending up something like: ``` struct MyStruct { } impl RecordWriter for MyStruct { fn write_row_group(..) { { write_col_1(); }; { write_col_2(); } } } ``` and finally _THIS_ is the code passed to rustc. It's just code now, fully expanded and standalone. If a user ever changes their `struct MyValue` definition the `ParquetRecordWriter` will be regenerated. There's no intermediate values to version control or worry about. Viewing the Derived Code === To see the generated code before it's compiled, one very useful bit is to install `cargo expand` [more info on gh](https://github.com/dtolnay/cargo-expand), then you can do: ``` $WORK_DIR/parquet-rs/parquet_derive_test cargo expand --lib > ../temp.rs ``` then you can dump the contents: ``` struct DumbRecord { pub a_bool: bool, pub a2_bool: bool, } impl RecordWriter<DumbRecord> for &[DumbRecord] { fn write_to_row_group( &self, row_group_writer: &mut Box<parquet::file::writer::RowGroupWriter>, ) { let mut row_group_writer = row_group_writer; { let vals: Vec<bool> = self.iter().map(|x| x.a_bool).collect(); let mut column_writer = row_group_writer.next_column().unwrap().unwrap(); if let parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) = column_writer { typed.write_batch(&vals[..], None, None).unwrap(); } row_group_writer.close_column(column_writer).unwrap(); }; { let vals: Vec<bool> = self.iter().map(|x| x.a2_bool).collect(); let mut column_writer = row_group_writer.next_column().unwrap().unwrap(); if let parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) = column_writer { typed.write_batch(&vals[..], None, None).unwrap(); } row_group_writer.close_column(column_writer).unwrap(); } } } ``` now I need to write out all the combinations of types we support and make sure it writes out data. Procedural Macros === The `parquet_derive` crate can ONLY export the derivation functionality. No traits, nothing else. The derive crate can not host test cases. It's kind of like a "dummy" crate which is only used by the compiler, never the code. The parent crate cannot use the derivation functionality, which is important because it means test code cannot be in the parent crate. This forces us to have a third crate, `parquet_derive_test`. I'm open to being wrong on any one of these finer points. I had to bang on this for a while to get it to compile! Potentials For Better Design === - [x] Recursion could be limited by generating the code as "snippets" instead of one big `quote!` AST generator. Or so I think. It might be nicer to push generating each columns writing code to another loop. - [X] ~~It would be nicer if I didn't have to be so picky about data going in to the `write_batch` function. Is it possible we could make a version of the function which accept `Into<DataType>` or similar? This would greatly simplify this derivation code as it would not need to enumerate all the supported types. Something like `write_generic_batch(&[impl Into<DataType>])` would be neat.~~ (not tackling in this generation of the plugin) - [X] ~~Another idea to improving writing columns, could we have a write function for `Iterator`s? I already have a `Vec<DumbRecord>`, if I could just write a mapping for accessing the one value, we could skip the whole intermediate vec for `write_batch`. Should have some significant memory advantages.~~ (not tackling in this generation of the plugin, it's a bigger parquet-rs enhancement) - [X] ~~It might be worthwhile to derive a parquet schema directly from a struct definition. That should stamp out opportunities for type errors.~~ (moved to #203) Status === I have successfully integrated this work with my own data exporter (takes postgres/couchdb and outputs a single parquet file). I think this code is worth including in the project, with the caveat that it only generates simplistic `RecordWriter`s. As people start to use we can add code generation for more complex, nested structs. We can convert the nested matching style to a fancier looping style. But for now, this explicit nesting is easier to debug and understand (to me at least!). Closes #4140 from xrl/parquet_derive Lead-authored-by: Xavier Lange <xrlange@gmail.com> Co-authored-by: Neville Dipale <nevilledips@gmail.com> Co-authored-by: Bryant Biggs <bryantbiggs@gmail.com> Co-authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>

A rebase and significant rewrite of sunchao/parquet-rs#197 Big improvement: I now use a more natural nested enum style, it helps break out what patterns of data types are . The rest of the broad strokes still apply. Goal === Writing many columns to a file is a chore. If you can put your values in to a struct which mirrors the schema of your file, this `derive(ParquetRecordWriter)` will write out all the fields, in the order in which they are defined, to a row_group. How to Use === ``` extern crate parquet; #[macro_use] extern crate parquet_derive; #[derive(ParquetRecordWriter)] struct ACompleteRecord<'a> { pub a_bool: bool, pub a_str: &'a str, } ``` RecordWriter trait === This is the new trait which `parquet_derive` will implement for your structs. ``` use super::RowGroupWriter; pub trait RecordWriter<T> { fn write_to_row_group(&self, row_group_writer: &mut Box<RowGroupWriter>); } ``` How does it work? === The `parquet_derive` crate adds code generating functionality to the rust compiler. The code generation takes rust syntax and emits additional syntax. This macro expansion works on rust 1.15+ stable. This is a dynamic plugin, loaded by the machinery in cargo. Users don't have to do any special `build.rs` steps or anything like that, it's automatic by including `parquet_derive` in their project. The `parquet_derive/src/Cargo.toml` has a section saying as much: ``` [lib] proc-macro = true ``` The rust struct tagged with `#[derive(ParquetRecordWriter)]` is provided to the `parquet_record_writer` function in `parquet_derive/src/lib.rs`. The `syn` crate parses the struct from a string-representation to a AST (a recursive enum value). The AST contains all the values I care about when generating a `RecordWriter` impl: - the name of the struct - the lifetime variables of the struct - the fields of the struct The fields of the struct are translated from AST to a flat `FieldInfo` struct. It has the bits I care about for writing a column: `field_name`, `field_lifetime`, `field_type`, `is_option`, `column_writer_variant`. The code then does the equivalent of templating to build the `RecordWriter` implementation. The templating functionality is provided by the `quote` crate. At a high-level the template for `RecordWriter` looks like: ``` impl RecordWriter for $struct_name { fn write_row_group(..) { $({ $column_writer_snippet }) } } ``` this template is then added under the struct definition, ending up something like: ``` struct MyStruct { } impl RecordWriter for MyStruct { fn write_row_group(..) { { write_col_1(); }; { write_col_2(); } } } ``` and finally _THIS_ is the code passed to rustc. It's just code now, fully expanded and standalone. If a user ever changes their `struct MyValue` definition the `ParquetRecordWriter` will be regenerated. There's no intermediate values to version control or worry about. Viewing the Derived Code === To see the generated code before it's compiled, one very useful bit is to install `cargo expand` [more info on gh](https://github.com/dtolnay/cargo-expand), then you can do: ``` $WORK_DIR/parquet-rs/parquet_derive_test cargo expand --lib > ../temp.rs ``` then you can dump the contents: ``` struct DumbRecord { pub a_bool: bool, pub a2_bool: bool, } impl RecordWriter<DumbRecord> for &[DumbRecord] { fn write_to_row_group( &self, row_group_writer: &mut Box<parquet::file::writer::RowGroupWriter>, ) { let mut row_group_writer = row_group_writer; { let vals: Vec<bool> = self.iter().map(|x| x.a_bool).collect(); let mut column_writer = row_group_writer.next_column().unwrap().unwrap(); if let parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) = column_writer { typed.write_batch(&vals[..], None, None).unwrap(); } row_group_writer.close_column(column_writer).unwrap(); }; { let vals: Vec<bool> = self.iter().map(|x| x.a2_bool).collect(); let mut column_writer = row_group_writer.next_column().unwrap().unwrap(); if let parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) = column_writer { typed.write_batch(&vals[..], None, None).unwrap(); } row_group_writer.close_column(column_writer).unwrap(); } } } ``` now I need to write out all the combinations of types we support and make sure it writes out data. Procedural Macros === The `parquet_derive` crate can ONLY export the derivation functionality. No traits, nothing else. The derive crate can not host test cases. It's kind of like a "dummy" crate which is only used by the compiler, never the code. The parent crate cannot use the derivation functionality, which is important because it means test code cannot be in the parent crate. This forces us to have a third crate, `parquet_derive_test`. I'm open to being wrong on any one of these finer points. I had to bang on this for a while to get it to compile! Potentials For Better Design === - [x] Recursion could be limited by generating the code as "snippets" instead of one big `quote!` AST generator. Or so I think. It might be nicer to push generating each columns writing code to another loop. - [X] ~~It would be nicer if I didn't have to be so picky about data going in to the `write_batch` function. Is it possible we could make a version of the function which accept `Into<DataType>` or similar? This would greatly simplify this derivation code as it would not need to enumerate all the supported types. Something like `write_generic_batch(&[impl Into<DataType>])` would be neat.~~ (not tackling in this generation of the plugin) - [X] ~~Another idea to improving writing columns, could we have a write function for `Iterator`s? I already have a `Vec<DumbRecord>`, if I could just write a mapping for accessing the one value, we could skip the whole intermediate vec for `write_batch`. Should have some significant memory advantages.~~ (not tackling in this generation of the plugin, it's a bigger parquet-rs enhancement) - [X] ~~It might be worthwhile to derive a parquet schema directly from a struct definition. That should stamp out opportunities for type errors.~~ (moved to apache#203) Status === I have successfully integrated this work with my own data exporter (takes postgres/couchdb and outputs a single parquet file). I think this code is worth including in the project, with the caveat that it only generates simplistic `RecordWriter`s. As people start to use we can add code generation for more complex, nested structs. We can convert the nested matching style to a fancier looping style. But for now, this explicit nesting is easier to debug and understand (to me at least!). Closes apache#4140 from xrl/parquet_derive Lead-authored-by: Xavier Lange <xrlange@gmail.com> Co-authored-by: Neville Dipale <nevilledips@gmail.com> Co-authored-by: Bryant Biggs <bryantbiggs@gmail.com> Co-authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>

xrl force-pushed the parquet_derive branch 2 times, most recently from 31ac273 to 38ec590 Compare December 1, 2018 22:45

parquet_derive WIP

38ec590

xrl mentioned this pull request Dec 1, 2018

A column can no longer be closed after get_typed_column_writer is used #198

Open

xrl added 7 commits December 2, 2018 12:55

Handles more combinations of data

5b0c5f4

Writes simple types behind an Option

44461f2

Basic README

2bdbbfd

Merge remote-tracking branch 'upstream/master' into parquet_derive

313bd7e

Put parquet_derive in the test harness

5c7e8fd

Unused variables

451c51c

cargo fmt pass to make travis happy

e4cee5b

sunchao changed the title ~~parqet_derive for the new RecordWriter trait~~ parquet_derive for the new RecordWriter trait Dec 3, 2018

sadikovi requested review from sadikovi and sunchao December 3, 2018 22:36

xrl mentioned this pull request Dec 4, 2018

Update rust format to fix build #199

Merged

xrl added 4 commits December 4, 2018 22:45

Supports borrowed Option

a0c9792

Merge remote-tracking branch 'upstream/master' into parquet_derive

99eb1d1

cargo fmt

3b0ae7a

Avoid moves when emitting code for String

1f6d829

xrl force-pushed the parquet_derive branch from 2f1edec to 1f6d829 Compare December 5, 2018 04:51

sadikovi reviewed Dec 7, 2018

View reviewed changes

alecmocatta mentioned this pull request Dec 10, 2018

Parameterize reading of rows with a type parameter, checked against the schema, and used to specialize the deserialization of rows #205

Open

sunchao reviewed Dec 11, 2018

View reviewed changes

xrl added 4 commits December 21, 2018 00:31

edits from feedback

38326b6

The writer moved

22e6653

removing whitespace

fcf86b7

newline at end of file

71ed787

cargo fmt

29b397c

sunchao mentioned this pull request Jan 24, 2019

[Rust] Writing optional fields apache/arrow#3455

Closed

xrl mentioned this pull request Apr 11, 2019

ARROW-5123: [Rust] Parquet derive for simple structs apache/arrow#4140

Closed

4 tasks

xrl closed this Apr 11, 2019

brainstorm added a commit to brainstorm/htsget-indexer that referenced this pull request Nov 12, 2019

Add parquet_derive to the mix. Thanks to sunchao/parquet-rs#197 and r…

18fa3d9

…elated (ongoing) issues to get parquet write support into arrow/parquet-rs

asfimport mentioned this pull request Sep 14, 2020

[Rust] derive RecordWriter from struct definitions apache/arrow#21608

Closed

parquet_derive for the new RecordWriter trait #197

parquet_derive for the new RecordWriter trait #197

Conversation

xrl commented Dec 1, 2018 • edited Loading

Goal

How to Use

RecordWriter trait

How does it work?

Viewing the Derived Code

Procedural Macros

Potentials For Better Design

Status

xrl commented Dec 1, 2018 • edited Loading

xrl commented Dec 3, 2018

sadikovi commented Dec 3, 2018 • edited Loading

xrl commented Dec 3, 2018

sadikovi commented Dec 3, 2018

xrl commented Dec 3, 2018

sadikovi commented Dec 3, 2018 • edited Loading

sadikovi commented Dec 3, 2018

xrl commented Dec 3, 2018 • edited Loading

sunchao commented Dec 5, 2018

sadikovi commented Dec 5, 2018

xrl commented Dec 5, 2018 • edited Loading

xrl commented Dec 5, 2018

xrl commented Dec 6, 2018

sadikovi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sadikovi commented Dec 9, 2018 via email

sunchao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xrl Dec 21, 2018 • edited Loading

Choose a reason for hiding this comment

xrl commented Dec 19, 2018

sunchao commented Dec 19, 2018

xrl commented Dec 19, 2018

sunchao commented Dec 19, 2018

xrl commented Dec 19, 2018

xrl commented Dec 21, 2018

coveralls commented Dec 21, 2018

Pull Request Test Coverage Report for Build 711

💛 - Coveralls

coveralls commented Dec 21, 2018

Pull Request Test Coverage Report for Build 711

💛 - Coveralls

sunchao commented Jan 4, 2019

xrl commented Apr 11, 2019

xrl commented Dec 1, 2018 •

edited

Loading

xrl commented Dec 1, 2018 •

edited

Loading

sadikovi commented Dec 3, 2018 •

edited

Loading

sadikovi commented Dec 3, 2018 •

edited

Loading

xrl commented Dec 3, 2018 •

edited

Loading

xrl commented Dec 5, 2018 •

edited

Loading

xrl Dec 21, 2018 •

edited

Loading