-
Notifications
You must be signed in to change notification settings - Fork 20
parquet_derive for the new RecordWriter trait #197
Conversation
What's really unfortunate is that |
31ac273
to
38ec590
Compare
Your approach looks certainly interesting, I did not know that it is possible to to that with macros and parsing of structs. As I mentioned previously, my plan was reusing Row enum that we have since it maps to pretty much anything and using triplet iterators to write data. The tricky part is specifying schema, but it could always be inferred from the data if needed. Anyway, I might start writing some code later to see the amount of work required, but would prefer not duplicating effort if possible - will be happy to help out with writes. |
@sadikovi can you provide some pseudocode for how the Row enum would work? |
Good question. It would be similar to what you have, except we would use getters to extract values and write them. Plus, certain types will allow recursive (maybe) calls to writes maps, arrays and structs. I can't show the exact code to do so, but this is how I think it could be done. Not sure if it helps. |
That's very interesting. I think that technique could be folded in to this code generation. I think I'm generating too much code, code which could be solved with some There's a lot of value to just putting a What is your opinion on getting this work merged in to master? What do you want to see/know? |
Well, I need to study your code. Usage example would be nice and more documentation around the code - will make it easy for people to follow and extend. It is a great work, though it would be appreciated if you could submit the work in chunks into master, not as one large PR - easier for people to review - but not required. Don't have to do now, I assume there is still a lot of work to do. TLDR I need to review the code again:). |
I will have a look in the next couple of days. |
A usage example is in the README.md for Would you like to set up a time for a call to walk through the code and ask questions? After you've had time to satisfy your own regular code review methods 😄 I will satisfy one more pointer style ( I don't see this code as being the end of |
Thanks for the effort @xrl ! I'll also take a look on this soon. :) |
@xrl As I am reviewing the code can you update the PR description to mention the goal of this PR? I struggle to derive it from the code. Is it adding record writer? Thanks. |
@sadikovi please post the code you are trying to derive and what errors you are seeing. Also, did you check out this branch and try running Edit: I added a |
@sadikovi I have added more explanation to the PR text above. The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. Looks good. It is very interesting that code can write structs.
I would like to see how much work it would be to handle nested types similar to what we do in reads. I am inclined to suggest using existing Row API, which allows to specify those things, but it might need to be extended to provide information for decimals, etc. I am biased towards reusing what we have built in record reader and applying similar concepts to build record writers.
These are just my two cents. If @sunchao LGTMs it, then merge it.
``` | ||
|
||
## Features | ||
- [X] Support writing `String`, `&str`, `bool`, `i32`, `f32`, `f64` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Byte arrays and unsigned variants?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call on the unsigned variants. To be clear, to satisfy that requirement I need to add u32
/u64
, right? Parquet does not handle 8 or 16 bit values, right?
And when it comes to "byte arrays", are you talking fixed-sized arrays like let arr : [u8; 16] = [0; 16];
? Or do you mean let vec : Vec<u8> = vec![0; 16];
? I haven't used either of those in parquet-rs so I need a little guidance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By "byte arrays" I meant parquet type ByteArray, not just strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sadikovi how do I write unsigned variants? I don't see a column writer for that.
let field_name = self.field_name.clone(); | ||
let field_type = self.field_type.clone(); | ||
let column_writer_variant = self.column_writer_variant.clone(); | ||
let is_option = self.is_option; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main concern is that Option of values [1, 2, 3, 4, 5, 6] will be encoded differently compared to i32 [1, 2, 3, 4, 5, 6], because of the definition levels. I think we should check if there are nulls first and then decide on what option to choose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point.
I normally put some doc to any function, but it is up to you.
…On Sun, 9 Dec 2018 at 10:08 PM, Xavier Lange ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In parquet_derive/src/lib.rs
<#197 (comment)>:
> + "i64" => quote! { parquet::column::writer::ColumnWriter::Int64ColumnWriter },
+ "f32" => quote! { parquet::column::writer::ColumnWriter::FloatColumnWriter },
+ "f64" => quote! { parquet::column::writer::ColumnWriter::DoubleColumnWriter },
+ o => unimplemented!("don't know {} for {:#?}", o, f),
+ };
+
+ FieldInfo {
+ field_name: f.ident.clone().expect("must be a named field"),
+ field_lifetime,
+ field_type,
+ is_option,
+ column_writer_variant,
+ }
+ }
+
+ pub fn to_writer_snippet(&self) -> proc_macro2::TokenStream {
This function does not need to be public, that was my mistake. Do you
still want it documented?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#197 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHbY3szG6fk3tvjWAqqS6EsORhibvIMyks5u3XvmgaJpZM4Y9IbM>
.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @xrl ! This is a great addition. I'm still getting more familiar with procedure macros but left some initial feedback. My biggest comment right now is whether we can remove or merge those unimplemented
clauses to make the code shorter and cleaner. Also, as @sadikovi pointed out, I wonder if there's a plan to support nested types at some point, and whether that's easy to achieve. Obviously we don't have to tackle that in this PR. :-)
let input: DeriveInput = parse_macro_input!(input as DeriveInput); | ||
let fields: Fields = match input.data { | ||
Data::Struct(DataStruct { fields, .. }) => fields, | ||
Data::Enum(_) => unimplemented!("don't support enum"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change this to assertions and panic? the input should always be struct, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The input should always be a struct, yes.
The reason I match on all the input.data
variants is so I can get a nicer message in the panic. Under the hood unimplemented!()
is actually panic!()
but with a little bit of text:
macro_rules! unimplemented {
() => (panic!("not yet implemented"));
($($arg:tt)+) => (panic!("not yet implemented: {}", format_args!($($arg)*)));
}
so the error will look like not yet implemented: don't support enum
. I think that's a little more user friendly -- and it still works the same as a panic. Is that OK? Now that I've explained the choice I am happy to switch to the plain panic!()
-macro if you'd like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sunchao how do you feel about this point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes unimplemented
looks good. Can you combine all the "unimplemented" branches though? Assuming Data
implements Display
or Debug
, you can print the type in the error message in this way.
impl FieldInfo { | ||
fn from(f: &Field) -> Self { | ||
let (field_type, field_lifetime, is_option) = match &f.ty { | ||
Type::Slice(_) => unimplemented!("unsupported type: Slice"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we merge all the unhandled clauses into a single one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern here is handling user-reported errors. I can do a generic fall-through but the error can't include what generated the error. Instead of this match it would be something like:
match &f.ty {
Type::Reference(ref tr) => {
// unimplemented!("lifetime: {:#?}", lifetime);
extract_type_reference_info(tr)
},
Type::Path(tp) => {
let (ident, is_option, _) = extract_path_info(&tp);
(ident, None, is_option)
},
_ => unimplemented!("unsupported type")
}
so if someone tries to derive:
#[derive(ParquetRecordWriter)]
struct Foo<'a> {
baz: String
bar: &'a [u8]
}
it will just panic at compile time with unsupported type
and I don't think the user will know right away which field caused the failure. This was a user-friendly consideration.
If we're OK omitting type-failure information then I'm OK simplifying, it will remove quite a few of lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just got to see this error in action when I tried compiling to test other fields into ByteArray:
not yet implemented: don't know Vec for Field
it's not the best error but it let's me know the Vec is the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sunchao what do you think about this response?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xrl sorry I just saw your comments. In the example you gave above, maybe we can do this:
match &f.ty {
Type::Reference(ref tr) => {
// unimplemented!("lifetime: {:#?}", lifetime);
extract_type_reference_info(tr)
},
Type::Path(tp) => {
let (ident, is_option, _) = extract_path_info(&tp);
(ident, None, is_option)
},
t @ _ => unimplemented!("unsupported type {}", t)
}
assuming Type
implements Display
or Debug
. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, thanks!
parquet_derive/src/lib.rs
Outdated
)) | ||
} | ||
|
||
fn extract_type_reference_info( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we move this and extract_path_info
into FieldInfo
as member function? this makes it easier to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, good idea!
field_lifetime: Option<Lifetime>, | ||
field_type: Ident, | ||
is_option: bool, | ||
column_writer_variant: proc_macro2::TokenStream, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason we need to depend both on proc_macro and proc_macro2? just curious :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose proc_macro2
to maximize our backwards compatibility, from their description:
A small shim over the proc_macro crate in the compiler intended to multiplex the stable interface as of 1.15.0 and the interface as of 1.30.0.
So I take in a compiler-defined proc_macro::TokenStream
and convert it to proc_macro2::TokenStream
so I can work on it in a rustc version-independent way. We only claim compatibility with nightly I can see this being unnecessary.
I'm open to removing it, I think it would just a matter of renaming to proc_macro::TokenStream
, etc.
I will be addressing this feedback soon. I'm waiting to see what happens with the switchover to the arrow project. |
Thanks @xrl . Since the merge is done, could you create a JIRA in Arrow for this feature? please also paste the description there :) |
@sunchao are we doing PRs against the arrow repo? or is the work still here and it gets rebased in to arrow? |
@xrl From now on, we should do PRs against the arrow repo. This repo will be frozen. I can help review there. |
Ah, I thought that would be the case! Thanks for the clarity. |
My goals with nested types would be convert over to the Row enum (see: #197 (comment)). Unfortunately I haven't used that API and looking over it I wasn't sure how it would work in in this case. I'd like to learn more then revisit this crate to make it more general with row enum or any other techniques. Especially anything that makes determining definition levels easier, it's a tricky part of Parquet I don't have the strongest handle on. This PR, as it is now, is useful in my own project. I am deriving a write for a field with 30+ fields and it has greatly improved my productivity. Hopefully that extends to other users and it shows a little sample of what is to come! |
1 similar comment
Sorry @xrl I just saw your updates. Posted some replies. I do recommend to create an Arrow JIRA for this and post the PR against the repo though, since eventually this should go there as a sub-crate. |
Closing in favor of the ARROW PR! |
…elated (ongoing) issues to get parquet write support into arrow/parquet-rs
A rebase and significant rewrite of sunchao/parquet-rs#197 Big improvement: I now use a more natural nested enum style, it helps break out what patterns of data types are . The rest of the broad strokes still apply. Goal === Writing many columns to a file is a chore. If you can put your values in to a struct which mirrors the schema of your file, this `derive(ParquetRecordWriter)` will write out all the fields, in the order in which they are defined, to a row_group. How to Use === ``` extern crate parquet; #[macro_use] extern crate parquet_derive; #[derive(ParquetRecordWriter)] struct ACompleteRecord<'a> { pub a_bool: bool, pub a_str: &'a str, } ``` RecordWriter trait === This is the new trait which `parquet_derive` will implement for your structs. ``` use super::RowGroupWriter; pub trait RecordWriter<T> { fn write_to_row_group(&self, row_group_writer: &mut Box<RowGroupWriter>); } ``` How does it work? === The `parquet_derive` crate adds code generating functionality to the rust compiler. The code generation takes rust syntax and emits additional syntax. This macro expansion works on rust 1.15+ stable. This is a dynamic plugin, loaded by the machinery in cargo. Users don't have to do any special `build.rs` steps or anything like that, it's automatic by including `parquet_derive` in their project. The `parquet_derive/src/Cargo.toml` has a section saying as much: ``` [lib] proc-macro = true ``` The rust struct tagged with `#[derive(ParquetRecordWriter)]` is provided to the `parquet_record_writer` function in `parquet_derive/src/lib.rs`. The `syn` crate parses the struct from a string-representation to a AST (a recursive enum value). The AST contains all the values I care about when generating a `RecordWriter` impl: - the name of the struct - the lifetime variables of the struct - the fields of the struct The fields of the struct are translated from AST to a flat `FieldInfo` struct. It has the bits I care about for writing a column: `field_name`, `field_lifetime`, `field_type`, `is_option`, `column_writer_variant`. The code then does the equivalent of templating to build the `RecordWriter` implementation. The templating functionality is provided by the `quote` crate. At a high-level the template for `RecordWriter` looks like: ``` impl RecordWriter for $struct_name { fn write_row_group(..) { $({ $column_writer_snippet }) } } ``` this template is then added under the struct definition, ending up something like: ``` struct MyStruct { } impl RecordWriter for MyStruct { fn write_row_group(..) { { write_col_1(); }; { write_col_2(); } } } ``` and finally _THIS_ is the code passed to rustc. It's just code now, fully expanded and standalone. If a user ever changes their `struct MyValue` definition the `ParquetRecordWriter` will be regenerated. There's no intermediate values to version control or worry about. Viewing the Derived Code === To see the generated code before it's compiled, one very useful bit is to install `cargo expand` [more info on gh](https://github.com/dtolnay/cargo-expand), then you can do: ``` $WORK_DIR/parquet-rs/parquet_derive_test cargo expand --lib > ../temp.rs ``` then you can dump the contents: ``` struct DumbRecord { pub a_bool: bool, pub a2_bool: bool, } impl RecordWriter<DumbRecord> for &[DumbRecord] { fn write_to_row_group( &self, row_group_writer: &mut Box<parquet::file::writer::RowGroupWriter>, ) { let mut row_group_writer = row_group_writer; { let vals: Vec<bool> = self.iter().map(|x| x.a_bool).collect(); let mut column_writer = row_group_writer.next_column().unwrap().unwrap(); if let parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) = column_writer { typed.write_batch(&vals[..], None, None).unwrap(); } row_group_writer.close_column(column_writer).unwrap(); }; { let vals: Vec<bool> = self.iter().map(|x| x.a2_bool).collect(); let mut column_writer = row_group_writer.next_column().unwrap().unwrap(); if let parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) = column_writer { typed.write_batch(&vals[..], None, None).unwrap(); } row_group_writer.close_column(column_writer).unwrap(); } } } ``` now I need to write out all the combinations of types we support and make sure it writes out data. Procedural Macros === The `parquet_derive` crate can ONLY export the derivation functionality. No traits, nothing else. The derive crate can not host test cases. It's kind of like a "dummy" crate which is only used by the compiler, never the code. The parent crate cannot use the derivation functionality, which is important because it means test code cannot be in the parent crate. This forces us to have a third crate, `parquet_derive_test`. I'm open to being wrong on any one of these finer points. I had to bang on this for a while to get it to compile! Potentials For Better Design === - [x] Recursion could be limited by generating the code as "snippets" instead of one big `quote!` AST generator. Or so I think. It might be nicer to push generating each columns writing code to another loop. - [X] ~~It would be nicer if I didn't have to be so picky about data going in to the `write_batch` function. Is it possible we could make a version of the function which accept `Into<DataType>` or similar? This would greatly simplify this derivation code as it would not need to enumerate all the supported types. Something like `write_generic_batch(&[impl Into<DataType>])` would be neat.~~ (not tackling in this generation of the plugin) - [X] ~~Another idea to improving writing columns, could we have a write function for `Iterator`s? I already have a `Vec<DumbRecord>`, if I could just write a mapping for accessing the one value, we could skip the whole intermediate vec for `write_batch`. Should have some significant memory advantages.~~ (not tackling in this generation of the plugin, it's a bigger parquet-rs enhancement) - [X] ~~It might be worthwhile to derive a parquet schema directly from a struct definition. That should stamp out opportunities for type errors.~~ (moved to #203) Status === I have successfully integrated this work with my own data exporter (takes postgres/couchdb and outputs a single parquet file). I think this code is worth including in the project, with the caveat that it only generates simplistic `RecordWriter`s. As people start to use we can add code generation for more complex, nested structs. We can convert the nested matching style to a fancier looping style. But for now, this explicit nesting is easier to debug and understand (to me at least!). Closes #4140 from xrl/parquet_derive Lead-authored-by: Xavier Lange <xrlange@gmail.com> Co-authored-by: Neville Dipale <nevilledips@gmail.com> Co-authored-by: Bryant Biggs <bryantbiggs@gmail.com> Co-authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
A rebase and significant rewrite of sunchao/parquet-rs#197 Big improvement: I now use a more natural nested enum style, it helps break out what patterns of data types are . The rest of the broad strokes still apply. Goal === Writing many columns to a file is a chore. If you can put your values in to a struct which mirrors the schema of your file, this `derive(ParquetRecordWriter)` will write out all the fields, in the order in which they are defined, to a row_group. How to Use === ``` extern crate parquet; #[macro_use] extern crate parquet_derive; #[derive(ParquetRecordWriter)] struct ACompleteRecord<'a> { pub a_bool: bool, pub a_str: &'a str, } ``` RecordWriter trait === This is the new trait which `parquet_derive` will implement for your structs. ``` use super::RowGroupWriter; pub trait RecordWriter<T> { fn write_to_row_group(&self, row_group_writer: &mut Box<RowGroupWriter>); } ``` How does it work? === The `parquet_derive` crate adds code generating functionality to the rust compiler. The code generation takes rust syntax and emits additional syntax. This macro expansion works on rust 1.15+ stable. This is a dynamic plugin, loaded by the machinery in cargo. Users don't have to do any special `build.rs` steps or anything like that, it's automatic by including `parquet_derive` in their project. The `parquet_derive/src/Cargo.toml` has a section saying as much: ``` [lib] proc-macro = true ``` The rust struct tagged with `#[derive(ParquetRecordWriter)]` is provided to the `parquet_record_writer` function in `parquet_derive/src/lib.rs`. The `syn` crate parses the struct from a string-representation to a AST (a recursive enum value). The AST contains all the values I care about when generating a `RecordWriter` impl: - the name of the struct - the lifetime variables of the struct - the fields of the struct The fields of the struct are translated from AST to a flat `FieldInfo` struct. It has the bits I care about for writing a column: `field_name`, `field_lifetime`, `field_type`, `is_option`, `column_writer_variant`. The code then does the equivalent of templating to build the `RecordWriter` implementation. The templating functionality is provided by the `quote` crate. At a high-level the template for `RecordWriter` looks like: ``` impl RecordWriter for $struct_name { fn write_row_group(..) { $({ $column_writer_snippet }) } } ``` this template is then added under the struct definition, ending up something like: ``` struct MyStruct { } impl RecordWriter for MyStruct { fn write_row_group(..) { { write_col_1(); }; { write_col_2(); } } } ``` and finally _THIS_ is the code passed to rustc. It's just code now, fully expanded and standalone. If a user ever changes their `struct MyValue` definition the `ParquetRecordWriter` will be regenerated. There's no intermediate values to version control or worry about. Viewing the Derived Code === To see the generated code before it's compiled, one very useful bit is to install `cargo expand` [more info on gh](https://github.com/dtolnay/cargo-expand), then you can do: ``` $WORK_DIR/parquet-rs/parquet_derive_test cargo expand --lib > ../temp.rs ``` then you can dump the contents: ``` struct DumbRecord { pub a_bool: bool, pub a2_bool: bool, } impl RecordWriter<DumbRecord> for &[DumbRecord] { fn write_to_row_group( &self, row_group_writer: &mut Box<parquet::file::writer::RowGroupWriter>, ) { let mut row_group_writer = row_group_writer; { let vals: Vec<bool> = self.iter().map(|x| x.a_bool).collect(); let mut column_writer = row_group_writer.next_column().unwrap().unwrap(); if let parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) = column_writer { typed.write_batch(&vals[..], None, None).unwrap(); } row_group_writer.close_column(column_writer).unwrap(); }; { let vals: Vec<bool> = self.iter().map(|x| x.a2_bool).collect(); let mut column_writer = row_group_writer.next_column().unwrap().unwrap(); if let parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) = column_writer { typed.write_batch(&vals[..], None, None).unwrap(); } row_group_writer.close_column(column_writer).unwrap(); } } } ``` now I need to write out all the combinations of types we support and make sure it writes out data. Procedural Macros === The `parquet_derive` crate can ONLY export the derivation functionality. No traits, nothing else. The derive crate can not host test cases. It's kind of like a "dummy" crate which is only used by the compiler, never the code. The parent crate cannot use the derivation functionality, which is important because it means test code cannot be in the parent crate. This forces us to have a third crate, `parquet_derive_test`. I'm open to being wrong on any one of these finer points. I had to bang on this for a while to get it to compile! Potentials For Better Design === - [x] Recursion could be limited by generating the code as "snippets" instead of one big `quote!` AST generator. Or so I think. It might be nicer to push generating each columns writing code to another loop. - [X] ~~It would be nicer if I didn't have to be so picky about data going in to the `write_batch` function. Is it possible we could make a version of the function which accept `Into<DataType>` or similar? This would greatly simplify this derivation code as it would not need to enumerate all the supported types. Something like `write_generic_batch(&[impl Into<DataType>])` would be neat.~~ (not tackling in this generation of the plugin) - [X] ~~Another idea to improving writing columns, could we have a write function for `Iterator`s? I already have a `Vec<DumbRecord>`, if I could just write a mapping for accessing the one value, we could skip the whole intermediate vec for `write_batch`. Should have some significant memory advantages.~~ (not tackling in this generation of the plugin, it's a bigger parquet-rs enhancement) - [X] ~~It might be worthwhile to derive a parquet schema directly from a struct definition. That should stamp out opportunities for type errors.~~ (moved to apache#203) Status === I have successfully integrated this work with my own data exporter (takes postgres/couchdb and outputs a single parquet file). I think this code is worth including in the project, with the caveat that it only generates simplistic `RecordWriter`s. As people start to use we can add code generation for more complex, nested structs. We can convert the nested matching style to a fancier looping style. But for now, this explicit nesting is easier to debug and understand (to me at least!). Closes apache#4140 from xrl/parquet_derive Lead-authored-by: Xavier Lange <xrlange@gmail.com> Co-authored-by: Neville Dipale <nevilledips@gmail.com> Co-authored-by: Bryant Biggs <bryantbiggs@gmail.com> Co-authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
Goal
Writing many columns to a file is a chore. If you can put your values in to a struct which mirrors the schema of your file, this
derive(ParquetRecordWriter)
will write out all the fields, in the order in which they are defined, to a row_group.How to Use
RecordWriter trait
This is the new trait which
parquet_derive
will implement for your structs.How does it work?
The
parquet_derive
crate adds code generating functionality to the rust compiler. The code generation takes rust syntax and emits additional syntax. This macro expansion works on rust 1.15+ stable. This is a dynamic plugin, loaded by the machinery in cargo. Users don't have to do any specialbuild.rs
steps or anything like that, it's automatic by includingparquet_derive
in their project. Theparquet_derive/src/Cargo.toml
has a section saying as much:The rust struct tagged with
#[derive(ParquetRecordWriter)]
is provided to theparquet_record_writer
function inparquet_derive/src/lib.rs
. Thesyn
crate parses the struct from a string-representation to a AST (a recursive enum value). The AST contains all the values I care about when generating aRecordWriter
impl:The fields of the struct are translated from AST to a flat
FieldInfo
struct. It has the bits I care about for writing a column:field_name
,field_lifetime
,field_type
,is_option
,column_writer_variant
.The code then does the equivalent of templating to build the
RecordWriter
implementation. The templating functionality is provided by thequote
crate. At a high-level the template forRecordWriter
looks like:this template is then added under the struct definition, ending up something like:
and finally THIS is the code passed to rustc. It's just code now, fully expanded and standalone. If a user ever changes their
struct MyValue
definition theParquetRecordWriter
will be regenerated. There's no intermediate values to version control or worry about.Viewing the Derived Code
To see the generated code before it's compiled, one very useful bit is to install
cargo expand
more info on gh, then you can do:then you can dump the contents:
now I need to write out all the combinations of types we support and make sure it writes out data.
Procedural Macros
The
parquet_derive
crate can ONLY export the derivation functionality. No traits, nothing else. The derive crate can not host test cases. It's kind of like a "dummy" crate which is only used by the compiler, never the code.The parent crate cannot use the derivation functionality, which is important because it means test code cannot be in the parent crate. This forces us to have a third crate,
parquet_derive_test
.I'm open to being wrong on any one of these finer points. I had to bang on this for a while to get it to compile!
Potentials For Better Design
quote!
AST generator. Or so I think. It might be nicer to push generating each columns writing code to another loop.It would be nicer if I didn't have to be so picky about data going in to the(not tackling in this generation of the plugin)write_batch
function. Is it possible we could make a version of the function which acceptInto<DataType>
or similar? This would greatly simplify this derivation code as it would not need to enumerate all the supported types. Something likewrite_generic_batch(&[impl Into<DataType>])
would be neat.Another idea to improving writing columns, could we have a write function for(not tackling in this generation of the plugin, it's a bigger parquet-rs enhancement)Iterator
s? I already have aVec<DumbRecord>
, if I could just write a mapping for accessing the one value, we could skip the whole intermediate vec forwrite_batch
. Should have some significant memory advantages.It might be worthwhile to derive a parquet schema directly from a struct definition. That should stamp out opportunities for type errors.(moved to Derive parquet schema from struct #203)Status
I have successfully integrated this work with my own data exporter (takes postgres/couchdb and outputs a single parquet file).
I think this code is worth including in the project, with the caveat that it only generates simplistic
RecordWriter
s. As people start to use we can add code generation for more complex, nested structs.Closes #192