Skip to content
This repository has been archived by the owner on Jan 11, 2021. It is now read-only.

parquet_derive for the new RecordWriter trait #197

Closed
wants to merge 17 commits into from

Conversation

xrl
Copy link
Contributor

@xrl xrl commented Dec 1, 2018

Goal

Writing many columns to a file is a chore. If you can put your values in to a struct which mirrors the schema of your file, this derive(ParquetRecordWriter) will write out all the fields, in the order in which they are defined, to a row_group.

How to Use

extern crate parquet;
#[macro_use] extern crate parquet_derive;

#[derive(ParquetRecordWriter)]
struct ACompleteRecord<'a> {
  pub a_bool: bool,
  pub a_str: &'a str,
}

RecordWriter trait

This is the new trait which parquet_derive will implement for your structs.

use super::RowGroupWriter;

pub trait RecordWriter<T> {
  fn write_to_row_group(&self, row_group_writer: &mut Box<RowGroupWriter>);
}

How does it work?

The parquet_derive crate adds code generating functionality to the rust compiler. The code generation takes rust syntax and emits additional syntax. This macro expansion works on rust 1.15+ stable. This is a dynamic plugin, loaded by the machinery in cargo. Users don't have to do any special build.rs steps or anything like that, it's automatic by including parquet_derive in their project. The parquet_derive/src/Cargo.toml has a section saying as much:

[lib]
proc-macro = true

The rust struct tagged with #[derive(ParquetRecordWriter)] is provided to the parquet_record_writer function in parquet_derive/src/lib.rs. The syn crate parses the struct from a string-representation to a AST (a recursive enum value). The AST contains all the values I care about when generating a RecordWriter impl:

  • the name of the struct
  • the lifetime variables of the struct
  • the fields of the struct

The fields of the struct are translated from AST to a flat FieldInfo struct. It has the bits I care about for writing a column: field_name, field_lifetime, field_type, is_option, column_writer_variant.

The code then does the equivalent of templating to build the RecordWriter implementation. The templating functionality is provided by the quote crate. At a high-level the template for RecordWriter looks like:

impl RecordWriter for $struct_name {
  fn write_row_group(..) {
    $({
      $column_writer_snippet
    })
  } 
}

this template is then added under the struct definition, ending up something like:

struct MyStruct {
}
impl RecordWriter for MyStruct {
  fn write_row_group(..) {
    {
       write_col_1();
    };
   {
       write_col_2();
   }
  }
}

and finally THIS is the code passed to rustc. It's just code now, fully expanded and standalone. If a user ever changes their struct MyValue definition the ParquetRecordWriter will be regenerated. There's no intermediate values to version control or worry about.

Viewing the Derived Code

To see the generated code before it's compiled, one very useful bit is to install cargo expand more info on gh, then you can do:

$WORK_DIR/parquet-rs/parquet_derive_test
cargo expand --lib > ../temp.rs

then you can dump the contents:

struct DumbRecord {
    pub a_bool: bool,
    pub a2_bool: bool,
}
impl RecordWriter<DumbRecord> for &[DumbRecord] {
    fn write_to_row_group(
        &self,
        row_group_writer: &mut Box<parquet::file::writer::RowGroupWriter>,
    ) {
        let mut row_group_writer = row_group_writer;
        {
            let vals: Vec<bool> = self.iter().map(|x| x.a_bool).collect();
            let mut column_writer = row_group_writer.next_column().unwrap().unwrap();
            if let parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) =
                column_writer
            {
                typed.write_batch(&vals[..], None, None).unwrap();
            }
            row_group_writer.close_column(column_writer).unwrap();
        };
        {
            let vals: Vec<bool> = self.iter().map(|x| x.a2_bool).collect();
            let mut column_writer = row_group_writer.next_column().unwrap().unwrap();
            if let parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) =
                column_writer
            {
                typed.write_batch(&vals[..], None, None).unwrap();
            }
            row_group_writer.close_column(column_writer).unwrap();
        }
    }
}

now I need to write out all the combinations of types we support and make sure it writes out data.

Procedural Macros

The parquet_derive crate can ONLY export the derivation functionality. No traits, nothing else. The derive crate can not host test cases. It's kind of like a "dummy" crate which is only used by the compiler, never the code.

The parent crate cannot use the derivation functionality, which is important because it means test code cannot be in the parent crate. This forces us to have a third crate, parquet_derive_test.

I'm open to being wrong on any one of these finer points. I had to bang on this for a while to get it to compile!

Potentials For Better Design

  • Recursion could be limited by generating the code as "snippets" instead of one big quote! AST generator. Or so I think. It might be nicer to push generating each columns writing code to another loop.
  • It would be nicer if I didn't have to be so picky about data going in to the write_batch function. Is it possible we could make a version of the function which accept Into<DataType> or similar? This would greatly simplify this derivation code as it would not need to enumerate all the supported types. Something like write_generic_batch(&[impl Into<DataType>]) would be neat. (not tackling in this generation of the plugin)
  • Another idea to improving writing columns, could we have a write function for Iterators? I already have a Vec<DumbRecord>, if I could just write a mapping for accessing the one value, we could skip the whole intermediate vec for write_batch. Should have some significant memory advantages. (not tackling in this generation of the plugin, it's a bigger parquet-rs enhancement)
  • It might be worthwhile to derive a parquet schema directly from a struct definition. That should stamp out opportunities for type errors. (moved to Derive parquet schema from struct #203)

Status

I have successfully integrated this work with my own data exporter (takes postgres/couchdb and outputs a single parquet file).

I think this code is worth including in the project, with the caveat that it only generates simplistic RecordWriters. As people start to use we can add code generation for more complex, nested structs.

Closes #192

@xrl
Copy link
Contributor Author

xrl commented Dec 1, 2018

What's really unfortunate is that cargo expand detected rustfmt and reformatted the entire project. I will need to clean that up somehow. What a pain 😄 (edit: I think I mostly fixed it!)

@xrl xrl force-pushed the parquet_derive branch 2 times, most recently from 31ac273 to 38ec590 Compare December 1, 2018 22:45
@xrl
Copy link
Contributor Author

xrl commented Dec 3, 2018

@sadikovi have you had a chance to review this work? is this anywhere close to what you had in mind when I opened #192? 😄

@sunchao sunchao changed the title parqet_derive for the new RecordWriter trait parquet_derive for the new RecordWriter trait Dec 3, 2018
@sadikovi
Copy link
Collaborator

sadikovi commented Dec 3, 2018

Your approach looks certainly interesting, I did not know that it is possible to to that with macros and parsing of structs.

As I mentioned previously, my plan was reusing Row enum that we have since it maps to pretty much anything and using triplet iterators to write data. The tricky part is specifying schema, but it could always be inferred from the data if needed.

Anyway, I might start writing some code later to see the amount of work required, but would prefer not duplicating effort if possible - will be happy to help out with writes.

@xrl
Copy link
Contributor Author

xrl commented Dec 3, 2018

@sadikovi can you provide some pseudocode for how the Row enum would work?

@sadikovi
Copy link
Collaborator

sadikovi commented Dec 3, 2018

Good question. It would be similar to what you have, except we would use getters to extract values and write them. Plus, certain types will allow recursive (maybe) calls to writes maps, arrays and structs. I can't show the exact code to do so, but this is how I think it could be done. Not sure if it helps.

@xrl
Copy link
Contributor Author

xrl commented Dec 3, 2018

That's very interesting. I think that technique could be folded in to this code generation. I think I'm generating too much code, code which could be solved with some From/Into/AsRef traits or a slightly different API. Handling all the recursive definitions is going to need some more sophistication I'm not quite equipped to write!

There's a lot of value to just putting a derive(...) on a structure and letting the procedural macro emit the writer code for each field. I'm very interested to hear more from you.

What is your opinion on getting this work merged in to master? What do you want to see/know?

@sadikovi
Copy link
Collaborator

sadikovi commented Dec 3, 2018

Well, I need to study your code. Usage example would be nice and more documentation around the code - will make it easy for people to follow and extend. It is a great work, though it would be appreciated if you could submit the work in chunks into master, not as one large PR - easier for people to review - but not required.

Don't have to do now, I assume there is still a lot of work to do. TLDR I need to review the code again:).

@sadikovi
Copy link
Collaborator

sadikovi commented Dec 3, 2018

I will have a look in the next couple of days.

@xrl
Copy link
Contributor Author

xrl commented Dec 3, 2018

A usage example is in the README.md for parquet_derive.
Similarly, I would have preferred a smaller unit of work but I didn't see an alternative path. This is the whole functionality.

Would you like to set up a time for a call to walk through the code and ask questions? After you've had time to satisfy your own regular code review methods 😄

I will satisfy one more pointer style (a_property: &'a Option<_>). That won't change much of the existing code then it's useful for a lot of situations.

I don't see this code as being the end of RecordWriter generation, just a first step for shallow/unnested structs. It satisfies my production code needs. In the future it can be modified or rewritten to support arbitrarily nested structs, just needs some procedural macro-fu 😄

@sunchao
Copy link
Owner

sunchao commented Dec 5, 2018

Thanks for the effort @xrl ! I'll also take a look on this soon. :)

@sadikovi
Copy link
Collaborator

sadikovi commented Dec 5, 2018

@xrl As I am reviewing the code can you update the PR description to mention the goal of this PR? I struggle to derive it from the code. Is it adding record writer? Thanks.

@xrl
Copy link
Contributor Author

xrl commented Dec 5, 2018

@sadikovi please post the code you are trying to derive and what errors you are seeing. Also, did you check out this branch and try running cargo test from the parquet_derive_test folder?

Edit: I added a Goals, How to Use and RecordWriter trait section to the PR.

@xrl
Copy link
Contributor Author

xrl commented Dec 5, 2018

@sadikovi I have added more explanation to the PR text above. The How does it work? section should help clarify!

@xrl
Copy link
Contributor Author

xrl commented Dec 6, 2018

I'm happy with the current status of this crate. I am using it in my own project and it generates reliable RecordWriter implementations. I think this means it's good for a thorough tire kicking @sadikovi @sunchao :)

Copy link
Collaborator

@sadikovi sadikovi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Looks good. It is very interesting that code can write structs.

I would like to see how much work it would be to handle nested types similar to what we do in reads. I am inclined to suggest using existing Row API, which allows to specify those things, but it might need to be extended to provide information for decimals, etc. I am biased towards reusing what we have built in record reader and applying similar concepts to build record writers.

These are just my two cents. If @sunchao LGTMs it, then merge it.

Cargo.toml Outdated Show resolved Hide resolved
```

## Features
- [X] Support writing `String`, `&str`, `bool`, `i32`, `f32`, `f64`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Byte arrays and unsigned variants?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call on the unsigned variants. To be clear, to satisfy that requirement I need to add u32/u64, right? Parquet does not handle 8 or 16 bit values, right?

And when it comes to "byte arrays", are you talking fixed-sized arrays like let arr : [u8; 16] = [0; 16];? Or do you mean let vec : Vec<u8> = vec![0; 16];? I haven't used either of those in parquet-rs so I need a little guidance.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "byte arrays" I meant parquet type ByteArray, not just strings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sadikovi how do I write unsigned variants? I don't see a column writer for that.

parquet_derive/README.md Outdated Show resolved Hide resolved
parquet_derive/src/lib.rs Show resolved Hide resolved
parquet_derive/src/lib.rs Outdated Show resolved Hide resolved
parquet_derive_test/Cargo.toml Outdated Show resolved Hide resolved
parquet_derive_test/src/lib.rs Show resolved Hide resolved
src/file/writer/record_writer.rs Outdated Show resolved Hide resolved
parquet_derive/src/lib.rs Outdated Show resolved Hide resolved
let field_name = self.field_name.clone();
let field_type = self.field_type.clone();
let column_writer_variant = self.column_writer_variant.clone();
let is_option = self.is_option;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern is that Option of values [1, 2, 3, 4, 5, 6] will be encoded differently compared to i32 [1, 2, 3, 4, 5, 6], because of the definition levels. I think we should check if there are nulls first and then decide on what option to choose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point.

@sadikovi
Copy link
Collaborator

sadikovi commented Dec 9, 2018 via email

Copy link
Owner

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xrl ! This is a great addition. I'm still getting more familiar with procedure macros but left some initial feedback. My biggest comment right now is whether we can remove or merge those unimplemented clauses to make the code shorter and cleaner. Also, as @sadikovi pointed out, I wonder if there's a plan to support nested types at some point, and whether that's easy to achieve. Obviously we don't have to tackle that in this PR. :-)

parquet_derive/src/lib.rs Outdated Show resolved Hide resolved
let input: DeriveInput = parse_macro_input!(input as DeriveInput);
let fields: Fields = match input.data {
Data::Struct(DataStruct { fields, .. }) => fields,
Data::Enum(_) => unimplemented!("don't support enum"),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change this to assertions and panic? the input should always be struct, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The input should always be a struct, yes.

The reason I match on all the input.data variants is so I can get a nicer message in the panic. Under the hood unimplemented!() is actually panic!() but with a little bit of text:

macro_rules! unimplemented {
    () => (panic!("not yet implemented"));
    ($($arg:tt)+) => (panic!("not yet implemented: {}", format_args!($($arg)*)));
}

so the error will look like not yet implemented: don't support enum. I think that's a little more user friendly -- and it still works the same as a panic. Is that OK? Now that I've explained the choice I am happy to switch to the plain panic!()-macro if you'd like.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sunchao how do you feel about this point?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes unimplemented looks good. Can you combine all the "unimplemented" branches though? Assuming Data implements Display or Debug, you can print the type in the error message in this way.

parquet_derive/src/lib.rs Outdated Show resolved Hide resolved
impl FieldInfo {
fn from(f: &Field) -> Self {
let (field_type, field_lifetime, is_option) = match &f.ty {
Type::Slice(_) => unimplemented!("unsupported type: Slice"),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we merge all the unhandled clauses into a single one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern here is handling user-reported errors. I can do a generic fall-through but the error can't include what generated the error. Instead of this match it would be something like:

match &f.ty {
      Type::Reference(ref tr) => {
        //                unimplemented!("lifetime: {:#?}", lifetime);
        extract_type_reference_info(tr)
      },
      Type::Path(tp) => {
        let (ident, is_option, _) = extract_path_info(&tp);
        (ident, None, is_option)
      },
      _ => unimplemented!("unsupported type")
    }

so if someone tries to derive:

#[derive(ParquetRecordWriter)]
struct Foo<'a> {
  baz: String
  bar: &'a [u8]
}

it will just panic at compile time with unsupported type and I don't think the user will know right away which field caused the failure. This was a user-friendly consideration.

If we're OK omitting type-failure information then I'm OK simplifying, it will remove quite a few of lines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just got to see this error in action when I tried compiling to test other fields into ByteArray:

not yet implemented: don't know Vec for Field

it's not the best error but it let's me know the Vec is the issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sunchao what do you think about this response?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xrl sorry I just saw your comments. In the example you gave above, maybe we can do this:

match &f.ty {
      Type::Reference(ref tr) => {
        //                unimplemented!("lifetime: {:#?}", lifetime);
        extract_type_reference_info(tr)
      },
      Type::Path(tp) => {
        let (ident, is_option, _) = extract_path_info(&tp);
        (ident, None, is_option)
      },
      t @ _ => unimplemented!("unsupported type {}", t)
    }

assuming Type implements Display or Debug. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks!

))
}

fn extract_type_reference_info(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move this and extract_path_info into FieldInfo as member function? this makes it easier to read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, good idea!

field_lifetime: Option<Lifetime>,
field_type: Ident,
is_option: bool,
column_writer_variant: proc_macro2::TokenStream,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason we need to depend both on proc_macro and proc_macro2? just curious :)

Copy link
Contributor Author

@xrl xrl Dec 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose proc_macro2 to maximize our backwards compatibility, from their description:

A small shim over the proc_macro crate in the compiler intended to multiplex the stable interface as of 1.15.0 and the interface as of 1.30.0.

So I take in a compiler-defined proc_macro::TokenStream and convert it to proc_macro2::TokenStream so I can work on it in a rustc version-independent way. We only claim compatibility with nightly I can see this being unnecessary.

I'm open to removing it, I think it would just a matter of renaming to proc_macro::TokenStream, etc.

@xrl
Copy link
Contributor Author

xrl commented Dec 19, 2018

I will be addressing this feedback soon. I'm waiting to see what happens with the switchover to the arrow project.

@sunchao
Copy link
Owner

sunchao commented Dec 19, 2018

Thanks @xrl . Since the merge is done, could you create a JIRA in Arrow for this feature? please also paste the description there :)

@xrl
Copy link
Contributor Author

xrl commented Dec 19, 2018

@sunchao are we doing PRs against the arrow repo? or is the work still here and it gets rebased in to arrow?

@sunchao
Copy link
Owner

sunchao commented Dec 19, 2018

@xrl From now on, we should do PRs against the arrow repo. This repo will be frozen. I can help review there.

@xrl
Copy link
Contributor Author

xrl commented Dec 19, 2018

Ah, I thought that would be the case! Thanks for the clarity.

@xrl
Copy link
Contributor Author

xrl commented Dec 21, 2018

My goals with nested types would be convert over to the Row enum (see: #197 (comment)). Unfortunately I haven't used that API and looking over it I wasn't sure how it would work in in this case.

I'd like to learn more then revisit this crate to make it more general with row enum or any other techniques. Especially anything that makes determining definition levels easier, it's a tricky part of Parquet I don't have the strongest handle on.

This PR, as it is now, is useful in my own project. I am deriving a write for a field with 30+ fields and it has greatly improved my productivity. Hopefully that extends to other users and it shows a little sample of what is to come!

@coveralls
Copy link

Pull Request Test Coverage Report for Build 711

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at ?%

Totals Coverage Status
Change from base Build 690: 0.0%
Covered Lines:
Relevant Lines: 0

💛 - Coveralls

1 similar comment
@coveralls
Copy link

Pull Request Test Coverage Report for Build 711

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at ?%

Totals Coverage Status
Change from base Build 690: 0.0%
Covered Lines:
Relevant Lines: 0

💛 - Coveralls

@sunchao
Copy link
Owner

sunchao commented Jan 4, 2019

Sorry @xrl I just saw your updates. Posted some replies. I do recommend to create an Arrow JIRA for this and post the PR against the repo though, since eventually this should go there as a sub-crate.

@xrl
Copy link
Contributor Author

xrl commented Apr 11, 2019

Closing in favor of the ARROW PR!

@xrl xrl closed this Apr 11, 2019
brainstorm added a commit to brainstorm/htsget-indexer that referenced this pull request Nov 12, 2019
…elated (ongoing) issues to get parquet write support into arrow/parquet-rs
nevi-me added a commit to apache/arrow that referenced this pull request Sep 14, 2020
A rebase and significant rewrite of sunchao/parquet-rs#197

Big improvement: I now use a more natural nested enum style, it helps break out what patterns of data types are . The rest of the broad strokes still apply.

Goal
===

Writing many columns to a file is a chore. If you can put your values in to a struct which mirrors the schema of your file, this `derive(ParquetRecordWriter)` will write out all the fields, in the order in which they are defined, to a row_group.

How to Use
===

```
extern crate parquet;
#[macro_use] extern crate parquet_derive;

#[derive(ParquetRecordWriter)]
struct ACompleteRecord<'a> {
  pub a_bool: bool,
  pub a_str: &'a str,
}
```

RecordWriter trait
===

This is the new trait which `parquet_derive` will implement for your structs.

```
use super::RowGroupWriter;

pub trait RecordWriter<T> {
  fn write_to_row_group(&self, row_group_writer: &mut Box<RowGroupWriter>);
}
```

How does it work?
===

The `parquet_derive` crate adds code generating functionality to the rust compiler. The code generation takes rust syntax and emits additional syntax. This macro expansion works on rust 1.15+ stable. This is a dynamic plugin, loaded by the machinery in cargo. Users don't have to do any special `build.rs` steps or anything like that, it's automatic by including `parquet_derive` in their project. The `parquet_derive/src/Cargo.toml` has a section saying as much:

```
[lib]
proc-macro = true
```

The rust struct tagged with `#[derive(ParquetRecordWriter)]` is provided to the `parquet_record_writer` function in `parquet_derive/src/lib.rs`. The `syn` crate parses the struct from a string-representation to a AST (a recursive enum value). The AST contains all the values I care about when generating a `RecordWriter` impl:

 - the name of the struct
 - the lifetime variables of the struct
 - the fields of the struct

The fields of the struct are translated from AST to a flat `FieldInfo` struct. It has the bits I care about for writing a column: `field_name`, `field_lifetime`, `field_type`, `is_option`, `column_writer_variant`.

The code then does the equivalent of templating to build the `RecordWriter` implementation. The templating functionality is provided by the `quote` crate. At a high-level the template for `RecordWriter` looks like:

```
impl RecordWriter for $struct_name {
  fn write_row_group(..) {
    $({
      $column_writer_snippet
    })
  }
}
```

this template is then added under the struct definition, ending up something like:

```
struct MyStruct {
}
impl RecordWriter for MyStruct {
  fn write_row_group(..) {
    {
       write_col_1();
    };
   {
       write_col_2();
   }
  }
}
```

and finally _THIS_ is the code passed to rustc. It's just code now, fully expanded and standalone. If a user ever changes their `struct MyValue` definition the `ParquetRecordWriter` will be regenerated. There's no intermediate values to version control or worry about.

Viewing the Derived Code
===

To see the generated code before it's compiled, one very useful bit is to install `cargo expand` [more info on gh](https://github.com/dtolnay/cargo-expand), then you can do:

```
$WORK_DIR/parquet-rs/parquet_derive_test
cargo expand --lib > ../temp.rs
```

then you can dump the contents:

```
struct DumbRecord {
    pub a_bool: bool,
    pub a2_bool: bool,
}
impl RecordWriter<DumbRecord> for &[DumbRecord] {
    fn write_to_row_group(
        &self,
        row_group_writer: &mut Box<parquet::file::writer::RowGroupWriter>,
    ) {
        let mut row_group_writer = row_group_writer;
        {
            let vals: Vec<bool> = self.iter().map(|x| x.a_bool).collect();
            let mut column_writer = row_group_writer.next_column().unwrap().unwrap();
            if let parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) =
                column_writer
            {
                typed.write_batch(&vals[..], None, None).unwrap();
            }
            row_group_writer.close_column(column_writer).unwrap();
        };
        {
            let vals: Vec<bool> = self.iter().map(|x| x.a2_bool).collect();
            let mut column_writer = row_group_writer.next_column().unwrap().unwrap();
            if let parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) =
                column_writer
            {
                typed.write_batch(&vals[..], None, None).unwrap();
            }
            row_group_writer.close_column(column_writer).unwrap();
        }
    }
}
```

now I need to write out all the combinations of types we support and make sure it writes out data.

Procedural Macros
===

The `parquet_derive` crate can ONLY export the derivation functionality. No traits, nothing else. The derive crate can not host test cases. It's kind of like a "dummy" crate which is only used by the compiler, never the code.

The parent crate cannot use the derivation functionality, which is important because it means test code cannot be in the parent crate. This forces us to have a third crate, `parquet_derive_test`.

I'm open to being wrong on any one of these finer points. I had to bang on this for a while to get it to compile!

Potentials For Better Design
===

 - [x] Recursion could be limited by generating the code as "snippets" instead of one big `quote!` AST generator. Or so I think. It might be nicer to push generating each columns writing code to another loop.
 - [X] ~~It would be nicer if I didn't have to be so picky about data going in to the `write_batch` function. Is it possible we could make a version of the function which accept `Into<DataType>` or similar? This would greatly simplify this derivation code as it would not need to enumerate all the supported types. Something like `write_generic_batch(&[impl Into<DataType>])` would be neat.~~ (not tackling in this generation of the plugin)
 - [X] ~~Another idea to improving writing columns, could we have a write function for `Iterator`s? I already have a `Vec<DumbRecord>`, if I could just write a mapping for accessing the one value, we could skip the whole intermediate vec for `write_batch`. Should have some significant memory advantages.~~ (not tackling in this generation of the plugin, it's a bigger parquet-rs enhancement)
 - [X] ~~It might be worthwhile to derive a parquet schema directly from a struct definition. That should stamp out opportunities for type errors.~~ (moved to #203)

Status
===

I have successfully integrated this work with my own data exporter (takes postgres/couchdb and outputs a single parquet file).

I think this code is worth including in the project, with the caveat that it only generates simplistic `RecordWriter`s. As people start to use we can add code generation for more complex, nested structs. We can convert the nested matching style to a fancier looping style. But for now, this explicit nesting is easier to debug and understand (to me at least!).

Closes #4140 from xrl/parquet_derive

Lead-authored-by: Xavier Lange <xrlange@gmail.com>
Co-authored-by: Neville Dipale <nevilledips@gmail.com>
Co-authored-by: Bryant Biggs <bryantbiggs@gmail.com>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
emkornfield pushed a commit to emkornfield/arrow that referenced this pull request Oct 16, 2020
A rebase and significant rewrite of sunchao/parquet-rs#197

Big improvement: I now use a more natural nested enum style, it helps break out what patterns of data types are . The rest of the broad strokes still apply.

Goal
===

Writing many columns to a file is a chore. If you can put your values in to a struct which mirrors the schema of your file, this `derive(ParquetRecordWriter)` will write out all the fields, in the order in which they are defined, to a row_group.

How to Use
===

```
extern crate parquet;
#[macro_use] extern crate parquet_derive;

#[derive(ParquetRecordWriter)]
struct ACompleteRecord<'a> {
  pub a_bool: bool,
  pub a_str: &'a str,
}
```

RecordWriter trait
===

This is the new trait which `parquet_derive` will implement for your structs.

```
use super::RowGroupWriter;

pub trait RecordWriter<T> {
  fn write_to_row_group(&self, row_group_writer: &mut Box<RowGroupWriter>);
}
```

How does it work?
===

The `parquet_derive` crate adds code generating functionality to the rust compiler. The code generation takes rust syntax and emits additional syntax. This macro expansion works on rust 1.15+ stable. This is a dynamic plugin, loaded by the machinery in cargo. Users don't have to do any special `build.rs` steps or anything like that, it's automatic by including `parquet_derive` in their project. The `parquet_derive/src/Cargo.toml` has a section saying as much:

```
[lib]
proc-macro = true
```

The rust struct tagged with `#[derive(ParquetRecordWriter)]` is provided to the `parquet_record_writer` function in `parquet_derive/src/lib.rs`. The `syn` crate parses the struct from a string-representation to a AST (a recursive enum value). The AST contains all the values I care about when generating a `RecordWriter` impl:

 - the name of the struct
 - the lifetime variables of the struct
 - the fields of the struct

The fields of the struct are translated from AST to a flat `FieldInfo` struct. It has the bits I care about for writing a column: `field_name`, `field_lifetime`, `field_type`, `is_option`, `column_writer_variant`.

The code then does the equivalent of templating to build the `RecordWriter` implementation. The templating functionality is provided by the `quote` crate. At a high-level the template for `RecordWriter` looks like:

```
impl RecordWriter for $struct_name {
  fn write_row_group(..) {
    $({
      $column_writer_snippet
    })
  }
}
```

this template is then added under the struct definition, ending up something like:

```
struct MyStruct {
}
impl RecordWriter for MyStruct {
  fn write_row_group(..) {
    {
       write_col_1();
    };
   {
       write_col_2();
   }
  }
}
```

and finally _THIS_ is the code passed to rustc. It's just code now, fully expanded and standalone. If a user ever changes their `struct MyValue` definition the `ParquetRecordWriter` will be regenerated. There's no intermediate values to version control or worry about.

Viewing the Derived Code
===

To see the generated code before it's compiled, one very useful bit is to install `cargo expand` [more info on gh](https://github.com/dtolnay/cargo-expand), then you can do:

```
$WORK_DIR/parquet-rs/parquet_derive_test
cargo expand --lib > ../temp.rs
```

then you can dump the contents:

```
struct DumbRecord {
    pub a_bool: bool,
    pub a2_bool: bool,
}
impl RecordWriter<DumbRecord> for &[DumbRecord] {
    fn write_to_row_group(
        &self,
        row_group_writer: &mut Box<parquet::file::writer::RowGroupWriter>,
    ) {
        let mut row_group_writer = row_group_writer;
        {
            let vals: Vec<bool> = self.iter().map(|x| x.a_bool).collect();
            let mut column_writer = row_group_writer.next_column().unwrap().unwrap();
            if let parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) =
                column_writer
            {
                typed.write_batch(&vals[..], None, None).unwrap();
            }
            row_group_writer.close_column(column_writer).unwrap();
        };
        {
            let vals: Vec<bool> = self.iter().map(|x| x.a2_bool).collect();
            let mut column_writer = row_group_writer.next_column().unwrap().unwrap();
            if let parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) =
                column_writer
            {
                typed.write_batch(&vals[..], None, None).unwrap();
            }
            row_group_writer.close_column(column_writer).unwrap();
        }
    }
}
```

now I need to write out all the combinations of types we support and make sure it writes out data.

Procedural Macros
===

The `parquet_derive` crate can ONLY export the derivation functionality. No traits, nothing else. The derive crate can not host test cases. It's kind of like a "dummy" crate which is only used by the compiler, never the code.

The parent crate cannot use the derivation functionality, which is important because it means test code cannot be in the parent crate. This forces us to have a third crate, `parquet_derive_test`.

I'm open to being wrong on any one of these finer points. I had to bang on this for a while to get it to compile!

Potentials For Better Design
===

 - [x] Recursion could be limited by generating the code as "snippets" instead of one big `quote!` AST generator. Or so I think. It might be nicer to push generating each columns writing code to another loop.
 - [X] ~~It would be nicer if I didn't have to be so picky about data going in to the `write_batch` function. Is it possible we could make a version of the function which accept `Into<DataType>` or similar? This would greatly simplify this derivation code as it would not need to enumerate all the supported types. Something like `write_generic_batch(&[impl Into<DataType>])` would be neat.~~ (not tackling in this generation of the plugin)
 - [X] ~~Another idea to improving writing columns, could we have a write function for `Iterator`s? I already have a `Vec<DumbRecord>`, if I could just write a mapping for accessing the one value, we could skip the whole intermediate vec for `write_batch`. Should have some significant memory advantages.~~ (not tackling in this generation of the plugin, it's a bigger parquet-rs enhancement)
 - [X] ~~It might be worthwhile to derive a parquet schema directly from a struct definition. That should stamp out opportunities for type errors.~~ (moved to apache#203)

Status
===

I have successfully integrated this work with my own data exporter (takes postgres/couchdb and outputs a single parquet file).

I think this code is worth including in the project, with the caveat that it only generates simplistic `RecordWriter`s. As people start to use we can add code generation for more complex, nested structs. We can convert the nested matching style to a fancier looping style. But for now, this explicit nesting is easier to debug and understand (to me at least!).

Closes apache#4140 from xrl/parquet_derive

Lead-authored-by: Xavier Lange <xrlange@gmail.com>
Co-authored-by: Neville Dipale <nevilledips@gmail.com>
Co-authored-by: Bryant Biggs <bryantbiggs@gmail.com>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

High Level Record Writer
4 participants