Typed Parquet Tuple #1198

JiJiTang · 2015-02-14T09:18:07Z

Used as sink to write tuple with primitive fields and also collection types(List, Set, Map)
Create Parquet schema generation macro

colinmarc · 2015-02-17T12:22:03Z

Hey! I'm just a rando, but I work on very similar stuff, and I have some rando thoughts:

This is a really cool approach, because it allows you to load parquet directly into tuples/case classes. However (if I'm reading this correctly), the problem with this PR is that it doesn't actually materialize the tuples/case classes for you, because it's using ParquetTupleScheme and you already have a cascading tuple of the fields from that. It also doesn't support nested schemas (I think) for the same reason.

You could bypass this and make the code simpler if you put this macro more or less directly into a ParquetTypedScheme; rather than reading the records into cascading tuples and then materializing them into case classes/scala tuples, you use the macro to materialize the case classes/scala tuples directly off of the parquet reader. Then you'd end up with an interface like:

case class Foo(a: String, b: Option[Double])
ParquetTypedSource[Foo].map { f => f.a, f.b } // type checking against your parquet schema!

Anyway, just a suggestion!

JiJiTang · 2015-02-17T15:29:44Z

Hello @colinmarc
Thank you for this comment and you are totally right. At the beginning, I wanted also to generate cascading fields directly from case classes using macros(defined in the scalding-macros module). But this will create a direct dependence on the module from scalding-parquet. So I've defined the fields as part of the method parameters and let users to decide whether or not to use scalding-macros. And perhaps I will make all these parameters implicit(I will try to make a commit tonight). And you are also right about the reason why it doesn't support nested schemas and a pull request need to be done in parquet-mr to make ParquetTupleScheme support nested schema. As you can see TupleWriteSupport only supports primitive type today.

julienledem · 2015-02-17T19:45:32Z

scalding-parquet/src/test/scala/com/twitter/scalding/parquet/tuple/macros/MacroUnitTests.scala

+        |  required binary e.a1.y;
+        |  required int32 e.a2.x;
+        |  required binary e.a2.y;
+        |  required binary e.y;


I'm curious, why do you use "." to capture nesting instead of actually nesting groups?

@julienledem Hi Julien, totally agree with you, it will be better to use nesting groups. The reason I make the generated schema flat is because today TupleWriteSupport only supports primitive type for right now.

if (field.isPrimitive()) { writePrimitive(record, field.asPrimitiveType()); } else { throw new UnsupportedOperationException("Complex type not implemented"); }

julienledem · 2015-02-17T19:45:55Z

Nice approach

ianoc · 2015-02-17T22:13:46Z

scalding-parquet/src/main/scala/com/twitter/scalding/parquet/tuple/macros/MacroImplicits.scala

+import _root_.parquet.schema.MessageType
+
+object MacroImplicits {
+  implicit def materializeCaseClassTypeDescriptor[T]: MessageType = macro SchemaProviderImpl.toParquetSchemaImp[T]


colinmarc · 2015-02-18T18:42:11Z

@JiJiTang just to clarify - I actually meant you could replace ParquetTupleScheme with a macro, a lot like this one. Rather than just generating a schema from a case class, you could also generate a parquet RecordMaterializer implementation that reads values directly into case class instances.

Then you wouldn't be limited to primitive types, and you'd be able to support nested structs. And you wouldn't have to define setters/getters for those case classes.

JiJiTang · 2015-02-18T23:30:13Z

@colinmarc Good point. And thank you very much for this suggestion. And It's trivial to write a macro to generate RecordMaterializer(as you said this would support nested case classes). But I am not sure we do this in Scalding and this means we skip directly the current ParquetTupleScheme implementation in parquet-mr. Plus this means we close the door for users to provide their own customized parquet schema? Perhaps I get you wrong. And let me think about it.

*Macro support nested group type *Add r/w support nested case classes *Tests + refacto

JiJiTang · 2015-02-23T22:54:00Z

Hi @julienledem @ianoc , this weekend I've added code that supports nested case classes read/write with nested parquet message type and also some macro facilities to generate r/w support relative classes. Tuple setter and converter don't need to be defined any more. And tuple Materializer is generated now(directly by manipulating the case class). And generated schema will look like this:

message SampleClassC {
  required group a {
    required int32 x;
    required binary y;
  }
  required group b {
    required group a {
      required int32 x;
      required binary y;
    }
    required binary y;
  }
}

Please check the unit tests for details.
Sorry for this commit with too much changes. And thank you guys very much for your time to review the code. And please let me know your thoughts about this.

And hi @colinmarc , I've tried to define macros which allow to generate all the read/write support classes. But I cannot generate WriteSupport or ReadSupport classes. Because macros doesn't allow to expand class at top level, which means all the generated classes will be inner classes and cannot be instantiated using java reflection in ParquetInputFormat without providing parent class instances. I've tried to simplify as possible the definition of TypedParquet source. Please check the committed unit tests for details and thanks a lot.

*Macro support nested group type *Add r/w support nested case classes *Tests + refacto

*Macro write support improvement

julienledem · 2015-04-07T17:51:19Z

...ain/scala/com/twitter/scalding/parquet/tuple/macros/impl/ParquetTupleConverterProvider.scala

+
+            override def createValue(): Any = {
+              if(fieldValues.isEmpty) null
+              else classOf[$tpe].getConstructors()(0).newInstance(fieldValues.toSeq.map(_.asInstanceOf[AnyRef]): _*)


we should be able to do better than creating through reflection. After all it's a macro.

Fixed in the latest commit

*Add Byte type support

*Improve tuple converter macro(delete unnecessary boxing)

JiJiTang · 2015-04-12T10:51:37Z

hi @ianoc @julienledem , thank you guys so much for the review. Check my latest two commits please and tell me if there's something not ready to go(Code is rebased so that it could be merged).

* Used as sink to write tuple with primitive fields * Create Parquet schema generation macro

*Refacto unit test by using platform test *Refacto macro code (cherry picked from commit 40cd1eb)

*Macro support nested group type *Add r/w support nested case classes *Tests + refacto

*Macro write support improvement

*Add Byte type support

*Improve tuple converter macro(delete unnecessary boxing)

johnynek · 2015-04-13T22:15:18Z

...parquet/src/main/scala/com/twitter/scalding/parquet/tuple/scheme/ParquetTupleConverter.scala

+import parquet.io.api.{ Binary, Converter, GroupConverter, PrimitiveConverter }
+import scala.util.Try
+
+trait TupleFieldConverter[+T] extends Converter {


what does this type do? It is unclear to mo. Why use currentValue and hasValue, not Option[T].

I guess I'm asking two things: 1) can you add a comment to this type. 2) why not use Option rather than two methods.

Hi @johnynek thanks a lot for the review. This trait is here to model parquet tuple field reading converter. As a field in parquet tuple can be "REQUIRED" and also "OPTIONAL". Not using Option[T] is to avoid Option boxing overhead for those fields that are required(Advice from @julienledem ). "hasValue" is there to tell if a optional field has a value or not when reading from parquet. For those optional fields, they will be boxed into Option at the moment parent tuple is created, as in macro implementation we know that if field is a option type or not.

Hi @johnynek , I've added some comments for the trait. Is that ok for you? Please let me know your thoughts.

*Improve tuple converter macro(delete unnecessary boxing)

*Macro support list collection type

*use two different classes for modeling required and optional field converter *delete unnecessary class cast, type all the field converter class

* Add macro support for collection types(LIST, SET, MAP)

JiJiTang · 2015-04-22T18:01:20Z

Hi guys, I have added a commit to make the macros support collection fields types (LIST, SET, MAP), please review my latest commit.

JiJiTang · 2015-05-08T21:37:20Z

Could someone please check the latest commits for managing collection type?

Typed Parquet Tuple

ianoc · 2015-05-14T23:30:17Z

@JiJiTang sorry about the delay getting this merged. Those latest commits look ok to me. I think if there are more things to be done should be done as a follow up PR. This was huge. Its great work and an awesome addition. Many thanks

JiJiTang · 2015-05-19T12:40:13Z

Hi @ianoc, thank you so much for the code review and merging this PR. And also many thanks to @julienledem @colinmarc @johnynek for your reviews and good advices. I will follow this PR for future updates.

johnynek · 2015-06-10T21:17:50Z

...uet/src/main/scala/com/twitter/scalding/parquet/tuple/macros/impl/WriteSupportProvider.scala

+
+      fieldType match {
+        case tpe if tpe =:= typeOf[String] =>
+          writePrimitiveField(q"rc.addBinary(Binary.fromString($fValue))")


this macro is not hygenic. We need the fully qualified path here.

Hi @johnynek , this is fixed in this commit: JiJiTang@60e7390 merged with PR #1303

coveralls · 2016-12-14T13:26:36Z

Changes Unknown when pulling 536bd0c on JiJiTang:develop into ** on twitter:develop**.

JiJiTang force-pushed the develop branch 3 times, most recently from 2874ad8 to f5f5e0d Compare February 15, 2015 09:33

julienledem reviewed Feb 17, 2015
View reviewed changes

ianoc reviewed Feb 17, 2015
View reviewed changes

JiJiTang force-pushed the develop branch from 2b7878b to e602c86 Compare February 18, 2015 23:56

JiJiTang added a commit to JiJiTang/scalding that referenced this pull request Feb 23, 2015

Typed Parquet Tuple twitter#1198

b8ad5b5

*Macro support nested group type *Add r/w support nested case classes *Tests + refacto

JiJiTang force-pushed the develop branch from b8ad5b5 to 6f0d55d Compare February 24, 2015 10:05

JiJiTang added a commit to JiJiTang/scalding that referenced this pull request Feb 24, 2015

Typed Parquet Tuple twitter#1198

6f0d55d

*Macro support nested group type *Add r/w support nested case classes *Tests + refacto

JiJiTang force-pushed the develop branch from 6f0d55d to 9abf725 Compare February 24, 2015 12:01

JiJiTang added a commit to JiJiTang/scalding that referenced this pull request Feb 24, 2015

Typed Parquet Tuple twitter#1198

9abf725

*Macro support nested group type *Add r/w support nested case classes *Tests + refacto

JiJiTang force-pushed the develop branch from 9abf725 to 916d943 Compare February 24, 2015 12:28

JiJiTang added a commit to JiJiTang/scalding that referenced this pull request Feb 24, 2015

Typed Parquet Tuple twitter#1198

916d943

*Macro support nested group type *Add r/w support nested case classes *Tests + refacto

JiJiTang added a commit to JiJiTang/scalding that referenced this pull request Feb 25, 2015

Typed Parquet Tuple twitter#1198

bf940e0

*Macro write support improvement

JiJiTang added a commit to JiJiTang/scalding that referenced this pull request Feb 25, 2015

Typed Parquet Tuple twitter#1198

4b67ac4

*Macro write support improvement

JiJiTang added a commit to JiJiTang/scalding that referenced this pull request Feb 25, 2015

Typed Parquet Tuple twitter#1198

cd7bd87

*Macro write support improvement

julienledem reviewed Apr 7, 2015
View reviewed changes

JiJiTang added a commit to JiJiTang/scalding that referenced this pull request Apr 12, 2015

Typed Parquet Tuple twitter#1198

893a086

*Add Byte type support

JiJiTang added a commit to JiJiTang/scalding that referenced this pull request Apr 12, 2015

Typed Parquet Tuple twitter#1198

b13934e

*Improve tuple converter macro(delete unnecessary boxing)

JiJiTang added 5 commits April 12, 2015 17:45

Typed Parquet Tuple

aa64868

* Used as sink to write tuple with primitive fields * Create Parquet schema generation macro

Typed Parquet Tuple

97c1980

*Refacto unit test by using platform test *Refacto macro code (cherry picked from commit 40cd1eb)

Typed Parquet Tuple twitter#1198

8dbef39

*Macro support nested group type *Add r/w support nested case classes *Tests + refacto

Typed Parquet Tuple twitter#1198

904858a

*Macro write support improvement

Typed Parquet Tuple twitter#1198

2782288

*Add Byte type support

JiJiTang force-pushed the develop branch from b13934e to 97dd384 Compare April 12, 2015 16:30

JiJiTang added a commit to JiJiTang/scalding that referenced this pull request Apr 12, 2015

Typed Parquet Tuple twitter#1198

97dd384

*Improve tuple converter macro(delete unnecessary boxing)

JiJiTang force-pushed the develop branch from 97dd384 to f9af7de Compare April 12, 2015 19:52

JiJiTang added a commit to JiJiTang/scalding that referenced this pull request Apr 12, 2015

Typed Parquet Tuple twitter#1198

f9af7de

*Improve tuple converter macro(delete unnecessary boxing)

johnynek reviewed Apr 13, 2015
View reviewed changes

JiJiTang force-pushed the develop branch from f9af7de to f7ad7a7 Compare April 14, 2015 20:43

Typed Parquet Tuple twitter#1198

f7ad7a7

*Improve tuple converter macro(delete unnecessary boxing)

JiJiTang added a commit to JiJiTang/scalding that referenced this pull request Apr 15, 2015

Typed Parquet Tuple twitter#1198

428c07b

*Macro support list collection type

Improve tuple converter macro

5f06144

*use two different classes for modeling required and optional field converter *delete unnecessary class cast, type all the field converter class

JiJiTang added a commit to JiJiTang/scalding that referenced this pull request Apr 21, 2015

Typed Parquet Tuple twitter#1198

80996b5

* Add macro support for collection types(LIST, SET, MAP)

Typed Parquet Tuple twitter#1198

536bd0c

* Add macro support for collection types(LIST, SET, MAP)

JiJiTang force-pushed the develop branch from 80996b5 to 536bd0c Compare April 21, 2015 20:18

ianoc added a commit that referenced this pull request May 14, 2015

Merge pull request #1198 from JiJiTang/develop

019ec80

Typed Parquet Tuple

ianoc merged commit 019ec80 into twitter:develop May 14, 2015

johnynek reviewed Jun 10, 2015
View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Typed Parquet Tuple #1198

Typed Parquet Tuple #1198

JiJiTang commented Feb 14, 2015

colinmarc commented Feb 17, 2015

JiJiTang commented Feb 17, 2015

julienledem Feb 17, 2015

JiJiTang Feb 17, 2015

julienledem commented Feb 17, 2015

ianoc Feb 17, 2015

colinmarc commented Feb 18, 2015

JiJiTang commented Feb 18, 2015

JiJiTang commented Feb 23, 2015

julienledem Apr 7, 2015

JiJiTang Apr 12, 2015

JiJiTang commented Apr 12, 2015

johnynek Apr 13, 2015

johnynek Apr 13, 2015

JiJiTang Apr 14, 2015

JiJiTang Apr 16, 2015

JiJiTang commented Apr 22, 2015

JiJiTang commented May 8, 2015

ianoc commented May 14, 2015

JiJiTang commented May 19, 2015

johnynek Jun 10, 2015

JiJiTang Jun 11, 2015

coveralls commented Dec 14, 2016

Typed Parquet Tuple #1198

Typed Parquet Tuple #1198

Conversation

JiJiTang commented Feb 14, 2015

colinmarc commented Feb 17, 2015

JiJiTang commented Feb 17, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

julienledem commented Feb 17, 2015

Choose a reason for hiding this comment

colinmarc commented Feb 18, 2015

JiJiTang commented Feb 18, 2015

JiJiTang commented Feb 23, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JiJiTang commented Apr 12, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JiJiTang commented Apr 22, 2015

JiJiTang commented May 8, 2015

ianoc commented May 14, 2015

JiJiTang commented May 19, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Dec 14, 2016