Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Typed Parquet Tuple #1198

Merged
merged 8 commits into from May 14, 2015
Merged

Typed Parquet Tuple #1198

merged 8 commits into from May 14, 2015

Conversation

JiJiTang
Copy link
Contributor

  • Used as sink to write tuple with primitive fields and also collection types(List, Set, Map)
  • Create Parquet schema generation macro

@JiJiTang JiJiTang force-pushed the develop branch 3 times, most recently from 2874ad8 to f5f5e0d Compare February 15, 2015 09:33
@colinmarc
Copy link

Hey! I'm just a rando, but I work on very similar stuff, and I have some rando thoughts:

This is a really cool approach, because it allows you to load parquet directly into tuples/case classes. However (if I'm reading this correctly), the problem with this PR is that it doesn't actually materialize the tuples/case classes for you, because it's using ParquetTupleScheme and you already have a cascading tuple of the fields from that. It also doesn't support nested schemas (I think) for the same reason.

You could bypass this and make the code simpler if you put this macro more or less directly into a ParquetTypedScheme; rather than reading the records into cascading tuples and then materializing them into case classes/scala tuples, you use the macro to materialize the case classes/scala tuples directly off of the parquet reader. Then you'd end up with an interface like:

case class Foo(a: String, b: Option[Double])
ParquetTypedSource[Foo].map { f => f.a, f.b } // type checking against your parquet schema!

Anyway, just a suggestion!

@JiJiTang
Copy link
Contributor Author

Hello @colinmarc
Thank you for this comment and you are totally right. At the beginning, I wanted also to generate cascading fields directly from case classes using macros(defined in the scalding-macros module). But this will create a direct dependence on the module from scalding-parquet. So I've defined the fields as part of the method parameters and let users to decide whether or not to use scalding-macros. And perhaps I will make all these parameters implicit(I will try to make a commit tonight). And you are also right about the reason why it doesn't support nested schemas and a pull request need to be done in parquet-mr to make ParquetTupleScheme support nested schema. As you can see TupleWriteSupport only supports primitive type today.

| required binary e.a1.y;
| required int32 e.a2.x;
| required binary e.a2.y;
| required binary e.y;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious, why do you use "." to capture nesting instead of actually nesting groups?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@julienledem Hi Julien, totally agree with you, it will be better to use nesting groups. The reason I make the generated schema flat is because today TupleWriteSupport only supports primitive type for right now.

      if (field.isPrimitive()) {
        writePrimitive(record, field.asPrimitiveType());
      } else {
        throw new UnsupportedOperationException("Complex type not implemented");
      }

@julienledem
Copy link
Contributor

Nice approach

import _root_.parquet.schema.MessageType

object MacroImplicits {
implicit def materializeCaseClassTypeDescriptor[T]: MessageType = macro SchemaProviderImpl.toParquetSchemaImp[T]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment

@colinmarc
Copy link

@JiJiTang just to clarify - I actually meant you could replace ParquetTupleScheme with a macro, a lot like this one. Rather than just generating a schema from a case class, you could also generate a parquet RecordMaterializer implementation that reads values directly into case class instances.

Then you wouldn't be limited to primitive types, and you'd be able to support nested structs. And you wouldn't have to define setters/getters for those case classes.

@JiJiTang
Copy link
Contributor Author

@colinmarc Good point. And thank you very much for this suggestion. And It's trivial to write a macro to generate RecordMaterializer(as you said this would support nested case classes). But I am not sure we do this in Scalding and this means we skip directly the current ParquetTupleScheme implementation in parquet-mr. Plus this means we close the door for users to provide their own customized parquet schema? Perhaps I get you wrong. And let me think about it.

JiJiTang added a commit to JiJiTang/scalding that referenced this pull request Feb 23, 2015
   *Macro support nested group type
   *Add r/w support nested case classes
   *Tests + refacto
@JiJiTang
Copy link
Contributor Author

Hi @julienledem @ianoc , this weekend I've added code that supports nested case classes read/write with nested parquet message type and also some macro facilities to generate r/w support relative classes. Tuple setter and converter don't need to be defined any more. And tuple Materializer is generated now(directly by manipulating the case class). And generated schema will look like this:

message SampleClassC {
  required group a {
    required int32 x;
    required binary y;
  }
  required group b {
    required group a {
      required int32 x;
      required binary y;
    }
    required binary y;
  }
}

Please check the unit tests for details.
Sorry for this commit with too much changes. And thank you guys very much for your time to review the code. And please let me know your thoughts about this.

And hi @colinmarc , I've tried to define macros which allow to generate all the read/write support classes. But I cannot generate WriteSupport or ReadSupport classes. Because macros doesn't allow to expand class at top level, which means all the generated classes will be inner classes and cannot be instantiated using java reflection in ParquetInputFormat without providing parent class instances. I've tried to simplify as possible the definition of TypedParquet source. Please check the committed unit tests for details and thanks a lot.

JiJiTang added a commit to JiJiTang/scalding that referenced this pull request Feb 24, 2015
   *Macro support nested group type
   *Add r/w support nested case classes
   *Tests + refacto
JiJiTang added a commit to JiJiTang/scalding that referenced this pull request Feb 24, 2015
   *Macro support nested group type
   *Add r/w support nested case classes
   *Tests + refacto
JiJiTang added a commit to JiJiTang/scalding that referenced this pull request Feb 24, 2015
   *Macro support nested group type
   *Add r/w support nested case classes
   *Tests + refacto
JiJiTang added a commit to JiJiTang/scalding that referenced this pull request Feb 25, 2015
   *Macro write support improvement
JiJiTang added a commit to JiJiTang/scalding that referenced this pull request Feb 25, 2015
   *Macro write support improvement
JiJiTang added a commit to JiJiTang/scalding that referenced this pull request Feb 25, 2015
   *Macro write support improvement

override def createValue(): Any = {
if(fieldValues.isEmpty) null
else classOf[$tpe].getConstructors()(0).newInstance(fieldValues.toSeq.map(_.asInstanceOf[AnyRef]): _*)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should be able to do better than creating through reflection. After all it's a macro.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in the latest commit

JiJiTang added a commit to JiJiTang/scalding that referenced this pull request Apr 12, 2015
   *Add Byte type support
JiJiTang added a commit to JiJiTang/scalding that referenced this pull request Apr 12, 2015
   *Improve tuple converter macro(delete unnecessary boxing)
@JiJiTang
Copy link
Contributor Author

hi @ianoc @julienledem , thank you guys so much for the review. Check my latest two commits please and tell me if there's something not ready to go(Code is rebased so that it could be merged).

  * Used as sink to write tuple with primitive fields
  * Create Parquet schema generation macro
   *Refacto unit test by using platform test
   *Refacto macro code

(cherry picked from commit 40cd1eb)
   *Macro support nested group type
   *Add r/w support nested case classes
   *Tests + refacto
   *Macro write support improvement
   *Add Byte type support
JiJiTang added a commit to JiJiTang/scalding that referenced this pull request Apr 12, 2015
   *Improve tuple converter macro(delete unnecessary boxing)
JiJiTang added a commit to JiJiTang/scalding that referenced this pull request Apr 12, 2015
   *Improve tuple converter macro(delete unnecessary boxing)
import parquet.io.api.{ Binary, Converter, GroupConverter, PrimitiveConverter }
import scala.util.Try

trait TupleFieldConverter[+T] extends Converter {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this type do? It is unclear to mo. Why use currentValue and hasValue, not Option[T].

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'm asking two things: 1) can you add a comment to this type. 2) why not use Option rather than two methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @johnynek thanks a lot for the review. This trait is here to model parquet tuple field reading converter. As a field in parquet tuple can be "REQUIRED" and also "OPTIONAL". Not using Option[T] is to avoid Option boxing overhead for those fields that are required(Advice from @julienledem ). "hasValue" is there to tell if a optional field has a value or not when reading from parquet. For those optional fields, they will be boxed into Option at the moment parent tuple is created, as in macro implementation we know that if field is a option type or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @johnynek , I've added some comments for the trait. Is that ok for you? Please let me know your thoughts.

   *Improve tuple converter macro(delete unnecessary boxing)
JiJiTang added a commit to JiJiTang/scalding that referenced this pull request Apr 15, 2015
   *Macro support list collection type
   *use two different classes for modeling required and optional field converter
   *delete unnecessary class cast, type all the field converter class
JiJiTang added a commit to JiJiTang/scalding that referenced this pull request Apr 21, 2015
    * Add macro support for collection types(LIST, SET, MAP)
    * Add macro support for collection types(LIST, SET, MAP)
@JiJiTang
Copy link
Contributor Author

Hi guys, I have added a commit to make the macros support collection fields types (LIST, SET, MAP), please review my latest commit.

@JiJiTang
Copy link
Contributor Author

JiJiTang commented May 8, 2015

Could someone please check the latest commits for managing collection type?

ianoc added a commit that referenced this pull request May 14, 2015
@ianoc ianoc merged commit 019ec80 into twitter:develop May 14, 2015
@ianoc
Copy link
Collaborator

ianoc commented May 14, 2015

@JiJiTang sorry about the delay getting this merged. Those latest commits look ok to me. I think if there are more things to be done should be done as a follow up PR. This was huge. Its great work and an awesome addition. Many thanks

@JiJiTang
Copy link
Contributor Author

Hi @ianoc, thank you so much for the code review and merging this PR. And also many thanks to @julienledem @colinmarc @johnynek for your reviews and good advices. I will follow this PR for future updates.


fieldType match {
case tpe if tpe =:= typeOf[String] =>
writePrimitiveField(q"rc.addBinary(Binary.fromString($fValue))")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this macro is not hygenic. We need the fully qualified path here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @johnynek , this is fixed in this commit: JiJiTang@60e7390 merged with PR #1303

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 536bd0c on JiJiTang:develop into ** on twitter:develop**.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants