Proposal: prototype for typesafe fields extension #212

Closed
wants to merge 17 commits into
from

Conversation

Projects
None yet
3 participants
Contributor

aaron-siegel commented Oct 25, 2012

This is just a proposal. It's a prototype for the typesafe fields extension that just implements map, mapTo, flatMap, and flatMapTo. Currently I've set the script just to generate up to arity 3 to keep things simple and clean, but this is easy to increase to 6 or 8 (or whatever seems right).

Again, this is a proposal only, to get a feel for what this API extension would look like under the covers. Before merging it, I'd want to increase the arity to something more reasonable, implement other RichPipe methods such as filter, implement GroupBuilder methods in a separate trait, and add more tests.

Feedback is welcome. Here's a description of the feature proposal that I previously sent to scalding-dev:

The typesafe API is great, but I sometimes find myself preferring the fields API, particularly in jobs with lots of fields. When I'm using the fields API, I find I crave more type safety. Scalding 0.8.0 took a step in this direction by introducing the semi-typesafe Field class. The traditional way of specifying a map operation in the fields API uses scala symbols as field identifiers, for example:

pipe.map(('x,'y) -> ('z,'w)) { tuple: (Int, Int) => val (x,y) = tuple; (x+y, x-y) }

Three major disadvantages of this API are:

(1) there are no compile-time typographical checks on field identifier names (if i misspell a field name, it won't be detected until flow planning);
(2) there are no compile-time or flow-time type-checks of field data (if 'x and 'y are valid incoming fields, but aren't both Ints, this won't be detected until the job throws a cast exception); and
(3) because the fields are untyped, there's no way to specify custom Orderings, and therefore no way to use group operations on keys whose types don't implement Comparable (even if scala knows about an implicit Ordering for them).

It's also a little annoying (but not quite a "major disadvantage") that in the mapping function, the tuple input has to be explicitly type-declared and then unwrapped.

0.8.0 introduced the concept of a Field object:

val xField = StringField[Int]("x")
val yField = StringField[Int]("y")

etc., so the map operation can now be written

pipe.map((xField, yField) -> (zField, wField)) { tuple: (Int, Int) => val (x,y) = tuple; (x+y, x-y) }

This solves (1), of course; and also (3), because the xField object (say) carries with it an implicit ordering on Int. But the type checking doesn't carry over to the mapping function, so (2) is still an issue.

My proposal for 0.9.0 adds additional type-checking by introducing various new signatures for RichPipe.map, one for each combination of input and output arity. A typical example (the "(2,2)-ary" map function) is

def map[A,B,S,T](fields: ((Field[A],Field[B]),(Field[S],Field[T])))(fn: ((A,B)) => (S,T))
  (implicit conv: TupleConverter[(A,B)], implicit setter: TupleSetter[(S,T)], implicit inArity: Arity2, implicit outArity: Arity2)

Now the example map operation can be written

pipe.map((xField, yField) -> (zField, wField)) { case (x,y) => (x+y,x-y) }

and it's fully typesafe: since the Field instances carry type information, the input and output parameters of the associated function implicitly receive their types. This solves all of the preceding issues.

The reason for the extra implicits (inArity and outArity) is that there are two other map implementations with similar signatures (the "(1,3)-ary" and "(3,1)-ary" map functions):

def map[A,S,T,U](fields: (Field[A],(Field[S],Field[T],Field[U])))(fn: (A) => (S,T,U))
  (implicit conv: TupleConverter[A], implicit setter: TupleSetter[(S,T,U)], implicit inArity: Arity1, implicit outArity: Arity3)

def map[A,B,C,S](fields: ((Field[A],Field[B],Field[C]),Field[S]))(fn: ((A,B,C)) => S)
  (implicit conv: TupleConverter[(A,B,C)], implicit setter: TupleSetter[S], implicit inArity: Arity3, implicit outArity: Arity1)

In all three cases ((2,2)-ary, (1,3)-ary, (3,1)-ary), the type erasure of the "fields" parameter is Tuple2, and the type erasure of the "fn" parameter is Function1. Likewise for "conv" and "setter". But the arity implicits have distinct type erasures, ensuring that each map implementation translates to a distinct Java signature.

This feature solves a lot of problems with the fields API. The downside (and this is the potential controversy) is that it requires adding a lot of new method signatures: n^2 of them, if we want to support up to (n,n)-ary mappings. I'm fairly sure that n=22 is not the right number here; something like n=8 might be appropriate. (I've certainly never mapped more than eight input or output fields.) But this still requires 64 new methods. Then, of course, we need another 64 for the flatMap implementations, and mapTo and flatMapTo...

The same principles can be applied to make a lot of the GroupBuilder logic typesafe. For example, a common pattern is

groupBuilder.plus[Map[Key,Value]]('myField)

but if 'myField is replaced by an object of type Field[Map[Key,Value]], then we get the benefit of some really snazzy type inference:

groupBuilder.plus(myField)

and so on.

I really like this approach, and I think it would make it a lot more comfortable to code up jobs in the fields API. I'd be curious to know what everyone thinks of the unavoidable proliferation of method signatures in RichPipe and GroupBuilder.

@azymnis azymnis and 1 other commented on an outdated diff Oct 25, 2012

...test/scala/com/twitter/scalding/TypedFieldsTest.scala
@@ -82,6 +82,24 @@ class TypedFieldsJob(args: Args) extends Job(args) {
}
+class TypedOperationsJob(args: Args) extends Job(args) {
@azymnis

azymnis Oct 25, 2012

Contributor

Do you want to also add a spec for this job?

@aaron-siegel

aaron-siegel Oct 26, 2012

Contributor

Definitely. My first goal was just to make sure the scala compiler got the type inferences right - and a spec isn't necessary for that, since we just need to make sure that it compiles.

But I certainly want more testing before considering this for a merge.

Owner

aaron-siegel replied Oct 27, 2012

I tried that (I completely agree that it's more pleasing semantically). I couldn't get it to work, because scala doesn't recognize object types as implicits. That is,

def myfn(implicit arity : Arity1)

gives a compile error if Arity1 is an object rather than a class. Perhaps there's another approach that I'm not aware of?

Really? We have implicit objects in algebird. The implicits have to be in scope at the call site, so they would have to be in scope in jobs.

My preference it to remove the implicits from being in Job by default in scalding 0.9.0, and instead require an import of Dsl._ to make things more explicit. I think as long as these implicits are register in the Dsl object, that would work:

object Dsl {
  implicit val arity1 = Arity1
  implicit val arity2 = Arity2
}
Owner

aaron-siegel replied Oct 29, 2012

Ahh, it hadn't occurred to me to define the objects first, and then separately define implicit vals that are equal to them. It's sadly redundant in a way, but makes sense in the context of Scala's semantics.

Owner

aaron-siegel replied Oct 29, 2012

This still doesn't work. scala rejects the signature

def myfn(implicit arity : Arity1)

if Arity1 is an object instead of a class.

Owner

aaron-siegel replied Oct 27, 2012

I found that every class that mixed in FieldConversions also needed Arities. Perhaps it's better to mix them in separately.

Owner

aaron-siegel replied Oct 27, 2012

In particular, this applies to RichPipe and GroupBuilder, which mix in FieldConversions, but also need Arities in order to get the implicits right for the typechecked methods. Thinking about it, the arities do serve a similar role to the field conversion methods - getting the scala compiler to recognize which underlying method to apply to a given set of arguments. So there's some justification for mixing Arities into FieldConversions. The other options are:

  1. Put the implicit arity1, arity2, ... vals directly in FieldConversions;
  2. Explicitly mix Arities into RichPipe and GroupBuilder in addition to FieldConversions ("with FieldConversions with Arities")

I don't have much of a preference here.

I like them (as I mention below) in Dsl, and require:

import Dsl._

to access these.

Why are FieldConversions mixed in here?

Can't we import the implicits we need directly by doing:

import Dsl.{putTheFunctionsHere}

We're really just doing this for implicit resolution, right? If so, I prefer the import approach in the library code that we control (I realize there were some cases before where we didn't do that).

Owner

aaron-siegel replied Oct 27, 2012

I needed access to the fieldToFields and productToFields methods. I agree that it's better to import Dsl instead. Actually, I wasn't aware this was possible until now - since the desired methods are part of a trait, not an object, they can't be imported directly from FieldConversions. Mixing FieldConversions into an object is a very clever way around this.

I think the FieldConversions mix-in was too "clever" and really was the result of expediency and a lack of understanding of implicit resolution by me (I'll assume Avi knew better, but let me make a design error).

Owner

aaron-siegel replied Oct 29, 2012

Got it. I'll make this change.

Owner

aaron-siegel replied Oct 29, 2012

Thinking about this some more, I think this might be confusing to users who are used to having all their implicits mixed in "for free," but suddenly have to import Dsl in order to get this particular one to work. I think this might be a case where it's better to stick with an existing pattern for consistency, rather than introduce a new and contradictory pattern in order to acknowledge flaws in the existing one.

certainly not an issue, but I don't know why we use ruby here except that I started that precedence. You can equally well run scala like a scripting language and it might be easier in the future to just use one language. In chill, for instance, I used scala to generate the tuple serializers.

Here's a longer example on scala scripting:
http://www.javacodegeeks.com/2012/01/first-steps-with-scala-say-goodbye-to.html

Owner

aaron-siegel replied Oct 29, 2012

Sure. I just used ruby for consistency with the other generation scripts. It's actually quite convenient for applications like this that are heavy on string substitution. But there were times when I was writing it that I longed for scala (e.g., the way ruby handles lambda expressions is far from elegant).

I understand that scala 2.10 is introducing a ruby-like string substitution feature... it's slowly becoming the universal language...

Collaborator

johnynek commented Oct 30, 2012

BTW: We need to release 0.8.1, and this work should be set for 0.9.0

Contributor

aaron-siegel commented Nov 5, 2012

This is pretty much done now, though it needs more test cases. I'm pretty happy with how it turned out - I've managed to provide type safety to almost all methods in RichPipe and GroupBuilder that have type parameters, and it's done in a way that should have minimal impact on existing jobs or jobs that don't want to use the new features.

Collaborator

johnynek commented Feb 21, 2013

Aaron,

I tend to think we should not proceed here, and instead invest in making the Typed-API do what you want.

Working with Fields with the Typed-API could be done using an approach from:
https://github.com/milessabin/shapeless#heterogenous-lists

https://github.com/milessabin/shapeless#extensible-records

I tend to think that's the way to go, rather than invest in the Fields-API for this reason. My thinking is that the Fields-API will continue to be a cascading-DSL.

If I can convince you, will you close this issue?

Contributor

aaron-siegel commented Feb 28, 2013

Hi, sorry for the slow response. I'm fine with closing this issue - I've
mostly switched over to using the typed API.

On Thu, Feb 21, 2013 at 11:22 AM, P. Oscar Boykin
notifications@github.comwrote:

Aaron,

I tend to think we should not proceed here, and instead invest in making
the Typed-API do what you want.

Working with Fields with the Typed-API could be done using an approach
from:
https://github.com/milessabin/shapeless#heterogenous-lists

https://github.com/milessabin/shapeless#extensible-records

I tend to think that's the way to go, rather than invest in the Fields-API
for this reason. My thinking is that the Fields-API will continue to be a
cascading-DSL.

If I can convince you, will you close this issue?


Reply to this email directly or view it on GitHubhttps://github.com/twitter/scalding/pull/212#issuecomment-13907387.

johnynek closed this Mar 1, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment