Joins with boolean column expressions, 2nd attempt #175

OlivierBlanvillain · 2017-09-15T23:06:58Z

This PR replaces #162 with a nicer solution that is both safer than the original approach (columns keep their "source dataset" phantom type) and simpler than the implicit CanAccess approach.

The trick is to abuse Scala type system to model "a boolean columns coming from either table A or B" as TypedColumn[A with B, Boolean]. The only changes that makes this possible are on the TypedColumn operation which now take an extra type parameter for the second argument of binary operations and combines the phantom types as expected. From a uses perspective nothing changes:

val condition: TypedColumn[A with B, Boolean] = (df1: A).col('a) === (df2: B).col('b)

The implicit conversions using for the widden import broke with this change. I propose to remove this mechanism altogether and require users to explicitly cast when they convert columns from one type to another. Implicit conversion do not compose and break type inference left and right, these changes are a good illustration of that.

codecov-io · 2017-09-16T09:03:15Z

Codecov Report

Merging #175 into master will decrease coverage by 0.43%.
The diff coverage is 94.11%.

@@            Coverage Diff             @@
##           master     #175      +/-   ##
==========================================
- Coverage   96.99%   96.56%   -0.44%     
==========================================
  Files          52       51       -1     
  Lines         866      874       +8     
  Branches       10       11       +1     
==========================================
+ Hits          840      844       +4     
- Misses         26       30       +4

Impacted Files	Coverage Δ
...la/frameless/functions/NonAggregateFunctions.scala	`100% <ø> (ø)`	⬆️
...ore/src/main/scala/frameless/CatalystOrdered.scala	`100% <100%> (ø)`	⬆️
...ataset/src/main/scala/frameless/TypedDataset.scala	`100% <100%> (ø)`	⬆️
dataset/src/main/scala/frameless/TypedColumn.scala	`100% <100%> (ø)`	⬆️
...cala/org/apache/spark/sql/FramelessInternals.scala	`77.77% <77.77%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ba9abbe...90fbbf8. Read the comment docs.

OlivierBlanvillain · 2017-09-17T18:58:57Z

@kanterov Would you have time for a review?

kanterov · 2017-09-17T19:09:07Z

@OlivierBlanvillain looking into it now! :)

iravid · 2017-09-17T19:23:46Z

👍🏻 on explicit casts.

kanterov · 2017-09-17T20:25:39Z

The trick with A with B is very neat. However, it seems that it messes up type inference outside of joins, and it doesn't seem right to trade off other uses cases for having nice joins.

I tried to play with implicit tricks we discussed #162, and everything seems to break inference in IntelliJ. The only thing that worked properly was:

  /** Lifts typed columns in join expressions */
  trait Join[L, R]

  object Join {
    def apply[L, R](): Join[L, R] = new Join[L, R] {}
  }

  implicit class LeftSyntax[L, U](c: TypedColumn[L, U]) {
    def left[R](scope: Join[L, R]): TypedColumn[Join[L, R], U] = new TypedColumn(c.expr)(c.uencoder)
  }

  implicit class RightSyntax[R, U](c: TypedColumn[R, U]) {
    def right[L](scope: Join[L, R]): TypedColumn[Join[L, R], U] = new TypedColumn(c.expr)(c.uencoder)
  }

With TypedDataset signatures like

  def joinInner[U](other: TypedDataset[U])(condition: Join[T, U] => TypedColumn[Join[T, U], Boolean])

And user code:

leftDs
  .joinLeft(rightDs)(s => leftDs.col('a).left(s) === rightDs.col('a).right(s))

As you can see, it's super explicit lifting of everything. It doesn't even work in left and right have the same name. I start to think that we are on very fragile edge with relying on an implicit inference that isn't super reliable and can easily break in corner cases.

I feel we can solve any syntactic problem with investing into two macros:

def expr(f: A => B): TypedColumn[A, B]
def joinExpr(f: (A, B) => C): TypedColumn[Join[A, B], C]

And having a macro-free way that doesn't necessarily has nice syntax, but must have nice type inference.

kanterov · 2017-09-17T20:49:27Z

One more thing, it's important to distinguish column that comes from the left dataset, and from the right dataset. Consider expression:

ds.joinInner(ds)(s => ds.col('a).left(s) === ds.col('a).right(s))

vs.

ds.joinInner(ds)(s => ds.col('a).left(s) === ds.col('a).left(s))

Now Spark fixed this only for equality (see TypedDataset#resolveSelfJoin), rewriting left side of === against left dataset, and right side against the right dataset. It doesn't work for asymmetric operators like >=.

In vanilla Spark API, in the case joined datasets aren't exactly the same, such ambiguity can be resolved by explicitly building expressions from a left and right datasets like:

val a: TypedDataset[X1[A]]
val b: TypedDataset[X1[A]]

a.joinInner(b, a.col('a) === b.col('a))

But this is very fragile to any transformations, for instance:

a.joinInner(b.filter(...), a.col('a) === b.col('a))

In this sense, being able to determine that column belongs to a right or left part would allow us to resolve this problem. Unfortunately, approach with A with B doesn't capture this information if both A and B have the same type.

OlivierBlanvillain · 2017-09-17T21:52:03Z

@kanterov thanks for the quick feedback :)

The trick with A with B is very neat. However, it seems that it messes up type inference outside of joins, and it doesn't seem right to trade off other uses cases for having nice joins.

Do you have some example I could play with? The test suite compiled fine after these changes, I thought that was enough ;)

Unfortunately, approach with A with B doesn't capture this information if both A and B have the same type.

I didn't follow you here. Do we need this distinction at the typelevel? The proposed API mirrors vanilla Spark, it's possible to do a.col('a) === b.col('a) with this PR the exact same way it can be done in Spark. Self joins might be problematic, but I guess with could add .left & .right methods on columns to disambiguate only these cases, and leave the common cases untouched.

I feel I'm missing something, maybe an executable example would help me better understand the issue.

imarios · 2017-09-21T05:27:42Z

@OlivierBlanvillain , @kanterov do you guys want to wait for this to close before cutting 0.4?

OlivierBlanvillain · 2017-09-23T08:02:14Z

ping @kanterov

kanterov · 2017-09-27T08:51:49Z

@OlivierBlanvillain sorry for the delay, I'm on vacation and have limited access to my machine, the problem comes from the fact, that column expressions like df.col('a) + df.col('a) infer to types like TypedColumn[T with T, Int].

Scala compiler infers that TypedColumn[T with T, Int] is equivalent to TypedColumn[T, Int], and all our code is compiling. While IntelliJ just breaks everything we have so far. I don't know implications of this on compilation time too.

Another problem is the way Spark resolves columns, I've prepared this code snippet to illustrate different kinds of problems https://gist.github.com/kanterov/49014d462b265eaf54353f09f0ac517a.

Please let me know what do you think.

OlivierBlanvillain · 2017-09-27T09:21:00Z

I did a quick experiment and I think the T with T issue can be worked-around:

trait Col[T, U]

trait TA
trait TB

val t1: Col[TA, Boolean] = null
val t2: Col[TA, Boolean] = null
val t3: Col[TB, Boolean] = null

// That won't be free in term of compile time, but I think if would still be
// better than a Coproduct representation.
trait With[A, B] { type Out }

trait LowPrioWith {
  implicit def identity[T]: With[T, T] { type Out = T } = null
}

object With extends LowPrioWith {
  implicit def combine[A, B]: With[A, B] { type Out = A with B } = null
}

def ++[A, B, U](a: Col[A, U], b: Col[B, U])(implicit w: With[A, B]): Col[w.Out, U] =null

scala> ++(t1, t3)
res4: Col[TA with TB,Boolean] = null

scala> ++(t1, t2)
res5: Col[TA,Boolean] = null

Thanks for the gist, I will have have a closer look later!

imarios · 2017-11-25T19:22:38Z

@OlivierBlanvillain is this PR still in progress? Can we merge parts of it?

OlivierBlanvillain · 2017-11-25T19:36:14Z

No I don't think it's partially mergeable...

OlivierBlanvillain · 2018-01-21T17:41:22Z

@kanterov I think I finally addressed your review, it only took 6 months 😅

The overall diff might be getting too large, so you might want to look at commits instead. Here are the two interesting ones:

52cd1cf Introduces a With type class to solves the A with A with ... with A blowup.
7cdec6b Adds a self-join disambiguation mechanism.

imarios · 2018-01-22T06:08:05Z

@OlivierBlanvillain This looks amazing ✨. Have been trying to break this with funky joins on options, collections, etc. Works great so far.

imarios · 2018-01-22T06:12:42Z

dataset/src/main/scala/frameless/TypedColumn.scala

    * apache/spark
    */
-  def =!=(other: TypedColumn[T, U]): TypedColumn[T, Boolean] = withExpr {
+  def =!=[TT, W](other: TypedColumn[TT, U])(implicit w: With.Aux[T, TT, W]): TypedColumn[W, Boolean] = withExpr {


Can we go a bit over the With type class? why do we need to do this?

I answered in the main thread to not lose the message on rebase

imarios · 2018-01-22T06:14:46Z

dataset/src/main/scala/frameless/TypedDataset.scala

+  }
+
+  /** Left hand side disambiguation of `col` for join expressions.
+    * To be used  when writting self-joins, noop in other circonstances.


French : )? circonstances

OlivierBlanvillain · 2018-01-22T09:32:51Z

Can we go a bit over the With type class? why do we need to do this?

So, suppose you want to write a funky join:

dsA.join(dsB)(dsA.col('a) != dsB.col('b))

Where dsA and dsB have unrelated types A and B. The signature of join should be such that it allows any (boolean) expression involving dsA and dsB (but not dsC). The way this is done is by having dsA.col('a) != dsB.col('b) returning a TypedColumn[A with B, Boolean], which is precisely what the join expects. With is here to compute A with B.

You might wonder, why not just write A with B in the signature of != in the first place? The reason is that doing so would result in compilers keeping track of unnecessary duplicates formed in with types; for example, dsA('a) + dsA('x) + ds('y) would have type A with A with A. Exposing these large useless types everywhere might confuse users and compilers 😄

imarios · 2018-01-27T23:37:20Z

Hey, @OlivierBlanvillain. While testing I found this:

scala> t.joinFull(t2)(t2('_1) === t2('_1)).show().run
org.apache.spark.sql.AnalysisException: Detected cartesian product for FULL OUTER join between logical plans
LocalRelation [_1#2, _2#3]
and
LocalRelation [_1#13, _2#14]
Join condition is missing or trivial.
Use the CROSS JOIN syntax to allow cartesian products between these relations.;
  at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$20.applyOrElse(Optimizer.scala:1080)
  at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$20.applyOrElse(Optimizer.scala:1077)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
  at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
  at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
  at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)

OlivierBlanvillain · 2018-01-28T00:18:30Z

@imarios This is the expected behavior. the join condition you wrote is trivially true, so it's equivalent to writing t.joinFull(t2)(lit(true)). Spark will complain about that unless you manually set conf.set("spark.sql.crossJoin.enabled", "true").

If you want to write a self join, you need to use the explicit colLeft and colRight methods:

t2.joinFull(t2)(t2.colLeft('_1) === t2.colRight('_1)).show().run

But there is not much we can do about trivial join condition, for example, you also explicitly write

t2.joinFull(t2)(t2.colLeft('_1) === t2.colLeft('_1)).show().run

imarios · 2018-01-28T03:37:03Z

@OlivierBlanvillain I see that the condition doesn't make sense, but probably this will be the only Frameless op that leads to run-time error but we say it's an expected behavior. Currently, anything that would cause a runtime exception gives compilation error, which is pretty much our sale point. The only current exception that we know of is explode(), which we already classified as a bug.

I feel that any join you do that doesn't involve cross-dataset comparison should not compile. Can't we just enforce that the joined condition (Boolean column) is either of type A with B or B with A, but not A with A, B with B, or A with C etc.?

scala> t.joinFull(t2)(t2('_1) === t2('_2).cast[String]).show().run
org.apache.spark.sql.AnalysisException: Detected cartesian product for FULL OUTER join between logical plans
LocalRelation [_1#110, _2#111]
and
LocalRelation [_1#121, _2#122]
Join condition is missing or trivial.
Use the CROSS JOIN syntax to allow cartesian products between these relations.;
  at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$20.applyOrElse(Optimizer.scala:1080)
  at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$20.applyOrElse(Optimizer.scala:1077)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)

imarios · 2018-01-28T04:45:16Z

I think all my above examples hit corner case where we try to join two datasets with the same schema.

OlivierBlanvillain · 2018-01-28T10:10:22Z

Can't we just enforce that the joined condition (Boolean column) is either of type A with B or B with A, but not A with A, B with B, or A with C etc.?

It's more complicated than that, for instance expressions like (A.a == A.a) || B.b involves both A and B but is trivially true and probably rejected by Spark. To me this is a configuration problem: joins in Spark/frameless are only sound under the following settings:

conf.set("spark.sql.selfJoinAutoResolveAmbiguity", "true")
conf.set("spark.sql.crossJoin.enabled", "true")

There is no way to enforce that at compile time, so I'm not sure what's would be the best option to enforce such configuration... (we already have the first requirement, see here)

imarios · 2018-01-28T20:56:17Z

Fair enough, I guess we can always try to improve on this on a seperate PR. Do you think we can do better in terms of test coverage?

OlivierBlanvillain · 2018-01-29T13:17:48Z

Sure, I know what's missing. I'm just waiting for #153 to basically redo everything on TypedColumn.scala 😄

imarios · 2018-02-09T03:50:09Z

Hey @OlivierBlanvillain this is the last code related PR for 0.5! I know you must be busy, so whenever you have the time. Thank you!

Implicit conversions don't compose; I propose we simply cast everything explicitly.

With computes the intersection of two types while leaving less noise than A with B: - With[A, A] = A - With[A, B] = A with B (when A != B) This type function is needed to prevent IDEs from infering large types with shape `A with A with ... with A`. These types could be confusing for both end users and IDE's type checkers.

This commit introduces a disambiguation mechanism for self-joins. The idea is to use ds.colLeft('a) and ds.colRight('a) instead of ds.col('a) when writing join expression, where the Left and Right suffix are use to tag the column for being on the left hand side of on the right hand side of the join expression. The implementation adds two new nodes to the Spark IR DisambiguateLeft and DisambiguateRight, which behave at dummies (just forward to underlying) but are intersepted at runtime when building the join plans to actually implement the disambiguation.

OlivierBlanvillain · 2018-02-11T22:35:08Z

@imarios It's rebased and should be hopefully be green 😄

imarios · 2018-02-11T23:39:40Z

Looks good @OlivierBlanvillain! Do you mind the code coverage concerns from Travis?

OlivierBlanvillain · 2018-02-11T23:49:48Z

The two lines not covered are these two, not sure how to trigger genCode but here I simply forwards to the underlying Expression, so I don't see how anything could go wrong. Note that the doGenCode is only called by the original genCode (it cannot be called from the outside given that it's protected), so it's OK to have the ??? there.

OlivierBlanvillain requested a review from kanterov September 15, 2017 23:07

OlivierBlanvillain mentioned this pull request Sep 22, 2017

Add a typed col function for creating column references #187

Open

OlivierBlanvillain changed the title ~~Joins with boolean column expressions, 2nd attempt~~ [WIP] Joins with boolean column expressions, 2nd attempt Nov 25, 2017

OlivierBlanvillain force-pushed the joins3 branch 2 times, most recently from e40c62c to 3acca14 Compare January 21, 2018 17:40

OlivierBlanvillain changed the title ~~[WIP] Joins with boolean column expressions, 2nd attempt~~ Joins with boolean column expressions, 2nd attempt Jan 21, 2018

imarios reviewed Jan 22, 2018

View reviewed changes

imarios added this to the 0.5-release milestone Jan 30, 2018

OlivierBlanvillain force-pushed the joins3 branch from 3acca14 to 536a529 Compare February 4, 2018 08:34

Minor changes

82fad78

OlivierBlanvillain force-pushed the joins3 branch 2 times, most recently from 43142cd to 3309e76 Compare February 11, 2018 22:16

OlivierBlanvillain added 6 commits February 11, 2018 23:29

Remove automagic widden import

0fd9cd7

Implicit conversions don't compose; I propose we simply cast everything explicitly.

Use With in all TypedColumn methods

ba17f86

Add joins APIs

d166ea7

Test joins

90fbbf8

OlivierBlanvillain force-pushed the joins3 branch from 3309e76 to 90fbbf8 Compare February 11, 2018 22:30

imarios merged commit dfd224a into typelevel:master Feb 12, 2018

Uh oh!

Joins with boolean column expressions, 2nd attempt #175

Joins with boolean column expressions, 2nd attempt #175

Uh oh!

Conversation

OlivierBlanvillain commented Sep 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-io commented Sep 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

OlivierBlanvillain commented Sep 17, 2017

Uh oh!

kanterov commented Sep 17, 2017

Uh oh!

iravid commented Sep 17, 2017

Uh oh!

kanterov commented Sep 17, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kanterov commented Sep 17, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

OlivierBlanvillain commented Sep 17, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

imarios commented Sep 21, 2017

Uh oh!

OlivierBlanvillain commented Sep 23, 2017

Uh oh!

kanterov commented Sep 27, 2017

Uh oh!

OlivierBlanvillain commented Sep 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

imarios commented Nov 25, 2017

Uh oh!

OlivierBlanvillain commented Nov 25, 2017

Uh oh!

OlivierBlanvillain commented Jan 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

imarios commented Jan 22, 2018

Uh oh!

imarios Jan 22, 2018

Choose a reason for hiding this comment

Uh oh!

OlivierBlanvillain Jan 22, 2018

Choose a reason for hiding this comment

Uh oh!

imarios Jan 22, 2018

Choose a reason for hiding this comment

Uh oh!

OlivierBlanvillain Jan 22, 2018

Choose a reason for hiding this comment

Uh oh!

OlivierBlanvillain commented Jan 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

imarios commented Jan 27, 2018

Uh oh!

OlivierBlanvillain commented Jan 28, 2018

Uh oh!

imarios commented Jan 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

imarios commented Jan 28, 2018

Uh oh!

OlivierBlanvillain commented Jan 28, 2018

Uh oh!

imarios commented Jan 28, 2018

Uh oh!

OlivierBlanvillain commented Jan 29, 2018

Uh oh!

imarios commented Feb 9, 2018

Uh oh!

OlivierBlanvillain commented Feb 11, 2018

Uh oh!

imarios commented Feb 11, 2018

OlivierBlanvillain commented Sep 15, 2017 •

edited

Loading

codecov-io commented Sep 16, 2017 •

edited

Loading

kanterov commented Sep 17, 2017 •

edited

Loading

kanterov commented Sep 17, 2017 •

edited

Loading

OlivierBlanvillain commented Sep 17, 2017 •

edited

Loading

OlivierBlanvillain commented Sep 27, 2017 •

edited

Loading

OlivierBlanvillain commented Jan 21, 2018 •

edited

Loading

OlivierBlanvillain commented Jan 22, 2018 •

edited

Loading

imarios commented Jan 28, 2018 •

edited

Loading

OlivierBlanvillain commented Feb 11, 2018 •

edited

Loading