Skip to content

Conversation

@OlivierBlanvillain
Copy link
Contributor

@OlivierBlanvillain OlivierBlanvillain commented Sep 15, 2017

This PR replaces #162 with a nicer solution that is both safer than the original approach (columns keep their "source dataset" phantom type) and simpler than the implicit CanAccess approach.

The trick is to abuse Scala type system to model "a boolean columns coming from either table A or B" as TypedColumn[A with B, Boolean]. The only changes that makes this possible are on the TypedColumn operation which now take an extra type parameter for the second argument of binary operations and combines the phantom types as expected. From a uses perspective nothing changes:

val condition: TypedColumn[A with B, Boolean] = (df1: A).col('a) === (df2: B).col('b)

The implicit conversions using for the widden import broke with this change. I propose to remove this mechanism altogether and require users to explicitly cast when they convert columns from one type to another. Implicit conversion do not compose and break type inference left and right, these changes are a good illustration of that.

@codecov-io
Copy link

codecov-io commented Sep 16, 2017

Codecov Report

Merging #175 into master will decrease coverage by 0.43%.
The diff coverage is 94.11%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #175      +/-   ##
==========================================
- Coverage   96.99%   96.56%   -0.44%     
==========================================
  Files          52       51       -1     
  Lines         866      874       +8     
  Branches       10       11       +1     
==========================================
+ Hits          840      844       +4     
- Misses         26       30       +4
Impacted Files Coverage Δ
...la/frameless/functions/NonAggregateFunctions.scala 100% <ø> (ø) ⬆️
...ore/src/main/scala/frameless/CatalystOrdered.scala 100% <100%> (ø) ⬆️
...ataset/src/main/scala/frameless/TypedDataset.scala 100% <100%> (ø) ⬆️
dataset/src/main/scala/frameless/TypedColumn.scala 100% <100%> (ø) ⬆️
...cala/org/apache/spark/sql/FramelessInternals.scala 77.77% <77.77%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ba9abbe...90fbbf8. Read the comment docs.

@OlivierBlanvillain
Copy link
Contributor Author

@kanterov Would you have time for a review?

@kanterov
Copy link
Contributor

@OlivierBlanvillain looking into it now! :)

@iravid
Copy link
Contributor

iravid commented Sep 17, 2017

👍🏻 on explicit casts.

@kanterov
Copy link
Contributor

kanterov commented Sep 17, 2017

The trick with A with B is very neat. However, it seems that it messes up type inference outside of joins, and it doesn't seem right to trade off other uses cases for having nice joins.

I tried to play with implicit tricks we discussed #162, and everything seems to break inference in IntelliJ. The only thing that worked properly was:

  /** Lifts typed columns in join expressions */
  trait Join[L, R]

  object Join {
    def apply[L, R](): Join[L, R] = new Join[L, R] {}
  }

  implicit class LeftSyntax[L, U](c: TypedColumn[L, U]) {
    def left[R](scope: Join[L, R]): TypedColumn[Join[L, R], U] = new TypedColumn(c.expr)(c.uencoder)
  }

  implicit class RightSyntax[R, U](c: TypedColumn[R, U]) {
    def right[L](scope: Join[L, R]): TypedColumn[Join[L, R], U] = new TypedColumn(c.expr)(c.uencoder)
  }

With TypedDataset signatures like

  def joinInner[U](other: TypedDataset[U])(condition: Join[T, U] => TypedColumn[Join[T, U], Boolean])

And user code:

leftDs
  .joinLeft(rightDs)(s => leftDs.col('a).left(s) === rightDs.col('a).right(s))

As you can see, it's super explicit lifting of everything. It doesn't even work in left and right have the same name. I start to think that we are on very fragile edge with relying on an implicit inference that isn't super reliable and can easily break in corner cases.

I feel we can solve any syntactic problem with investing into two macros:

def expr(f: A => B): TypedColumn[A, B]
def joinExpr(f: (A, B) => C): TypedColumn[Join[A, B], C]

And having a macro-free way that doesn't necessarily has nice syntax, but must have nice type inference.

@kanterov
Copy link
Contributor

kanterov commented Sep 17, 2017

One more thing, it's important to distinguish column that comes from the left dataset, and from the right dataset. Consider expression:

ds.joinInner(ds)(s => ds.col('a).left(s) === ds.col('a).right(s))

vs.

ds.joinInner(ds)(s => ds.col('a).left(s) === ds.col('a).left(s))

Now Spark fixed this only for equality (see TypedDataset#resolveSelfJoin), rewriting left side of === against left dataset, and right side against the right dataset. It doesn't work for asymmetric operators like >=.

In vanilla Spark API, in the case joined datasets aren't exactly the same, such ambiguity can be resolved by explicitly building expressions from a left and right datasets like:

val a: TypedDataset[X1[A]]
val b: TypedDataset[X1[A]]

a.joinInner(b, a.col('a) === b.col('a))

But this is very fragile to any transformations, for instance:

a.joinInner(b.filter(...), a.col('a) === b.col('a))

In this sense, being able to determine that column belongs to a right or left part would allow us to resolve this problem. Unfortunately, approach with A with B doesn't capture this information if both A and B have the same type.

@OlivierBlanvillain
Copy link
Contributor Author

OlivierBlanvillain commented Sep 17, 2017

@kanterov thanks for the quick feedback :)

The trick with A with B is very neat. However, it seems that it messes up type inference outside of joins, and it doesn't seem right to trade off other uses cases for having nice joins.

Do you have some example I could play with? The test suite compiled fine after these changes, I thought that was enough ;)

Unfortunately, approach with A with B doesn't capture this information if both A and B have the same type.

I didn't follow you here. Do we need this distinction at the typelevel? The proposed API mirrors vanilla Spark, it's possible to do a.col('a) === b.col('a) with this PR the exact same way it can be done in Spark. Self joins might be problematic, but I guess with could add .left & .right methods on columns to disambiguate only these cases, and leave the common cases untouched.

I feel I'm missing something, maybe an executable example would help me better understand the issue.

@imarios
Copy link
Contributor

imarios commented Sep 21, 2017

@OlivierBlanvillain , @kanterov do you guys want to wait for this to close before cutting 0.4?

@OlivierBlanvillain
Copy link
Contributor Author

ping @kanterov

@kanterov
Copy link
Contributor

@OlivierBlanvillain sorry for the delay, I'm on vacation and have limited access to my machine, the problem comes from the fact, that column expressions like df.col('a) + df.col('a) infer to types like TypedColumn[T with T, Int].

Scala compiler infers that TypedColumn[T with T, Int] is equivalent to TypedColumn[T, Int], and all our code is compiling. While IntelliJ just breaks everything we have so far. I don't know implications of this on compilation time too.

Another problem is the way Spark resolves columns, I've prepared this code snippet to illustrate different kinds of problems https://gist.github.com/kanterov/49014d462b265eaf54353f09f0ac517a.

Please let me know what do you think.

@OlivierBlanvillain
Copy link
Contributor Author

OlivierBlanvillain commented Sep 27, 2017

I did a quick experiment and I think the T with T issue can be worked-around:

trait Col[T, U]

trait TA
trait TB

val t1: Col[TA, Boolean] = null
val t2: Col[TA, Boolean] = null
val t3: Col[TB, Boolean] = null

// That won't be free in term of compile time, but I think if would still be
// better than a Coproduct representation.
trait With[A, B] { type Out }

trait LowPrioWith {
  implicit def identity[T]: With[T, T] { type Out = T } = null
}

object With extends LowPrioWith {
  implicit def combine[A, B]: With[A, B] { type Out = A with B } = null
}

def ++[A, B, U](a: Col[A, U], b: Col[B, U])(implicit w: With[A, B]): Col[w.Out, U] =null

scala> ++(t1, t3)
res4: Col[TA with TB,Boolean] = null

scala> ++(t1, t2)
res5: Col[TA,Boolean] = null

Thanks for the gist, I will have have a closer look later!

@imarios
Copy link
Contributor

imarios commented Nov 25, 2017

@OlivierBlanvillain is this PR still in progress? Can we merge parts of it?

@OlivierBlanvillain OlivierBlanvillain changed the title Joins with boolean column expressions, 2nd attempt [WIP] Joins with boolean column expressions, 2nd attempt Nov 25, 2017
@OlivierBlanvillain
Copy link
Contributor Author

No I don't think it's partially mergeable...

@OlivierBlanvillain OlivierBlanvillain force-pushed the joins3 branch 2 times, most recently from e40c62c to 3acca14 Compare January 21, 2018 17:40
@OlivierBlanvillain
Copy link
Contributor Author

OlivierBlanvillain commented Jan 21, 2018

@kanterov I think I finally addressed your review, it only took 6 months 😅

The overall diff might be getting too large, so you might want to look at commits instead. Here are the two interesting ones:

  • 52cd1cf Introduces a With type class to solves the A with A with ... with A blowup.
  • 7cdec6b Adds a self-join disambiguation mechanism.

@OlivierBlanvillain OlivierBlanvillain changed the title [WIP] Joins with boolean column expressions, 2nd attempt Joins with boolean column expressions, 2nd attempt Jan 21, 2018
@imarios
Copy link
Contributor

imarios commented Jan 22, 2018

@OlivierBlanvillain This looks amazing ✨. Have been trying to break this with funky joins on options, collections, etc. Works great so far.

* apache/spark
*/
def =!=(other: TypedColumn[T, U]): TypedColumn[T, Boolean] = withExpr {
def =!=[TT, W](other: TypedColumn[TT, U])(implicit w: With.Aux[T, TT, W]): TypedColumn[W, Boolean] = withExpr {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we go a bit over the With type class? why do we need to do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I answered in the main thread to not lose the message on rebase

}

/** Left hand side disambiguation of `col` for join expressions.
* To be used when writting self-joins, noop in other circonstances.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

French : )? circonstances

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🇫🇷

@OlivierBlanvillain
Copy link
Contributor Author

OlivierBlanvillain commented Jan 22, 2018

Can we go a bit over the With type class? why do we need to do this?

So, suppose you want to write a funky join:

dsA.join(dsB)(dsA.col('a) != dsB.col('b))

Where dsA and dsB have unrelated types A and B. The signature of join should be such that it allows any (boolean) expression involving dsA and dsB (but not dsC). The way this is done is by having dsA.col('a) != dsB.col('b) returning a TypedColumn[A with B, Boolean], which is precisely what the join expects. With is here to compute A with B.

You might wonder, why not just write A with B in the signature of != in the first place? The reason is that doing so would result in compilers keeping track of unnecessary duplicates formed in with types; for example, dsA('a) + dsA('x) + ds('y) would have type A with A with A. Exposing these large useless types everywhere might confuse users and compilers 😄

@imarios
Copy link
Contributor

imarios commented Jan 27, 2018

Hey, @OlivierBlanvillain. While testing I found this:

scala> t.joinFull(t2)(t2('_1) === t2('_1)).show().run
org.apache.spark.sql.AnalysisException: Detected cartesian product for FULL OUTER join between logical plans
LocalRelation [_1#2, _2#3]
and
LocalRelation [_1#13, _2#14]
Join condition is missing or trivial.
Use the CROSS JOIN syntax to allow cartesian products between these relations.;
  at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$20.applyOrElse(Optimizer.scala:1080)
  at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$20.applyOrElse(Optimizer.scala:1077)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
  at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
  at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
  at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)

@OlivierBlanvillain
Copy link
Contributor Author

@imarios This is the expected behavior. the join condition you wrote is trivially true, so it's equivalent to writing t.joinFull(t2)(lit(true)). Spark will complain about that unless you manually set conf.set("spark.sql.crossJoin.enabled", "true").

If you want to write a self join, you need to use the explicit colLeft and colRight methods:

t2.joinFull(t2)(t2.colLeft('_1) === t2.colRight('_1)).show().run

But there is not much we can do about trivial join condition, for example, you also explicitly write

t2.joinFull(t2)(t2.colLeft('_1) === t2.colLeft('_1)).show().run

@imarios
Copy link
Contributor

imarios commented Jan 28, 2018

@OlivierBlanvillain I see that the condition doesn't make sense, but probably this will be the only Frameless op that leads to run-time error but we say it's an expected behavior. Currently, anything that would cause a runtime exception gives compilation error, which is pretty much our sale point. The only current exception that we know of is explode(), which we already classified as a bug.

I feel that any join you do that doesn't involve cross-dataset comparison should not compile. Can't we just enforce that the joined condition (Boolean column) is either of type A with B or B with A, but not A with A, B with B, or A with C etc.?

scala> t.joinFull(t2)(t2('_1) === t2('_2).cast[String]).show().run
org.apache.spark.sql.AnalysisException: Detected cartesian product for FULL OUTER join between logical plans
LocalRelation [_1#110, _2#111]
and
LocalRelation [_1#121, _2#122]
Join condition is missing or trivial.
Use the CROSS JOIN syntax to allow cartesian products between these relations.;
  at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$20.applyOrElse(Optimizer.scala:1080)
  at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$20.applyOrElse(Optimizer.scala:1077)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)

@imarios
Copy link
Contributor

imarios commented Jan 28, 2018

I think all my above examples hit corner case where we try to join two datasets with the same schema.

@OlivierBlanvillain
Copy link
Contributor Author

Can't we just enforce that the joined condition (Boolean column) is either of type A with B or B with A, but not A with A, B with B, or A with C etc.?

It's more complicated than that, for instance expressions like (A.a == A.a) || B.b involves both A and B but is trivially true and probably rejected by Spark. To me this is a configuration problem: joins in Spark/frameless are only sound under the following settings:

conf.set("spark.sql.selfJoinAutoResolveAmbiguity", "true")
conf.set("spark.sql.crossJoin.enabled", "true")

There is no way to enforce that at compile time, so I'm not sure what's would be the best option to enforce such configuration... (we already have the first requirement, see here)

@imarios
Copy link
Contributor

imarios commented Jan 28, 2018

Fair enough, I guess we can always try to improve on this on a seperate PR. Do you think we can do better in terms of test coverage?

@OlivierBlanvillain
Copy link
Contributor Author

Sure, I know what's missing. I'm just waiting for #153 to basically redo everything on TypedColumn.scala 😄

@imarios
Copy link
Contributor

imarios commented Feb 9, 2018

Hey @OlivierBlanvillain this is the last code related PR for 0.5! I know you must be busy, so whenever you have the time. Thank you!

@OlivierBlanvillain OlivierBlanvillain force-pushed the joins3 branch 2 times, most recently from 43142cd to 3309e76 Compare February 11, 2018 22:16
Implicit conversions don't compose; I propose we simply cast everything explicitly.
With computes the intersection of two types while leaving less noise than A with B:

- With[A, A] = A
- With[A, B] = A with B (when A != B)

This type function is needed to prevent IDEs from infering large types
with shape `A with A with ... with A`. These types could be confusing for
both end users and IDE's type checkers.
This commit introduces a disambiguation mechanism for self-joins. The idea is
to use ds.colLeft('a) and ds.colRight('a) instead of ds.col('a) when writing
join expression, where the Left and Right suffix are use to tag the column for
being on the left hand side of on the right hand side of the join expression.
The implementation adds two new nodes to the Spark IR DisambiguateLeft and
DisambiguateRight, which behave at dummies (just forward to underlying) but
are intersepted at runtime when building the join plans to actually implement
the disambiguation.
@OlivierBlanvillain
Copy link
Contributor Author

@imarios It's rebased and should be hopefully be green 😄

@imarios
Copy link
Contributor

imarios commented Feb 11, 2018

Looks good @OlivierBlanvillain! Do you mind the code coverage concerns from Travis?

@OlivierBlanvillain
Copy link
Contributor Author

OlivierBlanvillain commented Feb 11, 2018

The two lines not covered are these two, not sure how to trigger genCode but here I simply forwards to the underlying Expression, so I don't see how anything could go wrong. Note that the doGenCode is only called by the original genCode (it cannot be called from the outside given that it's protected), so it's OK to have the ??? there.

@imarios imarios merged commit dfd224a into typelevel:master Feb 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants