Joins with boolean column expressions #162

OlivierBlanvillain · 2017-08-08T09:30:23Z

This PR updates the interface for joins to support boolean column expressions. It also adds methods for each type of join implemented in Spark (Inner, Cross, Full, Right, Left, LeftSemi and LeftAnti).

As is, these changes lose some of the type safety in frameless. The first commit gets rid of TypedColumn's first type parameter which was used to keep track of the table corresponding to this columns. I see two options to recover this safety:

Switch to a "functional" interface by changing the signature of select to def select[U](c: TypedColumn[T] => TypedColumn[U]): Dataset[U] and use the same trick in joins: def join[U](other: Dataset[U])(c: (TypedColumn[T], TypedColumn[U]) => TypedColumn[Boolean]): Dataset[(T, U)].
Recover the second type parameter on TypedColumn but make it a coproduct of source tables.

IMO both options would add too much complexity compared to the small gain of keeping track of column origin. I think the type safety lost in this PR is least important safety feature provided by frameless, thus my proposal to get rid of it 😄

codecov-io · 2017-08-08T09:55:13Z

Codecov Report

Merging #162 into master will decrease coverage by 0.22%.
The diff coverage is 99.18%.

@@            Coverage Diff             @@
##           master     #162      +/-   ##
==========================================
- Coverage      94%   93.77%   -0.23%     
==========================================
  Files          31       28       -3     
  Lines         667      659       -8     
  Branches        9       13       +4     
==========================================
- Hits          627      618       -9     
- Misses         40       41       +1

Impacted Files	Coverage Δ
...t/src/main/scala/frameless/functions/package.scala	`100% <ø> (ø)`	⬆️
...ataset/src/main/scala/frameless/TypedDataset.scala	`93.6% <100%> (+0.37%)`	⬆️
...aset/src/main/scala/frameless/ops/GroupByOps.scala	`95.23% <100%> (-2.92%)`	⬇️
...set/src/main/scala/frameless/FramelessSyntax.scala	`66.66% <100%> (ø)`	⬆️
dataset/src/main/scala/frameless/implicits.scala	`100% <100%> (ø)`	⬆️
...taset/src/main/scala/frameless/functions/Udf.scala	`100% <100%> (ø)`	⬆️
.../src/main/scala/frameless/ops/AggregateTypes.scala	`100% <100%> (ø)`	⬆️
...ain/scala/frameless/functions/UnaryFunctions.scala	`100% <100%> (ø)`	⬆️
...set/src/main/scala/frameless/ops/ColumnTypes.scala	`100% <100%> (ø)`	⬆️
...scala/frameless/functions/AggregateFunctions.scala	`100% <100%> (ø)`	⬆️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4c04b3f...e08f0e1. Read the comment docs.

kanterov · 2017-08-08T14:48:14Z

Did you see https://github.com/propensive/impromptu?

What about modeling join as something like:

trait TypedDataset[A]
  def innerJoin[B](bs: TypedDataset[B])(env: Env[A with B] => TypedColumn[A with B, Boolean]): TypedDataset[(A, B)]
}

trait TypedColumn[A] {
  def in[B >: A](env: Env[B]): TypedColumn[B]
}

as.innerJoin(bs)(env => as.col('a).in(env) === bs.col('b).in(env))

kanterov · 2017-08-08T14:54:35Z

Just realized there is a slightly different, but probably better way. Don't know if we want to generalize it, but the idea is:

trait Join2[A, b]

trait TypedDataset[A]
  def innerJoin[B](bs: TypedDataset[B])(env: Join2[A, B] => TypedColumn[Join2[A, B], Boolean]): TypedDataset[(A, B)]
}

trait TypedColumn[A, U] {
  def in(join: Join2[A, B]): TypedColumn[Join2[A, B], U]
  def in(join: Join2[B, A]): TypedColumn[Join2[B, A], U]
}

as.innerJoin(bs)(env => as.col('a).in(env) === bs.col('b).in(env))

OlivierBlanvillain · 2017-08-09T08:54:05Z

We can probably reduce it to as.innerJoin(bs)(implicit env => as.col('a) === bs.col('b)). My other idea was to use a coproduct of acceptable "source" for column expression, something like this:

trait TypedDataset[A]
  def innerJoin[B](bs: TypedDataset[B])(expr: TypedColumn[A :+: B :+: CNil, Boolean]): TypedDataset[(A, B)]
}

// Here === combines a TypedColumn[A :+: CNil, _] and a TypedColumn[B :+: CNil, _]
as.innerJoin(bs)(as.col('a) === bs.col('b))

But is it worth it? Are we really getting rid of an important class of errors with these checks? I think the additional heaviness of an implicit env everywhere would be too much...

kanterov · 2017-08-09T20:28:25Z

I think to make this decision we need to evaluate cost and opportunity. One of the things that might push us to remove one type parameter is problems with type inference and craziness of compile errors. On the other side, having type parameter enables us declaring operations like:

trait TypedColumn[A, B] { 
  def >>[C](c: TypedColumn[B, C]): TypedColumn[A, C]
}

I don't see we can get enough quantitive data to evaluate how important is this class of errors.

I think one of the things to explore is approach by @jeremyrsmith, I think we can get very nice syntax, and keep type parameter.

OlivierBlanvillain · 2017-08-10T10:48:52Z

@kanterov I didn't get what you mean by >>, could you elaborate?

I did a quick experiment with the implicit env => ds('a) idea, it looks promising: OlivierBlanvillain@5139091.

With Dotty's implicit function type we could just declare select as def select[C](expr: implicit CanAccess[T] => TypedColumn[C]) and keep everything as is. It might be possible to emulate just that with a macro. I think scala.rx has something similar in this file to turn Rx(a + b) into Rx(implicit env => a + b).

OlivierBlanvillain · 2017-08-10T10:52:45Z

BTW, what's your opinion on merging this PR without addressing column access safety?

kanterov · 2017-08-10T11:48:20Z

@OlivierBlanvillain >> is composition of two columns, for instance,

case class Person(address: Address)
case class Address(street: String, apt: Int)

def _address: TypedColumn[Person, Address]
def _apt: TypedColumn[Address, Int]

val x: TypedColumn[Person, Int] = _address >> _apt

The experiment looks very promising if we can use a macro to model implicit functions. What do you think if we merge this into a separate branch, polish, build artifact and ask people for feedback?

OlivierBlanvillain · 2017-08-10T13:12:44Z

@kanterov I think it's possible to do the same thing than >> with a function + Spark's getField on top of this PR:

def _address: TypedColumn[Address] = ... // + implicit ca: CanAccess[Address]?
def _apt(a: TypedColumn[Address]): TypedColumn[Int] = a.getField('apt)
val x: TypedColumn[Int] = _apt(_address)

The experiment looks very promising if we can use a macro to model implicit functions. What do you think if we merge this into a separate branch, polish, build artifact and ask people for feedback?

By this are you referring to this PR of a polished version of 5139091? I'm happy to push the idea as far as possible with explicit implicits, but I think I will ask for help for the macro part... I already do too much compiler hacking on workdays :P

imarios · 2017-08-10T17:52:16Z

I love the idea of rethinking about these topics. This is healthy and we should do this more often :).

So, philosophically speaking, A => B is essentially Function1[A,B] in Scala. So I think it does feels natural to go from TypedColumn[A,B] to a TypedColumn[A] => TypedColumn[B] representation.

If we do that then it also becomes much more natural to do function composition for TypedColumn[_] => TypedColumn[_] for example, TypedColumn[A] => TypedColumn[B] => TypedColumn[C] is essentially TypedColumn[A] => TypedColumn[C].

I am trying to see what is the downside of this besides a slight increase in verbosity and API complexity ...

OlivierBlanvillain · 2017-08-25T12:00:16Z

Ping @kanterov (#162 (comment))

@imarios Note that the idea here a bit different from the previous suggestion where I propose to turn all TypedColumn[A,B] into TypedColumn[A] => TypedColumn[B]. Here the idea to some something like:

implicit CanAccessTable[T1 :: T2 :: T3 :: HNil] => TypedColumn[C]

imarios · 2017-09-01T04:11:38Z

@OlivierBlanvillain so now is it possible to have a TypedColumn[A] and use it at a TypedDataset that contains an element of type A (so it passes compilation) but fails because, maybe, the name of the fields being accessed are not the same?

OlivierBlanvillain · 2017-09-10T08:26:55Z

@imarios yes, that's the current state of this PR. Some safety is lost, but we get full coverage of the join API and a simpler code base. The question is whether we should merge it as is and try to recover the safety later, or if losing this bit of safety is not acceptable.

This change reduces the type saftely of column expressions. For instance, it's now possible to typecheck some nonsensical `select` as follows: ``` ds1.select(ds2('a)) // This use to be a type error! ``` This simplification is a required step to support joins, which is also source compatible. Safety can hopefuly be recovered by having `select` take a `TypedColumn[T] => TypedColumn[C]` function and removing the apply method on TypedDataset.

OlivierBlanvillain · 2017-09-10T08:35:58Z

rebased

OlivierBlanvillain · 2017-09-10T08:38:08Z

dataset/src/main/scala/frameless/ops/ColumnTypes.scala

  * }}}
  */
-trait ColumnTypes[T, U <: HList] {
+trait ColumnTypes[U <: HList] {


I think it we can replace this by a Comapped now.

kanterov · 2017-09-10T09:59:47Z

What do you think about using CanAccess approach in a join expression to lift TypedColumn[L, A] to TypedColumn[L :+: R :+: CNil, A], and leave everything as it was before?

I feel this is an important part that we don't allow to mess up with types of datasets. It would be especially relevant in cases like:

val df = TypedDataset.create[X3[Int, Long, String]]()

df.select(df.col("a"), df.col("b"))
  .select(df.col("c")) // <-- should be compile error

Also losing T from TypedColumn[T, A] doesn't allow us to utilize approach with a macro transformation of Scala expressions to frameless TypedColumn[T, A]. I did some experiments with this, it works very well with type inference in IntelliJ, and people found it very natural. I find a lot of potential in this approach once scala-meta will support using type information in macro.

OlivierBlanvillain · 2017-09-10T10:56:49Z

What do you think about using CanAccess approach in a join expression to lift TypedColumn[L, A] to TypedColumn[L :+: R :+: CNil, A], and leave everything as it was before?

That's a nice idea. I originally considered having L < Coproduct in TypedColumn[L, A], but that would mean computer L1 union L2 on every column operation, so df('a) + df('b) + df('c) would involves 3 implicit resolutions to compute df#T union df#T union df#T = df#T. Given than our test suite already takes 45 seconds to compile, I'm not sure that would be practical...

If we keep everything is it currently is and special case joins, it means joins need to have a implicit conversion is scope to do the lifting, I'm going to give this a try!

This reverts commit eb66711. Conflicts: dataset/src/main/scala/frameless/TypedDataset.scala

Inference works fine

`CanAccess[_, A with B]` indicates that in this context it is possible to access columns from both table `A` and table `B`. The first type parameter is a dummy argument used for type inference. The trick works as follows: `(df: TypedDataset[T]).col('a)` looks for a CanAccess[T, T] which is always available thanks to the `globalInstance` implicit defined above. Expression for joins (and other multi dataset operations) take an `implicit a: CanAccess[Any, U with T] =>` closure. Because the first (dummy) type parameter of `CanAccess` is contravariant, the locally defined implicit will always be preferred over `globalInstance`, which implements the desired behavior.

OlivierBlanvillain · 2017-09-15T22:57:02Z

@kanterov Here is as far as I got using this CanAccess idea. It seams to work, but in the meantime I got another better idea (separate PR incoming). If you are interested, the last commit should be the interesting one.

OlivierBlanvillain mentioned this pull request Aug 8, 2017

Missing Dataset methods #163

Open

76 tasks

OlivierBlanvillain force-pushed the joins branch 3 times, most recently from b944225 to 44adda4 Compare August 25, 2017 12:33

OlivierBlanvillain mentioned this pull request Sep 10, 2017

Experiment with new function syntax #60

Closed

OlivierBlanvillain added 3 commits September 10, 2017 10:35

Add all joins with a TypedColumn[Boolean] condition

0a1afb3

Fix tut

71e67da

OlivierBlanvillain force-pushed the joins branch from 36ba9a3 to 71e67da Compare September 10, 2017 08:35

OlivierBlanvillain commented Sep 10, 2017

View reviewed changes

OlivierBlanvillain added 4 commits September 16, 2017 00:49

Revert "Remove TypedColumn's first type parameter"

bcabdb2

This reverts commit eb66711. Conflicts: dataset/src/main/scala/frameless/TypedDataset.scala

Merge branch 'master' into joins

fa3cf97

Remove col type arguments

34b198d

Inference works fine

OlivierBlanvillain closed this Sep 15, 2017

This was referenced Sep 15, 2017

Joins with boolean column expressions, 2nd attempt #175

Merged

Add a typed col function for creating column references #187

Open

Uh oh!

Joins with boolean column expressions #162

Joins with boolean column expressions #162

Uh oh!

Conversation

OlivierBlanvillain commented Aug 8, 2017

Uh oh!

codecov-io commented Aug 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

kanterov commented Aug 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kanterov commented Aug 8, 2017

Uh oh!

OlivierBlanvillain commented Aug 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kanterov commented Aug 9, 2017

Uh oh!

OlivierBlanvillain commented Aug 10, 2017

Uh oh!

OlivierBlanvillain commented Aug 10, 2017

Uh oh!

kanterov commented Aug 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

OlivierBlanvillain commented Aug 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

imarios commented Aug 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

OlivierBlanvillain commented Aug 25, 2017

Uh oh!

imarios commented Sep 1, 2017

Uh oh!

OlivierBlanvillain commented Sep 10, 2017

Uh oh!

OlivierBlanvillain commented Sep 10, 2017

Uh oh!

OlivierBlanvillain Sep 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kanterov commented Sep 10, 2017

Uh oh!

OlivierBlanvillain commented Sep 10, 2017

Uh oh!

OlivierBlanvillain commented Sep 15, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-io commented Aug 8, 2017 •

edited

Loading

kanterov commented Aug 8, 2017 •

edited

Loading

OlivierBlanvillain commented Aug 9, 2017 •

edited

Loading

kanterov commented Aug 10, 2017 •

edited

Loading

OlivierBlanvillain commented Aug 10, 2017 •

edited

Loading

imarios commented Aug 10, 2017 •

edited

Loading

OlivierBlanvillain Sep 10, 2017 •

edited

Loading