Skip to content

Conversation

@OlivierBlanvillain
Copy link
Contributor

This PR updates the interface for joins to support boolean column expressions. It also adds methods for each type of join implemented in Spark (Inner, Cross, Full, Right, Left, LeftSemi and LeftAnti).

As is, these changes lose some of the type safety in frameless. The first commit gets rid of TypedColumn's first type parameter which was used to keep track of the table corresponding to this columns. I see two options to recover this safety:

  1. Switch to a "functional" interface by changing the signature of select to def select[U](c: TypedColumn[T] => TypedColumn[U]): Dataset[U] and use the same trick in joins: def join[U](other: Dataset[U])(c: (TypedColumn[T], TypedColumn[U]) => TypedColumn[Boolean]): Dataset[(T, U)].

  2. Recover the second type parameter on TypedColumn but make it a coproduct of source tables.

IMO both options would add too much complexity compared to the small gain of keeping track of column origin. I think the type safety lost in this PR is least important safety feature provided by frameless, thus my proposal to get rid of it 😄

@OlivierBlanvillain OlivierBlanvillain mentioned this pull request Aug 8, 2017
76 tasks
@codecov-io
Copy link

codecov-io commented Aug 8, 2017

Codecov Report

Merging #162 into master will decrease coverage by 0.22%.
The diff coverage is 99.18%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #162      +/-   ##
==========================================
- Coverage      94%   93.77%   -0.23%     
==========================================
  Files          31       28       -3     
  Lines         667      659       -8     
  Branches        9       13       +4     
==========================================
- Hits          627      618       -9     
- Misses         40       41       +1
Impacted Files Coverage Δ
...t/src/main/scala/frameless/functions/package.scala 100% <ø> (ø) ⬆️
...ataset/src/main/scala/frameless/TypedDataset.scala 93.6% <100%> (+0.37%) ⬆️
...aset/src/main/scala/frameless/ops/GroupByOps.scala 95.23% <100%> (-2.92%) ⬇️
...set/src/main/scala/frameless/FramelessSyntax.scala 66.66% <100%> (ø) ⬆️
dataset/src/main/scala/frameless/implicits.scala 100% <100%> (ø) ⬆️
...taset/src/main/scala/frameless/functions/Udf.scala 100% <100%> (ø) ⬆️
.../src/main/scala/frameless/ops/AggregateTypes.scala 100% <100%> (ø) ⬆️
...ain/scala/frameless/functions/UnaryFunctions.scala 100% <100%> (ø) ⬆️
...set/src/main/scala/frameless/ops/ColumnTypes.scala 100% <100%> (ø) ⬆️
...scala/frameless/functions/AggregateFunctions.scala 100% <100%> (ø) ⬆️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4c04b3f...e08f0e1. Read the comment docs.

@kanterov
Copy link
Contributor

kanterov commented Aug 8, 2017

Did you see https://github.com/propensive/impromptu?

What about modeling join as something like:

trait TypedDataset[A]
  def innerJoin[B](bs: TypedDataset[B])(env: Env[A with B] => TypedColumn[A with B, Boolean]): TypedDataset[(A, B)]
}

trait TypedColumn[A] {
  def in[B >: A](env: Env[B]): TypedColumn[B]
}

as.innerJoin(bs)(env => as.col('a).in(env) === bs.col('b).in(env))

@kanterov
Copy link
Contributor

kanterov commented Aug 8, 2017

Just realized there is a slightly different, but probably better way. Don't know if we want to generalize it, but the idea is:

trait Join2[A, b]

trait TypedDataset[A]
  def innerJoin[B](bs: TypedDataset[B])(env: Join2[A, B] => TypedColumn[Join2[A, B], Boolean]): TypedDataset[(A, B)]
}

trait TypedColumn[A, U] {
  def in(join: Join2[A, B]): TypedColumn[Join2[A, B], U]
  def in(join: Join2[B, A]): TypedColumn[Join2[B, A], U]
}

as.innerJoin(bs)(env => as.col('a).in(env) === bs.col('b).in(env))

@OlivierBlanvillain
Copy link
Contributor Author

OlivierBlanvillain commented Aug 9, 2017

We can probably reduce it to as.innerJoin(bs)(implicit env => as.col('a) === bs.col('b)). My other idea was to use a coproduct of acceptable "source" for column expression, something like this:

trait TypedDataset[A]
  def innerJoin[B](bs: TypedDataset[B])(expr: TypedColumn[A :+: B :+: CNil, Boolean]): TypedDataset[(A, B)]
}

// Here === combines a TypedColumn[A :+: CNil, _] and a TypedColumn[B :+: CNil, _]
as.innerJoin(bs)(as.col('a) === bs.col('b))

But is it worth it? Are we really getting rid of an important class of errors with these checks? I think the additional heaviness of an implicit env everywhere would be too much...

@kanterov
Copy link
Contributor

kanterov commented Aug 9, 2017

I think to make this decision we need to evaluate cost and opportunity. One of the things that might push us to remove one type parameter is problems with type inference and craziness of compile errors. On the other side, having type parameter enables us declaring operations like:

trait TypedColumn[A, B] { 
  def >>[C](c: TypedColumn[B, C]): TypedColumn[A, C]
}

I don't see we can get enough quantitive data to evaluate how important is this class of errors.

I think one of the things to explore is approach by @jeremyrsmith, I think we can get very nice syntax, and keep type parameter.

@OlivierBlanvillain
Copy link
Contributor Author

@kanterov I didn't get what you mean by >>, could you elaborate?

I did a quick experiment with the implicit env => ds('a) idea, it looks promising: OlivierBlanvillain@5139091.

With Dotty's implicit function type we could just declare select as def select[C](expr: implicit CanAccess[T] => TypedColumn[C]) and keep everything as is. It might be possible to emulate just that with a macro. I think scala.rx has something similar in this file to turn Rx(a + b) into Rx(implicit env => a + b).

@OlivierBlanvillain
Copy link
Contributor Author

BTW, what's your opinion on merging this PR without addressing column access safety?

@kanterov
Copy link
Contributor

kanterov commented Aug 10, 2017

@OlivierBlanvillain >> is composition of two columns, for instance,

case class Person(address: Address)
case class Address(street: String, apt: Int)

def _address: TypedColumn[Person, Address]
def _apt: TypedColumn[Address, Int]

val x: TypedColumn[Person, Int] = _address >> _apt

The experiment looks very promising if we can use a macro to model implicit functions. What do you think if we merge this into a separate branch, polish, build artifact and ask people for feedback?

@OlivierBlanvillain
Copy link
Contributor Author

OlivierBlanvillain commented Aug 10, 2017

@kanterov I think it's possible to do the same thing than >> with a function + Spark's getField on top of this PR:

def _address: TypedColumn[Address] = ... // + implicit ca: CanAccess[Address]?
def _apt(a: TypedColumn[Address]): TypedColumn[Int] = a.getField('apt)
val x: TypedColumn[Int] = _apt(_address)

The experiment looks very promising if we can use a macro to model implicit functions. What do you think if we merge this into a separate branch, polish, build artifact and ask people for feedback?

By this are you referring to this PR of a polished version of 5139091? I'm happy to push the idea as far as possible with explicit implicits, but I think I will ask for help for the macro part... I already do too much compiler hacking on workdays :P

@imarios
Copy link
Contributor

imarios commented Aug 10, 2017

I love the idea of rethinking about these topics. This is healthy and we should do this more often :).

So, philosophically speaking, A => B is essentially Function1[A,B] in Scala. So I think it does feels natural to go from TypedColumn[A,B] to a TypedColumn[A] => TypedColumn[B] representation.

If we do that then it also becomes much more natural to do function composition for TypedColumn[_] => TypedColumn[_] for example, TypedColumn[A] => TypedColumn[B] => TypedColumn[C] is essentially TypedColumn[A] => TypedColumn[C].

I am trying to see what is the downside of this besides a slight increase in verbosity and API complexity ...

@OlivierBlanvillain
Copy link
Contributor Author

Ping @kanterov (#162 (comment))

@imarios Note that the idea here a bit different from the previous suggestion where I propose to turn all TypedColumn[A,B] into TypedColumn[A] => TypedColumn[B]. Here the idea to some something like:

implicit CanAccessTable[T1 :: T2 :: T3 :: HNil] => TypedColumn[C]

@OlivierBlanvillain OlivierBlanvillain force-pushed the joins branch 3 times, most recently from b944225 to 44adda4 Compare August 25, 2017 12:33
@imarios
Copy link
Contributor

imarios commented Sep 1, 2017

@OlivierBlanvillain so now is it possible to have a TypedColumn[A] and use it at a TypedDataset that contains an element of type A (so it passes compilation) but fails because, maybe, the name of the fields being accessed are not the same?

@OlivierBlanvillain
Copy link
Contributor Author

@imarios yes, that's the current state of this PR. Some safety is lost, but we get full coverage of the join API and a simpler code base. The question is whether we should merge it as is and try to recover the safety later, or if losing this bit of safety is not acceptable.

This change reduces the type saftely of column expressions. For instance, it's now possible to typecheck some nonsensical `select` as follows:

```
ds1.select(ds2('a)) // This use to be a type error!
```

This simplification is a required step to support joins, which is also source compatible. Safety can hopefuly be recovered by having `select` take a `TypedColumn[T] => TypedColumn[C]` function and removing the apply method on TypedDataset.
@OlivierBlanvillain
Copy link
Contributor Author

rebased

* }}}
*/
trait ColumnTypes[T, U <: HList] {
trait ColumnTypes[U <: HList] {
Copy link
Contributor Author

@OlivierBlanvillain OlivierBlanvillain Sep 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it we can replace this by a Comapped now.

@kanterov
Copy link
Contributor

What do you think about using CanAccess approach in a join expression to lift TypedColumn[L, A] to TypedColumn[L :+: R :+: CNil, A], and leave everything as it was before?

I feel this is an important part that we don't allow to mess up with types of datasets. It would be especially relevant in cases like:

val df = TypedDataset.create[X3[Int, Long, String]]()

df.select(df.col("a"), df.col("b"))
  .select(df.col("c")) // <-- should be compile error

Also losing T from TypedColumn[T, A] doesn't allow us to utilize approach with a macro transformation of Scala expressions to frameless TypedColumn[T, A]. I did some experiments with this, it works very well with type inference in IntelliJ, and people found it very natural. I find a lot of potential in this approach once scala-meta will support using type information in macro.

@OlivierBlanvillain
Copy link
Contributor Author

What do you think about using CanAccess approach in a join expression to lift TypedColumn[L, A] to TypedColumn[L :+: R :+: CNil, A], and leave everything as it was before?

That's a nice idea. I originally considered having L < Coproduct in TypedColumn[L, A], but that would mean computer L1 union L2 on every column operation, so df('a) + df('b) + df('c) would involves 3 implicit resolutions to compute df#T union df#T union df#T = df#T. Given than our test suite already takes 45 seconds to compile, I'm not sure that would be practical...

If we keep everything is it currently is and special case joins, it means joins need to have a implicit conversion is scope to do the lifting, I'm going to give this a try!

This reverts commit eb66711.

Conflicts:
	dataset/src/main/scala/frameless/TypedDataset.scala
Inference works fine
`CanAccess[_, A with B]` indicates that in this context it is possible to
access columns from both table `A` and table `B`. The first type parameter
is a dummy argument used for type inference.

The trick works as follows: `(df: TypedDataset[T]).col('a)` looks for a
CanAccess[T, T] which is always available thanks to the `globalInstance`
implicit defined above. Expression for joins (and other multi dataset
operations) take an `implicit a: CanAccess[Any, U with T] =>` closure.
Because the first (dummy) type parameter of `CanAccess` is contravariant,
the locally defined implicit will always be preferred over
`globalInstance`, which implements the desired behavior.
@OlivierBlanvillain
Copy link
Contributor Author

@kanterov Here is as far as I got using this CanAccess idea. It seams to work, but in the meantime I got another better idea (separate PR incoming). If you are interested, the last commit should be the interesting one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants