-
-
Notifications
You must be signed in to change notification settings - Fork 138
Joins with boolean column expressions #162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #162 +/- ##
==========================================
- Coverage 94% 93.77% -0.23%
==========================================
Files 31 28 -3
Lines 667 659 -8
Branches 9 13 +4
==========================================
- Hits 627 618 -9
- Misses 40 41 +1
Continue to review full report at Codecov.
|
|
Did you see https://github.com/propensive/impromptu? What about modeling join as something like: trait TypedDataset[A]
def innerJoin[B](bs: TypedDataset[B])(env: Env[A with B] => TypedColumn[A with B, Boolean]): TypedDataset[(A, B)]
}
trait TypedColumn[A] {
def in[B >: A](env: Env[B]): TypedColumn[B]
}
as.innerJoin(bs)(env => as.col('a).in(env) === bs.col('b).in(env)) |
|
Just realized there is a slightly different, but probably better way. Don't know if we want to generalize it, but the idea is: trait Join2[A, b]
trait TypedDataset[A]
def innerJoin[B](bs: TypedDataset[B])(env: Join2[A, B] => TypedColumn[Join2[A, B], Boolean]): TypedDataset[(A, B)]
}
trait TypedColumn[A, U] {
def in(join: Join2[A, B]): TypedColumn[Join2[A, B], U]
def in(join: Join2[B, A]): TypedColumn[Join2[B, A], U]
}
as.innerJoin(bs)(env => as.col('a).in(env) === bs.col('b).in(env)) |
|
We can probably reduce it to trait TypedDataset[A]
def innerJoin[B](bs: TypedDataset[B])(expr: TypedColumn[A :+: B :+: CNil, Boolean]): TypedDataset[(A, B)]
}
// Here === combines a TypedColumn[A :+: CNil, _] and a TypedColumn[B :+: CNil, _]
as.innerJoin(bs)(as.col('a) === bs.col('b))But is it worth it? Are we really getting rid of an important class of errors with these checks? I think the additional heaviness of an |
|
I think to make this decision we need to evaluate cost and opportunity. One of the things that might push us to remove one type parameter is problems with type inference and craziness of compile errors. On the other side, having type parameter enables us declaring operations like: trait TypedColumn[A, B] {
def >>[C](c: TypedColumn[B, C]): TypedColumn[A, C]
}I don't see we can get enough quantitive data to evaluate how important is this class of errors. I think one of the things to explore is approach by @jeremyrsmith, I think we can get very nice syntax, and keep type parameter. |
|
@kanterov I didn't get what you mean by I did a quick experiment with the With Dotty's implicit function type we could just declare select as |
|
BTW, what's your opinion on merging this PR without addressing column access safety? |
|
@OlivierBlanvillain case class Person(address: Address)
case class Address(street: String, apt: Int)
def _address: TypedColumn[Person, Address]
def _apt: TypedColumn[Address, Int]
val x: TypedColumn[Person, Int] = _address >> _aptThe experiment looks very promising if we can use a macro to model implicit functions. What do you think if we merge this into a separate branch, polish, build artifact and ask people for feedback? |
|
@kanterov I think it's possible to do the same thing than def _address: TypedColumn[Address] = ... // + implicit ca: CanAccess[Address]?
def _apt(a: TypedColumn[Address]): TypedColumn[Int] = a.getField('apt)
val x: TypedColumn[Int] = _apt(_address)
By this are you referring to this PR of a polished version of 5139091? I'm happy to push the idea as far as possible with explicit implicits, but I think I will ask for help for the macro part... I already do too much compiler hacking on workdays :P |
|
I love the idea of rethinking about these topics. This is healthy and we should do this more often :). So, philosophically speaking, If we do that then it also becomes much more natural to do function composition for I am trying to see what is the downside of this besides a slight increase in verbosity and API complexity ... |
|
Ping @kanterov (#162 (comment)) @imarios Note that the idea here a bit different from the previous suggestion where I propose to turn all
|
b944225 to
44adda4
Compare
|
@OlivierBlanvillain so now is it possible to have a |
|
@imarios yes, that's the current state of this PR. Some safety is lost, but we get full coverage of the join API and a simpler code base. The question is whether we should merge it as is and try to recover the safety later, or if losing this bit of safety is not acceptable. |
This change reduces the type saftely of column expressions. For instance, it's now possible to typecheck some nonsensical `select` as follows:
```
ds1.select(ds2('a)) // This use to be a type error!
```
This simplification is a required step to support joins, which is also source compatible. Safety can hopefuly be recovered by having `select` take a `TypedColumn[T] => TypedColumn[C]` function and removing the apply method on TypedDataset.
36ba9a3 to
71e67da
Compare
|
rebased |
| * }}} | ||
| */ | ||
| trait ColumnTypes[T, U <: HList] { | ||
| trait ColumnTypes[U <: HList] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it we can replace this by a Comapped now.
|
What do you think about using I feel this is an important part that we don't allow to mess up with types of datasets. It would be especially relevant in cases like: val df = TypedDataset.create[X3[Int, Long, String]]()
df.select(df.col("a"), df.col("b"))
.select(df.col("c")) // <-- should be compile errorAlso losing |
That's a nice idea. I originally considered having If we keep everything is it currently is and special case joins, it means joins need to have a implicit conversion is scope to do the lifting, I'm going to give this a try! |
This reverts commit eb66711. Conflicts: dataset/src/main/scala/frameless/TypedDataset.scala
Inference works fine
`CanAccess[_, A with B]` indicates that in this context it is possible to
access columns from both table `A` and table `B`. The first type parameter
is a dummy argument used for type inference.
The trick works as follows: `(df: TypedDataset[T]).col('a)` looks for a
CanAccess[T, T] which is always available thanks to the `globalInstance`
implicit defined above. Expression for joins (and other multi dataset
operations) take an `implicit a: CanAccess[Any, U with T] =>` closure.
Because the first (dummy) type parameter of `CanAccess` is contravariant,
the locally defined implicit will always be preferred over
`globalInstance`, which implements the desired behavior.
|
@kanterov Here is as far as I got using this |
This PR updates the interface for joins to support boolean column expressions. It also adds methods for each type of join implemented in Spark (Inner, Cross, Full, Right, Left, LeftSemi and LeftAnti).
As is, these changes lose some of the type safety in frameless. The first commit gets rid of TypedColumn's first type parameter which was used to keep track of the table corresponding to this columns. I see two options to recover this safety:
Switch to a "functional" interface by changing the signature of select to
def select[U](c: TypedColumn[T] => TypedColumn[U]): Dataset[U]and use the same trick in joins: def join[U](other: Dataset[U])(c: (TypedColumn[T], TypedColumn[U]) => TypedColumn[Boolean]): Dataset[(T, U)].Recover the second type parameter on TypedColumn but make it a coproduct of source tables.
IMO both options would add too much complexity compared to the small gain of keeping track of column origin. I think the type safety lost in this PR is least important safety feature provided by frameless, thus my proposal to get rid of it 😄