feat(spark): add Window support by andrew-coleman · Pull Request #307 · substrait-io/substrait-java

andrew-coleman · 2024-10-16T08:33:23Z

To support the OVER clause in SQL

This fixes some of the TPC-DS tests.

vbarua · 2024-10-16T21:48:41Z

This will also fix several of the TPC-DS tests, but I will enable those in a future PR to try and avoid merge conflicts with other PRs that are currently open.

I've gone ahead and reviewed and merged the other PRs so we shouldn't have any issues with conflict there, beyond the one that exists now. I can review this PR tomorrow.

vbarua

Left some questions.

vbarua · 2024-10-18T23:54:52Z

+            WindowSpecDefinition(_, _, SpecifiedWindowFrame(frameType, lower, upper))) =>
+        (fromSpark(frameType), fromSparkPreceding(lower), fromSparkFollowing(upper))
+      case WindowExpression(_, WindowSpecDefinition(_, _, UnspecifiedFrame)) =>
+        (WindowBoundsType.ROWS, UNBOUNDED, CURRENT_ROW)


I was curious about the default behaviour if the frame is unspecified, because for Postgres it is RANGE UNBOUNDED PRECEDING, however it looks like for Spark it seems to depend on whether the ordering is defined:

* @note * When ordering is not defined, an unbounded window frame (rowFrame, unboundedPreceding, * unboundedFollowing) is used by default. When ordering is defined, a growing window frame * (rangeFrame, unboundedPreceding, currentRow) is used by default.

https://github.com/apache/spark/blob/f8d92224b9af4ffffbb83ca2c9dd3c3b909b135d/sql/api/src/main/scala/org/apache/spark/sql/expressions/Window.scala#L35-L38

Good spot, I hadn't noticed that comment.

vbarua · 2024-10-19T00:13:48Z

+
+    val (frameType, lower, upper) = sparkExp match {
+      case WindowExpression(_: OffsetWindowFunction, _) =>
+        (WindowBoundsType.ROWS, UNBOUNDED, CURRENT_ROW)


What is an OffsetWindowFunction. Does it somehow override the frame bounds?

That's my understanding. TBH I'm not sure exactly how it should translate to substrait, but it's ignored by spark (it matches the lag/lead functions)
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L404

You actually should get the frame here as well properly. Otherwise the bound will be incorrect for other consumers (unbounded preceeding and current row will not work for Lead, for example) and Spark will throw. You'll see that if you add logicalPlan2.show() into SubstraitPlanTestBase::assertSqlToSubstraitRoundTrip and re-run the lag/lead test. (We should also make the tests always evaluate the converted plans to ensure they are actually valid, I'll try to get to that)

Can be fixed by just removing this case.

Ok, I've removed it for now

Blizzara

Thanks! Overall looks good, but there's some things in the bounds that I think aren't always correct

Blizzara · 2024-10-21T12:02:44Z

+          .append("sorts=")
+          .append(window.getSorts)
+      })
+  }


FWIW, in our internal fork I just deleted the whole RelToVerboseString thing. I don't see it bringing that much value over pretty-printing just the protobuf. And there's a fair amount of work to maintain this.

Blizzara · 2024-10-21T12:09:10Z


+  val WINDOW_SIGS: Seq[Sig] = Seq(
+    s[RowNumber]("row_number"),
+    s[Rank]("rank"),


From looking at our internal fork's window impl, the rank functions in Spark have some child column but substrait doesn't define any. I wonder how it works here, ie how do get a match still?

Ah you handle it here: https://github.com/substrait-io/substrait-java/pull/307/files#diff-78e1a9d4e9f4c7d8968b3ba83d3cc5222ade95d28e16b0328277c3f8c8a9d313R180

Blizzara · 2024-10-21T12:10:51Z

+    s[CumeDist]("cume_dist"),
+    s[NTile]("ntile"),
+    s[Lead]("lead"),
+    s[Lag]("lag"),


I also have a note around Lead and Lag having an NullType null as a child, which I think Substrait doesn't support, did you run into anything like that?

I haven't seen this - do you have a test case?

Looks like you had, you just handle it here: https://github.com/substrait-io/substrait-java/pull/307/files#diff-78e1a9d4e9f4c7d8968b3ba83d3cc5222ade95d28e16b0328277c3f8c8a9d313R183 😄

Oh yes - not known for my memory! 😂

Blizzara · 2024-10-21T12:15:24Z

+    case other => throw new UnsupportedOperationException(s"Unsupported bounds type: $other.")
+  }
+
+  def fromSparkPreceding(bound: Expression): WindowBound = bound match {


Substrait defines the bounds as being strictly positive integers, but iirc Spark may have a "preceeding -1 row" for example. This is what I used:

expr match { case UnboundedPreceding => WindowBound.UNBOUNDED case UnboundedFollowing => WindowBound.UNBOUNDED case CurrentRow => WindowBound.CURRENT_ROW case e: Literal => e.dataType match { case IntegerType => { val offset = e.eval().asInstanceOf[Int] if (offset < 0) WindowBound.Preceding.of(-offset) else if (offset == 0) WindowBound.CURRENT_ROW else WindowBound.Following.of(offset) } } case _ => throw new UnsupportedOperationException(s"Unexpected bound: $expr") } }

Thanks, I've changed it this.

Blizzara · 2024-10-21T12:39:43Z

+    case UNBOUNDED => UnboundedPreceding
+    case CURRENT_ROW => CurrentRow
+    case p: Preceding => Literal(p.offset())
+    case _ => throw new UnsupportedOperationException(s"Unsupported bounds expression $bound")


I think this also doesn't work for the same reason as above, the func.lowerBound() may very well be a Following (and vice versa below). So you should handle both in both.

Our version was a bit different, yours might be nice (once fixed):

def toSparkFrame( boundsType: WindowBoundsType, lowerBound: WindowBound, upperBound: WindowBound): WindowFrame = { val frameType = boundsType match { case WindowBoundsType.ROWS => RowFrame case WindowBoundsType.RANGE => RangeFrame case WindowBoundsType.UNSPECIFIED => return UnspecifiedFrame } SpecifiedWindowFrame( frameType, toSparkBound(lowerBound, isLower = true), toSparkBound(upperBound, isLower = false)) } private def toSparkBound(bound: WindowBound, isLower: Boolean): Expression = { bound.accept(new WindowBoundVisitor[Expression, Exception] { override def visit(preceding: WindowBound.Preceding): Expression = Literal(-preceding.offset().intValue()) override def visit(following: WindowBound.Following): Expression = Literal(following.offset().intValue()) override def visit(currentRow: WindowBound.CurrentRow): Expression = CurrentRow override def visit(unbounded: WindowBound.Unbounded): Expression = if (isLower) UnboundedPreceding else UnboundedFollowing }) }

Yours is nicer :).

Blizzara · 2024-10-21T13:15:55Z

+        |
+        |""".stripMargin
+    assertSqlSubstraitRelRoundTrip(query)
+  }


maybe add a test with two different partitions in same select? I think it should pass fine but just in case

Blizzara · 2024-10-21T13:17:00Z

+    assertSqlSubstraitRelRoundTrip(query)
+  }
+
+  test("min") {


nit: maybe name this as "aggregate" or something, as I think that's what it's testing (fact that it's min specifically is less relevant)?

Suggested change

test("min") {

test("aggregate") {

andrew-coleman · 2024-10-22T14:59:37Z

Thanks @Blizzara, there’s some really useful feedback here. I hadn’t realised you had already been implementing this, it would be awesome if you want to collaborate :)

I’d only implemented as much as necessary to enable the windowing tests in the TPC-DS suite to pass. If you’ve got other tests that require additional logic in the converter, then it would be great to add these. But perhaps in a future PR in the interest of moving things forward incrementally?

Blizzara · 2024-10-23T16:40:15Z

Thanks @Blizzara, there’s some really useful feedback here. I hadn’t realised you had already been implementing this, it would be awesome if you want to collaborate :)

Yea, shame on me - it's been on my todo-list ever since you added this into substrait-java to pull up our changes. Some of those changes are non-trivial refactorings which might be annoying, but I'll try to get to it!

I’d only implemented as much as necessary to enable the windowing tests in the TPC-DS suite to pass. If you’ve got other tests that require additional logic in the converter, then it would be great to add these. But perhaps in a future PR in the interest of moving things forward incrementally?

I'd like to see the bounds things fixed, since currently this implementation doesn't adhere to the bounds (preceeding, following) being strictly positive. That's not a problem for the roundtrip tests, but it means a plan generated here may not be valid Substrait that can be consumed by other

Blizzara · 2024-10-23T17:09:27Z

Thanks @Blizzara, there’s some really useful feedback here. I hadn’t realised you had already been implementing this, it would be awesome if you want to collaborate :)

Starting with #311! After that I'll add structs (+ fix the name handling), and then I have some more complicated changes in how FunctionMappings works to allow for more complicated mappings.

To support the OVER clause in SQL Signed-off-by: Andrew Coleman <andrew_coleman@uk.ibm.com>

Blizzara · 2024-10-25T14:43:02Z

+            WindowSpecDefinition(_, _, SpecifiedWindowFrame(frameType, lower, upper))) =>
+        (fromSpark(frameType), fromSpark(lower), fromSpark(upper))
+      case WindowExpression(_, WindowSpecDefinition(_, orderSpec, UnspecifiedFrame)) =>
+        if (orderSpec.isEmpty) {


just for posterity, this comes from Spark notes: https://github.com/apache/spark/blob/250f8affd04e4be14446dd02a1c52716e54a226d/sql/api/src/main/scala/org/apache/spark/sql/expressions/Window.scala#L36

Blizzara · 2024-10-25T14:50:36Z

-    "q21", "q22", "q23a", "q23b", "q24a", "q24b", "q25", "q26", "q27", "q28", "q29",
-    "q30", "q31", "q32", "q33", "q37", "q38",
-    "q40", "q41", "q42", "q43", "q46", "q48",
+  val successfulSQL: Set[String] = Set("q1", "q3", "q4", "q5", "q7", "q8",


given how this is growing, might be worth starting to list the ones that don't work instead 😄

Exactly, I was going to do this after my next PR which will add support for more numeric functions. The number of failing tests will be relatively small then.

Blizzara

LGTM, thanks!

andrew-coleman force-pushed the window branch from 03ac4c8 to f093ec4 Compare October 17, 2024 08:18

vbarua reviewed Oct 19, 2024

View reviewed changes

Blizzara reviewed Oct 21, 2024

View reviewed changes

andrew-coleman force-pushed the window branch from f093ec4 to fc0f15a Compare October 22, 2024 15:00

andrew-coleman requested a review from vbarua October 22, 2024 15:13

andrew-coleman force-pushed the window branch 2 times, most recently from 82cb729 to 73d3bc9 Compare October 25, 2024 09:07

andrew-coleman requested a review from Blizzara October 25, 2024 09:48

feat(spark): add Window support

86a0548

To support the OVER clause in SQL Signed-off-by: Andrew Coleman <andrew_coleman@uk.ibm.com>

andrew-coleman force-pushed the window branch from 73d3bc9 to 86a0548 Compare October 25, 2024 12:44

Blizzara reviewed Oct 25, 2024

View reviewed changes

Blizzara approved these changes Oct 25, 2024

View reviewed changes

Blizzara merged commit b3f61a2 into substrait-io:main Oct 25, 2024

andrew-coleman deleted the window branch October 25, 2024 16:11

Conversation

andrew-coleman commented Oct 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vbarua commented Oct 16, 2024

Uh oh!

vbarua left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Blizzara left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Blizzara Oct 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrew-coleman commented Oct 22, 2024

Uh oh!

Blizzara commented Oct 23, 2024

Uh oh!

Blizzara commented Oct 23, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Blizzara left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

andrew-coleman commented Oct 16, 2024 •

edited

Loading

Blizzara Oct 25, 2024 •

edited

Loading