feat(spark): support for FetchRel offset field by andrew-coleman · Pull Request #296 · substrait-io/substrait-java

andrew-coleman · 2024-09-24T15:02:46Z

Add missing support for the ‘offset’ clause in the spark module.

Blizzara · 2024-09-25T08:00:46Z

+    if (fetch.getCount.isPresent) {
+      val limit = Literal(fetch.getCount.getAsLong.intValue(), IntegerType)
+      fetch.getOffset match {
+        case 1L => GlobalLimit(limitExpr = limit, child = child)


fwiw this is just misusing the field, but I know it was there already so not a problem for this PR 😅

If I understand this, you're using offset 1 or -1 as a switching field? What's needed to not do this? At the very least, could we used something like -2 to signal this to avoid using a valid offset value as a magic number?

@Blizzara Do you know if there's anything outside of substrait-spark that's relying on this magic number, or is it something that we can change.

I'm not aware of anything else, I'd be happy to see it changed! (I had already changed it in our internal fork of substrait-spark, but the code of this PR is nicer than mine 🎉 )

vbarua · 2024-09-27T22:46:29Z

+      fetch.getOffset match {
+        case 1L => GlobalLimit(limitExpr = limit, child = child)
+        case -1L => LocalLimit(limitExpr = limit, child = child)
+        case _ => visitFallback(fetch)


Can Spark not handle the case where both the offset and count are set?

OFFSET is supported as of Spark 3.4.

Can Spark not handle the case where both the offset and count are set?

Yes, the unit test in this PR sets limit and offset. Spark handles limit by breaking it into two logical relations LocalLimit and GlobalLimit (there's an explanation here).

In the case of the test query in this PR:

select l_partkey from lineitem where l_shipdate < date '1998-01-01' " + "order by l_shipdate asc, l_discount desc limit 100 offset 1000

the spark logical plan is:

GlobalLimit 100 +- Offset 1000 +- LocalLimit 1100 +- Project [l_partkey#7997L] +- Sort [l_shipdate#8006 ASC NULLS FIRST, l_discount#8002 DESC NULLS LAST], true +- Project [l_partkey#7997L, l_shipdate#8006, l_discount#8002] +- Filter (isnotnull(l_shipdate#8006) AND (l_shipdate#8006 < 1998-01-01)) +- Relation ...

I'm not sure how this would best translate into a Substrait plan.

The handling of the limit clause was already in the codebase as inherited from Gluten. As @Blizzara noted, this PR adds the offset. But I can understand that this whole area needs rethinking.

Based on what you've indicated, is converting FetchRel(count=X, offset = Y, <input>) into:

GlobalLimit X +- Offset Y +- LocalLimit X + Y + <converted input>

viable?

Generally speaking, Substrait consumers should be written in such a way that they accept all valid Substrait plans, even those not built specifically by or for Spark in this case. It's perfectly valid to set both fields of the FetchRel, even if Spark doesn't.

viable?

Could be... I'll do some more testing to get a better understanding of this.

Generally speaking, ...

Completely agree. One of the weaknesses of the current test suite it that is relies entirely on round-tripping the query plans. I need to improve on this, somehow.

From reading the Spark PRs here is my understanding.

When converting from Substrait -> Spark:

Fetch(limit>=0, offset>0) => LocalLimit -> GlobalLimitAndOffset Fetch(limit<0, offset>0) => Offset Fetch(limit>=0, offset==0) => LocalLimit -> GlobalLimit

When converting from Spark -> Substrait then the following logic should work:

if current_node == LocalLimit: if next_node == GlobalLimitAndOffset: if current_node.limit < next_node.limit: raise Exception("this plan cannot be converted into Substrait") else: yield Fetch(limit=next_node.limit, offset=next_node.offset) skip(next_node) elif next_node == GlobalLimit: if current_node.limit < next_node.limit: raise Exception("this plan cannot be converted into Substrait") else: yield Fetch(limit=next_node.limit, offset=0) skip(next_node) else: raise Exception("this plan cannot be converted into Substrait") # This cannot have been preceded by LocalLimit because we would have # skipped it above so it is just a global limit if current_node == GlobalLimit: yield Fetch(limit=current_node.limit, offset=0) if current_node == Offset: yield Fetch(limit=-1, offset=current_node.offset) if current_node == GlobalLimitAndOffset: yield Fetch(limit=current_node.limit, offset=current_node.offset)

The above logic should always round trip successfully (unless I made a mistake in my understanding)

The case where there is either a LocalLimit alone or a LocalLimit followed by GlobalLimit or GlobalLimitAndOffset but the LocalLimit's limit is smaller cannot be supported today.

This plan implements a per-group limit and the closest equivalent Substrait plan would be something like...

┌──────┐ ┌───────────────────────────┐ ┌───────┐ ┌───────────┐ │ SCAN │ │ FILTER (partition_id = 1) │ │ LIMIT │ ────┬────│ UNION ALL │ └──────┘ └───────────────────────────┘ └───────┘ │ └───────────┘ │ │ ┌──────┐ ┌───────────────────────────┐ ┌───────┐ │ │ SCAN │ │ FILTER (partition_id = 2) │ │ LIMIT │ ────┼ └──────┘ └───────────────────────────┘ └───────┘ │ │ │ │ ... │ │ │ │ ┌──────┐ ┌───────────────────────────┐ ┌───────┐ │ │ SCAN │ │ FILTER (partition_id = N) │ │ LIMIT │ ────┘ └──────┘ └───────────────────────────┘ └───────┘

Thank you @westonpace, that's really helpful. I think I've captured that in the amended commit, and added a couple of extra tests

Add missing support for the ‘offset’ clause in the spark module. Signed-off-by: Andrew Coleman <andrew_coleman@uk.ibm.com>

andrew-coleman · 2024-10-10T13:27:30Z

@vbarua I believe I have addressed all the comments and this is ready for re-review. Thanks!

vbarua

This looks reasonable to me. Thanks fork pushing this forward!

Blizzara reviewed Sep 25, 2024

View reviewed changes

vbarua changed the title ~~feat(spark): add support for ‘offset’ clause~~ feat(spark): add support for FetchRel offset field Sep 27, 2024

vbarua reviewed Sep 27, 2024

View reviewed changes

andrew-coleman force-pushed the offset branch from c8bde6d to 1e80765 Compare October 3, 2024 17:53

feat(spark): add support for ‘offset’ clause

3498d44

Add missing support for the ‘offset’ clause in the spark module. Signed-off-by: Andrew Coleman <andrew_coleman@uk.ibm.com>

andrew-coleman force-pushed the offset branch from 1e80765 to 3498d44 Compare October 4, 2024 06:30

andrew-coleman requested a review from vbarua October 7, 2024 10:19

vbarua approved these changes Oct 11, 2024

View reviewed changes

vbarua changed the title ~~feat(spark): add support for FetchRel offset field~~ feat(spark): support for FetchRel offset field Oct 11, 2024

vbarua merged commit d22c07f into substrait-io:main Oct 11, 2024

andrew-coleman deleted the offset branch October 14, 2024 06:53

Conversation

andrew-coleman commented Sep 24, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrew-coleman commented Oct 10, 2024

Uh oh!

vbarua left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants