Port of substrait-spark module from Gluten #271

andrew-coleman · 2024-06-18T10:51:43Z

As discussed in the chain of comments here, I’m offering up this PR for your consideration.

I can take no credit for its implementation, it is a copy of the substrait-spark module that was part of the gluten repo before it was removed, with a few minor corrections and additions. We are using this as part of a project here (IBM) and would like to see it remain available under the Apache ecosystem.

It is useful for converting spark query plans to and from substrait.

CLAassistant · 2024-06-18T10:51:50Z

All committers have signed the CLA.

EpsilonPrime

I suspect this will be one of the topics for the community meeting later today if you're interested.

EpsilonPrime · 2024-06-19T07:19:49Z

spark/build.gradle.kts

+        }
+        developers {
+          developer {
+            // TBD Get the list of


Hmm, we should have this somewhere.

I just copied the pom section from the other build files in the project. We could just remove the developer section entirely.

EpsilonPrime · 2024-06-19T07:24:24Z

spark/src/main/scala/io/substrait/debug/ExpressionToString.scala

It's worth noting that this will be the first instance of Scala in substrait-java. I'll let those that use this code more often decide if that's a bad thing or not.

Side question: If this code was magically available in Java would it still be usable by existing callers?

Yes, I didn't know whether this would be an issue. The other PR that I referenced was also written in scala, and that didn't seem to be an issue in any of the comments. This code originally came from project gluten, which is all in scala. Spark itself is also written in scala.
We are using this library from java without any issues.

Side question: If this code was magically available in Java would it still be usable by existing callers?

I don't think it would make any difference from an end user point of view.

Interested party here!

I think we needed this PR for so long. And it won't be possible to do this in Java (AFAIU with low level Spark APIs), and creating a substrait-scala just for this doesn't make much sense. I guess it would be fine to add this code here.

cc @vbarua

To echo @andrew-coleman comments - the only issue I've seen with calling the library from Java is that following the links in VSCode drops me into decompiled Scala code; not that readable in my view.

But that's it - so all together not bad at all :-)

EpsilonPrime · 2024-06-19T07:27:31Z

spark/src/test/scala/io/substrait/spark/TPCHPlan.scala

+    assertSqlSubstraitRelRoundTrip(
+      "select l_partkey from lineitem where l_shipdate < date '1998-01-01' " +
+        "order by l_shipdate asc, l_discount desc")
+//    assertSqlSubstraitRelRoundTrip(


Any idea why these are commented out?

Good question. There are three tests commented out here. When I uncommented them, the first one failed (because the offset clause has not been implemented yet) but the other two passed. So I'll push a new commit with the comments removed and the failing test in a separate ignore block.

There are a few tests that are ignored due to unimplemented features. I plan to work on these in the coming weeks.

Blizzara · 2024-06-19T09:30:33Z

FWIW, we also have an internal fork of this, and I've done some amount of work on it. If this gets included here, I'll be happy to add in my changes as relevant; otherwise I'll look at some point at opensourcing our fork.

It's worth noting that this will be the first instance of Scala in substrait-java. I'll let those that use this code more often decide if that's a bad thing or not.

There are likely some annoyances it may cause (like deciding/aligning the code formatting for both), but as long as this is a separate subproject it should not affect the Java-only parts in any way really afaik.

Side question: If this code was magically available in Java would it still be usable by existing callers?

Probably yes, but given Spark (esp. Catalyst) is written mostly in Scala, it's just much easier to traverse the Spark plans in Scala than in Java.
The current Scala implementation is directly usable from within Java consumers, so that's not an issue.

andrew-coleman · 2024-06-19T09:35:26Z

FWIW, we also have an internal fork of this, and I've done some amount of work on it. If this gets included here, I'll be happy to add in my changes as relevant; otherwise I'll look at some point at opensourcing our fork.

Awesome! Then we should combine our efforts.

mbwhite · 2024-06-19T10:16:43Z

Probably worth adding that I've got several examples of using this library with Spark; need to get these approved to be able to share them, currently writing up docs for this at present.

vibhatha · 2024-06-19T12:38:07Z

spark/src/main/scala/io/substrait/spark/logical/ToSubstraitRel.scala

+   * a, max(b) + 1 from table group by a</code>, We need create [[Project]] on top of [[Aggregate]]
+   * to correctly support it.
+   *
+   * TODO: support [[Rollup]] and [[GroupingSets]]


maybe we can create issues for these to be added later?

andrew-coleman · 2024-06-25T13:17:53Z

Pleased to see the positive discussion in the 2024-06-19 Substrait Community Sync Meeting regarding this PR. Is there anything I can do to help progress this?

vbarua

I think it makes sense for this code to live in substrait-java. I took a cursory pass at the code and it seems fine. While it could be nicer, I think it's the kind of thing that we can improved with time and I don't want to block the initial PR.

I can potentially help with reviews of this in the future, though it's been ~8 years since I touched Scala, and that was a very different style (think cats-effect). Which is to say it could take me a bit to ramp back up.

I did have one question about the build which I left a comment on.

vbarua · 2024-06-25T14:57:15Z

spark/src/main/resources/spark.yml

+# limitations under the License.
+%YAML 1.2
+---
+scalar_functions:


meta/future: it's potentially desirable to bring extensions like this into the core spec, either as core extensions or spark specific functions. Not for today though.

spark/build.gradle.kts

vbarua · 2024-06-25T16:40:52Z

spark/src/main/scala/org/apache/spark/substrait/SparkTypeUtil.scala

+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.substrait


Another minor thing I just noticed. Both this class and ToSubstraitType are under org.apache.spark.substrait instead of io.substrait.spark like everything else. Is that intentional?

It appears that in SparkTypeUtil, the sameType method is accessing a private member of org.apache.spark.sql.types.DataType 🤮 so it has to be in the org.apache.spark namespace.
The other file though doesn't look like it needs to be there so I have moved it into io.substrait.spark.
If we're happy with this for now, I'll try and recode this bit at a later date.

That's definitely 🤮, but we can improve that in the future. Thanks for looking into it.

This module was part of the gluten project and subsequently removed. It is useful for converting spark query plans to and from substrait. Signed-off-by: andrew-coleman <andrew_coleman@uk.ibm.com>

- uncomment unit tests - move ToSubtraitType class to package io.substrait.spark Signed-off-by: andrew-coleman <andrew_coleman@uk.ibm.com>

andrew-coleman · 2024-06-26T08:32:45Z

I've rebased this on latest commit in main to fix the last build break (getDfsNames() was removed recently)

andrew-coleman force-pushed the substrait-spark branch from f7a5ed7 to 157276e Compare June 18, 2024 15:14

EpsilonPrime reviewed Jun 19, 2024

View reviewed changes

vibhatha reviewed Jun 19, 2024

View reviewed changes

vbarua approved these changes Jun 25, 2024

View reviewed changes

andrew-coleman force-pushed the substrait-spark branch from be74cd8 to a04d72d Compare June 25, 2024 15:52

vbarua reviewed Jun 25, 2024

View reviewed changes

andrew-coleman added 2 commits June 26, 2024 07:38

feat: port of substrait-spark module from Gluten

fe2505c

This module was part of the gluten project and subsequently removed. It is useful for converting spark query plans to and from substrait. Signed-off-by: andrew-coleman <andrew_coleman@uk.ibm.com>

feat: address PR feedback

6394dfc

- uncomment unit tests - move ToSubtraitType class to package io.substrait.spark Signed-off-by: andrew-coleman <andrew_coleman@uk.ibm.com>

andrew-coleman force-pushed the substrait-spark branch from a04d72d to 6394dfc Compare June 26, 2024 08:28

EpsilonPrime approved these changes Jun 26, 2024

View reviewed changes

vbarua merged commit 8537dca into substrait-io:main Jun 26, 2024
12 checks passed

andrew-coleman deleted the substrait-spark branch June 27, 2024 07:32

nielspardon mentioned this pull request Jun 27, 2024

Add spark submodule to convert SparkSql's LogicalPlan to Substrait Rel. #90

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port of substrait-spark module from Gluten #271

Port of substrait-spark module from Gluten #271

andrew-coleman commented Jun 18, 2024

CLAassistant commented Jun 18, 2024 •

edited

Loading

EpsilonPrime left a comment

EpsilonPrime Jun 19, 2024

andrew-coleman Jun 19, 2024

EpsilonPrime Jun 19, 2024

andrew-coleman Jun 19, 2024

vibhatha Jun 19, 2024

mbwhite Jun 19, 2024

EpsilonPrime Jun 19, 2024

andrew-coleman Jun 19, 2024

Blizzara commented Jun 19, 2024

andrew-coleman commented Jun 19, 2024

mbwhite commented Jun 19, 2024

vibhatha Jun 19, 2024

andrew-coleman commented Jun 25, 2024

vbarua left a comment

vbarua Jun 25, 2024

vbarua Jun 25, 2024

andrew-coleman Jun 26, 2024

vbarua Jun 26, 2024

andrew-coleman commented Jun 26, 2024

Port of substrait-spark module from Gluten #271

Port of substrait-spark module from Gluten #271

Conversation

andrew-coleman commented Jun 18, 2024

CLAassistant commented Jun 18, 2024 • edited Loading

EpsilonPrime left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Blizzara commented Jun 19, 2024

andrew-coleman commented Jun 19, 2024

mbwhite commented Jun 19, 2024

Choose a reason for hiding this comment

andrew-coleman commented Jun 25, 2024

vbarua left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrew-coleman commented Jun 26, 2024

CLAassistant commented Jun 18, 2024 •

edited

Loading