Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port of substrait-spark module from Gluten #271

Merged
merged 2 commits into from
Jun 26, 2024

Conversation

andrew-coleman
Copy link
Contributor

As discussed in the chain of comments here, I’m offering up this PR for your consideration.

I can take no credit for its implementation, it is a copy of the substrait-spark module that was part of the gluten repo before it was removed, with a few minor corrections and additions. We are using this as part of a project here (IBM) and would like to see it remain available under the Apache ecosystem.

It is useful for converting spark query plans to and from substrait.

@CLAassistant
Copy link

CLAassistant commented Jun 18, 2024

CLA assistant check
All committers have signed the CLA.

Copy link
Member

@EpsilonPrime EpsilonPrime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this will be one of the topics for the community meeting later today if you're interested.

}
developers {
developer {
// TBD Get the list of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we should have this somewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just copied the pom section from the other build files in the project. We could just remove the developer section entirely.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth noting that this will be the first instance of Scala in substrait-java. I'll let those that use this code more often decide if that's a bad thing or not.

Side question: If this code was magically available in Java would it still be usable by existing callers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I didn't know whether this would be an issue. The other PR that I referenced was also written in scala, and that didn't seem to be an issue in any of the comments. This code originally came from project gluten, which is all in scala. Spark itself is also written in scala.
We are using this library from java without any issues.

Side question: If this code was magically available in Java would it still be usable by existing callers?

I don't think it would make any difference from an end user point of view.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interested party here!

I think we needed this PR for so long. And it won't be possible to do this in Java (AFAIU with low level Spark APIs), and creating a substrait-scala just for this doesn't make much sense. I guess it would be fine to add this code here.

cc @vbarua

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To echo @andrew-coleman comments - the only issue I've seen with calling the library from Java is that following the links in VSCode drops me into decompiled Scala code; not that readable in my view.

But that's it - so all together not bad at all :-)

assertSqlSubstraitRelRoundTrip(
"select l_partkey from lineitem where l_shipdate < date '1998-01-01' " +
"order by l_shipdate asc, l_discount desc")
// assertSqlSubstraitRelRoundTrip(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea why these are commented out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. There are three tests commented out here. When I uncommented them, the first one failed (because the offset clause has not been implemented yet) but the other two passed. So I'll push a new commit with the comments removed and the failing test in a separate ignore block.

There are a few tests that are ignored due to unimplemented features. I plan to work on these in the coming weeks.

@Blizzara
Copy link
Contributor

FWIW, we also have an internal fork of this, and I've done some amount of work on it. If this gets included here, I'll be happy to add in my changes as relevant; otherwise I'll look at some point at opensourcing our fork.

It's worth noting that this will be the first instance of Scala in substrait-java. I'll let those that use this code more often decide if that's a bad thing or not.

There are likely some annoyances it may cause (like deciding/aligning the code formatting for both), but as long as this is a separate subproject it should not affect the Java-only parts in any way really afaik.

Side question: If this code was magically available in Java would it still be usable by existing callers?

Probably yes, but given Spark (esp. Catalyst) is written mostly in Scala, it's just much easier to traverse the Spark plans in Scala than in Java.
The current Scala implementation is directly usable from within Java consumers, so that's not an issue.

@andrew-coleman
Copy link
Contributor Author

FWIW, we also have an internal fork of this, and I've done some amount of work on it. If this gets included here, I'll be happy to add in my changes as relevant; otherwise I'll look at some point at opensourcing our fork.

Awesome! Then we should combine our efforts.

@mbwhite
Copy link
Contributor

mbwhite commented Jun 19, 2024

Probably worth adding that I've got several examples of using this library with Spark; need to get these approved to be able to share them, currently writing up docs for this at present.

* a, max(b) + 1 from table group by a</code>, We need create [[Project]] on top of [[Aggregate]]
* to correctly support it.
*
* TODO: support [[Rollup]] and [[GroupingSets]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can create issues for these to be added later?

@andrew-coleman
Copy link
Contributor Author

Pleased to see the positive discussion in the 2024-06-19 Substrait Community Sync Meeting regarding this PR. Is there anything I can do to help progress this?

Copy link
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense for this code to live in substrait-java. I took a cursory pass at the code and it seems fine. While it could be nicer, I think it's the kind of thing that we can improved with time and I don't want to block the initial PR.

I can potentially help with reviews of this in the future, though it's been ~8 years since I touched Scala, and that was a very different style (think cats-effect). Which is to say it could take me a bit to ramp back up.

I did have one question about the build which I left a comment on.

# limitations under the License.
%YAML 1.2
---
scalar_functions:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meta/future: it's potentially desirable to bring extensions like this into the core spec, either as core extensions or spark specific functions. Not for today though.

spark/build.gradle.kts Show resolved Hide resolved
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.spark.substrait
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another minor thing I just noticed. Both this class and ToSubstraitType are under org.apache.spark.substrait instead of io.substrait.spark like everything else. Is that intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that in SparkTypeUtil, the sameType method is accessing a private member of org.apache.spark.sql.types.DataType 🤮 so it has to be in the org.apache.spark namespace.
The other file though doesn't look like it needs to be there so I have moved it into io.substrait.spark.
If we're happy with this for now, I'll try and recode this bit at a later date.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's definitely 🤮, but we can improve that in the future. Thanks for looking into it.

This module was part of the gluten project and subsequently removed.
It is useful for converting spark query plans to and from substrait.

Signed-off-by: andrew-coleman <andrew_coleman@uk.ibm.com>
- uncomment unit tests
- move ToSubtraitType class to package io.substrait.spark

Signed-off-by: andrew-coleman <andrew_coleman@uk.ibm.com>
@andrew-coleman
Copy link
Contributor Author

I've rebased this on latest commit in main to fix the last build break (getDfsNames() was removed recently)

@vbarua vbarua merged commit 8537dca into substrait-io:main Jun 26, 2024
12 checks passed
@andrew-coleman andrew-coleman deleted the substrait-spark branch June 27, 2024 07:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants