-
Notifications
You must be signed in to change notification settings - Fork 1.5k
chore: generate basic spark function tests #16409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Not sure if it makes sense to commit the script I used, so I'll paste it here for now:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @shehabgamin appreciate if you can provide more insight on the PR details. It looks like a lot of tests commented out mostly from spark
domain and PR rationale didn't give enough details. 🤔
@comphead Sorry about that, I just updated the PR description! |
datafusion/sqllogictest/test_files/spark/datetime/current_timezone.slt
Outdated
Show resolved
Hide resolved
Got it @shehabgamin I'm seeing a lot of
which not very explanatory. So the test takes recent samples from internal Spark examples and run it, comparing the actual(Spark provided) and expected values |
This is how to get Spark examples from their internal repo |
@comphead I fixed the script, nice catch!
Sail generates Gold data tests using tests extracted directly from the Spark code base. However, we cannot directly port the exact SQL queries from Sail because DataFusion and Spark interpret SQL data types differently (Sail speaks Spark SQL). So, when creating the |
I'm personally intrigued tbh but I'd say the DF core should be agnostic of specific data-driven architecture(like Spark) even if we do a lot of Spark integration like Sail or Comet. imho data-driven arch should be living and addressing in some bridge project which take care on Apache Spark specifics comparing to DF, including INT96, decimals, nested types, some null handlings, etc.. @alamb @andygrove as initiators WDYT? |
It is my opinion that both standalone .slt tests (such as are in this PR) and more substantial "run queries in both systems and compare" style integration tests (such as are in the comet repo) are needed. The value of standalone .slt tests is they keep the barrier to contribution low(er) -- the hope is that we'll get the function library moved / ported over with community help and having the tests waiting will keep the context required by new contributors low. As valuable as having tests that actually start Spark, etc (as done in Comet) are, putting them in the main DataFusion repo would make it even harder to contribute to DataFusion due to having to set up the dependencies, understand spark errors, etc. @comphead if you are suggesting creating a new repository / project for running the integration tests I think that is quite an interesting idea, and maybe we can make a separate ticket. BTW there is quite a bit of more discussion about spark functions testing strategy here for anyone else following along SuggestionWhat I suggest we do is update the README.md file https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/spark/README.md, explaining what is going on Maybe we can add a section something like the following
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much @shehabgamin
I think this is is a great step forward -- I had a few suggestions (for more comments / readme) but be could also do that as a follow on PR.
@comphead I hope the rationale for .slt only tests makes sense. I agree it is not sufficient for complete testing but I feel it is enough to get the process started.
We should not merge this PR until we get consensus.
# KIND, either express or implied. See the License for the | ||
# specific language governing permissions and limitations | ||
# under the License. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend adding another header to each of these files to make they more self describing (and direct people to the tickets to help)
Perhaps something like
# This file was originally created by <script_name> from <gold file>
# and is part of implementing the datafusion-spark function library
# please see https://github.com/apache/datafusion/issues/15914
# for more information
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh great idea!
## Implementation Status | ||
|
||
Implementing the `datafusion-spark` compatible functions project is still a work in progress. | ||
Many of the tests in this directory are commented out and are waiting for help with implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shehabgamin On a separate note if we planning people to help contributing to this crate we probably need to attach a documentation or sample PR how to implement a function and test it with DF and Spark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@comphead Yep! The next step is to make a PR that shows a function being ported over and enabling the relevant .slt
file.
Co-authored-by: Oleks V <comphead@users.noreply.github.com>
Co-authored-by: Oleks V <comphead@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm thanks @shehabgamin
Thanks again @shehabgamin and @comphead |
Which issue does this PR close?
datafusion-spark
Spark Compatible Functions #15914Rationale for this change
Following up on the discussion with @andygrove, @alamb, and @linhr (see #15914 (comment)), the goal here is to provide contributors with basic tests that can serve as a reference and starting point when implementing Spark functions.
The script I ran does not extract all the function tests from Sail’s gold data, as creating a comprehensive script to generate each
.slt
test would have taken too long. As a result, the tests added in this PR represent only a subset of the gold data tests. Future work can improve upon the script.What changes are included in this PR?
Test files.
Are these changes tested?
N/A.
Are there any user-facing changes?
No.