Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark 3.4.0 and DBR 12.2 support #699

Merged
merged 26 commits into from
Jun 2, 2023

Conversation

chris-twiner
Copy link
Contributor

@chris-twiner chris-twiner commented Apr 6, 2023

#698 - until 3.4.0 is released it won't compile

Closes #698
Closes #506

@chris-twiner chris-twiner changed the title Temp/3.4.0 support 3.4.0 (and DBR 12.2) support Apr 6, 2023
@chris-twiner
Copy link
Contributor Author

fyi - I've tested with #700 as well on the final cut, would you like me to roll #700 (doCodegen instead of codegen) into this PR?

@pomadchin
Copy link
Member

@chris-twiner yes, please! could you also adjust the build slightly so the build is green?

Sorry for some delays in reviews, I'll help you to make it merged once I have some time!

@pomadchin
Copy link
Member

@chris-twiner could you rebase on top of main branch? (I can do it too, but it may lead to issues on your machine)

@pomadchin
Copy link
Member

Fixed builds here 🤦 #709

@chris-twiner
Copy link
Contributor Author

the builds weren't broken per se, the generated yml was, per 186fa2d it was due to windows . Now it shows up an actual test failure on KMeansTest - will start looking at that.

@chris-twiner
Copy link
Contributor Author

@chris-twiner yes, please! could you also adjust the build slightly so the build is green?

Sorry for some delays in reviews, I'll help you to make it merged once I have some time!

re delays - np

@chris-twiner
Copy link
Contributor Author

the KMeans issue is probably due to https://issues.apache.org/jira/browse/SPARK-30661 as that's the code that is throwing. There are two issues, one is sometimes:

requirement failed: Vector should have dimension larger than zero.

the other due to centerSquaredNorms having a size of 1 but k is 2. I try setting k to 1 and it states that's invalid for this input vector.

I can attempt to identify it further but this is a bit out of my wheelhouse. I can only assume, given I find nothing else on it in the jiras, that either it's really a new 3.4 bug or the test inputs were broken and 3.4 doesn't tolerate it.

@chris-twiner
Copy link
Contributor Author

chris-twiner commented May 5, 2023

It's a "bug" in the tests, pre 3.4 was tolerant of the Arbitrary data, that is not the case in 3.4. I've managed to prove the extra code I've added to the test works in intellij for 3.4 and, when manually swapping out versions in the build, 3.3.2, 32.3 and 3.1.3. However running directly from the build does not work, neither does it on CI. The tests fail at:

[info] GroupByTests:
[info] - groupByMany('a).agg(sum('b))
[info] - agg(sum('a))
[info] - agg(sum('a), sum('b))
[info] - agg(sum('a), sum('b), sum('c))
[info] - agg(sum('a), sum('b), min('c), max('d))
[info] - groupBy('a).agg(sum('b))
[info] - groupBy('a).mapGroups('a, sum('b))
[info] - groupBy('a).agg(sum('b), sum('c)) to groupBy('a).agg(sum('a), sum('b), sum('a), sum('b), sum('a)) *** FAILED ***
[info]   IllegalStateException was thrown during property evaluation.
[info]     Message: Cannot call methods on a stopped SparkContext.
[info]   This stopped SparkContext was created at:

which is likely due to the mapGroups workaround. <-- It was not, and more likely the shutdown of the job

@codecov
Copy link

codecov bot commented May 5, 2023

Codecov Report

Merging #699 (89c5cd5) into master (0d72664) will increase coverage by 0.50%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #699      +/-   ##
==========================================
+ Coverage   95.01%   95.52%   +0.50%     
==========================================
  Files          65       67       +2     
  Lines        1184     1184              
  Branches       28       37       +9     
==========================================
+ Hits         1125     1131       +6     
+ Misses         59       53       -6     
Flag Coverage Δ
2.12.16 95.52% <100.00%> (+0.50%) ⬆️
2.13.10 96.13% <100.00%> (+0.51%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ataset/src/main/scala/frameless/TypedEncoder.scala 97.42% <ø> (+0.02%) ⬆️
...taset/src/main/scala/frameless/functions/Udf.scala 88.23% <ø> (ø)
...aset/src/main/scala/frameless/ops/GroupByOps.scala 98.21% <ø> (ø)
dataset/src/main/pre34/frameless/MapGroups.scala 100.00% <100.00%> (ø)
...taset/src/main/scala/frameless/functions/Lit.scala 100.00% <100.00%> (+5.55%) ⬆️
...cala/org/apache/spark/sql/FramelessInternals.scala 96.15% <100.00%> (+16.84%) ⬆️
dataset/src/main/spark34/frameless/MapGroups.scala 100.00% <100.00%> (ø)

... and 1 file with indirect coverage changes

…arrays are derived in TypedEncoder so no need for that either
@chris-twiner
Copy link
Contributor Author

based on coverage results the scalareflection stuff in FramelessInternals is trimmed down to a bare minimum. I have also removed the column (_.expr) function as it's not used anywhere.

The test failure here I've also fixed for rlike to check for a valid generated regex.

@chris-twiner
Copy link
Contributor Author

wrt KMeansTest failure, the euclideanUpdaeInPlace ArrayIndexOutOfBoundsException now occurs in 45 out of 1000 runs it seems (i.e. out of 10,000 maximum actual checks ~0.45 -> 4.5% of the time).

Running the average tests, similarly failure didn't occur in 500(0) runs (so prob far less than 0.01% occurrence). The actual input was described in the exception as a List of strings rather than a List of Longs, which is odd, I cannot reproduce.

I don't know that it makes sense to try and resolve these tests further given the low percentage occurrence and that the "issues" are not in frameless code. As such I've added a wrapper to retry in the case of that specific issue for kmeans and avg. Alternatively the spark test approach could be used instead of scalacheck for kmeans, just consting the vectors.

@chris-twiner
Copy link
Contributor Author

as review hasn't started yet I hope this approach is ok - via imports only, there was too much code change for something that should be transparent.

Note the two lines of code shown as untested in previous patches (code gen for DisambiguiateLeft/Right) were never tested, I've forced the issue.

@chris-twiner
Copy link
Contributor Author

@pomadchin - sorry I just realised you may not have been alerted it was green - hopefully this does alert you, since there are other releases in between the mima stuff may need version bumps though.

@pomadchin
Copy link
Member

@chris-twiner yes!! Thanks for pointing it out; I’ll hopefully take a closer look until this Friday, or the beginning of the next week.

@pomadchin
Copy link
Member

It's been a holiday weekend! Will return to this one this week, sorry for a delay!

@chris-twiner
Copy link
Contributor Author

np, fyi - 3.5.0 doesn't seem to have any issues with this snapshot.

@pomadchin pomadchin force-pushed the temp/3.4.0_support branch 5 times, most recently from 71180e3 to 9228809 Compare June 1, 2023 22:37
}

check(prop)
check(prop3[Double])
tolerantRun( _.isInstanceOf[ArrayIndexOutOfBoundsException] ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without the tolerantRun it fails often enough on 3.4 to stop the build

Copy link
Member

@pomadchin pomadchin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! A nice approach. I cleaned up the build file a bit but rather, than that the change looks good. Thanks for the incredible work!

@pomadchin
Copy link
Member

Created a follow up issue #717

@pomadchin pomadchin force-pushed the temp/3.4.0_support branch 2 times, most recently from 52851e3 to 6d1c69c Compare June 1, 2023 23:22
@pomadchin pomadchin changed the title 3.4.0 (and DBR 12.2) support Spark 3.4.0 and DBR 12.2 support Jun 2, 2023
@pomadchin
Copy link
Member

Thanks for a great contribution one more time! 🔥

@pomadchin pomadchin merged commit f1fcef1 into typelevel:master Jun 2, 2023
7 checks passed
@chris-twiner
Copy link
Contributor Author

Thanks for a great contribution one more time! 🔥

As with the original 3.x you are mtw!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

spark 3.4 support - replacing dataTypeFor logic Fix compatibility issues with the Databricks runtime
2 participants