Skip to content

Feat: Support Spark 4.0.0 part1 #1830

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open

Conversation

huaxingao
Copy link
Contributor

@huaxingao huaxingao commented Jun 2, 2025

Which issue does this PR close?

Closes #.

Rationale for this change

Adding shim files for Spark 4.0.0 support

What changes are included in this PR?

How are these changes tested?

@andygrove
Copy link
Member

I ran into compilation issues locally when building for Spark 3.4. I think these can be resolved simply by moving the new spark-3.5 shims into the spark-3.x shim folder instead.

Also, some jobs fail with scalastyle errors, which can be fixed by running make format.

@@ -74,7 +74,7 @@ public static ColumnDescriptor convertToParquet(StructField field) {
builder = Types.primitive(PrimitiveType.PrimitiveTypeName.INT64, repetition);
} else if (type == DataTypes.BinaryType) {
builder = Types.primitive(PrimitiveType.PrimitiveTypeName.BINARY, repetition);
} else if (type == DataTypes.StringType) {
} else if (type == DataTypes.StringType || type.sameType(DataTypes.StringType)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with the sameType method. Why is it needed for strings but not for the other types?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to support String Collation in Spark 4.0. (e.g. test("Check order by on table with collated string column") )
Without String Collation, it goes to https://github.com/apache/spark/blob/master/sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/DataTypeAstBuilder.scala#L89 which uses singleton DataTypes.StringType, so type == DataTypes.StringType, But with String Collation, it goes to https://github.com/apache/spark/blob/master/sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/DataTypeAstBuilder.scala#L94, so type == DataTypes.StringType fails and I added type.sameType(DataTypes.StringType to let String Collation pass. Actually, I think it should be

else if (
  type == DataTypes.StringType ||
  (type.sameType(DataTypes.StringType) && isSpark40Plus())
)

@codecov-commenter
Copy link

codecov-commenter commented Jun 5, 2025

Codecov Report

Attention: Patch coverage is 0% with 14 lines in your changes missing coverage. Please review.

Project coverage is 58.70%. Comparing base (f09f8af) to head (f13351b).
Report is 261 commits behind head on main.

Files with missing lines Patch % Lines
...c/main/java/org/apache/comet/parquet/TypeUtil.java 0.00% 5 Missing and 2 partials ⚠️
...ffle/comet/CometBoundedShuffleMemoryAllocator.java 0.00% 6 Missing ⚠️
.../scala/org/apache/comet/serde/QueryPlanSerde.scala 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #1830      +/-   ##
============================================
+ Coverage     56.12%   58.70%   +2.57%     
- Complexity      976     1139     +163     
============================================
  Files           119      130      +11     
  Lines         11743    12795    +1052     
  Branches       2251     2406     +155     
============================================
+ Hits           6591     7511     +920     
- Misses         4012     4060      +48     
- Partials       1140     1224      +84     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@andygrove
Copy link
Member

andygrove commented Jun 5, 2025

The Spark version will also need to be updated in .github/workflows/spark_sql_test_ansi.yml. It currently has:

spark-version: [{short: '4.0', full: '4.0.0-preview1'}]

Copy link
Contributor

@parthchandra parthchandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor comments.

Copy link
Contributor

@YanivKunda YanivKunda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few dependency alignment as part of the Spark 4.0.0 release

pom.xml Outdated
@@ -600,7 +600,7 @@ under the License.
<!-- Use Scala 2.13 by default -->
<scala.version>2.13.14</scala.version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scala was upgraded in the 4.0.0 release version:

Suggested change
<scala.version>2.13.14</scala.version>
<scala.version>2.13.16</scala.version>

pom.xml Outdated
@@ -600,7 +600,7 @@ under the License.
<!-- Use Scala 2.13 by default -->
<scala.version>2.13.14</scala.version>
<scala.binary.version>2.13</scala.binary.version>
<spark.version>4.0.0-preview1</spark.version>
<spark.version>4.0.0</spark.version>
<spark.version.short>4.0</spark.version.short>
<parquet.version>1.13.1</parquet.version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parquet was upgraded in the 4.0.0 release version:

Suggested change
<parquet.version>1.13.1</parquet.version>
<parquet.version>1.15.2</parquet.version>

pom.xml Outdated
@@ -600,7 +600,7 @@ under the License.
<!-- Use Scala 2.13 by default -->
<scala.version>2.13.14</scala.version>
<scala.binary.version>2.13</scala.binary.version>
<spark.version>4.0.0-preview1</spark.version>
<spark.version>4.0.0</spark.version>
<spark.version.short>4.0</spark.version.short>
<parquet.version>1.13.1</parquet.version>
<semanticdb.version>4.9.5</semanticdb.version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Scala is upgraded, this probably needs to be upgraded to:

  • at least 4.12.0: the first version to support scala 2.13.16
  • 4.12.7: the latest 4.12.x version
  • 4.13.6: the latest of the latest

Note: this might need to be also aligned in the scala-2.13 profile!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And also, because I can't comment on the next line -
slf4j was upgraded in the 4.0.0 release version, to version 2.0.16

Copy link
Contributor

@YanivKunda YanivKunda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found some files that might have been left over from local operations -
should probably be deleted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

macOS local file

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

macOS local file

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like a temporary diff patch file (I know it sounds like "salsa sauce", but it probably shouldn't be here at all)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

macOS local file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants