[FEA] Add nested struct support for comparison operations #8964

revans2 · 2021-08-05T13:56:27Z

Is your feature request related to a problem? Please describe.
For Spark we are pushing to get more support for structs in a number of operators. We already have some support for sorting structs, so we should be able to come up with a way to do comparisons of nested structs too. NOTE this does not include lists as children of the structs just structs that contains basic types including strings and other structs.

The operations we would like to support include the BINARY ops EQUAL, NOT_EQUAL, LESS, GREATER, LESS_EQUAL, GREATER_EQUAL, NULL_EQUALS, and if possible NULL_MAX and NULL_MIN.

This should follow the same pattern we have supported for sorting with the order of precedence for the children in a struct go from first to last. In this case we would like nulls within the struct columns to be less than other values, but equal to each other. meaning Struct(null) is less than Struct(5) and Struct(null) == Struct(null). Nulls at the top level still depend on the operator being performed. For NULL_EQUALS nulls are equal to each other.

Describe the solution you'd like
It would be great if we could do this as regular binary ops, but if we need them to be separate APIs that works too. If null equality/etc needs to be configurable for the python APIs a separate API is fine.

Describe alternatives you've considered
We could flatten the struct columns ourselves and do a number of different operations to combine the results back together to get the right answer. But cudf already has a flatten method behind the scenes so why replicate that when others could benefit from it too.

The text was updated successfully, but these errors were encountered:

github-actions · 2021-11-15T21:04:49Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

sameerz · 2021-12-21T02:03:43Z

This is still required

bdice · 2022-06-07T02:36:13Z

This code snippet demonstrates some behavior with NaNs that I investigated with @rwlee. tl;dr Spark treats NaN the same in binary operators <, <=, ==, ... as in the comparators <, == used for sorting and equality. This follows the rules in #4760 but with elementwise comparison of structs.

Show snippet

Save as binops.scala and run with: $ spark-shell -i binops.scala

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{DoubleType, StructType}
import org.apache.spark.sql.Row

val schema = new StructType()
  .add("struct1", new StructType()
    .add("x", DoubleType)
    .add("y", DoubleType))
  .add("struct2", new StructType()
    .add("x", DoubleType)
    .add("y", DoubleType))

val v1 = 1.0
val v2 = Double.NaN

val structData = Seq(
  Row(Row(v1, v1), Row(v1, v1)),
  Row(Row(v1, v1), Row(v1, v2)),
  Row(Row(v1, v1), Row(v2, v1)),
  Row(Row(v1, v1), Row(v2, v2)),
  Row(Row(v1, v2), Row(v1, v1)),
  Row(Row(v1, v2), Row(v1, v2)),
  Row(Row(v1, v2), Row(v2, v1)),
  Row(Row(v1, v2), Row(v2, v2)),
  Row(Row(v2, v1), Row(v1, v1)),
  Row(Row(v2, v1), Row(v1, v2)),
  Row(Row(v2, v1), Row(v2, v1)),
  Row(Row(v2, v1), Row(v2, v2)),
  Row(Row(v2, v2), Row(v1, v1)),
  Row(Row(v2, v2), Row(v1, v2)),
  Row(Row(v2, v2), Row(v2, v1)),
  Row(Row(v2, v2), Row(v2, v2)),
)

val df = spark.createDataFrame(
  spark.sparkContext.parallelize(structData), schema)
df.printSchema()
df.show(false)

val df2 = df.selectExpr("struct1", "struct2", "struct1 < struct2", "struct1 <= struct2", "struct1 == struct2")
df2.printSchema()
df2.show(false)

Show output

This is the relevant part of the output for understanding NaN behavior.

+----------+----------+-------------------+--------------------+-------------------+
|struct1   |struct2   |(struct1 < struct2)|(struct1 <= struct2)|(struct1 = struct2)|
+----------+----------+-------------------+--------------------+-------------------+
|{1.0, 1.0}|{1.0, 1.0}|false              |true                |true               |
|{1.0, 1.0}|{1.0, NaN}|true               |true                |false              |
|{1.0, 1.0}|{NaN, 1.0}|true               |true                |false              |
|{1.0, 1.0}|{NaN, NaN}|true               |true                |false              |
|{1.0, NaN}|{1.0, 1.0}|false              |false               |false              |
|{1.0, NaN}|{1.0, NaN}|false              |true                |true               |
|{1.0, NaN}|{NaN, 1.0}|true               |true                |false              |
|{1.0, NaN}|{NaN, NaN}|true               |true                |false              |
|{NaN, 1.0}|{1.0, 1.0}|false              |false               |false              |
|{NaN, 1.0}|{1.0, NaN}|false              |false               |false              |
|{NaN, 1.0}|{NaN, 1.0}|false              |true                |true               |
|{NaN, 1.0}|{NaN, NaN}|true               |true                |false              |
|{NaN, NaN}|{1.0, 1.0}|false              |false               |false              |
|{NaN, NaN}|{1.0, NaN}|false              |false               |false              |
|{NaN, NaN}|{NaN, 1.0}|false              |false               |false              |
|{NaN, NaN}|{NaN, NaN}|false              |true                |true               |
+----------+----------+-------------------+--------------------+-------------------+

Adds support for Spark's null aware equality binop and expands/improves Java testing for struct binops. Properly tests null structs and full operator testing coverage. Utilizes existing Spark struct binop support with JNI changes to force the full null-aware comparison. Expands on #11153 Partial solution to #8964 -- `NULL_MAX` and `NULL_MIN` still outstanding. Authors: - Ryan Lee (https://github.com/rwlee) Approvers: - Tobias Ribizel (https://github.com/upsj) - Vukasin Milovanovic (https://github.com/vuule) - Jason Lowe (https://github.com/jlowe) URL: #11520

revans2 added feature request New feature or request Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Aug 5, 2021

github-actions bot added this to Needs prioritizing in Feature Planning Aug 5, 2021

harrism added this to the List and Struct data types and operations milestone Aug 11, 2021

beckernick added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Aug 25, 2021

rwlee self-assigned this Aug 25, 2021

rwlee mentioned this issue Oct 15, 2021

Nested struct binop comparison #9452

Closed

sameerz linked a pull request Oct 19, 2021 that will close this issue

Nested struct binop comparison #9452

Closed

github-actions bot added the inactive-30d label Nov 15, 2021

jrhemstad added 0 - Backlog In queue waiting for assignment and removed inactive-30d labels Jan 4, 2022

devavret mentioned this issue Feb 1, 2022

[FEA] Story - Supporting row operators on nested types #10186

Closed

rwlee mentioned this issue Feb 8, 2022

Nested struct binop comparison #10255

Closed

rwlee mentioned this issue Aug 12, 2022

Struct support for NULL_EQUALS binary operation #11520

Merged

3 tasks

GregoryKimball mentioned this issue Oct 7, 2022

[FEA] Implement full support for nested types #11844

Closed

ttnghia mentioned this issue Jul 13, 2023

[FEA] Fully support nested types in Spark SQL functions NVIDIA/spark-rapids#8550

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Add nested struct support for comparison operations #8964

[FEA] Add nested struct support for comparison operations #8964

revans2 commented Aug 5, 2021

github-actions bot commented Nov 15, 2021

sameerz commented Dec 21, 2021

bdice commented Jun 7, 2022

[FEA] Add nested struct support for comparison operations #8964

[FEA] Add nested struct support for comparison operations #8964

Comments

revans2 commented Aug 5, 2021

github-actions bot commented Nov 15, 2021

sameerz commented Dec 21, 2021

bdice commented Jun 7, 2022