Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add nested struct support for comparison operations #8964

Open
revans2 opened this issue Aug 5, 2021 · 3 comments
Open

[FEA] Add nested struct support for comparison operations #8964

revans2 opened this issue Aug 5, 2021 · 3 comments
Assignees
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@revans2
Copy link
Contributor

revans2 commented Aug 5, 2021

Is your feature request related to a problem? Please describe.
For Spark we are pushing to get more support for structs in a number of operators. We already have some support for sorting structs, so we should be able to come up with a way to do comparisons of nested structs too. NOTE this does not include lists as children of the structs just structs that contains basic types including strings and other structs.

The operations we would like to support include the BINARY ops EQUAL, NOT_EQUAL, LESS, GREATER, LESS_EQUAL, GREATER_EQUAL, NULL_EQUALS, and if possible NULL_MAX and NULL_MIN.

This should follow the same pattern we have supported for sorting with the order of precedence for the children in a struct go from first to last. In this case we would like nulls within the struct columns to be less than other values, but equal to each other. meaning Struct(null) is less than Struct(5) and Struct(null) == Struct(null). Nulls at the top level still depend on the operator being performed. For NULL_EQUALS nulls are equal to each other.

Describe the solution you'd like
It would be great if we could do this as regular binary ops, but if we need them to be separate APIs that works too. If null equality/etc needs to be configurable for the python APIs a separate API is fine.

Describe alternatives you've considered
We could flatten the struct columns ourselves and do a number of different operations to combine the results back together to get the right answer. But cudf already has a flatten method behind the scenes so why replicate that when others could benefit from it too.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@sameerz
Copy link
Contributor

sameerz commented Dec 21, 2021

This is still required

@bdice
Copy link
Contributor

bdice commented Jun 7, 2022

This code snippet demonstrates some behavior with NaNs that I investigated with @rwlee. tl;dr Spark treats NaN the same in binary operators <, <=, ==, ... as in the comparators <, == used for sorting and equality. This follows the rules in #4760 but with elementwise comparison of structs.

Show snippet

Save as binops.scala and run with: $ spark-shell -i binops.scala

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{DoubleType, StructType}
import org.apache.spark.sql.Row

val schema = new StructType()
  .add("struct1", new StructType()
    .add("x", DoubleType)
    .add("y", DoubleType))
  .add("struct2", new StructType()
    .add("x", DoubleType)
    .add("y", DoubleType))

val v1 = 1.0
val v2 = Double.NaN

val structData = Seq(
  Row(Row(v1, v1), Row(v1, v1)),
  Row(Row(v1, v1), Row(v1, v2)),
  Row(Row(v1, v1), Row(v2, v1)),
  Row(Row(v1, v1), Row(v2, v2)),
  Row(Row(v1, v2), Row(v1, v1)),
  Row(Row(v1, v2), Row(v1, v2)),
  Row(Row(v1, v2), Row(v2, v1)),
  Row(Row(v1, v2), Row(v2, v2)),
  Row(Row(v2, v1), Row(v1, v1)),
  Row(Row(v2, v1), Row(v1, v2)),
  Row(Row(v2, v1), Row(v2, v1)),
  Row(Row(v2, v1), Row(v2, v2)),
  Row(Row(v2, v2), Row(v1, v1)),
  Row(Row(v2, v2), Row(v1, v2)),
  Row(Row(v2, v2), Row(v2, v1)),
  Row(Row(v2, v2), Row(v2, v2)),
)

val df = spark.createDataFrame(
  spark.sparkContext.parallelize(structData), schema)
df.printSchema()
df.show(false)

val df2 = df.selectExpr("struct1", "struct2", "struct1 < struct2", "struct1 <= struct2", "struct1 == struct2")
df2.printSchema()
df2.show(false)
Show output

This is the relevant part of the output for understanding NaN behavior.

+----------+----------+-------------------+--------------------+-------------------+
|struct1   |struct2   |(struct1 < struct2)|(struct1 <= struct2)|(struct1 = struct2)|
+----------+----------+-------------------+--------------------+-------------------+
|{1.0, 1.0}|{1.0, 1.0}|false              |true                |true               |
|{1.0, 1.0}|{1.0, NaN}|true               |true                |false              |
|{1.0, 1.0}|{NaN, 1.0}|true               |true                |false              |
|{1.0, 1.0}|{NaN, NaN}|true               |true                |false              |
|{1.0, NaN}|{1.0, 1.0}|false              |false               |false              |
|{1.0, NaN}|{1.0, NaN}|false              |true                |true               |
|{1.0, NaN}|{NaN, 1.0}|true               |true                |false              |
|{1.0, NaN}|{NaN, NaN}|true               |true                |false              |
|{NaN, 1.0}|{1.0, 1.0}|false              |false               |false              |
|{NaN, 1.0}|{1.0, NaN}|false              |false               |false              |
|{NaN, 1.0}|{NaN, 1.0}|false              |true                |true               |
|{NaN, 1.0}|{NaN, NaN}|true               |true                |false              |
|{NaN, NaN}|{1.0, 1.0}|false              |false               |false              |
|{NaN, NaN}|{1.0, NaN}|false              |false               |false              |
|{NaN, NaN}|{NaN, 1.0}|false              |false               |false              |
|{NaN, NaN}|{NaN, NaN}|false              |true                |true               |
+----------+----------+-------------------+--------------------+-------------------+

rapids-bot bot pushed a commit that referenced this issue Aug 15, 2022
Adds support for Spark's null aware equality binop and expands/improves Java testing for struct binops. Properly tests null structs and full operator testing coverage. Utilizes existing Spark struct binop support with JNI changes to force the full null-aware comparison.

Expands on #11153

Partial solution to #8964 -- `NULL_MAX` and `NULL_MIN` still outstanding.

Authors:
  - Ryan Lee (https://github.com/rwlee)

Approvers:
  - Tobias Ribizel (https://github.com/upsj)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Jason Lowe (https://github.com/jlowe)

URL: #11520
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
No open projects
Feature Planning
Needs prioritizing
Development

Successfully merging a pull request may close this issue.

7 participants