Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] GPU get_json_object does incompatible escaping and error checking #12483

Closed
revans2 opened this issue Jan 5, 2023 · 2 comments · Fixed by #15082
Closed

[BUG] GPU get_json_object does incompatible escaping and error checking #12483

revans2 opened this issue Jan 5, 2023 · 2 comments · Fixed by #15082
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@revans2
Copy link
Contributor

revans2 commented Jan 5, 2023

Describe the bug
the GPU implementation of get_json_object parses the JSON path in a way that is incompatible with Spark's and is not documented. We also throw a bunch of exceptions on invalid JSON paths that are not documented. Some of these I am fine if we just document the incompatibility. For others we might want to look into fixing them...

Steps/Code to reproduce bug

expected:

scala> val df = Seq("""{"A": 1, "A.B": 2, "'A": {"B'": 3}}""").toDF
scala> df.repartition(1).selectExpr("""get_json_object(value, "${A}") as something""").show()
+---------+
|something|
+---------+
|     null|
+---------+
scala> df.repartition(1).selectExpr("""get_json_object(value, "$.'A") as something""").show()
+---------+
|something|
+---------+
| {"B'":3}|
+---------+
scala> df.repartition(1).selectExpr("""get_json_object(value, "$.'A.B'") as something""").show()
+---------+
|something|
+---------+
|        3|
+---------+

Actual (on the GPU)

scala> val df = Seq("""{"A": 1, "A.B": 2, "'A": {"B'": 3}}""").toDF

scala> df.repartition(1).selectExpr("""get_json_object(value, "${A}") as something""").show()
...
ai.rapids.cudf.CudfException: CUDF failure at: /home/roberte/src/spark-rapids-jni/thirdparty/cudf/cpp/src/strings/strings_column_view.cpp:47: strings column has no children

scala> df.repartition(1).selectExpr("""get_json_object(value, "$.'A") as something""").show()
...
ai.rapids.cudf.CudfException: CUDF failure at:/home/roberte/src/spark-rapids-jni/thirdparty/cudf/cpp/src/strings/json/json_path.cu:654: Encountered invalid JSONPath input string

scala> df.repartition(1).selectExpr("""get_json_object(value, "$.'A.B'") as something""").show()
+---------+
|something|
+---------+
|        2|
+---------+

Expected behavior
At a minimum we need to document these differences. Ideally we catch any parsing errors from the JSON path parser and we return a null in those cases just like Spark does. We might even want to look into falling back to the CPU if we see a single quote ' in the path.

The single quote escaping, at least for spark, appears to only happen inside of [] operations, like $['A.B'], which returns 2 for both the GPU and the CPU.

@revans2 revans2 added bug Something isn't working Needs Triage Need team to review and classify labels Jan 5, 2023
@GregoryKimball GregoryKimball added 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jun 6, 2023
@revans2
Copy link
Contributor Author

revans2 commented Jan 12, 2024

I would like to expand on this with a few more examples in issues in the Spark Plugin repo NVIDIA/spark-rapids#10196 NVIDIA/spark-rapids#10194

@davidwendt
Copy link
Contributor

@nvdbaranec

@GregoryKimball GregoryKimball added the Spark Functionality that helps Spark RAPIDS label Feb 13, 2024
rapids-bot bot pushed a commit that referenced this issue Feb 29, 2024
This PR addresses a parsing issue related to JSONPath by implementing distinct parsing rules for values inside and outside brackets. For instance, in `{ "A.B": 2, "'A": { "B'": 3 } }`, `$.'A.B'` differs from `$['A.B']`.  (See [Assertible JSON Path Documentation](https://assertible.com/docs/guide/json-path))

The fix ensures accurate parsing of JSONPath values containing quotes. For example in `{ "A.B": 2, "'A": { "B'": 3 } }`


| JSONPath      | Before Fix                                            | Spark        | After Fix           |
|---------------|-------------------------------------------------------|----------------------|---------------------|
| $.'A.B'       | 2                                                     | 3                    | 3                   |
| $.'A          | CUDF_FAIL("Encountered invalid JSONPath input string")| {"B'": 3}            | {"B'": 3}           |



Resolves [12483](#12483).

Authors:
  - Suraj Aralihalli (https://github.com/SurajAralihalli)
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Robert (Bobby) Evans (https://github.com/revans2)
  - Mike Wilson (https://github.com/hyperbolic2346)
  - Nghia Truong (https://github.com/ttnghia)
  - Bradley Dice (https://github.com/bdice)

URL: #15082
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants