Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add whitespace removal as a JSON reader preprocessing option #14865

Closed
GregoryKimball opened this issue Jan 24, 2024 · 3 comments
Closed
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Jan 24, 2024

Is your feature request related to a problem? Please describe.
In cases where libcudf returns a JSON field as an unparsed string (see #14572), the string includes any unquoted whitespace present in the original JSON character data. This presents a challenge for downstream processing of mixed types, especially in the Spark-RAPIDS plugin where we match Spark's behavior to always remove unquoted whitespace. It also allows better control in the case where a user requests a hierarchical type as a string type.

@revans2 Would you please comment on other benefits of having this reader option available?

Describe the solution you'd like
We would like a JSON reader option and cuIO utility function that uses an FST instance to remove unquoted whitespace from input character data.

Describe alternatives you've considered
Document the whitespace differences as a GPU vs CPU difference and don't migrate customer queries that include mixed types and unquoted whitespace.

Additional context
We have a similar whitespace normalization challenge for the results of calls to get_json_object (see NVIDIA/spark-rapids#10218), and an FST-based whitespace normalization tool could be part of the solution for that problem as well.

@GregoryKimball GregoryKimball added feature request New feature or request 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue Spark Functionality that helps Spark RAPIDS labels Jan 24, 2024
@revans2
Copy link
Contributor

revans2 commented Jan 30, 2024

The main thing for us is to try and match what Spark does with white space so we can return bit for bit identical results. This would happen any time that we ask for a column to be returned as a string, but they underlying data types are not actually quoted strings.

Here are a few examples of what Spark does with normalization of the white space.

+------------------------------------------------+-------------------------------------+
|jsonStr                                         |normalized                           |
+------------------------------------------------+-------------------------------------+
|{"a":50}                                        |{"a":50}                             |
| { "a" : 50 }                                   |{"a":50}                             |
|{" a ":50}                                      |{" a ":50}                           |
|{"a": [1, 2, 3, 4, 5, 6, 7, 8], "b": {"c": "d"}}|{"a":[1,2,3,4,5,6,7,8],"b":{"c":"d"}}|
|{"a":\t"b"}                                     |{"a":"b"}                            |
+------------------------------------------------+-------------------------------------+

Please note that in the above example \t represents the tab character and not the two character sequence '' and 't'.

Technically Spark will also strip out carriage return and line feed (\r and \n) characters, but that would have to be dependent on JSONLines being enabled as '\r' and '\n' are a hard separator for us when parsing JSON lines format.

I tied multiple unicode which space characters as well as \f and \v, which resulted in Spark saying that the result is invalid JSON. So I would want to be sure that we do not remove these from the result as we might turn something that was invalid JSON into valid JSON.

@shrshi
Copy link
Contributor

shrshi commented Jan 30, 2024

Thank you very much for these examples @revans2!
Can the inputs be invalid JSON strings? I'm trying to understand what the normalized output would be in the following example

{"a" : "b }

rapids-bot bot pushed a commit that referenced this issue Feb 8, 2024
This PR provides a proof-of-concept for the usage of FST in removing unquoted spaces and tabs in JSON strings. This is a useful feature in the cases where we want to cast a hierarchical JSON object to a string, and overcomes the challenge of processing mixed types using Spark. [#14865](#14865)
The FST assumes that the single quotes in the input data have already been normalized (possibly using [`normalize_single_quotes`](#14729)).

Authors:
  - Shruti Shivakumar (https://github.com/shrshi)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Elias Stehle (https://github.com/elstehle)
  - Bradley Dice (https://github.com/bdice)
  - Karthikeyan (https://github.com/karthikeyann)

URL: #14931
rapids-bot bot pushed a commit that referenced this issue Mar 4, 2024
This work is a follow-up to PR #14931 which provided a proof-of-concept for using the a FST to normalize unquoted whitespaces. This PR implements the pre-processing FST in cuIO and adds a JSON reader option that needs to be set to true to invoke the normalizer. 
Addresses feature request #14865

Authors:
  - Shruti Shivakumar (https://github.com/shrshi)
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Robert (Bobby) Evans (https://github.com/revans2)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Robert Maynard (https://github.com/robertmaynard)
  - Bradley Dice (https://github.com/bdice)

URL: #15033
@GregoryKimball
Copy link
Contributor Author

Closed by #15033

rapids-bot bot pushed a commit that referenced this issue May 20, 2024
…delimiter (#15556)

Addresses #15277
Given a JSON lines buffer with records separated by a delimiter passed at runtime, the idea is to modify the JSON tokenization FST to consider the passed delimiter to generate EOL token instead of the newline character currently hard-coded. 
This PR does not modify the whitespace normalization FST to [strip out unquoted `\n` and `\r`](#14865 (comment)). Whitespace normalization will be handled in follow-up works.
Note that this is not a multi-object JSON reader since we are not using the offsets data in the string column, and hence there is no resetting of the start state at every row offset.

Current status:
- [X] Semantic bracket/brace DFA 
- [X] DFA removing excess characters after record in line
- [X] Pushdown automata generating tokens
- [x] Test passing arbitrary delimiter that does not occur in input to the reader

Authors:
  - Shruti Shivakumar (https://github.com/shrshi)

Approvers:
  - Paul Mattione (https://github.com/pmattione-nvidia)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Elias Stehle (https://github.com/elstehle)
  - Karthikeyan (https://github.com/karthikeyann)

URL: #15556
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
Status: Done
Archived in project
Development

No branches or pull requests

3 participants