[FEA] Add whitespace removal as a JSON reader preprocessing option #14865

GregoryKimball · 2024-01-24T17:57:11Z

Is your feature request related to a problem? Please describe.
In cases where libcudf returns a JSON field as an unparsed string (see #14572), the string includes any unquoted whitespace present in the original JSON character data. This presents a challenge for downstream processing of mixed types, especially in the Spark-RAPIDS plugin where we match Spark's behavior to always remove unquoted whitespace. It also allows better control in the case where a user requests a hierarchical type as a string type.

@revans2 Would you please comment on other benefits of having this reader option available?

Describe the solution you'd like
We would like a JSON reader option and cuIO utility function that uses an FST instance to remove unquoted whitespace from input character data.

Describe alternatives you've considered
Document the whitespace differences as a GPU vs CPU difference and don't migrate customer queries that include mixed types and unquoted whitespace.

Additional context
We have a similar whitespace normalization challenge for the results of calls to get_json_object (see NVIDIA/spark-rapids#10218), and an FST-based whitespace normalization tool could be part of the solution for that problem as well.

The text was updated successfully, but these errors were encountered:

revans2 · 2024-01-30T17:47:18Z

The main thing for us is to try and match what Spark does with white space so we can return bit for bit identical results. This would happen any time that we ask for a column to be returned as a string, but they underlying data types are not actually quoted strings.

Here are a few examples of what Spark does with normalization of the white space.

+------------------------------------------------+-------------------------------------+
|jsonStr                                         |normalized                           |
+------------------------------------------------+-------------------------------------+
|{"a":50}                                        |{"a":50}                             |
| { "a" : 50 }                                   |{"a":50}                             |
|{" a ":50}                                      |{" a ":50}                           |
|{"a": [1, 2, 3, 4, 5, 6, 7, 8], "b": {"c": "d"}}|{"a":[1,2,3,4,5,6,7,8],"b":{"c":"d"}}|
|{"a":\t"b"}                                     |{"a":"b"}                            |
+------------------------------------------------+-------------------------------------+

Please note that in the above example \t represents the tab character and not the two character sequence '' and 't'.

Technically Spark will also strip out carriage return and line feed (\r and \n) characters, but that would have to be dependent on JSONLines being enabled as '\r' and '\n' are a hard separator for us when parsing JSON lines format.

I tied multiple unicode which space characters as well as \f and \v, which resulted in Spark saying that the result is invalid JSON. So I would want to be sure that we do not remove these from the result as we might turn something that was invalid JSON into valid JSON.

shrshi · 2024-01-30T18:35:53Z

Thank you very much for these examples @revans2!
Can the inputs be invalid JSON strings? I'm trying to understand what the normalized output would be in the following example

{"a" : "b }

This PR provides a proof-of-concept for the usage of FST in removing unquoted spaces and tabs in JSON strings. This is a useful feature in the cases where we want to cast a hierarchical JSON object to a string, and overcomes the challenge of processing mixed types using Spark. [#14865](#14865) The FST assumes that the single quotes in the input data have already been normalized (possibly using [`normalize_single_quotes`](#14729)). Authors: - Shruti Shivakumar (https://github.com/shrshi) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Elias Stehle (https://github.com/elstehle) - Bradley Dice (https://github.com/bdice) - Karthikeyan (https://github.com/karthikeyann) URL: #14931

This work is a follow-up to PR #14931 which provided a proof-of-concept for using the a FST to normalize unquoted whitespaces. This PR implements the pre-processing FST in cuIO and adds a JSON reader option that needs to be set to true to invoke the normalizer. Addresses feature request #14865 Authors: - Shruti Shivakumar (https://github.com/shrshi) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Robert (Bobby) Evans (https://github.com/revans2) - Vukasin Milovanovic (https://github.com/vuule) - Robert Maynard (https://github.com/robertmaynard) - Bradley Dice (https://github.com/bdice) URL: #15033

GregoryKimball · 2024-03-13T04:03:14Z

Closed by #15033

…delimiter (#15556) Addresses #15277 Given a JSON lines buffer with records separated by a delimiter passed at runtime, the idea is to modify the JSON tokenization FST to consider the passed delimiter to generate EOL token instead of the newline character currently hard-coded. This PR does not modify the whitespace normalization FST to [strip out unquoted `\n` and `\r`](#14865 (comment)). Whitespace normalization will be handled in follow-up works. Note that this is not a multi-object JSON reader since we are not using the offsets data in the string column, and hence there is no resetting of the start state at every row offset. Current status: - [X] Semantic bracket/brace DFA - [X] DFA removing excess characters after record in line - [X] Pushdown automata generating tokens - [x] Test passing arbitrary delimiter that does not occur in input to the reader Authors: - Shruti Shivakumar (https://github.com/shrshi) Approvers: - Paul Mattione (https://github.com/pmattione-nvidia) - Vukasin Milovanovic (https://github.com/vuule) - Elias Stehle (https://github.com/elstehle) - Karthikeyan (https://github.com/karthikeyann) URL: #15556

GregoryKimball added feature request New feature or request 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue Spark Functionality that helps Spark RAPIDS labels Jan 24, 2024

shrshi mentioned this issue Jan 30, 2024

POC for whitespace removal in input JSON data using FST #14931

Merged

3 tasks

GregoryKimball mentioned this issue Jan 31, 2024

[FEA] JSON reader improvements for Spark-RAPIDS #13525

Open

shrshi mentioned this issue Feb 13, 2024

API for JSON unquoted whitespace normalization #15033

Merged

3 tasks

GregoryKimball closed this as completed Mar 13, 2024

This was referenced Apr 12, 2024

[WIP] POC for reading multi-line JSON in string columns #15520

Draft

Reading multi-line JSON in string columns using runtime configurable delimiter #15556

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Add whitespace removal as a JSON reader preprocessing option #14865

[FEA] Add whitespace removal as a JSON reader preprocessing option #14865

GregoryKimball commented Jan 24, 2024 •

edited

revans2 commented Jan 30, 2024

shrshi commented Jan 30, 2024 •

edited

GregoryKimball commented Mar 13, 2024

[FEA] Add whitespace removal as a JSON reader preprocessing option #14865

[FEA] Add whitespace removal as a JSON reader preprocessing option #14865

Comments

GregoryKimball commented Jan 24, 2024 • edited

revans2 commented Jan 30, 2024

shrshi commented Jan 30, 2024 • edited

GregoryKimball commented Mar 13, 2024

GregoryKimball commented Jan 24, 2024 •

edited

shrshi commented Jan 30, 2024 •

edited