Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC for whitespace removal in input JSON data using FST #14931

Merged
merged 17 commits into from
Feb 8, 2024

Conversation

shrshi
Copy link
Contributor

@shrshi shrshi commented Jan 30, 2024

Description

This PR provides a proof-of-concept for the usage of FST in removing unquoted spaces and tabs in JSON strings. This is a useful feature in the cases where we want to cast a hierarchical JSON object to a string, and overcomes the challenge of processing mixed types using Spark. #14865
The FST assumes that the single quotes in the input data have already been normalized (possibly using normalize_single_quotes).

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Jan 30, 2024
@shrshi shrshi added feature request New feature or request non-breaking Non-breaking change 2 - In Progress Currently a work in progress and removed CMake CMake build issue labels Jan 30, 2024
@github-actions github-actions bot added the CMake CMake build issue label Jan 30, 2024
Copy link
Contributor

@elstehle elstehle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for working on this and for putting the FST to use 🙂
I did just some early high-level review on the FST stuff. Overall, this looks good already. Just left a few minor comments that may help us to further simplify the logic a bit.

cpp/tests/io/json_whitespace_normalization_test.cu Outdated Show resolved Hide resolved
cpp/tests/io/json_whitespace_normalization_test.cu Outdated Show resolved Hide resolved
cpp/tests/io/json_whitespace_normalization_test.cu Outdated Show resolved Hide resolved
cpp/tests/io/json_whitespace_normalization_test.cu Outdated Show resolved Hide resolved
cpp/tests/io/json_whitespace_normalization_test.cu Outdated Show resolved Hide resolved
cpp/tests/io/json_whitespace_normalization_test.cu Outdated Show resolved Hide resolved
cpp/tests/io/json_whitespace_normalization_test.cu Outdated Show resolved Hide resolved
cpp/tests/io/json_whitespace_normalization_test.cu Outdated Show resolved Hide resolved
@shrshi shrshi marked this pull request as ready for review January 30, 2024 22:16
@shrshi shrshi requested a review from a team as a code owner January 30, 2024 22:16
@vuule
Copy link
Contributor

vuule commented Feb 3, 2024

@elstehle would it be feasible to create a single FST that can perform both quote normalization and whitespace removal, and also be configurable to do only one of these preprocessing steps. I know the JSON parser FST is configurable to some extent, but I don't know how limited this approach.

@elstehle
Copy link
Contributor

elstehle commented Feb 5, 2024

@elstehle would it be feasible to create a single FST that can perform both quote normalization and whitespace removal, and also be configurable to do only one of these preprocessing steps. I know the JSON parser FST is configurable to some extent, but I don't know how limited this approach.

I believe it should be possible to have an FST that does both in a single pass. We'd have to see if it makes sense to integrate all three options, i.e., (1) whitespace removal, (2) quote normalization, (3) both, into a single FST instance or whether that would overcomplicate the translation function and make it too branchy. Or whether it'd be better to have three separate FST instances for each of the three options above.

@vuule
Copy link
Contributor

vuule commented Feb 5, 2024

@elstehle would it be feasible to create a single FST that can perform both quote normalization and whitespace removal, and also be configurable to do only one of these preprocessing steps. I know the JSON parser FST is configurable to some extent, but I don't know how limited this approach.

I believe it should be possible to have an FST that does both in a single pass. We'd have to see if it makes sense to integrate all three options, i.e., (1) whitespace removal, (2) quote normalization, (3) both, into a single FST instance or whether that would overcomplicate the translation function and make it too branchy. Or whether it'd be better to have three separate FST instances for each of the three options above.

Thanks. If you're not sure if this is feasible, it probably makes the most sense to start with separate FSTs.

@shrshi shrshi requested a review from elstehle February 6, 2024 19:41
Copy link
Contributor

@elstehle elstehle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few minor comments, otherwise looks good to me 👍

cpp/tests/io/json_whitespace_normalization_test.cu Outdated Show resolved Hide resolved
cpp/tests/io/json_whitespace_normalization_test.cu Outdated Show resolved Hide resolved
cpp/tests/io/json_whitespace_normalization_test.cu Outdated Show resolved Hide resolved
cpp/tests/io/json_whitespace_normalization_test.cu Outdated Show resolved Hide resolved
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple suggestions to improve comments. Otherwise LGTM!

* | state, whitespaces following escaped double quotes inside strings may be removed.
*
* NOTE: An important case NOT handled by this FST is that of whitespace following newline
* characters within a string. For example, `{"a":"x\n y"}` ---FST--> `{"a":"x\ny"}`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example makes it sound like this FST does that transformation. Maybe write:

Suggested change
* characters within a string. For example, `{"a":"x\n y"}` ---FST--> `{"a":"x\ny"}`
* characters within a string. For example, `{"a":"x\n y"}` is unchanged by this FST. It
* does not become `{"a":"x\ny"}`.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the current FST, we would get the transformation described in the comment, but that is not the expected behaviour i.e. we should not remove whitespace characters within quotes. I think the following would make it clearer -

Suggested change
* characters within a string. For example, `{"a":"x\n y"}` ---FST--> `{"a":"x\ny"}`
* characters within a string. Consider the following example
* Input: {"a":"x\n y"}
* FST output: {"a":"x\ny"}
* Expected output: {"a":"x\n y"}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused. We are documenting a known bug in the current implementation? Are we intending to fix this before merging?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For compatibility with Spark, we don't need to consider newlines within strings as a part of the string. While reading from JSON lines with the option set to recover from invalid lines, I think newline characters present before the end of the record (like in the example {"a":"x\n y"}) will result in the parser treating it as an invalid line.
I have added the note for the sake of completeness and to clarify the scope of the FST.

cpp/tests/io/json_whitespace_normalization_test.cu Outdated Show resolved Hide resolved
@karthikeyann
Copy link
Contributor

Quote normalization is used for entire JSON. Whitespace removal is required only for downstream processing of mixed types (#14865) which should be much smaller than entire JSON. So, this may be the reason for separate FSTs. Per string FST for whitespace could be useful (only if without minimizing the performance).

Copy link
Contributor

@karthikeyann karthikeyann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice and clean FST state table! Great work.

cpp/tests/io/json_whitespace_normalization_test.cu Outdated Show resolved Hide resolved
cpp/tests/io/json_whitespace_normalization_test.cu Outdated Show resolved Hide resolved
Comment on lines +80 to +83
{/* IN_STATE " \ \n <SPC> OTHER */
/* TT_OOS */ {{TT_DQS, TT_OOS, TT_OOS, TT_OOS, TT_OOS}},
/* TT_DQS */ {{TT_OOS, TT_DEC, TT_OOS, TT_DQS, TT_DQS}},
/* TT_DEC */ {{TT_DQS, TT_DQS, TT_DQS, TT_DQS, TT_DQS}}}};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is error state not expected to happen since we don't have a state for error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no error state for both the quote normalization and whitespace normalization. In case of invalid JSON inputs (such as the GroundTruth_InvalidInput test case), it processes them anyway and leaves the error-handling and recovery to the next parsing FST.

void run_test(const std::string& input, const std::string& output)
{
// Prepare cuda stream for data transfers & kernels
rmm::cuda_stream stream{};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be cudf::get_default_stream() for tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea! It's better to call cudf::test::get_default_stream() here instead of creating a new stream. Fixed.

cpp/tests/io/json_whitespace_normalization_test.cu Outdated Show resolved Hide resolved
cpp/tests/io/json_whitespace_normalization_test.cu Outdated Show resolved Hide resolved
TEST_F(JsonWSNormalizationTest, GroundTruth_InvalidInput)
{
std::string input = "{\"a\" : \"b }\n{ \"c\" :\t\"d\"}";
std::string output = "{\"a\":\"b }\n{\"c\":\"d\"}";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (not change suggestion):
Why do some strings cases use raw string literal, but some cases are escaped strings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With raw strings, it's hard to see the positions of spaces and tabs when they are next to each other, especially when editors map tabs to different number of spaces. With escaped strings, I think we have more control.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a really good answer! I hadn't considered that.

Copy link
Contributor

@karthikeyann karthikeyann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look good.

@shrshi
Copy link
Contributor Author

shrshi commented Feb 8, 2024

/merge

@rapids-bot rapids-bot bot merged commit 3f8cb74 into rapidsai:branch-24.04 Feb 8, 2024
68 checks passed
rapids-bot bot pushed a commit that referenced this pull request Mar 4, 2024
This work is a follow-up to PR #14931 which provided a proof-of-concept for using the a FST to normalize unquoted whitespaces. This PR implements the pre-processing FST in cuIO and adds a JSON reader option that needs to be set to true to invoke the normalizer. 
Addresses feature request #14865

Authors:
  - Shruti Shivakumar (https://github.com/shrshi)
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Robert (Bobby) Evans (https://github.com/revans2)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Robert Maynard (https://github.com/robertmaynard)
  - Bradley Dice (https://github.com/bdice)

URL: #15033
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

5 participants