Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes JSON reader's recovery option's behaviour to ignore all characters after a valid JSON record #14279

Conversation

elstehle
Copy link
Contributor

Description

Closes #14226.

The new behvior of JSON_LINES_RECOVER will now ignore excess characters after the first valid JSON record on each JSON line.

{ "number": 1 } 
{ "number": 1 } xyz
{ "number": 1 } {}
{ "number": 1 } { "number": 4 }

Implementation details:
The JSON parser pushdown automaton was changed for JSON_LINES_RECOVER format such that when in state PD_PVL (post-value, "I have just finished parsing a value") and when the stack context is ROOT ("I'm not somewhere within a list or struct"), we just treat all characters as "white space" until encountering a newline character. post-value in stack context ROOT is exactly the condition we are in after having parsed the first valid record of a JSON line. Thanks to @karthikeyann for suggesting to use PD_PVL as the capturing state.

As the stack context is generated upfront, we have to fix up and correct the stack context to set the stack context as ROOT stack context for all these excess characters. I.e., (_ means ROOT stack context, { means within a STRUCT stack context):

in:    {"a":1}{"this is supposed to be ignored"}
stack: _{{{{{{_{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{

Needs to be fixed up to become:

in:    {"a":1}{"this is supposed to be ignored"}
stack: _{{{{{{__________________________________

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@elstehle elstehle requested a review from a team as a code owner October 13, 2023 12:43
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Oct 13, 2023
@elstehle elstehle added bug Something isn't working 3 - Ready for Review Ready for review by team cuIO cuIO issue non-breaking Non-breaking change labels Oct 13, 2023
@@ -753,15 +753,15 @@ class TranslationOp {
RelativeOffsetT const relative_offset,
SymbolT const read_symbol) const
{
return translation_op(*this, state_id, match_id, relative_offset, read_symbol);
return translation_op(state_id, match_id, relative_offset, read_symbol);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a minor improvement that I piggy-backed into this PR. The delegate that implements translation_op has no benefit from the extra reference of *this, so we removed it from the arguments.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple small pieces of feedback. I think this makes sense. The overall complexity of the FST code is getting higher and higher, but I'm not sure if we can do much about that while continuing to add requested features.

cpp/src/io/json/nested_json_gpu.cu Outdated Show resolved Hide resolved
cpp/tests/io/json_test.cpp Show resolved Hide resolved
@elstehle
Copy link
Contributor Author

. The overall complexity of the FST code is getting higher and higher, but I'm not sure if we can do much about that while continuing to add requested features

Yes, agreed. Unfortunately, the Pushdown Transducer for a JSON parser is quite complex by nature already. For most of the remaining Spark requests on our radar we will be introducing extra pre-/post-processing steps rather than changing the Pushdown machinery.

Copy link
Contributor

@karthikeyann karthikeyann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! 🚀 Especially with the additional processing step without complicating the existing JSON FST table.

cpp/src/io/json/nested_json_gpu.cu Outdated Show resolved Hide resolved
cpp/src/io/json/nested_json_gpu.cu Outdated Show resolved Hide resolved
@GregoryKimball
Copy link
Contributor

@andygrove would you please evaluate this solution?

@andygrove
Copy link
Contributor

@andygrove would you please evaluate this solution?

@GregoryKimball I have confirmed that this resolves the issue. The plugin tests for this issue now pass when run against this PR.

@elstehle
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 50e2211 into rapidsai:branch-23.12 Oct 20, 2023
57 checks passed
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

[FEA] Add JSON reader option to ignore all characters after a valid JSON record
6 participants