Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading multi-line JSON in string columns using runtime configurable delimiter #15556

Merged
merged 39 commits into from
May 20, 2024

Conversation

shrshi
Copy link
Contributor

@shrshi shrshi commented Apr 17, 2024

Description

Addresses #15277
Given a JSON lines buffer with records separated by a delimiter passed at runtime, the idea is to modify the JSON tokenization FST to consider the passed delimiter to generate EOL token instead of the newline character currently hard-coded.
This PR does not modify the whitespace normalization FST to strip out unquoted \n and \r. Whitespace normalization will be handled in follow-up works.
Note that this is not a multi-object JSON reader since we are not using the offsets data in the string column, and hence there is no resetting of the start state at every row offset.

Current status:

  • Semantic bracket/brace DFA
  • DFA removing excess characters after record in line
  • Pushdown automata generating tokens
  • Test passing arbitrary delimiter that does not occur in input to the reader

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Apr 17, 2024
@shrshi shrshi added feature request New feature or request non-breaking Non-breaking change cuIO cuIO issue and removed libcudf Affects libcudf (C++/CUDA) code. labels Apr 17, 2024
Copy link
Contributor

@elstehle elstehle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two high-level comments. Trying to understand if we're currently doing more work than we would actually have to do.

cpp/src/io/json/nested_json_gpu.cu Outdated Show resolved Hide resolved
cpp/src/io/json/nested_json_gpu.cu Outdated Show resolved Hide resolved
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Apr 23, 2024
Copy link
Contributor

@elstehle elstehle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be curious to hear your thoughts on leaving \n as the delimiter on the stack context array. To my understanding, that would make changes in the logical stack and - hopefully - the PDA-FST superfluous(?). Working on this you probably have better insight if that's the case.

Did you get a chance to think through whether we could just keep the original translation table and just translate delim to \n and keep feeding that to the logical stack.

cpp/src/io/json/nested_json_gpu.cu Outdated Show resolved Hide resolved
cpp/src/io/json/nested_json_gpu.cu Outdated Show resolved Hide resolved
@shrshi
Copy link
Contributor Author

shrshi commented Apr 29, 2024

If we translate delim to \n in the to_stack_op automata, I agree that we would not have to make changes to the logical stack. However if there are \n characters in the input JSON, then I think the PDA FST will need to recognize the context in which the newline occurs (from the logical stack?) and decide if it is has been translated from delim or not.
I'm not sure how involved the context detection logic would be in this case. What do you think?

I'd be curious to hear your thoughts on leaving \n as the delimiter on the stack context array. To my understanding, that would make changes in the logical stack and - hopefully - the PDA-FST superfluous(?). Working on this you probably have better insight if that's the case.

Did you get a chance to think through whether we could just keep the original translation table and just translate delim to \n and keep feeding that to the logical stack.

@elstehle
Copy link
Contributor

elstehle commented May 6, 2024

If we translate delim to \n in the to_stack_op automata, I agree that we would not have to make changes to the logical stack. However if there are \n characters in the input JSON, then I think the PDA FST will need to recognize the context in which the newline occurs (from the logical stack?) and decide if it is has been translated from delim or not. I'm not sure how involved the context detection logic would be in this case. What do you think?

I'd be curious to hear your thoughts on leaving \n as the delimiter on the stack context array. To my understanding, that would make changes in the logical stack and - hopefully - the PDA-FST superfluous(?). Working on this you probably have better insight if that's the case.

Did you get a chance to think through whether we could just keep the original translation table and just translate delim to \n and keep feeding that to the logical stack.

Sorry, missed your reply. To be clear, what I mean is to only translate the stack context (i.e., stack_symbols array in the get_token_stream function) from delim to \n and not modify the JSON input itself.

@vuule vuule self-requested a review May 17, 2024 17:13
cpp/include/cudf/io/json.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/io/json.hpp Outdated Show resolved Hide resolved
cpp/src/io/json/nested_json_gpu.cu Outdated Show resolved Hide resolved
@shrshi shrshi requested a review from vuule May 17, 2024 21:27
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thank you!
Last couple of small suggestions

cpp/include/cudf/io/json.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/io/json.hpp Outdated Show resolved Hide resolved
@shrshi shrshi requested a review from vuule May 17, 2024 21:53
Copy link
Contributor

@elstehle elstehle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's great, thanks for adding the validity checks on the delimiter option!

@vuule vuule added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels May 20, 2024
@shrshi
Copy link
Contributor Author

shrshi commented May 20, 2024

/merge

@rapids-bot rapids-bot bot merged commit 9ce1721 into rapidsai:branch-24.06 May 20, 2024
70 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
Archived in project
Status: Landed
Development

Successfully merging this pull request may close these issues.

None yet

5 participants