-
Notifications
You must be signed in to change notification settings - Fork 857
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Reading multi-line JSON in string columns using runtime configurable …
…delimiter (#15556) Addresses #15277 Given a JSON lines buffer with records separated by a delimiter passed at runtime, the idea is to modify the JSON tokenization FST to consider the passed delimiter to generate EOL token instead of the newline character currently hard-coded. This PR does not modify the whitespace normalization FST to [strip out unquoted `\n` and `\r`](#14865 (comment)). Whitespace normalization will be handled in follow-up works. Note that this is not a multi-object JSON reader since we are not using the offsets data in the string column, and hence there is no resetting of the start state at every row offset. Current status: - [X] Semantic bracket/brace DFA - [X] DFA removing excess characters after record in line - [X] Pushdown automata generating tokens - [x] Test passing arbitrary delimiter that does not occur in input to the reader Authors: - Shruti Shivakumar (https://github.com/shrshi) Approvers: - Paul Mattione (https://github.com/pmattione-nvidia) - Vukasin Milovanovic (https://github.com/vuule) - Elias Stehle (https://github.com/elstehle) - Karthikeyan (https://github.com/karthikeyann) URL: #15556
- Loading branch information
Showing
5 changed files
with
476 additions
and
149 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.