-
Notifications
You must be signed in to change notification settings - Fork 857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reading multi-line JSON in string columns using runtime configurable delimiter #15556
Reading multi-line JSON in string columns using runtime configurable delimiter #15556
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two high-level comments. Trying to understand if we're currently doing more work than we would actually have to do.
…json-runtime-delimiter
…json-runtime-delimiter
…json-runtime-delimiter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be curious to hear your thoughts on leaving \n
as the delimiter
on the stack context array. To my understanding, that would make changes in the logical stack and - hopefully - the PDA-FST superfluous(?). Working on this you probably have better insight if that's the case.
Did you get a chance to think through whether we could just keep the original translation table and just translate
delim
to\n
and keep feeding that to the logical stack.
If we translate
|
Sorry, missed your reply. To be clear, what I mean is to only translate the stack context (i.e., |
…json-runtime-delimiter
…json-runtime-delimiter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thank you!
Last couple of small suggestions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's great, thanks for adding the validity checks on the delimiter option!
/merge |
Description
Addresses #15277
Given a JSON lines buffer with records separated by a delimiter passed at runtime, the idea is to modify the JSON tokenization FST to consider the passed delimiter to generate EOL token instead of the newline character currently hard-coded.
This PR does not modify the whitespace normalization FST to strip out unquoted
\n
and\r
. Whitespace normalization will be handled in follow-up works.Note that this is not a multi-object JSON reader since we are not using the offsets data in the string column, and hence there is no resetting of the start state at every row offset.
Current status:
Checklist