New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add comma separated value parsing as an option in iterate_many #2016
Conversation
This looks very reasonable to me. @jkeiser Do you want to have a look? |
I am hoping that someone will help review this. I like it very much myself. |
Ok. Merging. |
Thanks very much @lemire |
@yongxiangng It has been released. |
@@ -290,6 +293,8 @@ inline void document_stream::next_document() noexcept { | |||
if (error) { return; } | |||
// Always set depth=1 at the start of document | |||
doc.iter._depth = 1; | |||
// consume comma if comma separated is allowed | |||
if (allow_comma_separated) { doc.iter.consume_character(','); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I absolutely love that this is so confined to a single place! I'm a little surprised we couldn't use advance(), though, switching from thinking in terms of tokens (which ,
is one of) to thinking of characters does seem like a dangerous thing to me, even if it's confined to this one place where it works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkeiser I merged this because, as you have remarked, it is a very neat patch that is quite isolated.
We can change the design.
Are you saying that we should just skip over the token, no matter what it is? Ignoring its nature?
Can you elaborate on your concern?
(I am 100% open to changing this.)
iterate_many
can now parse comma separated documents.However, this mode will not support batched processing in chunks. This is because it is difficult to find the border between 2 documents efficiently when comma is used as the delimiter. Hence, the batch size is increased to be as large as the json passed in.
It is the user's responsibility to create a parser that has capacity large enough to handle the minimum batch size, failing which will return a capacity error. Because
allow_comma_separated
is a parameter afterbatch_size
and defaulted to false, users enabling comma separated parsing will have to explicitly set the batch size and thus would be conscious to check that their parser's capacity is large enough to handle the batch size.This closes #1999