Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add comma separated value parsing as an option in iterate_many #2016

Merged
merged 5 commits into from Jun 15, 2023
Merged

Add comma separated value parsing as an option in iterate_many #2016

merged 5 commits into from Jun 15, 2023

Conversation

yongxiangng
Copy link
Contributor

iterate_many can now parse comma separated documents.

However, this mode will not support batched processing in chunks. This is because it is difficult to find the border between 2 documents efficiently when comma is used as the delimiter. Hence, the batch size is increased to be as large as the json passed in.

It is the user's responsibility to create a parser that has capacity large enough to handle the minimum batch size, failing which will return a capacity error. Because allow_comma_separated is a parameter after batch_size and defaulted to false, users enabling comma separated parsing will have to explicitly set the batch size and thus would be conscious to check that their parser's capacity is large enough to handle the batch size.

This closes #1999

@lemire
Copy link
Member

lemire commented Jun 6, 2023

This looks very reasonable to me.

@jkeiser Do you want to have a look?

@lemire
Copy link
Member

lemire commented Jun 7, 2023

I am hoping that someone will help review this. I like it very much myself.

@lemire
Copy link
Member

lemire commented Jun 15, 2023

Ok. Merging.

@lemire lemire merged commit b399c01 into simdjson:master Jun 15, 2023
40 checks passed
@yongxiangng
Copy link
Contributor Author

Thanks very much @lemire

@lemire
Copy link
Member

lemire commented Jun 15, 2023

@yongxiangng It has been released.

@@ -290,6 +293,8 @@ inline void document_stream::next_document() noexcept {
if (error) { return; }
// Always set depth=1 at the start of document
doc.iter._depth = 1;
// consume comma if comma separated is allowed
if (allow_comma_separated) { doc.iter.consume_character(','); }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I absolutely love that this is so confined to a single place! I'm a little surprised we couldn't use advance(), though, switching from thinking in terms of tokens (which , is one of) to thinking of characters does seem like a dangerous thing to me, even if it's confined to this one place where it works.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkeiser I merged this because, as you have remarked, it is a very neat patch that is quite isolated.

We can change the design.

Are you saying that we should just skip over the token, no matter what it is? Ignoring its nature?

Can you elaborate on your concern?

(I am 100% open to changing this.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enhance iterate_many to enable parsing of comma separated documents
3 participants