Add comma separated value parsing as an option in iterate_many #2016

yongxiangng · 2023-06-06T11:11:02Z

iterate_many can now parse comma separated documents.

However, this mode will not support batched processing in chunks. This is because it is difficult to find the border between 2 documents efficiently when comma is used as the delimiter. Hence, the batch size is increased to be as large as the json passed in.

It is the user's responsibility to create a parser that has capacity large enough to handle the minimum batch size, failing which will return a capacity error. Because allow_comma_separated is a parameter after batch_size and defaulted to false, users enabling comma separated parsing will have to explicitly set the batch size and thus would be conscious to check that their parser's capacity is large enough to handle the batch size.

This closes #1999

lemire · 2023-06-06T21:41:21Z

This looks very reasonable to me.

@jkeiser Do you want to have a look?

lemire · 2023-06-07T17:03:03Z

I am hoping that someone will help review this. I like it very much myself.

lemire · 2023-06-15T13:11:10Z

Ok. Merging.

yongxiangng · 2023-06-15T13:52:43Z

Thanks very much @lemire

lemire · 2023-06-15T16:24:05Z

@yongxiangng It has been released.

jkeiser · 2023-07-04T16:01:38Z

include/simdjson/generic/ondemand/document_stream-inl.h

@@ -290,6 +293,8 @@ inline void document_stream::next_document() noexcept {
  if (error) { return; }
  // Always set depth=1 at the start of document
  doc.iter._depth = 1;
+  // consume comma if comma separated is allowed
+  if (allow_comma_separated) { doc.iter.consume_character(','); }


I absolutely love that this is so confined to a single place! I'm a little surprised we couldn't use advance(), though, switching from thinking in terms of tokens (which , is one of) to thinking of characters does seem like a dangerous thing to me, even if it's confined to this one place where it works.

@jkeiser I merged this because, as you have remarked, it is a very neat patch that is quite isolated.

We can change the design.

Are you saying that we should just skip over the token, no matter what it is? Ignoring its nature?

Can you elaborate on your concern?

(I am 100% open to changing this.)

Yong Xiang Ng added 5 commits June 6, 2023 13:25

Add comma separated value parsing

881aeca

Fix failing tests

5977d40

Make tests work for exceptions

9ca6ce8

Fix test

f378445

Fix try catch making test fail

a3342b2

yongxiangng mentioned this pull request Jun 6, 2023

Enhance iterate_many to enable parsing of comma separated documents #1999

Closed

lemire merged commit b399c01 into simdjson:master Jun 15, 2023
40 checks passed

FourierTransformer mentioned this pull request Jun 15, 2023

Version 3.2.0 FourierTransformer/lua-simdjson#56

Open

jkeiser reviewed Jul 4, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add comma separated value parsing as an option in iterate_many #2016

Add comma separated value parsing as an option in iterate_many #2016

yongxiangng commented Jun 6, 2023

lemire commented Jun 6, 2023

lemire commented Jun 7, 2023

lemire commented Jun 15, 2023

yongxiangng commented Jun 15, 2023

lemire commented Jun 15, 2023

jkeiser Jul 4, 2023

lemire Jul 4, 2023

Add comma separated value parsing as an option in iterate_many #2016

Add comma separated value parsing as an option in iterate_many #2016

Conversation

yongxiangng commented Jun 6, 2023

lemire commented Jun 6, 2023

lemire commented Jun 7, 2023

lemire commented Jun 15, 2023

yongxiangng commented Jun 15, 2023

lemire commented Jun 15, 2023

jkeiser Jul 4, 2023

Choose a reason for hiding this comment

lemire Jul 4, 2023

Choose a reason for hiding this comment