Enhance iterate_many to enable parsing of comma separated documents #1999

yongxiangng · 2023-05-12T12:47:16Z

Feature request

Currently, iterate_many is able to parse a json like auto json = R"([1,2,3] {"1":1,"2":3,"4":4} [1,2,3] )"_padded;

However, it would be nice to be able to parse a json with documents separated by commas auto json = R"([1,2,3] , {"1":1,"2":3,"4":4} , [1,2,3] )"_padded;

Possible implementation (only a small part)

I have looked into the code base a little, and I think we are able to perform the logic in the function document_stream::next_document() and consume any trailing commas after processing a particular document.

Challenges

However, running stage1 poses a problem. We are no longer able to quickly identify the position of the last JSON document in a given batch. This is because find_next_document_index identifies the boundary between 2 documents by searching for ][ ]{ }[ }{ patterns starting from the back of the batch. If we put a comma between these patterns, we can't distinguish whether they are comma separated documents or elements in an array.

Hence, I'm not sure if such a feature is possible to implement. I am willing to write the implementations if we agreeable with this feature and we have a solution to the stage1 problem

The text was updated successfully, but these errors were encountered:

lemire · 2023-05-12T13:53:07Z

It is doable but would require some work. Note that to my knowledge, your proposed input does not follow any established standard. Typically, in a stream of JSON documents, you separate them by white space (e.g., line endings).

Possibly related: #1356

yongxiangng · 2023-05-12T14:33:29Z

I agree.

Would you have any ideas for the work needed to modify find_next_document_index? Currently it is searching for the last complete document from the end. I don't think this approach will be viable if we allow comma separated documents. We will have to search from the beginning, until the last complete document.

However, searching from front will incur a performance penalty, because we will be searching through the entire batch to find the last complete document. Whereas, searching from back, we might be able to terminate early without having to search through the entire batch.

What are your thoughts on this? Is this too much of a change to do (I can do the changes, but it seems from the iterate_many.md that this "reading from back" algorithm is novel and crucial)? Are there any other way around this issue?

lemire · 2023-05-12T16:29:34Z

We will have to search from the beginning, until the last complete document.

Which is bad. You don't want to do that for the reasons that you have outlined.

If you know that you have a stream of objects (not arrays), then I can see how to fix it. But it gets complicate in the general case. I don't know how to do it right now.

pjuhasz · 2023-05-12T19:06:24Z

Some food for thought: Perl's JSON::XS has a mode called incremental parsing with which it is possible to parse comma separated documents, with some manual help: https://metacpan.org/pod/JSON::XS#EXAMPLES

This could be adapted to simdjson's context if iterate_many gave back a character count that pointed to the end of the document after each successful parse, and you could increment this counter manually, to indicate that it should skip a part of the buffer before starting the next document. This mechanism could support multi-JSON documents or streams with arbitrary separators, not just commas.

lemire · 2023-05-12T19:48:49Z

This could be adapted to simdjson's context if iterate_many gave back a character count that pointed to the end of the document after each successful parse

It does. That's not the problem. The problem is that you don't want to parse a truncated document.

The way it works underneath is that you give me 128 GB of data, I might split it into chunks for 1 MB. I index the 1 MB. Then I start iterating through it. One document. Two document. And so forth. I stop at the last complete JSON document.

When I get toward the end of the 1 MB, I try to load and index another 1 MB, and then I glue the two indexes... so that I start the new block with a full document. And I resume...

(Actually, the next 1 MB is processed in a separate thread while you are iterating through the first 1 MB, but you get the idea.)

And on and on we go.

Maybe I have...

... {"1":1,"2":3,"4

at the end of a window and
":4} ...

at the start of the other one... I need to find the start of the last truncated document (if any).

I don't want to start parsing a partially indexed document (here {"1":1,"2":3,"4) as you would get garbage. You'd be missing part of the document.

It happens that we can do that without a problem if you follow a standard such as jsonlines or ndjson...

These are strongly self-synchronizing meaning that if I give you a chunk of ndjson or jsonlines, as long as the chunk size is larger than any document (meaning that your chunk contains at least a full document) then you can identify the start and the end of the last complete JSON document by starting from the end (without having to process the whole stream).

As far as I can tell, if you use a comma separator, the content is not strongly self-synchronizing, meaning that you may need to scan the whole index to find out where the last complete document is.

yongxiangng · 2023-06-01T09:08:22Z

Thanks for the responses on this.

The use case for this is not because we are trying to be non standard. Rather, we are receiving large amounts of data, and the data is chunked. Most of the data is in an array and hence comma separated.

Being able to parse comma separated documents is just a means to an end, we would like to be able to parse a truncated document in our use case so we can parse while the data is being received.

Do you know any way for us to achieve this?

lemire · 2023-06-01T14:17:11Z

@yongxiangng

Do you know any way for us to achieve this?

Either you switch your data source to ndjson or jsonlines, in which case it will work out of the box with simdjson. If you cannot or do not want to switch your input type, then it is not supported in simdjson. This means that you will need to implement the support, and ideally produce a pull request for us to process. It looks like you already dug into our source code and know pretty well what work it entails.

The simdjson library is a community supported project: we build features with our users.

jkeiser · 2023-06-02T00:34:37Z

We've talked many times about supporting fully streamed parsing of a single document, but haven't gotten there.

How much data is coming in this array? Like, how big is it actually?

lemire · 2023-06-02T23:36:07Z

How much data is coming in this array? Like, how big is it actually?

My suspicion is that it is not big but they want to process a truncated input.

yongxiangng · 2023-06-03T03:24:25Z

Sorry for the late reply.

I can only check with the user after the weekends. I believe I heard it was a few kb, but perhaps it is also because they want to process a truncated input.

yongxiangng · 2023-06-05T03:26:48Z

We would like to process the data as soon as the network packet arrives, even if it means the document might be partial. The size is ~16kb. Is it possible to achieve this?

Otherwise, I'm willing to help make the changes if I'm able to. I think it might be quite a big change, so I'll have to look more into the code and see if the old issues are relevant.

I think we just want to parse the data as soon as it comes to reduce latency, and the overall size of the document is likely to be relatively small (so probably not related to #128).

lemire · 2023-06-05T13:51:27Z

Pull request invited.

yongxiangng · 2023-06-06T11:13:58Z

I wasn't able to come up with an elegant way to do truncated parsing. Instead I created a pr (#2016) for parsing comma separated documents, and clients that don't use this option don't suffer much performance penalty (except 1 conditional when iterating to the next document in the document stream)

yongxiangng mentioned this issue Jun 6, 2023

Add comma separated value parsing as an option in iterate_many #2016

Merged

lemire closed this as completed in #2016 Jun 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance iterate_many to enable parsing of comma separated documents #1999

Enhance iterate_many to enable parsing of comma separated documents #1999

yongxiangng commented May 12, 2023

lemire commented May 12, 2023

yongxiangng commented May 12, 2023 •

edited

lemire commented May 12, 2023

pjuhasz commented May 12, 2023

lemire commented May 12, 2023

yongxiangng commented Jun 1, 2023

lemire commented Jun 1, 2023

jkeiser commented Jun 2, 2023

lemire commented Jun 2, 2023

yongxiangng commented Jun 3, 2023

yongxiangng commented Jun 5, 2023 •

edited

lemire commented Jun 5, 2023

yongxiangng commented Jun 6, 2023

Enhance iterate_many to enable parsing of comma separated documents #1999

Enhance iterate_many to enable parsing of comma separated documents #1999

Comments

yongxiangng commented May 12, 2023

Feature request

Possible implementation (only a small part)

Challenges

lemire commented May 12, 2023

yongxiangng commented May 12, 2023 • edited

lemire commented May 12, 2023

pjuhasz commented May 12, 2023

lemire commented May 12, 2023

yongxiangng commented Jun 1, 2023

lemire commented Jun 1, 2023

jkeiser commented Jun 2, 2023

lemire commented Jun 2, 2023

yongxiangng commented Jun 3, 2023

yongxiangng commented Jun 5, 2023 • edited

lemire commented Jun 5, 2023

yongxiangng commented Jun 6, 2023

yongxiangng commented May 12, 2023 •

edited

yongxiangng commented Jun 5, 2023 •

edited