Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance iterate_many to enable parsing of comma separated documents #1999

Closed
yongxiangng opened this issue May 12, 2023 · 13 comments · Fixed by #2016
Closed

Enhance iterate_many to enable parsing of comma separated documents #1999

yongxiangng opened this issue May 12, 2023 · 13 comments · Fixed by #2016

Comments

@yongxiangng
Copy link
Contributor

Feature request

Currently, iterate_many is able to parse a json like auto json = R"([1,2,3] {"1":1,"2":3,"4":4} [1,2,3] )"_padded;

However, it would be nice to be able to parse a json with documents separated by commas auto json = R"([1,2,3] , {"1":1,"2":3,"4":4} , [1,2,3] )"_padded;

Possible implementation (only a small part)

I have looked into the code base a little, and I think we are able to perform the logic in the function document_stream::next_document() and consume any trailing commas after processing a particular document.

Challenges

However, running stage1 poses a problem. We are no longer able to quickly identify the position of the last JSON document in a given batch. This is because find_next_document_index identifies the boundary between 2 documents by searching for ][ ]{ }[ }{ patterns starting from the back of the batch. If we put a comma between these patterns, we can't distinguish whether they are comma separated documents or elements in an array.

Hence, I'm not sure if such a feature is possible to implement. I am willing to write the implementations if we agreeable with this feature and we have a solution to the stage1 problem

@lemire
Copy link
Member

lemire commented May 12, 2023

It is doable but would require some work. Note that to my knowledge, your proposed input does not follow any established standard. Typically, in a stream of JSON documents, you separate them by white space (e.g., line endings).

Possibly related: #1356

@yongxiangng
Copy link
Contributor Author

yongxiangng commented May 12, 2023

I agree.

Would you have any ideas for the work needed to modify find_next_document_index? Currently it is searching for the last complete document from the end. I don't think this approach will be viable if we allow comma separated documents. We will have to search from the beginning, until the last complete document.

However, searching from front will incur a performance penalty, because we will be searching through the entire batch to find the last complete document. Whereas, searching from back, we might be able to terminate early without having to search through the entire batch.

What are your thoughts on this? Is this too much of a change to do (I can do the changes, but it seems from the iterate_many.md that this "reading from back" algorithm is novel and crucial)? Are there any other way around this issue?

@lemire
Copy link
Member

lemire commented May 12, 2023

We will have to search from the beginning, until the last complete document.

Which is bad. You don't want to do that for the reasons that you have outlined.

If you know that you have a stream of objects (not arrays), then I can see how to fix it. But it gets complicate in the general case. I don't know how to do it right now.

@pjuhasz
Copy link
Contributor

pjuhasz commented May 12, 2023

Some food for thought: Perl's JSON::XS has a mode called incremental parsing with which it is possible to parse comma separated documents, with some manual help: https://metacpan.org/pod/JSON::XS#EXAMPLES

This could be adapted to simdjson's context if iterate_many gave back a character count that pointed to the end of the document after each successful parse, and you could increment this counter manually, to indicate that it should skip a part of the buffer before starting the next document. This mechanism could support multi-JSON documents or streams with arbitrary separators, not just commas.

@lemire
Copy link
Member

lemire commented May 12, 2023

This could be adapted to simdjson's context if iterate_many gave back a character count that pointed to the end of the document after each successful parse

It does. That's not the problem. The problem is that you don't want to parse a truncated document.

The way it works underneath is that you give me 128 GB of data, I might split it into chunks for 1 MB. I index the 1 MB. Then I start iterating through it. One document. Two document. And so forth. I stop at the last complete JSON document.

When I get toward the end of the 1 MB, I try to load and index another 1 MB, and then I glue the two indexes... so that I start the new block with a full document. And I resume...

(Actually, the next 1 MB is processed in a separate thread while you are iterating through the first 1 MB, but you get the idea.)

And on and on we go.

Maybe I have...

... {"1":1,"2":3,"4

at the end of a window and
":4} ...

at the start of the other one... I need to find the start of the last truncated document (if any).

I don't want to start parsing a partially indexed document (here {"1":1,"2":3,"4) as you would get garbage. You'd be missing part of the document.

It happens that we can do that without a problem if you follow a standard such as jsonlines or ndjson...

These are strongly self-synchronizing meaning that if I give you a chunk of ndjson or jsonlines, as long as the chunk size is larger than any document (meaning that your chunk contains at least a full document) then you can identify the start and the end of the last complete JSON document by starting from the end (without having to process the whole stream).

As far as I can tell, if you use a comma separator, the content is not strongly self-synchronizing, meaning that you may need to scan the whole index to find out where the last complete document is.

@yongxiangng
Copy link
Contributor Author

Thanks for the responses on this.

The use case for this is not because we are trying to be non standard. Rather, we are receiving large amounts of data, and the data is chunked. Most of the data is in an array and hence comma separated.

Being able to parse comma separated documents is just a means to an end, we would like to be able to parse a truncated document in our use case so we can parse while the data is being received.

Do you know any way for us to achieve this?

@lemire
Copy link
Member

lemire commented Jun 1, 2023

@yongxiangng

Do you know any way for us to achieve this?

Either you switch your data source to ndjson or jsonlines, in which case it will work out of the box with simdjson. If you cannot or do not want to switch your input type, then it is not supported in simdjson. This means that you will need to implement the support, and ideally produce a pull request for us to process. It looks like you already dug into our source code and know pretty well what work it entails.

The simdjson library is a community supported project: we build features with our users.

@jkeiser
Copy link
Member

jkeiser commented Jun 2, 2023

We've talked many times about supporting fully streamed parsing of a single document, but haven't gotten there.

How much data is coming in this array? Like, how big is it actually?

@lemire
Copy link
Member

lemire commented Jun 2, 2023

How much data is coming in this array? Like, how big is it actually?

My suspicion is that it is not big but they want to process a truncated input.

@yongxiangng
Copy link
Contributor Author

Sorry for the late reply.

I can only check with the user after the weekends. I believe I heard it was a few kb, but perhaps it is also because they want to process a truncated input.

@yongxiangng
Copy link
Contributor Author

yongxiangng commented Jun 5, 2023

We would like to process the data as soon as the network packet arrives, even if it means the document might be partial. The size is ~16kb. Is it possible to achieve this?

Otherwise, I'm willing to help make the changes if I'm able to. I think it might be quite a big change, so I'll have to look more into the code and see if the old issues are relevant.

I think we just want to parse the data as soon as it comes to reduce latency, and the overall size of the document is likely to be relatively small (so probably not related to #128).

@lemire
Copy link
Member

lemire commented Jun 5, 2023

Pull request invited.

@yongxiangng
Copy link
Contributor Author

I wasn't able to come up with an elegant way to do truncated parsing. Instead I created a pr (#2016) for parsing comma separated documents, and clients that don't use this option don't suffer much performance penalty (except 1 conditional when iterating to the next document in the document stream)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants