-
Notifications
You must be signed in to change notification settings - Fork 982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance iterate_many to enable parsing of comma separated documents #1999
Comments
It is doable but would require some work. Note that to my knowledge, your proposed input does not follow any established standard. Typically, in a stream of JSON documents, you separate them by white space (e.g., line endings). Possibly related: #1356 |
I agree. Would you have any ideas for the work needed to modify However, searching from front will incur a performance penalty, because we will be searching through the entire batch to find the last complete document. Whereas, searching from back, we might be able to terminate early without having to search through the entire batch. What are your thoughts on this? Is this too much of a change to do (I can do the changes, but it seems from the |
Which is bad. You don't want to do that for the reasons that you have outlined. If you know that you have a stream of objects (not arrays), then I can see how to fix it. But it gets complicate in the general case. I don't know how to do it right now. |
Some food for thought: Perl's JSON::XS has a mode called incremental parsing with which it is possible to parse comma separated documents, with some manual help: https://metacpan.org/pod/JSON::XS#EXAMPLES This could be adapted to simdjson's context if iterate_many gave back a character count that pointed to the end of the document after each successful parse, and you could increment this counter manually, to indicate that it should skip a part of the buffer before starting the next document. This mechanism could support multi-JSON documents or streams with arbitrary separators, not just commas. |
It does. That's not the problem. The problem is that you don't want to parse a truncated document. The way it works underneath is that you give me 128 GB of data, I might split it into chunks for 1 MB. I index the 1 MB. Then I start iterating through it. One document. Two document. And so forth. I stop at the last complete JSON document. When I get toward the end of the 1 MB, I try to load and index another 1 MB, and then I glue the two indexes... so that I start the new block with a full document. And I resume... (Actually, the next 1 MB is processed in a separate thread while you are iterating through the first 1 MB, but you get the idea.) And on and on we go. Maybe I have...
at the end of a window and at the start of the other one... I need to find the start of the last truncated document (if any). I don't want to start parsing a partially indexed document (here It happens that we can do that without a problem if you follow a standard such as jsonlines or ndjson... These are strongly self-synchronizing meaning that if I give you a chunk of ndjson or jsonlines, as long as the chunk size is larger than any document (meaning that your chunk contains at least a full document) then you can identify the start and the end of the last complete JSON document by starting from the end (without having to process the whole stream). As far as I can tell, if you use a comma separator, the content is not strongly self-synchronizing, meaning that you may need to scan the whole index to find out where the last complete document is. |
Thanks for the responses on this. The use case for this is not because we are trying to be non standard. Rather, we are receiving large amounts of data, and the data is chunked. Most of the data is in an array and hence comma separated. Being able to parse comma separated documents is just a means to an end, we would like to be able to parse a truncated document in our use case so we can parse while the data is being received. Do you know any way for us to achieve this? |
Either you switch your data source to ndjson or jsonlines, in which case it will work out of the box with simdjson. If you cannot or do not want to switch your input type, then it is not supported in simdjson. This means that you will need to implement the support, and ideally produce a pull request for us to process. It looks like you already dug into our source code and know pretty well what work it entails. The simdjson library is a community supported project: we build features with our users. |
We've talked many times about supporting fully streamed parsing of a single document, but haven't gotten there. How much data is coming in this array? Like, how big is it actually? |
My suspicion is that it is not big but they want to process a truncated input. |
Sorry for the late reply. I can only check with the user after the weekends. I believe I heard it was a few kb, but perhaps it is also because they want to process a truncated input. |
We would like to process the data as soon as the network packet arrives, even if it means the document might be partial. The size is ~16kb. Is it possible to achieve this? Otherwise, I'm willing to help make the changes if I'm able to. I think it might be quite a big change, so I'll have to look more into the code and see if the old issues are relevant. I think we just want to parse the data as soon as it comes to reduce latency, and the overall size of the document is likely to be relatively small (so probably not related to #128). |
Pull request invited. |
I wasn't able to come up with an elegant way to do truncated parsing. Instead I created a pr (#2016) for parsing comma separated documents, and clients that don't use this option don't suffer much performance penalty (except 1 conditional when iterating to the next document in the document stream) |
Feature request
Currently,
iterate_many
is able to parse a json likeauto json = R"([1,2,3] {"1":1,"2":3,"4":4} [1,2,3] )"_padded;
However, it would be nice to be able to parse a json with documents separated by commas
auto json = R"([1,2,3] , {"1":1,"2":3,"4":4} , [1,2,3] )"_padded;
Possible implementation (only a small part)
I have looked into the code base a little, and I think we are able to perform the logic in the function
document_stream::next_document()
and consume any trailing commas after processing a particular document.Challenges
However, running
stage1
poses a problem. We are no longer able to quickly identify the position of the last JSON document in a given batch. This is becausefind_next_document_index
identifies the boundary between 2 documents by searching for][
]{
}[
}{
patterns starting from the back of the batch. If we put a comma between these patterns, we can't distinguish whether they are comma separated documents or elements in an array.Hence, I'm not sure if such a feature is possible to implement. I am willing to write the implementations if we agreeable with this feature and we have a solution to the
stage1
problemThe text was updated successfully, but these errors were encountered: