-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add support for query language (e.g., a simple subset of JSONPath) #95
Comments
cc @TkTech |
This is a much bigger project than it sounds if you want comprehensive support. I'd suggest implementing support for simdjson as an extension to jq instead of implementing the query language in simdjson itself. That said, if you restrict the scope to simple filtering for navigation that is much simpler. I use a simple but inefficient approach in |
I think that is what I mean by "by re-using an existing framework (plugging simdjson into it)". I am betting that this can be done relatively well without too much work.
Right. Honestly, I was just thinking of filtering... basically "subset selection". I am not assuming that this is a small project. It could get a bit challenging, even with a restricted query language. |
If that's all you're looking for at the moment, that is something I can contribute. It would be nice to have the filtering language consistent between the various bindings/forks. |
I think that would be good. Even if it is minimalist, that would still be a huge step forward in some instances. |
I like the idea of a compatible subset. Is there anything stopping us from saying "ok, here's the bit of JSONPath we support"? I don't know whether it's feasible to combine doing things entirely from outside simdjson and to get performance that's representative of what it could be. Much of the work in stage 2 could be completely elided if we knew we only needed some of the input. What might be nice is to figure out how to define a subset of functionality for simdjson that supports outside projects without taking on all the trappings of JSONPath or some big complicated language - i.e. what additional API features would we have to expose? |
JSONPath is probably out-of-scope. You would build a JSONPath implementation on simdjson, but it's a lot to put into what should ideally be a pretty small, solid core. There are 2 main improvements that I can see as an end-user. One is iterative parsing, which doesn't necessarily offer speed improvements except for possible early termination, but it does offer significantly better peak memory usage. It should probably work in batches or it'll suffer from warmup penalties if the caller takes too long to continue to the next element. The query language won't help with this. (#31) The second is avoiding validation and storing bits we don't care about in the first place (which can also work with iterative parsing). This would certainly improve memory usage but risks being slower than simply parsing everything if we don't implement it properly or make too complex of a language. So I propose 4 simple OPs, same as pysimdjson.
A query string is decomposed into these simple operations using a trivial state machine. The parser grabs the next operation off the stack before continuing to parse the next element, so it can immediately discard it if the types mismatch (ex: found the start of an object/dict but query is looking for an array). Usage is simple in practice, and is similar to what {
"hello": "world",
"list": [
1,
2,
3
],
"list.of.dicts": [
{"hello": "world"},
{"hello": "bob"}
]
} Example queries for the above document:
These queries should be usable after the document has been parsed into |
@EgorBo @luizperes as folks that have built on top of simdjson already, your input would be appreciated :) |
Hi @TkTech, grammar ::= '.' | '.' binding
binding ::= string | string '[' number* ']' | binding '.' 'binding' |
Wait? We already have a formal syntax? |
I'm not enthusiastic about this direction. I can potentially see all manner of helper libraries that sit outside of simdjson and help people query things. This is fine and I don't see any reason to stop this, but there's no reason that simdjson needs to change (I hope) to support this kind of use. What I'm enthusiastic about is the idea of people being able to control the simdjson parsing step with a query to select out things that are needed and reduce the parsing cost. This has potentially enormous benefits of performance, as we can avoid materializing large quantities of the document. Even a straightforward suppression of stage 2 when not needed is a big deal, but beyond that, techniques to accurately search for keys or values (while keeping track of what level we're in, etc) have huge potential (e.g. 4-5x on our existing speeds). |
@geofflangdale This context is relevant: TkTech/pysimdjson#22 |
We're not going to get too many cracks at building a query language. I like the start that's made, here, but don't really want something that is only centered around putting a band-aid on the injury of Python object creation. I would like something that offers a bit more power and can be supported natively within simdjson. We could really make selective queries blaze along. Stage 2 is a huge millstone around our necks, and half of the reason we stopped optimizing Stage 1 after a point is that Stage 2 is so expensive. So having more opportunity to cut down on Stage 2 work would open up even more ability to go really fast. So think big! (and also small and tractable and implementable, please :-)). But I think a query language really needs some ability to search and pattern match as well (within reason). |
Maybe support for JSON Pointer would be a nice start (https://tools.ietf.org/html/rfc6901) |
@klon Cookies for you!!! Yes... yes... |
@ioioioio Can we add JSON Pointer to our 0.2 release target? (End of summer) |
Sure. It works on json_parser branch, it is just not yet as clean as I'd like it to be. |
Marked for inclusion in the next release. |
JSON Pointer support has been added by @ioioioio, it follows RFC6901. I believe that this should go a long way toward addressing part of this issue. Nevertheless, we need to leave it open because it seems like there are several different issues. Todo: we need to break this wide issue into separate components that can be addressed and closed. It is currently a bit too open-ended. |
Where can I find examples on how to navigate/query a |
JSON Pointer has been implemented and is in master, so I am going to close this issue. |
Yes, but where can I find examples on how to use it? Is there some tutorial (other than the "Navigating the parsed document" section in the readme) to learn how to get values out of a parsed json document? |
The section in question has been improved: https://github.com/lemire/simdjson#navigating-the-parsed-document Currently there is no tutorial, I will create a new issue. |
Hi. |
Currently, client can navigate the parsed JSON, but there is no support for a query language.
This could be supported either by re-using an existing framework (plugging simdjson into it) or working from simdjson itself.
The text was updated successfully, but these errors were encountered: