Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SUGGESTION + BUG: address multiple JSON input files/streams separately #64

Closed
ghost opened this issue Jan 4, 2013 · 6 comments
Closed

Comments

@ghost
Copy link

ghost commented Jan 4, 2013

Maybe addressing multiple JSON documents separately is outside the scope of the "stream processing" philosophy, however:

  1. it's useful to be able to combine JSON from different web-APIs, e.g. extract data from this one, combine it with that, then add to a JSON list hosted over there. This use-case is likely to increase over time. (arguably the future of programming.)
  2. sed and awk both have script commands to read and write specific files... so, it is at least aligned with the concept of "sed for JSON", if not "stream processing".

Maybe there's a way/idiom to do this already?

example of my present usage:

~$ h_url="http://www.bom.gov.au/fwo/IDV60901/IDV60901.94868.json"
~$ alias h='curl -s $h_url | jq -r ".observations.data[0].air_temp" '
~$ echo [] > t

~$ cat t | jq ". + [\"$(h)\"]" > t2; mv t2 t; cat t
[
  "37.5"
]

I'm using jq twice, once to extract air_temp from one JSON stream, then again to append it to another - it would be nice to combine them. I'm not sure of syntax, but maybe sed-like r myfilename.... [even nicer, get myurl]. This woud have the same syntactic role and semantic effect as object/array construction. Could assign them to variables, etc.

example of proposed usage:

~$ cat t | jq ". + [get $h_url.observations.data[0].air_temp]" > t2; mv t2 t; cat t
[
  "37.5"
]

BUG:
Actually, I see now that I could just use $(curl -s $h_url) to combine them. However, there seems to be a bug (maybe overflow? because too big for a constructed object? It's fine when streamed.). I'm using jq version 1.2:

~$ cat t | jq ". + [$(curl -s $h_url).observations.data[0].air_temp]" > t2; mv t2 t; cat t
jq: execute.c:251: jq_next: Assertion `jv_get_kind(objv.value) == JV_KIND_OBJECT' failed.
Aborted (core dumped)

I still like the idea of facilitating/encouraging this kind of use of jq.

BTW: in my actual code, I'm getting and putting to http://jsonblob.com/api (not file t as above).

@jkleint
Copy link
Contributor

jkleint commented Jan 4, 2013

+1 on the idea; I think it would be great to have multiple named input streams, so you can do joins and such. Currently I think you'd have to munge each stream into fields of an umbrella object.

A clean way you could do it would be to have a built-in function args() or similar that returns an array of streams read from the command line:

$ jq 'args()[0] | {title, author: args()[1][.author]}' posts.json realnames.json

Where, borrowing from an example in the manual, args()[0] is the contents of the first parameter posts.json:

[{"title": "Frist psot", "author": "anon"}, {"title": "A well-written article", "author": "person1"}]

and args()[1] is the contents of the second parameter realnames.json:

{"anon": "Anonymous Coward", "person1": "Person McPherson"}

giving the result

{"title": "Frist psot", "author": "Anonymous Coward"}
{"title": "A well-written article", "author": "Person McPherson"}

That way you could do command substitution to get data from curl or what have you:

jq 'args()[0] + [args()[1].observations.data[0].air_temp]' temps.json <(curl $url)

And not have to include an HTTP client in jq. :)

You could keep the meaning of . as concatenation-of-all-files for convenience and compatibility.

@ghost
Copy link
Author

ghost commented Jan 4, 2013

Or, use bash-style argument variables $1, $2, $3 etc (jq is already using $-prefixes for variables). I think args() looks nicer, more JSONy though!

jq '$1 + [$2.observations.data[0].air_temp]' temps.json <(curl $url)

BTW: can you have several arguments of the form <(curl $url), and jq will see them as distinct arguments? (Or will the shell have already sent them all to stdin, concatenating them? EDIT couldn't find in man page, but a test shows they get distinct file descriptors) e.g.

jq '$1 + [$2.observations.data[0].air_temp]' <(curl $url1) <(curl $url2) 

I agree, it's not the unix way to include a HTTP client in jq. (One issue is that, JSON is a specific use of HTTP, and you often must include headers to that effect. It would be nice to hide that.)

EDIT ah! The nature of JSON documents is that we can already distinguish them, regardless of whether they have been concatenated or are in separate files/streams, because a sequence of JSON instances does not have enclosing {} or [], and the documents are not comma separated. Therefore, jq already knows the following is three distinct JSON instances (assuming they are):

cat a b c | jq .

It's a question of accessing this list (which might not fit with the jq architecture, since I expect it streams such instances, and not assemble them into one giant JSON. Perhaps that would be the simplest solution: a way to create a list of separate JSON instances. ah2! And re-reading the docs, -s/--slurp already does exactly this.

Instead of running the filter for each JSON object in the input, read the entire input stream into a large array and run the filter just once.

Therefore, we have a syntax like args()[0], but simpler .[0], with exactly the same semantics, and... it's already implemented (I tested to confirm it works):

jq --slurp '.[0] + [.[1].observations.data[0].air_temp]' temps.json <(curl -s $url)

Note that the leading . acts to root the path, so we can use it anywhere without ambiguity with other array indexing.

Can also do it with variables:

jq --slurp '.[0] as $a | .[1] as $b | $a + [$b.observations.data[0].air_temp]' temps.json <(curl -s $url)

And with args[0] too (jq seems to require no parentheses when 0 args, according to docs and my tests - so can't have args()[0])

jq --slurp 'def args:.; args[0] + [args[1].observations.data[0].air_temp]' temps.json <(curl -s $url)

EDIT2 sorry, I'm wrong about the "rooted path" - the root changes when you filter. So you need to capture it in a variable at the start. For this reason, a function won't work on its own (but we can use function syntax by having it access that variable).

$ jq -s '. as $args | $args[0][] | {title, author: $args[1][.author]}' posts.json realnames.json 
{
  "author": "Anonymous Coward",
  "title": "Frist psot"
}
{
  "author": "Person McPherson",
  "title": "A well-written article"
}

$ jq -s '. as $args | def args:$args; args[0][] | {title, author: args[1][.author]}' posts.json realnames.json

BTW: need the extra [] in .[0][] to get the array items one by one instead of the whole array at once.

The above is to demonstrate the generality. I think using one specific variable, and streaming the other, is clearer for this specific code:

$ jq -s '.[1] as $hash | .[0][] | {title, author: $hash[.author]}' posts.json realnames.json 

It turns out to be similar to the example code in the manual for this, just using .[1] and .[0] instead of .realnames and .posts

.realnames as $names | .posts[] | {title, author: $names[.author]}

@stedolan
Copy link
Contributor

stedolan commented Jan 4, 2013

The crash is because jq uses 16-bit bytecode indexes internally and putting 5k lines of JSON into the program overflows these. That limit will probably remain in place for the next while, but it should definitely give an error message rather than just crash.

As @13ren points out, you can more or less solve your problem by cat-ing the JSON documents and using --slurp.

A HTTP client in jq would certainly be useful, a lot of my (and probably everyone else's) usecases have been curl foo | jq. It's tempting to integrate FreeBSD's libfetch at some stage, it's nice and small.

For the moment, I think I'm more likely to hack up a system function or similar, so you can shell out to other programs and use their output in jq. This would give a less horrible way of doing your ". + [$(curl ...)]" trick.

@ghost
Copy link
Author

ghost commented Jan 4, 2013

@stedolan thanks, a system function would helpful - I'm also interpolating date and xpath results in the same horrible way.

Also, I think accessing unix tools as if they emitted JSON could be profound. I've seen some research on this concept, of structured-data versions of unix tools (using XML though), and I believe Microsoft's powershell does it too.

@jkleint
Copy link
Contributor

jkleint commented Jan 4, 2013

@13ren, as you discovered, you can do as many process substitutions as you like and they all appear as separate files. More here: https://en.wikipedia.org/wiki/Process_substitution

I had originally thought of the $1 syntax as well, and then remembered what an ugly pain it is in Bash, and changed my tack.

Your --slurp solution is simple and clever. One potential issue with just concatenating JSON blobs is that they themselves can already contain multiple items, so if file0 contains "a" and "b" and file1 has "c", then .[1] is "b" from file0. Not a problem if you control the inputs and know their shape, but easy to exploit if you're trying to cause trouble.

@ghost
Copy link
Author

ghost commented Jan 5, 2013

@jkleint thanks, saying "process substitution" enabled me to find it in the man page. (I'd searched for "<(", but when concentrating on getting the escaping right, I was below the entry, and man's search doesn't wrap, so didn't find it). A cool feature, esp combined with tee. I wonder if a tee has a role within jq syntax. EDIT I guess , is tee, but infix.

I agree $1 syntax is ugly, but its string interpolation (in bash) is super convenient. Escaping is confusing though. EDIT one problem I experienced when playing around with syntax was one specific example with many/nested brackets - I found having one less [] in the args[0] syntax was clearer. Not sure how often that happens. Maybe your example is probably more typical, having a couple of brackets - here it is with both syntaxes:

$ jq -s 'args[0][] | {title, author: args[1][.author]}' posts.json realnames.json 
$ jq -s '$1[] | {title, author: $2[.author]}' posts.json realnames.json 

re: --slurp: Thanks! I thought of one JSON value split across files, but didn't think of the other case you mention, of more than one in a file, nor that it enabled a kind of JSON injection.

A simple solution is to check the number of arguments - but this can be subverted by omitting JSON values from other files.

I think a totally secure solution hack is to introduce a JSON guard token value before each file in bash (i.e. before each expected JSON value), and check they are all equal, .[0]==.[2]==...==.[n], where n = (len(args)-1)*2 Token spoofing can be prevented by using a randomly generated string as the JSON guard value.

Or, just run each file through jq individually (in a preliminary step), using --slurp, checking for only one value. (note to self: --slurp always wraps the values in a list (even when there is only one value), so this check will never confuse a JSON value that is a list with the top-level list of slurped values).

Of course, simplest is to have a command line switch in jq that enforces one JSON value per file (maybe --secure-slurp/-S?). I doubt it's necessary yet, but preventing injection attacks is the kind of feature popular web tools all seem to need eventually. Also, there might be further wrinkles I haven't thought of.

@ghost ghost closed this as completed May 7, 2013
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants