Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation/Feature request: Clarify expressions/scope #1326

Open
lylemoffitt opened this issue Jan 27, 2017 · 44 comments
Open

Documentation/Feature request: Clarify expressions/scope #1326

lylemoffitt opened this issue Jan 27, 2017 · 44 comments
Labels

Comments

@lylemoffitt
Copy link

Exigent Question:

At any point in a jq script what does the filter . return? It may be easy for an experienced user, but it's not clear from the documentation. Put another way: what defines an expression? What delimits scope? The answers to these questions are implied, but not explicitly or clearly stated by the documentation. It's ironic that the dot filter is referred to the "least interesting filter", because it is the key to understanding the transformation of data through the script.

To clarify, a script is the string passed as the command-line argument filter or loaded with --from-file. This is to disambiguate from the pragmatic units chained together within it, which are also called 'filters'.

Problems:

The man page doesn't really say a whole lot about parenthesis. They pretty much only show up in function signatures and in examples. Yet, they have a fundamental relationship with the dot filter, and thus a critical role in the functioning the script. Their usage should be clarified. It would also be helpful to clarify their relationship with the object constructors, [] and {}, as all three are used to create sub-expressions and return objects.

The easy thing to do here would be to just create a section where you define () as an expression operator or scope operator, and then stick all the missing explanation there. This might solve the immediate issue, but you could do a lot better. I'm trying to stick with one problem here, but in general the manual could be a lot clearer. I don't know if you're trying to intentionally hide that jq is a full-blown language, but it would certainly be a lot cleaner if you approached explaining the query language like it was the pure-function programming language it is.

Suggested Solutions:

  1. Define the operator () as a Value Constructor and put it in the Types and Values section. It constructs a value from the output of the contained expression. The only thing that would be needed to be changed about its existing functionality in order to bring it in line with the other constructor operators is that it must also work when the expression is empty. Analogous to [] and {}, this should be implemented to construct a null value.

    Example:

    For each serialized JSON input, the type construction operator should return the minimum viable value of same type when the contained expression is empty, or the result of composing the expression over the input otherwise.

    echo '<json>' | jq -c '[]'              #=> []
    echo '<json>' | jq -c '[<expr>]'        #=> [result of <expr> applied to <json>]
    echo '<json>' | jq -c '{}'              #=> {}
    echo '<json>' | jq -c '{<key>:<expr>}'  #=> {<key>: result of <expr> applied to <json>}
    echo '<json>' | jq -c '()'              #=> null
    echo '<json>' | jq -c '(<expr>)'        #=> result of <expr> applied to <json>
  2. Add a section Operator Precedence and Expression Evaluation (or something to that effect) with the following:

    1. Define how filters and operators are composed into expressions and how the expressions are applied to the input JSON string to create the output JSON string. An explicitly codified type-transform like the following (written in pseudo-Haskell) would be one way to do it and be enormously helpful in terms of reasoning about a jq script.

      -- A filter is function that accepts json and returns json
      filter      :: ( JSON ) -> JSON
      -- An operator is a function that accepts 2 filters and returns a filter
      operator    :: ( filter , filter ) -> filter
      -- An closure is a function that accepts json and a filter and returns a json
      closure     :: ( JSON , filter ) -> JSON
    2. Define operator precedence. I know it's basically just left to right and parenthesis first, but it's important to explicitly state these things. This is where the type-transform will come in handy again, because it help elucidate why different sets of operators have different semantics. For example, constructors (like [] and {}), which are called operators, are actually closures. This explains why they have totally different semantics.

    3. Define scoping rules. The effect of () on scope is briefly mentioned in the Variables sub-section, but never talked about directly. The relationship between constructors and scope is never mentioned at all. Discussion of the relationship between . and the concept of scope should also be discussed. Again, closures will help here.

@nicowilliams
Copy link
Contributor

Hmm, OK. I suppose the docs do need some clarification. I do think that @stedolan was trying to go for an intuitive description of an intuitive language. However, jq is a rather powerful language with aspects that are not obvious at first glance.

. is always "the current input value". Always. You can add |.| in most places and... it changes nothing, because it means "produce the current input value (from the expression on the left of the pipe) to the expression on the right of the pipe".

Function arguments might be particularly confusing. It's best to think of functions as having ONE (and only one) value argument and zero, one, or more function arguments. E.g., def foo: . + .; has one value argument, while def foo(bar): . + bar; has one value argument and one function argument (bar), and outputs . + <bar applied to .>.

Parenthesis can also be used to group expressions. E.g., (1 + 2) * 3. I think this is fairly obvious, but it's true and surprising that the manual does not mention this!

Parenthesis can be important to deal with precedence issues. E.g., 1, 2 | . * 3, . * 5 could be interpreted in a number of different ways (though in only one way by jq) -- it's better to use parenthesis to avoid confusion. E.g., (1, 2) | ((. * 3), (. * 5)) or 1, (2 | ((. * 3), (. * 5))).

@nicowilliams
Copy link
Contributor

Thanks for your input! Keep it coming. It will make jq better.

@lylemoffitt
Copy link
Author

@nicowilliams Thanks for the quick response. I get what they were going for, I just felt like it kinda tripped over itself a little to get there. The language is intuitive and simple, I just with it had been explained better. I ignored jq in favor of the less powerful, but easier to grok jo, months ago exactly because of the documentation.

Keep it coming.

I definitely have more ideas, but they are more focused around enhancing the programming language aspects. I wanted to see how receptive the community is first, before going further.

@pkoppstein
Copy link
Contributor

@lylemoffitt - I'm not sure this is relevant, but since you wrote:

I definitely have more ideas, but they are more focused around enhancing the programming language aspects.

I thought I'd mention that a jq documentation effort has just started at stackoverflow.com. Maybe it could be justified by adopt a "programming language" approach?

An entry point: https://stackoverflow.com/documentation/jq/topics

@lylemoffitt
Copy link
Author

@pkoppstein - Thanks for mentioning that. Wasn't aware of that feature on stackoverflow. It isn't really what I had in mind though.

Maybe it could be justified by adopt a "programming language" approach?

I think it would be better than the current one, but that's not really my call. I'm also not saying the current approach is bad either; I just don't think it's as effective as it could be. Like sed, jq is a great CLI tool that with an embedded DSL. In sed's documentation (its man page), they took the approach of emphasizing the DSL over the CLI. This (IMO) is probably what led to the long-term success of sed as a tool, but it also has the downside of making it harder to approach. I myself only recently understood the deeper nature of sed beyond its sed -e 's///' usage in part because I found the documentation so dense. But, now that I'm over the hump, I wouldn't have it any other way.

TLDR - It's a tradeoff.

@pkoppstein
Copy link
Contributor

pkoppstein commented Jan 28, 2017

@lylemoffitt - I have no idea how the jq documentation at stackoverflow.com will pan out, but I like the combination of brevity and accessibility that characterizes the current "manual", so in a way it would make sense for the more "programming language" orientation that you have in mind to have a home at stackoverflow.com, if there is to be additional documentation there.

(Currently, as you may know, the home for the more technical aspects and details is the jq wiki. Maybe you'd like to start a "jq for Programmers" page there? The potential downside of that is the risk that things could get confusing with an official tutorial, an official manual, another manual on the jq wiki, and still another manual of sorts on stackoverflow ...)

My orientation is heavily influenced by the documentation I worked on for a large proprietary language. There were three distinct volumes:

  1. Tutorial
  2. Manual (i.e. reference manual)
  3. User's Guide

@lylemoffitt
Copy link
Author

@pkoppstein

modulo a few tweaks [...] brevity and accessibility

I'm inclined to agree with you here. I'm not 100% sure what the right approach is given that each has its own set of trade-offs.

the jq wiki

I hadn't actually seen the wiki before. Like most projects on GitHub, I had assumed it was empty of full of incomplete/outdated information. This one has some good information that is appropriate placed there. A "jq for Programmers" page there would probably be better than stackoverflow. Either way, it's always second-class to the reference material provided with a distribution.


Ideally, there should be a quick-reference that's just as accessible as the current man page, but aimed at more experienced users. Perhaps a good solution would be to have two separate man-pages? The current man jq could stay focused on the quick-n-dirty CLI usage, while man jq-lang could be focused on the language and.jq module documentation.

@nicowilliams
Copy link
Contributor

@pkoppstein What's the copyright licensing associated with SO docs?

@pkoppstein
Copy link
Contributor

@nicowilliams - As best I can tell, the rules are elaborated in Section 3 ("Subscriber Content") of http://stackexchange.com/legal. The key point seems to be "all Subscriber Content that You contribute to the Network is perpetually and irrevocably licensed to Stack Exchange under the Creative Commons Attribution Share Alike license."

My (somewhat cursory) reading is that the contributor retains copyright and is not expected to grant an exclusive license.

@nicowilliams
Copy link
Contributor

@pkoppstein Excellent. Thanks.

nicowilliams added a commit to nicowilliams/jq that referenced this issue Jan 28, 2017
@nicowilliams
Copy link
Contributor

I've pushed a partial fix for this, 6f9646a.

@fadado
Copy link

fadado commented Feb 8, 2017

In relation to operators precedence, I found this table at Rosetta code:

Precedence Operator Associativity Description
lowest | %right pipe
, %left generator
// %right specialized "or" for detecting empty streams
= |= += -= *= /= %= //= %nonassoc set component
or %left short-circuit "or"
and %left short-circuit "and"
!= == < > <= >= %nonassoc boolean tests
+ - %left polymorphic plus and minus
* / % %left polymorphic multiply, divide; mod
highest ? (none) post-fix operator for suppressing errors

@fadado
Copy link

fadado commented Feb 10, 2017

@lylemoffitt

It's ironic that the dot filter is referred to the "least interesting filter", because it is the key to understanding the transformation of data through the script.

Yes, I will change "least interesting filter" with

Two important predefined filters are "." (pass), the filter that does nothing, and "empty", the filter that never produces values. The main laws for those filters and the | (bind) and , (then) operators are:

. | a  ≡  a
a | .  ≡  a

empty , a    ≡  a
a , empty    ≡  a

empty | a    ≡  empty  
a | empty    ≡  empty

a , (b , c)  ≡  (a , b) , c
(a , b) | c  ≡  (a | c) , (b | c)

By the way, for my sanity I decided to put names to all filters and operators

Filter/Op. Name
. pass
| bind
, then
[ ] values
? protect
// alternative

The manual seems to deliberately avoid naming all things!

JJOR

@pkoppstein
Copy link
Contributor

@fadado wrote:

The manual seems to deliberately avoid naming all things!

Yes, that's one way the manual achieves a brilliant economy of expression and avoids the "cognitive burden" that comes with naming, especially if the names are potentially misleading, as is the case with "pass" for ".".

Readers can be encouraged to pronounce the single-character punctuation operators in accordance with their preferences for pronouncing the punctuation characters themselves (e.g. "dot" for ".", "pipe" for "|", and "comma" for ",").

Please note that [] is not an "operator" in the usual sense. Fundamentally, [] is the empty JSON array. The postfix use of [] is, in my opinion, best understood as a shorthand, i.e., under certain circumstances, expr | .[] can be contracted to expr[] and/or (expr)[].

The name "alternative" for "//" is appropriate as it is a two-character operator with a meaning that is unrelated to "/".

@lylemoffitt
Copy link
Author

@fadado

operator precedence

That's interesting, and helpful, thanks. I was surprised to see that the alternator was right associative. Isn't it defined to evaluate left to right?

The main laws

This. This is more of the kind of thing I was talking about. Helpful, clear, concise. Even if this is alien to a normal user, it's still worth putting in, because of how innocuous it is.

@lylemoffitt
Copy link
Author

@pkoppstein

one way the manual achieves a brilliant economy of expression and avoids the "cognitive burden" that comes with naming

Generally, easing cognitive burden goes hand in hand with low expressive power. The man page may come off as an easy read, but it does so at the cost of length and verbosity. If you're set on reading it, the length may not be important, but it's certainly off-putting. Part of the trade of for writing to a low bar is that, while it makes on-boarding easier, it dampens the long-term effectiveness. Now that I understand the language better, I would much rather have a normal function reference, but my only choice is to scroll through a lot of text trying to remember which section the function I'm looking for is under.

pronounce the single-character punctuation operators in accordance with their preferences

The problem with "call it whatever you want" mentality is that you lack community agreement. Especially if you want people to be able to find reference materials on stack overflow, they are going to need a common name to google. Searching for "jq slash-slash" is going to end in a bad user experience. Moreover, all of this is done in the name of bowing to fear that users will flee because you made them learn the names for things. If you structure the man page uniformly, they won't even notice the names. Once they get the formatting their eyes will just jump to the section they care about.

Please note that [] is not an "operator" in the usual sense.

I believe we are all in agreement. The man page uses the terms operator, filter, and function somewhat interchangeably. I believe, the general rule it follows is that filters have word-names, functions have word-names and explicit arguments in parens, and operators are symbols.


When it comes to learning how to use a tool, none of this complexity really matters. All you really want to know is how to grep the fields out of the stupid json. But when it comes to learning how to use a language, it's all very important. As I said before, jq's problem is that it's both. I remain with my estimation that the best approach is to split the two aspects into their own pages.

@fadado
Copy link

fadado commented Feb 11, 2017

The manual seems to deliberately avoid naming all things!

Yes, that's one way the manual achieves a brilliant economy of expression and avoids...

Ok, if it is a feature and not a bug I will reframe my mind, and I can say the dot operator is like an all-pass filter...

@pkoppstein
Copy link
Contributor

@fadado wrote:

if it is a feature and not a bug I will reframe my mind

Thanks for the willingness to see it from another perspective.

... and I can say the dot operator is like an all-pass filter...

Yes, readers of the English-language edition of the jq documentation will have no trouble understanding references of the form "the _ operator", where _ is "dot", "comma", "pipe", or "query", and writing "the dot operator" rather than "the . operator" is undoubtedly sometimes easier on the eyes.

As for describing "." as an all-pass filter --- I am wondering whether the audience who will benefit from such a description is largely the same audience who will understand https://en.wikipedia.org/wiki/All-pass_filter ?

@fadado
Copy link

fadado commented Feb 12, 2017

As for describing "." as an all-pass filter --- I am wondering whether the audience who will benefit from such a description is largely the same audience who will understand https://en.wikipedia.org/wiki/All-pass_filter ?

You are rigth, but while in XSLT we say ". is the current node", or in the shell we say ". is the current working directory", what should I say in JQ?

The phrase ". is the null filter" will be ok, but null is also a type name and value; this will be a source of confusion. In SNOBOL the null string is a pattern that always matches, and has the same role as the dot filter. For example, in the following code the dot filter helps to emulate SNOBOL fence or Prolog cut:

label $fence
| F
| (. , break $fence)   # like SNOBOL fence or Prolog cut
| G

Can I say null filter? Or perhaps input value?

@ghost
Copy link

ghost commented Feb 12, 2017

Have you considered "Identity filter"? It is, after all, an identity function.

@wtlangford
Copy link
Contributor

wtlangford commented Feb 12, 2017 via email

@pkoppstein
Copy link
Contributor

@fadado - In explaining the identity filter, ".", it would be helpful to mention that it echoes each JSON value presented as input in turn. Indeed, in my opinion, the main area for improvement of the manual is explaining the stream-oriented aspect of jq. (See https://github.com/stedolan/jq/wiki/Advanced-Topics#streams)

@lylemoffitt
Copy link
Author

I like "all pass filter" in part because of its intuitive interpretation, but it has a lot of namespace conflict, and should someone google it they'd be given a lot of misdirection.

Using "identity function" is a good choice, on par with "current working directory", and "current node". The problem is more of which analogy you want to go with. Analogizing with the shell would be the best choice IMO here, because of the synergy with explaining the pipe operator, similarity in formatting and operation.

I agree that "null filter" would probably be a source of confusion.

Using "input value" probably works, but then you also need to explain that it's largely unnecessary to provide an input value, since it's automatically interpreted/provided for you most of the time.

@pkoppstein
Copy link
Contributor

@lylemoffitt - Obviously "all-pass filter" is clear to some, but even for those with a signal-processing background, might not the bit about phase change be a potential source of confusion? More importantly, two of the primary meanings of "pass" are:

To come to an end:
To decline one's turn to bid, draw, bet, compete, or play.

(Source: https://www.ahdictionary.com/word/search.html?q=pass)

@lylemoffitt
Copy link
Author

@pkoppstein -- We are in agreement. That's pretty much what I was getting at. Though, your point about "pass" is important, too. I was thinking from a more common understanding, e.g. "all things pass through it". Either way, it's probably not a good way to go.

Drawing analogies to the shell and imperative/functional languages are probably the safest bets.

@nicowilliams
Copy link
Contributor

@fadado | is more like "call". It's actually how you call functions: .foo | bar calls bar with .foo as its .. "Bind" is more appropriate for EXP as $name | ..., since that creates a symbolic binding for the output value(s) of EXP (that is, $name refers to each value output by EXP, successively, but only one value at a time, and it is visible only to the expression to the right of the |).

@nicowilliams
Copy link
Contributor

I do like some of the suggestions here. Certainly a table of operator precedence would be nice, and some of the "laws" that @fadado proposes would be useful to include.

I too would rather not "name everything". For now anyways.

@lylemoffitt
Copy link
Author

@nicowilliams

I too would rather not "name everything". For now anyways.

I'm inclined to agree with this, as there are more important issues IMO, but the push for Stack Overflow kinda necessitates that we have common pronounceable names for all the fundamental operations in jq. The rest of the discussion about what to call them should be focused on how to explain them first, and then suggest alternative operator names only by way of analogy.

Currently, all of the functions are easily searchable, and most of the operators actually have explicit names given. But, a (perhaps) surprising number are without names. Instead, they are repeatedly referred to as "the _ operator/filter", or are given no noun at all and simply referred to by their bare symbol! This latter point is really unacceptable, and places a burden on both the manual writer and the reader. Speech is the fundamental basis of reading and understanding; if you can't pronounce a thing, then you can't leverage the language processing ability in your brain towards understanding that thing, which is effectively an inhibitor since learning is all about neural activation. To wit, what do you expect people to say when they read the symbol .[] in the following quote from the manual?

Running .[] with the input [1,2,3] will produce the numbers as three separate results, rather than as a single array.

I digress...


The following are all taken from the man page. Type denotes what noun is used with a given symbol. The suggested name attempts to find something close to what people colloquially call the given operation, while also avoiding name conflicts and providing a minimum of specificity.

Type Current Name Suggested Name
operator ? try operator
filter . dot operator
filter .foo member operator
syntax .[<string>] index operator
syntax .[2] index operator
syntax .[10:15] slice operator
N/A .[] stream operator
N/A , comma operator
operator ` `

The unnamed .[]? and .foo? are not named here and should remain so, because they are really just common applications of the now so named "try operator".

@nicowilliams
Copy link
Contributor

nicowilliams commented Feb 13, 2017

Type Current Name Suggested Name
operator ? try operator
special . identity operator
operator .foo object identifier-index operator
operator ."foo" object index operator
operator .[EXP] array or object index operator
operator .[EXP:EXP] array and string slice operator
operator .[] array and object value iterator operator
operator , comma, or output concatenation operator
operator ` `
syntax `EXP as $name `
syntax [EXP] array constructor
syntax {EXP:EXP, ...} object constructor

I'm not sure that classification as "syntax" vs. "operator" makes sense. It's all syntactic. Some of these things are "operators" in the mathematical sense, but maybe all of them are (except for ., which can be thought of as the identity function). Even the binding syntax can be thought of as an operator, one that establishes a symbolic binding.

@nicowilliams
Copy link
Contributor

There's also ."string", and a variety of other operators. Certainly a table or two would be nice.

@lylemoffitt
Copy link
Author

@nicowilliams

There's also ."string" [...]

Yup. Totally missed ."string", because it's not in the header.

[...] and a variety of other operators.

Which? I believe, all remaining operators have explicit names already provided in the manual. It's not super obvious, but it is there or in the context. Double checking, these are the exceptions:

  • The equivalence operators (defined as an "expression") == and !=, which I did overlook. (My bad.)
  • The ; symbol, which is only mentioned directly in the wiki. It could be called a "section terminator" somewhere, i suppose. But it doesn't really need to be mentioned since it's no more an operator than than the : is.
  • Parens, which your PR declared as a "grouping operator".
  • The assignment operators: +=, -=, *=, /=, %=, and //=, which I felt were adequately address by their section name and by the fact that they are all just lexical concatenations of other named operators; e.g.
  • The recurse function "shorthand" .., is another one I genuinely missed. I blame that one on poor organization. It should obviously be called the "recurse operator".

Certainly a table or two would be nice.

A table would be nice, but I think these names should also be put in the section labels. This is clear and consistent with the other operators that are named, like Addition and Array construction.

@nicowilliams
Copy link
Contributor

@lylemoffitt Well, there's also the array-collect operator ([EXP]), the object construction operator ({<EXP>: <EXP>, ...}).

I'm going to have to learn whether the doc system supports tables...

@nicowilliams
Copy link
Contributor

The ; indeed is not an operator. It is a separator/terminator of sorts, as follows

  • separates formal parameter names
  • separates function call argument expressions
  • terminates the bodies of functions
  • separates the initializer and update expressions in reduce expressions
  • separates the initializer, update, and extraction expressions in foreach expressions

@lylemoffitt
Copy link
Author

lylemoffitt commented Feb 13, 2017

@nicowilliams

I'm going to have to learn whether the doc system supports tables...

Looks like the answer is no (see ronn-format). Maybe another form would do? You could change the entry format for building the manpage to something like:

f.puts "### #{entry['title']['symbol'] -- entry['title']['name']}\n"

And change the yaml to match:

entries:
  - title: 
      - name: "Index Operator"
        symbol: "`.[EXP]`" 
    body: |
      You can also look up fields of an object using syntax like...

@lylemoffitt
Copy link
Author

lylemoffitt commented Feb 13, 2017

@nicowilliams

Well, there's also the array-collect operator ([EXP]), the object construction operator ({<EXP>: <EXP>, ...}).

I thought those were already named well enough by context, but thanks for adding them.


Side note: The rules for what constitutes an acceptable EXP in each of the above is different. For example if it's 3/2, then [3/2] is fine, even though there is no such thing as a fractional index, while { a: 3/2 } will fail to compile (citing shell quoting issues of course).

@nicowilliams
Copy link
Contributor

Yes, there are places where not the full range of expressions is permitted, most notably the object constructor, for the subtle reason that it's impossible to avoid ambiguities in the grammar.

Thanks for checking doc support for tables. Adding that is going to be a low priority for me for now, unless someone offers a PR.

@lylemoffitt
Copy link
Author

@nicowilliams -- If you're fine with my solution to the tables (or something like it), I can certainly put in that PR for you. I don't think we're set on the content yet, though.

@nicowilliams
Copy link
Contributor

@lylemoffitt Can we get a preview of what a rendered manpage would look like?

@nicowilliams
Copy link
Contributor

@lylemoffitt Er, actually, ronnformat does seem to support tables, since it claims that "[a]ll markdown(7) linking features are supported."

@lylemoffitt
Copy link
Author

@nicowilliams

All markdown(7) linking features are supported.

That looks like they only have support for markdown's [ link text ]( link url ) and [ link text ]( #section-link ) features.

@nicowilliams
Copy link
Contributor

nicowilliams commented Feb 14, 2017

@lylemoffitt Oy, yes, I misread that. But elsewhere it says:

The ronn(1) command converts text in a simple markup to UNIX manual pages.
The syntax includes all Markdown formatting features, plus conventions for
expressing the structure and various notations present in standard UNIX manpages.

@nicowilliams
Copy link
Contributor

I tried it, and... no dice, ronn does not seem to support tables. rtomayko/ronn#99

@nicowilliams
Copy link
Contributor

Also, whatever is done for manpages has to work for the HTML-rendered manual as well.

@nicowilliams
Copy link
Contributor

@lylemoffitt #1340 is a PR with some modest enhancements based on this issue and #1337.

davidfetter pushed a commit to davidfetter/jq that referenced this issue Oct 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants