Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raw content was changed by jq '.' #1741

Closed
gdmmx opened this issue Oct 17, 2018 · 52 comments
Closed

raw content was changed by jq '.' #1741

gdmmx opened this issue Oct 17, 2018 · 52 comments

Comments

@gdmmx
Copy link

gdmmx commented Oct 17, 2018

image

Bad case as attachment
bad_case.json.txt

@wtlangford
Copy link
Contributor

This is a known behavior: see #1652 and https://github.com/stedolan/jq/wiki/FAQ#numbers

Internally, we represent json numbers as IEEE754 doubles, which aren't precise at very large or very small magnitudes.

Unfortunately, changing this behavior will impose a heavy cost on performance (bignum/bigmath libraries are slow), so we're taking a careful/slow approach to the problem, and haven't really addressed it yet.

@gdmmx
Copy link
Author

gdmmx commented Oct 19, 2018

Thanks for your fast reply.
Wait for an approach solution.

@leonid-s-usov
Copy link
Contributor

Hey, just a suggestion here.
Do you think it would be a good idea to only convert numbers to doubles when some math is about to take place, and otherwise preserve their string representation?
I'm subscribed to the repository and in most cases I can recall the problem of large integers was about some IDs being truncated. The "no math no double" approach would make it work in all of those, and in case the lossy conversion is required it may spit out a warning and proceed as it does now.

@wtlangford
Copy link
Contributor

wtlangford commented Oct 19, 2018 via email

@leonid-s-usov
Copy link
Contributor

oops, i've already pushed my version of the same thing.
#1743

But it was fun!
Now I will go and compare the implementations

@leonid-s-usov
Copy link
Contributor

Wait, what's that branch? Can't detect it by name...

  remotes/origin/autotools
  remotes/origin/bugfix/aix-issues
  remotes/origin/docs
  remotes/origin/fix-destructuring-alternation
  remotes/origin/gh-pages
  remotes/origin/haskell-version
  remotes/origin/header-cleanup
  remotes/origin/jq-1.5-branch
  remotes/origin/libjq
  remotes/origin/macos-strptime
  remotes/origin/master
  remotes/origin/qsort-stability
  remotes/origin/tco-in-compiler

@wtlangford
Copy link
Contributor

It's here: #1327

@leonid-s-usov
Copy link
Contributor

Nice... i like the subkind thing, though I'd encode it into the kind upper nibble as @nicowilliams has suggested. Having a proper subkind is wanted, will probably even make some checks faster

Nevertheless, I'd say that the pull requests are quite different. Your approach keeps jv untouched, but for that you introduce a slightly higher but still a limitation: 64 bit integers. So, decimals will not work and greater than 64 bit numbers will not work.
Also it will be quite easy to overflow the thing by using scientific notation of numbers.

I understand that integer IDs > 64 bits seem quite unlikely, but on the other hand it's hard for me to tell if using int64 is giving a dramatic increase in the speed compared to the full string literal implementation. The trick is that the only slowdown is full comparison of two literals (I don't know the stats, but assume it's a rare case) and potentially slower jv_number allocation. This could be noticeable for large arrays of large numbers.

Also, what I would definitely improve on my version, and what is already part of your code, is the initial decision to whether it's even needed to store the literal. That could basically select between the best fitting subkind, including the integer, double and raw string literal, and eventually even maybe a bignumber.

@wtlangford
Copy link
Contributor

Nevertheless, I'd say that the pull requests are quite different. Your approach keeps jv untouched, but for that you introduce a slightly higher but still a limitation: 64 bit integers. So, decimals will not work and greater than 64 bit numbers will not work.
Also it will be quite easy to overflow the thing by using scientific notation of numbers.

True, and there's a reason nico opted for that solution. We didn't want to introduce an additional jvp_* struct because we aren't consistent about calling jv_free on things we know to be jv_numbers, because there's historically been nothing to free on them. And so we were constrained by the size of the union.

Relatedly, I was going to point this out on your PR. Your linux-based tests are failing for memory leak issues (The macOS tests don't fail for memory issues because valgrind doesn't work on macOS). If you want to try and find all the places where we didn't jv_free a jv_number, I wouldn't complain. It'd be a good way to get closer to supporting bignums/etc.

Also, what I would definitely improve on my version, and what is already part of your code, is the initial decision to whether it's even needed to store the literal. That could basically select between the best fitting subkind, including the integer, double and raw string literal, and eventually even maybe a bignumber.

I might consider this a requirement to merge, though I'd have to think on it a bit.

potentially slower jv_number allocation

I'm not too worried about this, especially with the logic described above, though I'd be interested in some potential benchmarks.

@leonid-s-usov
Copy link
Contributor

I'll be happy to hear some more specific comments on the PR itself.

re. valgrind - that kind of explains why I haven't noticed any speed time increase when it was enabled in the config.

I have actually thought about the memory issues, and have changed the places I could spot by statically analyzing the code, but I didn't have much time to go over all other cases, just few hours I have dedicated to it yesterday. It seemed like going through the few functions in the Memory Management sections should do the job, namely jv_copy, jv_free and jv_get_refcnt, and I have covered those.

But what I couldn't possibly check over other parts of the code, like the execution, and just assumed, that the above functions are called on all objects, including the JV_KIND_NUMBER.

@wtlangford
Copy link
Contributor

I'll be happy to hear some more specific comments on the PR itself.

I was at work yesterday, so didn't have much time to look at the PR in detail. I'll try to get to that today.

re. valgrind - that kind of explains why I haven't noticed any speed time increase when it was enabled in the config.

Yep! Those tests are slooooow in valgrind. You'll know when it's running! I recommend making a cup of tea or something.

I have actually thought about the memory issues, and have changed the places I could spot by statically analyzing the code, but I didn't have much time to go over all other cases, just few hours I have dedicated to it yesterday. It seemed like going through the few functions in the Memory Management sections should do the job, namely jv_copy, jv_free and jv_get_refcnt, and I have covered those.

But what I couldn't possibly check over other parts of the code, like the execution, and just assumed, that the above functions are called on all objects, including the JV_KIND_NUMBER.

So the issue here is specifically that in builtins/execution/compilation/etc, if we know that a jv is JV_KIND_NUMBER, we might not call jv_free/jv_copy on it. Look at f_plus in src/builtin.c for an example of this:

else if (jv_get_kind(a) == JV_KIND_NUMBER && jv_get_kind(b) == JV_KIND_NUMBER) {
    return jv_number(jv_number_value(a) +
                     jv_number_value(b));

If a or b are your new literal number implementations, you end up leaking the allocated jvp_literal_number.

@pkoppstein
Copy link
Contributor

pkoppstein commented Oct 20, 2018

@leonid-s-usov - Thanks for getting the ball rolling again on this long-simmering and important topic.

Perhaps I should have chimed in earlier, but I am very uncomfortable with the
emphasis in your approach on "literal number semantics", so that in your revision
to the docs, you write:

"Numbers are now taken literally, so 1 != 1.0"

Oddly enough, though, in your implementation (jq-1.6rc1-23-g00c00a3-dirty):

jq -n '1.0 == 1' #=> true

So I am a bit confused about your intentions, not only with respect to 1 and 1.0
but also all the infinitely many other possible representations (1.0E0, 1.00, 1.000, ....).

In any case, I would like to argue that jq should NOT take JSON literal numbers literally
(not least because this introduces an enormous backward-incompatibility);
and furthermore, that jq should adopt a real-number semantics for JSON literal numbers, as
it now does for JSON numbers which can be accurately represented using IEEE 64 bit floats.

In fact, I'd like to propose two principles regarding "numbers" that I believe
jq should adhere to. A more formal presentation follows, but they can be stated informally as follows:

(1) `.` should preserve the semantic value of JSON literal numbers;
(2) for JSON texts, equality (jq's `==`) should conform with real-number semantics;

where it is to be understood that:

  • "semantic value" is defined in terms of (mathematical) real numbers.
  • the principles do not apply to the results of arithmetic operations.

Of course the JSON specifications of "number" are primarily syntactic,
but one reading of them is that JSON literal numbers should have the natural
interpretation as real numbers, according to which 10.0 and 10.00
would be regarded as representing the same real number.

In any case, for jq, the two principles are uncontroversial for
numbers that can be accurately represented using IEEE 64-bit floats.

Formal Presentation

Using Sem(_) to denote the "semantic value" function, backticks to denote
a valid JSON text (e.g. 10.0), and ordinary decimal/scientific notation to represent
mathematical real numbers (e.g. 10.0), we can write:

Sem( `10.0` ) => 10.0
Sem( `10.00`) => 10.0

Note that the second principle (2) only applies to JSON texts. Once
an arithmetic operation has been performed on a numeric value, (2) no longer
applies.

Thus (2) requires that the jq expression 10.0 == 10.00 should evaluate to true,
but says nothing about the jq expression 10.0 + 0 == 10.00.

The two principles can be stated more formally as follows, it being understood that
N1 and N2 are any two (but possibly the same) JSON literal numbers:

(1a) if Sem(N1) == Sem(N2), and if . emits N1 as X1 and N2 as X2, then Sem(X1) == Sem(X2);

(1b) if Sem(N1) != Sem(N2), and if . emits N1 as X1 and N2 as X2, then Sem(X1) != Sem(X2);

(2a) the jq expression N1 == N2 evaluates to the same boolean value as Sem(N1) == Sem(N2).

Since Sem(_) is defined in terms of mathematical real numbers, some
consequences of the above are that the following jq expressions should
all evaluate to true (as they do in jq 1.5):

10.0 == 10.00
10.0 == 1.0E1
10.0 == 10

Extension to jq numeric values

For jq, we will want to define equality (==) when comparing a JSON
literal number with a numeric value in jq, e.g. so that we can evaluate the jq
expression:

     10 == ("10.0"|tonumber)

Once again, I would prefer that the semantics be based on the natural mapping to real numbers.

What about jq numbers for which "isinfinite" is true?

`1.0e1000 == (1.0e1000|tonumber)'

Currently, this evaluates to true because both the left and and right are 'isinfinite', but Sem(1.0e1000) is a finite real number, and thus a case can be made for the expression to evaluate to false.

@leonid-s-usov
Copy link
Contributor

@pkoppstein thanks for the response!

It's definitely not too late. Actually on the PR itself I have mentioned this non-backward compatible change and also mentioned that it's not hard at all to consider the equality operator like any other arithmetic operator and actually compare the double representation of the number.

Just a side note about jq -n '1.0 == 1' #=> true. This is definitely a bug with regards to the intended behaviour. Need to check this, cause on the same branch

./jq -n '[1.0][0] == 1'   #=> false
./jq -n '[1.0][0] == 1.0' #=> true

The formal presentation above looks promising, but I feel there are some gaps still. I will get back to this in the evening and hope to better understand what you propose.

My main point about "literal" number comparison was that if one has a number value in their JSON which would overflow a double, and we go for preserving it given that there's no math involved, we should also support matching that number to a literally provided constant.

@leonid-s-usov
Copy link
Contributor

Just a thought... what if we simply introduce a new binary operator to compare numbers semantically? Something like ===. That kind of solves the issue in the least complex and the most compatible way:

  1. ALL binary operations on numbers and all builtin functions using numbers as parameters will use double representation of the number, yielding 100% compatible results
  2. . will maintain the literal by the means of instance sharing, i.e. as long as no mutation was made to a number it will stay the way it was parsed - basically what my PR is doing
  3. A new literal comparison operator will be provided to allow comparing numbers as their literal strings.

I'm also thinking that maybe we can provide (3) by the means of number to string conversion preserving the number literal where available, to save on expanding the syntax with a new binary operator.
On the other hand, we could make use of the literal comparison for other cases, like object comparison which honours the stream order of key value pairs.

@pkoppstein
Copy link
Contributor

pkoppstein commented Oct 21, 2018

@leonid-s-usov -

  1. Regarding your comment:

it kind of makes more sense to discuss this particular aspect here than in the issue #1741

Sure, but I would prefer to continue discussing the general requirements here (at #1741). Even the title of PR#1743 ("Save literal value of the parsed number to preserve it for the output") presupposes the pre-eminence of "literal values", and in any case, the discussion at PRs is usually about implementation details.

  1. Regarding your comment:

Just a thought... what if we simply introduce a new binary operator to compare numbers semantically? Something like ===.

If there is a need to preserve some kind of "literal" representation, then it would be far, far better simply to support a filter for converting it to a string. If tostring does not suffice, then something like toliteral would.

  1. There are several good alternatives for enhancing jq's support for JSON "number", mainly:
  • a) based on support for BigInteger in some way
  • b) based on support for BigDecimal in some way
  • c) support for real-number semantics in some way

These approaches can of course be combined in various ways, but assuming some combination of them, I see very little value in adding support for preserving each literal JSON string that is presented, whereas I do see large downsides, especially given that each real number has infinitely many valid JSON representations. Since jq advertises itself as a kind of awk for JSON, and since awk has stood the test of time very well, I think that the behavior of awks collectively regarding numbers is worth heeding. For example:

$ awk 'BEGIN {print (1 == 1.0); print 1; print 1.0}'
1
1
1

@leonid-s-usov
Copy link
Contributor

leonid-s-usov commented Oct 21, 2018

@pkoppstein
I may be expressing myself not well enough, sorry for that. Let me try to clarify

it kind of makes more sense to discuss this particular aspect here than in the issue #1741

This was taken from my comment on the PR about a very specific issue, namely the strange behaviour of the equivalence operator I wasn't expecting, and I have described my finding around that issue there. Pretty much PR-scoped topic.

Just a thought... what if we simply introduce a new binary operator to compare numbers semantically? Something like ===.

If you read a little bit lower in the same post of mine I'm actually questioning if a "tostring" conversion will be sufficient there.

Let me just repeat the portion of that message here

  1. ALL binary operations on numbers and all builtin functions using numbers as parameters will use double representation of the number, yielding 100% compatible results
  2. . will maintain the literal by the means of instance sharing, i.e. as long as no mutation was made to a number it will stay the way it was parsed - basically what my PR is doing
  3. A new literal comparison operator will be provided to allow comparing numbers as their literal strings.

I'm also thinking that maybe we can provide (3) by the means of number to string conversion preserving the number literal where available, to save on expanding the syntax with a new binary operator.

So, in essence, what I eventually suggested was to fully maintain today's number behaviour, adding just two things:

  • saving the literal of a number to preserve it for the output, given that there were no mutations to that number
  • adding a mechanism of comparing numbers literally, as opposed to the default (and unaffected) behaviour of comparing number's double representation.

To me this looks quite aligned with your guidelines. If you feel that way as well then let's try to move on.
Specifically, I'm thinking of what could be the semantics of the potential toliteral filter / builtin? Let's say my task is to select a db record by id, represented as a 64 bit uint. Should we simply do

cat records.json | jq 'select((.id | tostring) == "123456789876543212345")'

I'm not sure if there are any requirement of a tostring today which will be violated if we simply use number's literal (when available) when converted to string. Maybe a better (and also typesafe) version could be

cat records.json | jq 'select(literal(.id) == "123456789876543212345")'

WDYT?


UPDATE

I have just realised that due to how tostring is implemented, and given the literal preserving behaviour of the PR it will work like that out of the box. The only problem here is that there is no type checking. That is, if we had a builtin literal filter it could return jv_invalid with a message that literal representation of a number is not available.

@pkoppstein
Copy link
Contributor

pkoppstein commented Oct 21, 2018

@leonid-s-usov - == is of critical importance in jq, so we need to be clear about how it is defined for literal JSON values. Thus this response focuses on equality. Once we have that nailed, everything else should fall into place.

Let me emphasize at the outset that I am drawing a distinction between "JSON number literals" in general, and "the literal specification of a JSON number literal" in particular. To avoid tedium, I will use "JSON number" as a synonym for "JSON number literal". Any references to jq numeric values will be to the IEEE 64-bit float that jq currently uses.

m == n

Part (2) of my proposal requires that if m and n are JSON numbers, then m==n should evaluate to true iff m and n represent the same real number.
For example, each of the following should evaluate to true (as they do now in jq):

1 == 1.0
1.0 == 1.00
1 == 1.0E0
1 == 10E-1

That's the easy part. The hardest part is the handling of m == y
where m is a JSON number and y is a jq numeric value.

m == y

In order to formulate my proposal regarding m == y, let's suppose
that tocanonical is defined on both JSON literal numbers and numeric
strings so that it evaluates to a canonical string representation,
e.g. we can suppose that all the above representations of 1 map to "1",
which is chosen because 1|tostring evaluates to "1".

So my basic proposal can be extended as follows:

 (3) if m is a JSON literal and y a jq numeric value,
     then m == y iff
   (y|isnan|not) and (y|isfinite) and
   (m|tocanonical) ==  (y|tocanonical)

Incidentally, using tocanonical, part (2) of my proposal can be rephrased essentially as

(2) m == n iff (m|tocanonical) == (n|tocanonical)

Literal preservation of JSON numbers

As I've said, I think the PROs of having jq store JSON number
literals literally is very small, and the CONs are very great, but
if it is decided that jq should store JSON numeric literals
literally, then I would propose (and in fact request) that this be
done independently of other enhancements to jq's support for
numbers. In fact, I believe that support for the preservation of
JSON number literals should be confined to a toliteral filter
(however named) that would recover the JSON literal as a JSON
string.

@leonid-s-usov
Copy link
Contributor

Let me address your comment in three parts:

  1. The == part
  2. The tocanonical part
  3. The literal part

==

Do I understand correctly that the current (1.5) behaviour of == is fine? Because like I said before I am willing to preserve and not touch it.

My POC implementation had it changed to allow for literal matching while I was aware of the side effects. However you were very clear since the beginning that those side effects aren't wanted, and I'm fully convinced.

I have another way to achieve the wanted functionality, so it's fine.
Is it?

tocanonical

I fully agree with this generic approach to treating numbers.

Do I understand correctly the the canonisation algorithm will be able to parse and canonise number literals which would be truncated if converted to a double precision? If so then I'm all in.

As a side note, looks like there is a problem with the order of filters in (y|not|isnan). Should the whole precondition be read as (y|isnormal)?

Last but not the least, in order to actually solve the issue we are having this conversation under, the original canonical representation of the literal value should be preserved to allow for later comparison with another literal. That is, to allow the following

10000000000000.00000000000000002 != 1.000000000000000000000000000001e13

please forgive me if I have missed a zero, i hope you get the point

Which brings us to

literal

While waiting for your response I have almost finished my changes to the PR, including the tests and docs updates.

  • I am still using the original literal snapshot
  • I have returned to the unmodified behaviour of all comparisons
  • I have defined a filter literal/2 which ignored its first input. IMO the form literal(.id) makes more sense than (.id | toliteral), but that's of course subjective. I'm open to suggestions. I have defined this filter as follows:
    • literal(string) returns the string
    • literal(null) returns "null"
    • literal(false) returns "false"
    • literal(true) returns "true"
    • literal(object) and literal(array) return type error with a message that only simple types can have literal
    • literal(number) returns the number literal as string in case it's available. Otherwise it returns an error with a message that "computed numbers don't have an associated literal".

In practice, this function is a type safe version of tostring, which is described in the updated docs.

Conclusion

I believe that the proposed PR (please wait for the latest commits) is a good step towards the tocanonical functionality, alongside with any additional features like bignum support.

What I think makes this PR a candidate for approval is that given full backward compatibility it

  • solves this and other cases of such issue
  • lays foundation for further improvements around jv_number
  • doesn't introduce significant overhead on the current implementation of jq

@pkoppstein
Copy link
Contributor

@leonid-s-usov - It looks like we're making good progress, but I believe we still have some way to go.

Do I understand correctly that the current (1.5) behaviour of == is fine?

That's a tricky question, because at present one can only use == on jq values as they exist now, whereas we're talking about numbers having two possible representations in jq:

a) a "raw value" corresponding to a JSON numeric literal
b) an IEEE 64-bit value

In my previous posts, I have generally proposed that the "raw value" would be based on "real number" semantics, but now may be a good time to mention that I am not opposed to introducing a distinction between integer and non-integer numeric values. With this variation:

1|tocanonical #=> "1"
1.0|tocanonical #=> "1.0"

However whether or not this variation is supported, I would still want real-number semantics for ==.

Regarding:

y|not|isnan

Yes, of course that should have read y|isnan|not. Thanks for spotting the error. I hope it won't sow too much confusion if I simply correct it.

literal/1 vs toliteral/0

the form literal(.id) makes more sense than (.id | toliteral), but that's of course subjective.

jq already has an overwhelming bias towards having filters acting on steams of values, as illustrated most glaringly by not. Of course there are filters (such as sigma(s) defined as: reduce s as $x (null;.+$x)) which have to be defined with the stream as an argument, but jq already has a family of filters for performing transformations, such as tonumber, tostring, and tojson.

@pkoppstein
Copy link
Contributor

pkoppstein commented Oct 21, 2018

@leonid-s-usov wrote:

Do I understand correctly the the canonisation algorithm will be able to parse and canonise number literals which would be truncated if converted to a double precision? If so then I'm all in.

There is a slight ambiguity in your question, so let me emphasize that

a) the canonicalization algorithm I have in mind will preserve the mathematical real-number value of ALL JSON number literals.

b) The canonical value will (at least logically) always be used UNTIL an arithmetic operation is performed.

(Of course, as an optimization, if the IEEE 64-bit representation can be used as the canonical value, then so be it.)

In a nutshell, my proposal and hope is that:

1.0E1000 == 1.0E1000 #=> true # because these are JSON literals that both represent the same real number

but:

1.0E1000 == (0+1.0E10000) #=> false # because the RHS becomes an IEEE 64-bit float

As I also mentioned, I have no objection to the expression [1, 1.0] printing as such.

@pkoppstein
Copy link
Contributor

@leonid-s-usov - Unfortunately, github somehow completely lost my most recent post. Since I don't have time to try to reproduce it just now, let me just highlight some salient points that I hope are still relevant:

integers

Although I have proposed and hope for a "real-number semantics" for ==, I am very much open to the idea of jq preserving a distinction between integers and non-integer reals when it comes for example to printing a JSON literal. Using #=> to signify emits, I think it would be perfectly reasonable and maybe advantageous to see:

jq -n 1.00 #=> 1.0 
jq -n 1  #=> 1

jq -n '1.00 | toliteral' #=> "1.0" 
jq -n '1 | toliteral'  #=> "1"

literal/1 vs toliteral/0

jq is very strongly oriented, both in theory and in practice, to filters, especially for transformations such as tostring, tojson, tonumber, @csv, etc. Not even not/0 has an arity-1 equivalent.

Of course there are cases when there is no choice but to provide the stream of inputs as an argument, e.g.

def sigma(s): reduce s as $x (null; . + $x); # polymorphic

Even here, though, it could be argued that a better version would take the input as the initial value:

def sigma(s): reduce s as $x (.; . + $x);  # variant

The deciding factor, perhaps, is that by encapsulating the to-literal functionality as an arity-0 function, it follows necessarily that each input will be processed independently of the others, whereas that is not the case for functions that have arguments that can be streams. Thus first(s) ignores all but the first item in the stream, s.

@pkoppstein
Copy link
Contributor

pkoppstein commented Oct 22, 2018

@leonid-s-usov - github won't let me post at PR#1743 so let me try here.

Here's a test case which highlights a discrepancy between the current implementation of jq (jq-1.6rc1-25-g56f6124-dirty) and the principles I'm advocating:

./jq -n '123456789123456789123456789 == 123456789123456790000000000'
true

By the principle:

If m and n are JSON numeric literals, then m==n iff Sem(m) == Sem(n)

the result should of course be false. However, if your goal is to have a number-related PR accepted for jq 1.5.x, then the "real-number semantics" for == that I have advocated will probably be judged unacceptable. Indeed the same might be true of all the changes (as opposed to additions) that we're proposing, so it would be good to hear from @wtlangford and @nicowilliams on this point. For their benefit, here in a nutshell, are the main questions:

  1. Is it acceptable for jq 1.5.x that . reflect the real-number value of JSON numeric literals?

  2. Is it acceptable for jq 1.5.x to evaluate the following expression as false:

    123456789123456789123456789 == 123456789123456790000000000

@pkoppstein
Copy link
Contributor

@leonid-s-usov - github is currently malfunctioning rather badly. Some of my posts have simply disappeared, or rather, they disappear, reappear, and then disappear again.

For now, I'd just like to add that if your goal is to have a number-related PR accepted for jq 1.5.x then the "real-number semantics" for == that I have advocated will probably be judged unacceptable. Indeed the same might be true of all the changes (as opposed to additions) that we're proposing, so it would be good to hear from @wtlangford and @nicowilliams on this point. For their benefit, here in a nutshell, are the main questions:

  1. Is it acceptable for jq 1.5.x that . reflect the real-number value of JSON numeric literals?

  2. Is it acceptable for jq 1.5.x to evaluate the following expression as false:

123456789123456789123456789 == 123456789123456790000000000

@pkoppstein
Copy link
Contributor

pkoppstein commented Oct 22, 2018

@leonid-s-usov wrote:

literal(number) returns the number literal as string in case it's available. Otherwise it returns an error with a message that "computed numbers don't have an associated literal".

In general, jq avoids raising errors if at all possible, and in this case, I think it would be advisable to avoid raising an error, though of course there are some interesting edge cases.

Since I believe that ultimately the "to-literal" function will probably be named something like toliteral and have 0-arity, I'll assume that is the case.

First, in the ordinary case, I think it would be rather odd if 1.23 | toliteral evaluated to "1.23" but (1.23 + 0)|toliteral raised an error, the point being that 1.23+0 can be (accurately) printed as a JSON number literal. It might be a bit of a stretch, but I think that it would be reasonable to extend this by requiring toliteral to behave like tostring for all jq numeric values other than nan and infinite.

What then about nan and infinite?

Consider:

$ jq -nc '{a: nan, b: infinite}'
{"a":null,"b":1.7976931348623157e+308}
$ jq -nc '{a: nan, b:infinite} | map_values(type)'
{"a":"number","b":"number"}

Since jq effectively regards nan and infinite as numeric literals, I think it would make sense for nan|toliteral to emit "nan" and infinite|toliteral to emit "infinite", but obviously one can just as easily argue that for numbers, toliteral should only produce strings corresponding to valid JSON numeric literals.

@leonid-s-usov
Copy link
Contributor

OK, picked up the toliteral semantics. Slept over it.

Thanks for the nan and infinite cases, added those. I've pushed a change with some clarifications about this new filter.

Practically, toliteral is a type safe version of tojson.
By saying type safe I mean that the value produced with this filter is guaranteed to be fully equivalent to the original JSON value, and can be read back into the same internal representation, without any losses due to conversion or ambiguity.

This explains why nan and infinite numbers will fail just as any other calculated number since they have no literal origin in an input JSON

@wtlangford
Copy link
Contributor

I'll chime back in here with a few notes:

  1. I'm currently only interested in a solution to the ". changed my number!" problem, and that if you do anything on/with the number (even ==), it falls into a IEEE754 double. I'm willing to punt the remainder of that problem to a later improvement (bignums?).
  2. I'm not keen on exposing the internals of the literal. I think this is an implementation detail only. I see this sort of change as an extension underneath the jv type, where regardless if you got a jvp_number_literal or a double as the underlying storage, it still acts just the same! The only difference is that the jvp_number_literal doesn't get approximated if you don't do anything to it.
  3. I'm going to be releasing 1.6 this week. I've frozen the feature set at this point and am just working on a few final administrative tasks. This change won't be in it. The goal from here is to move to a more regular/typical release cycle. Since this isn't about to be part of a proper, tagged release, I'm a bit more willing to do some... smaller, incremental changes that we can all live with and play with.
  4. I'll have some comments to share on the PR shortly.

@leonid-s-usov
Copy link
Contributor

leonid-s-usov commented Oct 22, 2018 via email

@wtlangford
Copy link
Contributor

What a pity that this can’t be included.. Maybe you could review this point after checking the branch?

It's primarily a desire to let something like this live on master for a while before committing to it in a release. It's not really a comment on the branch itself. Again, the goal is to not go another 3 years before 1.7.

Re. exposing the internals… Well if there was a more generic way of dumping jv to string I’d use it and this way encapsulate the implementation details about literals. But currently the only way to control the output is to expose the needed data to the jv_print.c, so I don’t think there’s much choice here. It’s pretty much in line with jv_string_value and jv_invalid_msg

Ah, that comment was specifically about exposing the details of this to someone's jq program. The thought being: "if we later come up with a solution that makes this unnecessary and we remove the functionality, I don't want to have to maintain a toliteral/similar builtin."

@wtlangford
Copy link
Contributor

I would like to mention that I back up @pkoppstein here in a sense that although storing the literal string seems to solve the OP's problem it's a naive intermediate solution, having a number of side effects.

Part of my approach to all of this has been that yes, storing the literal string is a naïve solution, but if we implement it properly and keep it an internal detail, it can provide value to users who are only doing passthrough, and we can continue to investigate/work on/explore more complete solutions. And when we find one/more we like, we can remove this naïve approach and replace it with something better.

I have found this one: http://speleotrove.com/decimal/dnperf.html

decNumber appears to be under the GNU GPL, which we're strongly opposed to.

The performance of the thing is surely low when used as a replacement for all math. However I am suggesting some hybrid approach:

Interesting proposal, but I worry that having different semantics for addition/subtraction vs multiplication/division/complex math funcs will be even more confusing.

@leonid-s-usov
Copy link
Contributor

decNumber appears to be under the GNU GPL, which we're strongly opposed to.

It's a pity.. What is the licence you would accept?

The more I think about it the more I feel that using a decimal instead of string for the literal representation is the best thing we can and maybe even will have to do.

@leonid-s-usov
Copy link
Contributor

There is also this distribution of the library,
http://download.icu-project.org/files/decNumber/

They are having an ICU licence. Would that work? https://github.com/unicode-org/icu/blob/master/icu4c/LICENSE, see paragraph 1.

@leonid-s-usov
Copy link
Contributor

Interesting proposal, but I worry that having different semantics for addition/subtraction vs multiplication/division/complex math funcs will be even more confusing.

Well, there's some logic behind that. First of all, given proper decimal representation we can really compare literals, like for real! No reason we'd not want to do that, cause the 0 != 0.00 nonsense will not happen there.

Now, imagine that I have a long 128 bit identifier as an integer. Using jq i'd want to compare it to another id or even find a distance between them. Or match another id with an offset from this id.
with decimals - totally possible.

On the other hand, when multiplying two high precision numbers I should understand that my precision should become twice as big for the result. This is something i'd hardly expect if I knew that any other math calculations would give me IEEE double precision, - and that is a really nice precision for an everyday math.
Someone searching for a scientific calculator should get one, and jq should at least not screw up their input data when reformatting it for the input ;)

Well, we could also consider making other non-lossy operations on two literals, and fall back to doubles when functions are called or one of the operands is already a double. That could also do, but then we're in the danger of making it a long computation if one decides to reduce an array of literal numbers....

@leonid-s-usov
Copy link
Contributor

@wtlangford Please comment on whether the ICU license is acceptable

COPYRIGHT AND PERMISSION NOTICE

Copyright (c) 1995-2005 International Business Machines Corporation and others
All rights reserved.

Permission is hereby granted, free of charge, to any person obtaining a
copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, and/or sell copies of the Software, and to permit persons
to whom the Software is furnished to do so, provided that the above
copyright notice(s) and this permission notice appear in all copies of
the Software and that both the above copyright notice(s) and this
permission notice appear in supporting documentation.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT
OF THIRD PARTY RIGHTS. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR
HOLDERS INCLUDED IN THIS NOTICE BE LIABLE FOR ANY CLAIM, OR ANY SPECIAL
INDIRECT OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES WHATSOEVER RESULTING
FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT,
NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION
WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

Except as contained in this notice, the name of a copyright holder
shall not be used in advertising or otherwise to promote the sale, use
or other dealings in this Software without prior written authorization
of the copyright holder.

--------------------------------------------------------------------------------
All trademarks and registered trademarks mentioned herein are the property of their respective owners.

@sjackman
Copy link

sjackman commented Feb 8, 2019

Quickly scanning through this issue, I don't see that anyone has explicitly mentioned the safe integer limit of a IEEE 754 double-precision binary floating-point. The significand is 53 bits, so the maximum safely representable integer is 2^53 = 9007199254740992. You can try this experiment in GCC.

❯❯❯ gdb
(gdb) p (double)((1ULL<<53)-1)
9007199254740991
(gdb) p (double)((1ULL<<53)+0)
9007199254740992
(gdb) p (double)((1ULL<<53)+1)
9007199254740992
(gdb) p (1ULL<<53)+1
9007199254740993

A reasonable thing to do in the short term is to error and exit (perhaps optionally with a command line option) when reading an integer that exceeds 9,007,199,254,740,992.

See https://en.wikipedia.org/wiki/Double-precision_floating-point_format
and https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Number/MAX_SAFE_INTEGER
and https://stackoverflow.com/questions/26380364/why-is-number-max-safe-integer-9-007-199-254-740-991-and-not-9-007-199-254-740-9

@nicowilliams
Copy link
Contributor

@sjackman That might not actually be a reasonable thing to do though because there almost certainly are users who do understand the limitations of IEEE754 and are using real numbers and large integers in JSON and jq. There's a pending PR that adds a bigreal implementation, but even that might need to be an option (perhaps the default) for performance reasons.

@sjackman
Copy link

sjackman commented Feb 9, 2019

Using real numbers is okay, and reading a real number that exceeds 9e15 is also okay. Reading an integer that exceeds 2^53 however is going to result in subtly corrupting the data. In that case, it'd be better to produce an error. That could be optional behaviour, but I'd definitely prefer an error rather than corrupting the data.

@nicowilliams
Copy link
Contributor

I don't agree. jq must be able to read numbers too big to represent exactly because it does today and because users can reasonably expect it to continue to. Many JSON parsers use IEEE754 and so users are fairly used to this danger.

@sjackman
Copy link

I'd suggest an option (say --integer-overflow-error) to enable an error when losing precision on integers. Losing precision on reals is expected behaviour. Losing precision on integers is usually undesirable. Just a feature request anyway. Feel free to ignore. It sounds like you have a bigint solution in the works.

@nicowilliams
Copy link
Contributor

That would work. First we should get the bigreal contribution integrated.

@leonid-s-usov
Copy link
Contributor

Speaking of which, what's the plan here? There was some progress when @wtlangford pushed some thread local additions to the PR on January 5.
I understand that some performance checks were run but I haven't heard of any results.
The trick here is that the impact is not obvious since the conversion to the IEEE binary floating point is deferred and omitted at all if no math is performed on the numbers, which improves the parsing, as decimal floating point is much easier to parse and store.

@nicowilliams
Copy link
Contributor

@leonid-s-usov, @wtlangford told me it'd be ready very soon. I'll talk to him this week.

@peterlynch
Copy link

Another example: https://jqplay.org/s/9ZW4XcEwqX

This:

{"startTime":9223372036854775807,"nextFireTime":9223372036854775807}

Gets changed to

{
  "startTime": 9223372036854776000,
  "nextFireTime": 9223372036854776000
}

@leonid-s-usov
Copy link
Contributor

@peterlynch one may easily overflow this thread with such examples. They are literally infinite.
However, they aren't giving any new input and the solution to this problem has already been submitted. We are waiting to have it merged soon, fingers crossed

@expipiplus1
Copy link

It's quite surprising to get something like this, what happened to my minor version number!

$ echo '{"version": 0.0}' | jq .version
0

@leonid-s-usov
Copy link
Contributor

@expipiplus1

  1. The dot and the precision should be maintained by the latest implementation, check the master branch version.
  2. Even though it will work in a future version, you can't expect that. In your case, this should have been a string to preserve any semantic meaning. Once you add your build as a third component this suddenly stops being a valid fractional number, so I'd suggest you follow the common practice of storing versions as strings, or if you wish, use an object to store separate components of the version as integers.

@wtlangford
Copy link
Contributor

I'd like to further comment on this-
Both 0.0 and 0 are completely equivalent JSON encodings of the number "zero". That is, to RFC7159 (the JSON spec), both 0.0 and 0 decode to the number value 0, as JSON doesn't define "typed" numbers (it doesn't distinguish between integers and non-integers).

While the changes on the master branch should have the side effect of preserving your precision, I don't consider that to be the primary intent of those changes- those changes are intended to avoid, when possible, floating-point approximations on numbers that aren't precisely representable by IEEE754 doubles (our internal representation for numbers in jq).

@pkoppstein
Copy link
Contributor

@wtlangford - A small point, but although RFC7159 does mention that "This specification allows implementations to set limits on the range and precision of numbers accepted", I don't see anything that requires that 0 and 0.0 be mapped to the same value or be regarded as equivalent or equal. Am I missing something?

@leonid-s-usov
Copy link
Contributor

leonid-s-usov commented Oct 19, 2020

This specification allows implementations to set limits on the range
and precision of numbers accepted. Since software that implements
IEEE 754-2008 binary64 (double precision) numbers [IEEE754] is
generally available and widely used, good interoperability can be
achieved by implementations that expect no more precision or range
than these provide, in the sense that implementations will
approximate JSON numbers within the expected precision. A JSON
number such as 1E400 or 3.141592653589793238462643383279 may indicate
potential interoperability problems, since it suggests that the
software that created it expects receiving software to have greater
capabilities for numeric magnitude and precision than is widely
available.
Note that when such software is used, numbers that are integers and
are in the range [-(2^53)+1, (2^53)-1] are interoperable in the
sense that implementations will agree exactly on their numeric
values.

The way I read the above suggests to me that

  1. it is suggested that no more precision or range is to be expected than fits into a double-precision float
  2. integers in the given range are guaranteed to mean themselves. 0.0 is not an integer

herowinb added a commit to herowinb/live-dl that referenced this issue Feb 4, 2021
Put Discord Channel ID bettwen 2 ", or it will be changed by jq. More detail: github.com/jqlang/jq/issues/1741
Add line to detect if Discord Channel ID is default
@Abbyyan
Copy link

Abbyyan commented Sep 7, 2021

Is there any suggestions about convert json stream with uinit now?
Hope for an approach solution.
Thanks a lot.

@emanuele6
Copy link
Member

jq 1.7 released with the fix. closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests