Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Value space of rdf:JSON datatype #65

Closed
gkellogg opened this issue Sep 20, 2023 · 21 comments · Fixed by #66
Closed

Value space of rdf:JSON datatype #65

gkellogg opened this issue Sep 20, 2023 · 21 comments · Fixed by #66
Labels
spec:substantive Issue or proposed change in the spec that changes its normative content

Comments

@gkellogg
Copy link
Member

Updates for #62 delved into updating the definition of the value space of the rdf:JSON datatype to use more primitive concepts from INFRA (arrays, maps, strings, booleans, and null) as well as number from ECMAScript.#62

The existing value space is based on the JCS representation of the JSON literal value. The proposed update could look like the following:

The value space
is a single JSON value in the form of an array, map, string, number, boolean, or null.
  • Array entries may be any of the above JSON values.
  • Map keys are strings with values, which may be any of the above JSON values.

Two JSON values are considered equal if they are the same string, number, boolean, or null; if they are both arrays with entries which are pairwise equal; or if they are both maps with equal map entries.

@gkellogg gkellogg added the spec:substantive Issue or proposed change in the spec that changes its normative content label Sep 20, 2023
@afs
Copy link
Contributor

afs commented Sep 21, 2023

array, map, string, number, boolean, or null.

As a general principle - I'm in favour of linking to original definitions where possible rather than incorporating material or normative referencing derived works which may diverge because they are for a specific or different purpose.

For JSON, RFC 8259 - I think that is the original definitive place (it would mean "map" -> "object"). RFC 8259 is the current STD-90.

(certainly for "string" - because a JSON string is not an RDF string or xsd:string)

@pfps
Copy link
Contributor

pfps commented Sep 21, 2023

But just what are the JSON values, particularly number?

@afs
Copy link
Contributor

afs commented Sep 21, 2023

What are the requirements?

https://www.w3.org/TR/json-ld11/#terms-imported-from-other-specifications

where number goes to
https://tc39.es/ecma262/#sec-terms-and-definitions-number-value

but is that what the value space for a JSON fragment is for?

If it is "JSON processors treat them the same" then https://www.rfc-editor.org/rfc/rfc8259.html#section-6
and apply (some of) RFC8785 because the ultimate abstract value is not important.

@gkellogg
Copy link
Member Author

array, map, string, number, boolean, or null.

As a general principle - I'm in favour of linking to original definitions where possible rather than incorporating material or normative referencing derived works which may diverge because they are for a specific or different purpose.

For JSON, RFC 8259 - I think that is the original definitive place (it would mean "map" -> "object"). RFC 8259 is the current STD-90.

(certainly for "string" - because a JSON string is not an RDF string or xsd:string)

First, we need to decide if we want to go for this decomposed notion of a JSON value for the value space. I'm fine with sourcing RFC8259, which would get out of the problem of having to go to ECMAScript for numbers.

JSON-LD (and INFRA) tend to use the term "map" rather than "object", as "object" is overly general. We can use the term "map" while still referencing the "Object" section in the RFC8259.

Regarding strings, certainly the strings referenced as JSON values (or within a JSON serialization) reference "strings" from RFC8259, and may include their own escape sequences. While "\uDEAD" may be represented (it technically can be in JSON-LD 1.1), this would be an aspect of the JSON value, rather than the lexical representation which would not allow a surrogate natively. JSON-LD 1.2 would likely be updated to exclude surrogates. It should be clear, and we may need to state it as such, that a JSON string is disjoint from an RDF string.

Of course, the other alternative is to not go with the decomposed notion of a JSON value as the value space, in which case we're dealing exclusively with RDF strings containing a JSON serialization. Note that the existing value space uses JCS/RFC8765 for the canonical form of JSON, which has similar requirements for character representation as our own, and requires implementations to terminate if a "loan surrogate" is found.

@afs
Copy link
Contributor

afs commented Sep 22, 2023

do the two styles agree on what matches for numbers? (I think JCS does because it (in effect) goes through binary)

@pfps
Copy link
Contributor

pfps commented Sep 22, 2023

JCS has the decided advantage of only processing a subset of JSON. Unless rdf:JSON is limited to that subset depending on JCS may not be possible.

@afs
Copy link
Contributor

afs commented Sep 22, 2023

I-JSON: RFC 7493

@gkellogg
Copy link
Member Author

JCS was used by JSON-LD to create the RDF serialization of a JSON value in the Object to RDF Conversion algorithm, so it never did allow for surrogates, although JCS was not finalized at that time, so the definition of canonical lexical form may not strictly define that restriction. Any strict update to the rdf:JSON definition within JSON-LD would use JCS directly, and further limit code points similar how we've one in RDF Concepts and disallow surrogates explicitly. Obviously, this is what I-JSON did.

@pfps
Copy link
Contributor

pfps commented Sep 22, 2023

I-JSON has a lot more restrictions than just nice strings. Is rdf:JSON supposed to have these other restrictions too? If so, these other restrictions need to be stated explicitly.

The nice strings restriction needs to be either stated or true. I think that it is not true currently.

@afs
Copy link
Contributor

afs commented Sep 22, 2023

"A lot"?

It is those things that make for accurate consistent parsing.

@pfps
Copy link
Contributor

pfps commented Sep 22, 2023

Number restrictions to IEEE floating point double.
No duplicate member names.

Ok, so not a lot in absolute terms. But a large part of the JSON syntax is affected.

@afs
Copy link
Contributor

afs commented Sep 23, 2023

It is the areas where there is no common, stable, implemented values.
Unless @pfps has a proposal?

@gkellogg
Copy link
Member Author

JSON doesn't allow duplicate keys (member names), either, although it is not typically an error condition; the last key wins. Limitations of I-JSON (and JCS) on string and number representation should not be a problem, as they're effectively already in place in JSON-LD due to the tacit correspondence to JCS.

@pfps
Copy link
Contributor

pfps commented Sep 24, 2023

For JSON numbers, I suggest xsd:decimal.
For objects, I suggest name-value pairs.

@afs
Copy link
Contributor

afs commented Sep 24, 2023

Why have something that has different interpretations across different JSON implementations?

I-JSON/JCS reflects where JSON is standardised, de-facto and de-jure.

@pfps
Copy link
Contributor

pfps commented Sep 24, 2023

The question is whether rdf:JSON is going to be the JSON that "does not attempt to impose ECMAScript’s internal data representations on other programming languages" and thus has objects containing "zero or more name/value pairs", strings as "sequence[s] of zero or more Unicode characters", and numbers as potentially unbounded decimal values or the JSON that has objects as EMCAScript objects with all "properties of an object [...] uniquely identified using property keys", strings as "ordered sequences of zero or more 16-bit unsigned integer values", and numbers as a "double-precision 64-bit format IEEE 754-2019 values".

If rdf:JSON is going to be the former, then all references should be to json.org and RFCs, JSON values should not be tied to ECMASCRIPT, and string ordering should be by Unicode codepoint; if rdf:JSON is going to be the latter, then all references should be to the ECMAScript 2024 Language Specification or whatever document currently defines ECMASCRIPT and JSON values and string ordering can be by UTF-16 code unit.

@domel
Copy link
Contributor

domel commented Sep 24, 2023

Agree. But referencing to json.org that can change at any time, is not a good idea.

@afs
Copy link
Contributor

afs commented Sep 24, 2023

json.org has a link at the top to ECMA-404 (the link is broken (!! given the number) but ECMA-404 exists)

The JSON syntax specified by this specification and by RFC 8259 are intended to be identical.

The warning on the EMCA-404 download page is worth noting.

@gkellogg
Copy link
Member Author

Suggest a PR that does the following:

The lexical space is the set of RDF strings which conform to the JSON Grammar as described in Section 2 JSON Grammar of [RFC8259] which are also I-JSON messages [RFC7493].

The value space is the set of arrays, objects, strings, _numbers, and JSON literals (boolean and null) [RFC8259]. Two values are considered equal if they are the same string, number, JSON literal; if they are both arrays with elements which are pairwise equal; or if they are both objects with equal members.

The ** lexical to value mapping** map every element of the lexical space to the result of parsing it into a JSON value.

I don't think we need to get into the relationship between JSON strings and RDF strings, or exactly what a JSON number is, other than as defined in RFC8259. Note that the lexical space is an RDF string, as any lexical value must be.

@pfps
Copy link
Contributor

pfps commented Sep 27, 2023

I prefer a value space that is not tied to ECMAscript and a lexical order that is not tied to UTF-16. I suggest the following, which handles all JSON texts:

Value space:

The value space of rdf:JSON is recursively defined as the union of

  • objects - finite bags of members, which are pairs of string (names) rdf:JSON values (values)
  • arrays - finite sequences of elements, which are rdf:JSON values
  • numbers - the value space of xsd:decimal
  • strings - finite sequences of UNICODE code points
  • false, null, and true - constants different from any other elements of the value space

Ordering:

Objects are less than arrays, which are less than numbers, which are less than strings, which are less than false, which is less than null, which is less than true.
Object members are ordered by lexicographic ordering over their name and value.
Objects are ordered by first sorting their members from lesser to greater and then using lexicographic order over the resulting sequences.
Arrays are ordered by lexicographic ordering over their elements.
Numbers are ordered by the ordering of real numbers.
Strings are ordered by lexicographic ordering over code points.

Canonical form:

The canonical form of an object is { followed by the canonical form of its members in order from lesser to greater separated by , followed by }.
The canonical form of an array is [ followed by the canonical form of its elements in sequence order separated by , followed by ].

The canonical form of a number is its xsd:decimal canonical form.

The canonical form of a string is " followed by the string with " replaced by ", \ replaced by \,
U+0008 replaced by \b, U+0009 replaced by \t, U+000A replaced by \n, U+000C replaced by \f, U+000D replaced by \r,
and other code points between U+0000 through U+001F, inclusive, replaced by \uhhhh where hhhh is the lower-case four-digit hexadecimal numeral for the code point followed by ".

The canonical form of false is the string false, the canonical form of null is the string null, the canonical form of true is the string true.

@gkellogg
Copy link
Member Author

I prefer a value space that is not tied to ECMAscript and a lexical order that is not tied to UTF-16.

PR #66 does not currently reference either spec directly (indirectly through RFC8259 and JCS).

I suggest the following, which handles all JSON texts:

Value space:

The value space of rdf:JSON is recursively defined as the union of

  • objects - finite bags of members, which are pairs of string (names) rdf:JSON values (values)
  • arrays - finite sequences of elements, which are rdf:JSON values
  • numbers - the value space of xsd:decimal
  • strings - finite sequences of UNICODE code points
  • false, null, and true - constants different from any other elements of the value space

Note that xsd:decimal is neither adequate to represent all JSON numbers nor consistent with JSON-LD. If defined in terms of XSD types, it should stick with what JSON-LD does and use either xsd:integer or xsd:double depending on the existence of a fractional part.

Ordering:

Objects are less than arrays, which are less than numbers, which are less than strings, which are less than false, which is less than null, which is less than true. Object members are ordered by lexicographic ordering over their name and value. Objects are ordered by first sorting their members from lesser to greater and then using lexicographic order over the resulting sequences. Arrays are ordered by lexicographic ordering over their elements. Numbers are ordered by the ordering of real numbers. Strings are ordered by lexicographic ordering over code points.

Ordering should be consistent with ordering the JCS representation. This implies that:

  • strings starting with " (U+0022) would come before
  • numbers (leading decimal U+0030-U+0039 or hyphen U+002D), which come before
  • array starting with [ (U+005B), which come before
  • false (f is U+0066), which come before
  • null (n is U+006E), which come before
  • true (t is U+0074), which come before
  • object ({ is U+007B).

Canonical form:

The canonical form of an object is { followed by the canonical form of its members in order from lesser to greater separated by , followed by }. The canonical form of an array is [ followed by the canonical form of its elements in sequence order separated by , followed by ].

The canonical form of a number is its xsd:decimal canonical form.

The canonical form of a string is " followed by the string with " replaced by ", \ replaced by , U+0008 replaced by \b, U+0009 replaced by \t, U+000A replaced by \n, U+000C replaced by \f, U+000D replaced by \r, and other code points between U+0000 through U+001F, inclusive, replaced by \uhhhh where hhhh is the lower-case four-digit hexadecimal numeral for the code point followed by ".

The canonical form of false is the string false, the canonical form of null is the string null, the canonical form of true is the string true.

This really needs to be JCS due to wide implementation in JSON-LD processors already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spec:substantive Issue or proposed change in the spec that changes its normative content
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants