-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Unicode terminology #51
Comments
I don't think that this is correct. The Unicode Standard https://www.unicode.org/versions/Unicode15.0.0/UnicodeStandard-15.0.pdf mentions both code points and characters. Code points identify or encode characters (but the mapping is not 1-1). Of course, the sequence of bits that ends up being used to encode a string uses code points at some stage, but I don't see that using only "code point" is helpful. See also the definition of xsd:string at https://www.w3.org/TR/xmlschema11-2/#string We might want to do better than this but please don't take all the character out of the document. |
I’ll re-phrase. We do over-use the term “character”, but it is sometimes appropriate. |
See discussion in w3c/rdf-semantics#41 (comment) about using Unicode code points instead of Unicode strings, and potentially forbidding non character code points (surrogates and code points ending in FFFE and FFFF), which are currently allowed. Also consider the use of Unicode scalar value which prohibits surrogate code points. |
A similar issue recently came up for JSON RFC 8269 Errata 7603.
Note that there's some discussion in email about whether surrogates are intended to be included, suggesting the potential use of Unicode scalar value, which is what I-JSON provides for, and is recommended for all IETF-specified protocols. However, note the provision: "But in the real world JSON strings contain any old combination of Unicode code points, as described in the report.". Note that concrete RDF grammars are typically defined using EBNF which has a built-in concept of a Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] | /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */ So surrogate characters are already excluded. Surrogates include High-Surragate Code Point: The IRI grammar is based on ABNF, where ucschar = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
/ %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
/ %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
/ %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
/ %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
/ %xD0000-DFFFD / %xE1000-EFFFD This suggests that replacing the use of "Unicode string" with "a sequence of Unicode scalar values", or similar, would not introduce a backwards-compatibility issue. (JSON-LD should clarify that strings are composed only of Unicode scalar values). |
Looking through all the RDF/SPARQL docs with grep, many mentions are of the form "Unicode codepoints U+0020" - a specific value and it uses the
so, IIUC, that in the concrete written encoding space (UTF-8 etc), not in the abstract space of codepoints and scalar values. However. So - suggestion for discussion - say it once with proper explanation, maybe as a note, and not worry too much elsewhere. Many are "unicode string" (lowercase "u") - less formal. rdf-xml is the worse for "Unicode string" (uppercase "U") although it may have been correct at the time. Checking for perfect sequences of scale value (vs codepoints), as the JSON errata notes, isn't universally done, which is not surprising as an extra pass over a string during e.g. parsing will cause a measurable slowdown. Support across different programming languages might be different as well. |
I left some comments just now on #59. Reading the above again and what you're attempting to do with terminology makes me think that more discussion is needed. The In short, the term "Unicode string" doesn't mean anything specific, and I would suggest adopting the term DOMString directly or defining what you mean by "Unicode string" (probably to mean a sequence of code points and not excluding surrogates) |
Some alternatives from the Unicode Glossary to "Unicode string":
None of these are currently defined in the Internationalization Glossary, but presumably could be. Note that Unicode does define Unicode String as "A code unit sequence containing code units of a particular Unicode encoding form (whether well-formed or not)." This would suggest that it's appropriate only if we want to bind the lexical value to a particular encoding, which may be reasonable. |
Thanks for this. I don't think those are necessarily better options for RDF. Most of your references to a string, as you've noted elsewhere, are really just veiled references to an What I would recommend is:
(Infra defines "is"/"identical to" with this in mind) Doing it this way saves having to make a bunch of modifications globally. |
@aphillips Thanks, we're definitely converging on that description, and it's a subject for discussion in upcoming meetings. I think the best we can do is have a section "Strings in RDF" that describes what you outline, without trying to define a string term, which would be out-of-scope for RDF. We can then replace our references to "Unicode string" with just "string" and rely on guidance from that section in RDF Concepts for how strings are to be interpreted in RDF. |
There is really only one place that more-or-less arbitrary strings show up In RDF 1.1 Concepts the lexical form of literals is "a Unicode [UNICODE] string", which can be a sequence of 8-bit, 16-bit, or 32-bit integers. This So what should lexical forms be? As far as I can tell there are four viable Each alternative covers the entirety of all the built-in dataypes. (The |
Yes. The lexical form should be as general as possible; here may be use cases for strange datatypes so the definition should be as wide as possible. Of the choices, I prefer a sequence of Unicode code points with a discouragement on high-surrogate and low-surrogate codepoints and "compatibility characters". I don't think for lexical forms, making anything illegal is helpful. FWIW xsd:string is not completely fixed. It depends on the choice of XML 1.0 vs XML 1.1.
|
If all Unicode code points are to be allowed why not go the whole way to any 32-bit integer? |
Most of the RDF syntaxes are defined for a concrete encoding of UTF-8 which has a maximum length of 4 bytes encoding 21 bits of value. I have understood a "unicode string" as being after decoding i.e. code points - RDF Concepts is about the abstract data model. I don't see a reading "a sequence of 8-bit, 16-bit, or 32-bit integers" because that is about encode forms (all aside from RDF/XML which could be in a non-UTF form). UTF-32 isn't 32-bits of value - it has "ill-formed" exclusions. |
According to the Unicode spec, all Unicode encoding forms are restricted to only encode scalar values: "As for all of the Unicode encoding forms, UTF-32 is restricted to representation of code points in the ranges 0..D7FF16 and E00016..10FFFF16—that is, Unicode scalar values. This guarantees interoperability with the UTF-16 and UTF-8 encoding forms." [Unicode 15.0, page 35] "In the UTF-16 encoding form, non-surrogate code points in the range U+0000..U+FFFF are represented as a single 16-bit code unit; code points in the supplementary planes, in the range U+10000..U+10FFFF, are represented as pairs of 16-bit code units. These pairs of special code units are known as surrogate pairs." [Unicode 15.0, page 36] Allowing surrogate code points in an RDF lexical string to be serialized as code units in the surrogate range can result in round-tripping changes that are not detectable as errors. "Beyond the ASCII range of Unicode, many of the non-ideographic scripts are represented by two bytes per code point in UTF-8; all non-surrogate code points between U+0800 and U+FFFF are represented by three bytes; and supplementary code points above U+FFFF require four bytes." [Unicode 1.50, page 37] If an RDF lexical value includes surrogate code points the only way to serialize them in any RDF surface syntax that uses any UTF encoding form is via \u or \U escapes. This is the same situation as any non-code points that could occur in an RDF lexical value. It is true that UTF-32 cannot contain code units that are not code points, requiring that they be serialized using \U escapes, but this is the same situation as for surrogate code points. |
The PR does further - it is current takes up the Char production from XML which excludes U+0000 and the 2 "non-characters". The XML text advises against other characters which we'd inherit. Unicode Scalar Value is better that XML Char. |
There are lots more "non-characters" than the two excluded by Char. |
I'm applying suggestions to the PR that remove the restriction to the Char production, but that also removes restrictions on surrogates. I would be in favor of replacing that with restricting code points to Unicode scalar values instead, if that satisfies the various requirements. Note that I did simplify the table description of |
xsd:string has its own value space (actually two) and can't be separated from that. The string definition is currently quite different. |
True - I meant that FFFE and FFFF are "non-characters" because they relate to bad practice in decoding control (legacy). |
Perhaps a place for the XML Char production is in the description of values of |
What had you in mind? I don't see the need in RDF Concepts (the abstract data model). The XML TR has advice about Char and we don't want to imply that. As with any datatype, the definition is with the datatype.
|
The previous "Character strings (but not all Unicode character strings)" was rather vague. I have a suggestion (#59 (comment)) to change it to "Character strings matching the Char production from [[XML11]]," Updates for XML11 have been made elsewhere, and it would make sense to reference XML 1.1 Char production here, but maybe that's too restrictive? I think we need to improve the previous statement to make it more clear, though. |
Why do we need to refer to Char? Char is not the same as Unicode scalar value for XML-related reasons. |
Suggest something else. The previous line was not helpful. It seemed to me that Char matched what could be in an s day:string, but I’m sure I don’t have a complete appreciation for the subtleties. |
I note that we appear to be painting the bikeshed pretty thoroughly 😉. Mentioning Note that XML in XML Schema means 1.1 and 1.1's |
Done. See the PR. Linking to the definition of (RDF) string from "core type" xsd:string and then mentioning Char is confusing. |
If this were a new spec, then yes. Whether a strict RDF 1.2. implementation can ingest any RDF 1.1 document, or whether a database that stores RDF 1.1 data is affected, is quite important in any claims this WG makes about compatibility. There is a certain amount of less-than-perfect data out there. 😑 |
Some of the use of "characters" in the spec should be replaced with "code points" to conform with standard recommendations. For example the phrase "compare equal, character by character" should probably be "compare equal, code point by code point".Improve the use of Unicode terminology throughout the document. Better reference terminology from i18n-glossary. (Note, there is no convenient externally defined term for "unicode string").
Within a string, constrain code points to be unicode scalar values, which excludes surrogates.
The text was updated successfully, but these errors were encountered: