Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Unicode terminology #51

Closed
gkellogg opened this issue Jun 27, 2023 · 27 comments · Fixed by #59
Closed

Improve Unicode terminology #51

gkellogg opened this issue Jun 27, 2023 · 27 comments · Fixed by #59
Labels
discuss-f2f Proposed for discussion during the next face-to-face meeting i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. needs discussion Proposed for discussion in an upcoming meeting spec:editorial Minor issue or proposed change in the specification (markup, typo, informative text) spec:substantive Issue or proposed change in the spec that changes its normative content

Comments

@gkellogg
Copy link
Member

gkellogg commented Jun 27, 2023

Some of the use of "characters" in the spec should be replaced with "code points" to conform with standard recommendations. For example the phrase "compare equal, character by character" should probably be "compare equal, code point by code point".

Improve the use of Unicode terminology throughout the document. Better reference terminology from i18n-glossary. (Note, there is no convenient externally defined term for "unicode string").

Within a string, constrain code points to be unicode scalar values, which excludes surrogates.

@gkellogg gkellogg added the spec:editorial Minor issue or proposed change in the specification (markup, typo, informative text) label Jun 27, 2023
@pfps
Copy link
Contributor

pfps commented Jun 27, 2023

I don't think that this is correct. The Unicode Standard https://www.unicode.org/versions/Unicode15.0.0/UnicodeStandard-15.0.pdf mentions both code points and characters. Code points identify or encode characters (but the mapping is not 1-1). Of course, the sequence of bits that ends up being used to encode a string uses code points at some stage, but I don't see that using only "code point" is helpful.

See also the definition of xsd:string at https://www.w3.org/TR/xmlschema11-2/#string We might want to do better than this but please don't take all the character out of the document.

@gkellogg
Copy link
Member Author

I’ll re-phrase. We do over-use the term “character”, but it is sometimes appropriate.

@gkellogg
Copy link
Member Author

gkellogg commented Aug 7, 2023

See discussion in w3c/rdf-semantics#41 (comment) about using Unicode code points instead of Unicode strings, and potentially forbidding non character code points (surrogates and code points ending in FFFE and FFFF), which are currently allowed. Also consider the use of Unicode scalar value which prohibits surrogate code points.

@gkellogg gkellogg added needs discussion Proposed for discussion in an upcoming meeting spec:substantive Issue or proposed change in the spec that changes its normative content labels Aug 9, 2023
@gkellogg
Copy link
Member Author

A similar issue recently came up for JSON RFC 8269 Errata 7603.

Section 1 says:
* A string is a sequence of zero or more Unicode characters [UNICODE].
It should say:
* A string is a sequence of zero or more Unicode code points [UNICODE].

Note that there's some discussion in email about whether surrogates are intended to be included, suggesting the potential use of Unicode scalar value, which is what I-JSON provides for, and is recommended for all IETF-specified protocols. However, note the provision: "But in the real world JSON strings contain any old combination of Unicode code points, as described in the report.".

Note that concrete RDF grammars are typically defined using EBNF which has a built-in concept of a Char:

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] | /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

So surrogate characters are already excluded. Surrogates include High-Surragate Code Point: U+D800 to U+DBFF and Low-Surragate Code Point: U+DC00 to U+DFFF, which are explicitly outside the range of Char.

The IRI grammar is based on ABNF, where eschar also excludes the surrogate range:

ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
               / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
               / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
               / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
               / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
               / %xD0000-DFFFD / %xE1000-EFFFD

This suggests that replacing the use of "Unicode string" with "a sequence of Unicode scalar values", or similar, would not introduce a backwards-compatibility issue. (JSON-LD should clarify that strings are composed only of Unicode scalar values).

@afs
Copy link
Contributor

afs commented Aug 15, 2023

Looking through all the RDF/SPARQL docs with grep, many mentions are of the form "Unicode codepoints U+0020" - a specific value and it uses the U+ notation which is codepoint. Those are correct (they are scalar values as well).

D80 Unicode string: A code unit sequence containing code units of a particular Unicode
encoding form.

so, IIUC, that in the concrete written encoding space (UTF-8 etc), not in the abstract space of codepoints and scalar values.

However.
Adding yet more terminology "scalar value" everywhere may be unhelpful. Codepoint has been in-use for a while.

So - suggestion for discussion - say it once with proper explanation, maybe as a note, and not worry too much elsewhere. Many are "unicode string" (lowercase "u") - less formal. rdf-xml is the worse for "Unicode string" (uppercase "U") although it may have been correct at the time.

Checking for perfect sequences of scale value (vs codepoints), as the JSON errata notes, isn't universally done, which is not surprising as an extra pass over a string during e.g. parsing will cause a measurable slowdown. Support across different programming languages might be different as well.

@gkellogg gkellogg added the i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. label Aug 21, 2023
@gkellogg gkellogg changed the title Code points vs characters Improve Unicode terminology Aug 21, 2023
@aphillips
Copy link
Contributor

I left some comments just now on #59. Reading the above again and what you're attempting to do with terminology makes me think that more discussion is needed.

The D80 definition of "Unicode string" is not very helpful, since it is effectively "a byte-string in some Unicode character encoding form". This is not the same thing as a scalar value string and, as @afs notes, most implementations do not want the cost of decoding strings into Unicode scalar values except when they are doing actual manipulation of the text. String comparison (for equality) is usually not one of these operations. There it is sufficient (and much higher performance) to use a "code unit" string in a specified character encoding form of Unicode. This is what a DOMString is, for example (the DOM uses UTF-16).

In short, the term "Unicode string" doesn't mean anything specific, and I would suggest adopting the term DOMString directly or defining what you mean by "Unicode string" (probably to mean a sequence of code points and not excluding surrogates)

@gkellogg
Copy link
Member Author

Some alternatives from the Unicode Glossary to "Unicode string":

  • Abstract Character Sequence – "An ordered sequence of one or more abstract characters."
  • Plain Text – "Computer-encoded text that consists only of a sequence of code points from a given standard, with no other formatting or structural information."
  • Well-Formed Code Unit Sequence – "A code unit sequence that follows the specification of a Unicode encoding form."

None of these are currently defined in the Internationalization Glossary, but presumably could be.

Note that Unicode does define Unicode String as "A code unit sequence containing code units of a particular Unicode encoding form (whether well-formed or not)." This would suggest that it's appropriate only if we want to bind the lexical value to a particular encoding, which may be reasonable.

@aphillips
Copy link
Contributor

aphillips commented Aug 30, 2023

@gkellogg

Thanks for this. I don't think those are necessarily better options for RDF.

Most of your references to a string, as you've noted elsewhere, are really just veiled references to an xsd:string value. These are "Unicode Scalar Value strings", from the point of view that they aren't in any specific encoding, just sequences of XML Char. In a very few places, this definition turns out to be inconvenient for implementers because of the difference between the code unit representation (i.e. UTF-8 or UTF-16) and a scalar value string.

What I would recommend is:

  1. Adopt the term 'string' in your terminology section and define it to mean (approximately) "An ordered sequence of zero or more Unicode code points" (zero to allow the empty string) Note that this definition would allow unpaired surrogates: you could restrict this by saying "Unicode scalar values" instead.
  2. Provide normative text to allow for the efficient comparison of strings, along the lines of:

A string is identical to another string if it consists of the same sequence of code points. An implementation MAY determine string equality by comparing the code units of two strings using the same Unicode character encoding form (UTF-8 or UTF-16) without decoding the string into a scalar value sequence.

(Infra defines "is"/"identical to" with this in mind)

Doing it this way saves having to make a bunch of modifications globally.

@gkellogg
Copy link
Member Author

@aphillips Thanks, we're definitely converging on that description, and it's a subject for discussion in upcoming meetings. I think the best we can do is have a section "Strings in RDF" that describes what you outline, without trying to define a string term, which would be out-of-scope for RDF. We can then replace our references to "Unicode string" with just "string" and rely on guidance from that section in RDF Concepts for how strings are to be interpreted in RDF.

@gkellogg gkellogg added the discuss-f2f Proposed for discussion during the next face-to-face meeting label Sep 5, 2023
@pfps
Copy link
Contributor

pfps commented Sep 5, 2023

There is really only one place that more-or-less arbitrary strings show up
in RDF Concepts - the lexical form of literals. There is xsd:string, but
that is governed by the XML Schema Datatypes recommendation, and there is the first component of literal values in language-tagged strings, but this is defined in terms of lexical form. There is no requirement that lexical form for RDF literals be constrained to the lexical space of any normal RDF datatype. So the lexical form of RDF literals could be more general thanthe lexical space of xsd:string.

In RDF 1.1 Concepts the lexical form of literals is "a Unicode [UNICODE] string", which can be a sequence of 8-bit, 16-bit, or 32-bit integers. This
isn't a good definition for RDF but could be read to say that the lexical
form can contain integers that are not Unicode code points at all, such as
0x7FFF1234. RDF surface syntaxes, even N-triples, have escapes that allow
these lexlcal forms to be encoded in UTF-8. So restricting the lexical form
of literals to Unicode code points is a substantive change. Further
restricting to match XML 1.1 (or 1.0) Char is another substantive change.

So what should lexical forms be? As far as I can tell there are four viable
alternatives - a sequence of (32-bit) integers, a sequence of Unicode code
points, a squence of Unicode scalar values, and XML 1.1 (or 1.0) Char.

Each alternative covers the entirety of all the built-in dataypes. (The
first two might require separating the "lexical space" of language-tagged
strings from the full space of lexical forms.) Each alternative has
something to recommend it - efficient transfer of arbitrary data (in 32-bit
chunks), transfer of arbitrary data (in 16-bit chunks), use of sequences of
Unicode characters, and use of XML 1.1 (1.0), respectively. In the end, I
prefer Unicode scalar values as it matches most closely what I think strings should be in the current web ecosystem.

@afs
Copy link
Contributor

afs commented Sep 6, 2023

the lexical form of RDF literals could be more general than the lexical space of xsd:string.

Yes. The lexical form should be as general as possible; here may be use cases for strange datatypes so the definition should be as wide as possible.

Of the choices, I prefer a sequence of Unicode code points with a discouragement on high-surrogate and low-surrogate codepoints and "compatibility characters". I don't think for lexical forms, making anything illegal is helpful.

FWIW xsd:string is not completely fixed. It depends on the choice of XML 1.0 vs XML 1.1.

It is ·implementation-defined· whether an implementation of this specification supports the Char production from [XML], or that from [XML 1.0], or both.

@pfps
Copy link
Contributor

pfps commented Sep 6, 2023

If all Unicode code points are to be allowed why not go the whole way to any 32-bit integer?

@afs
Copy link
Contributor

afs commented Sep 6, 2023

Most of the RDF syntaxes are defined for a concrete encoding of UTF-8 which has a maximum length of 4 bytes encoding 21 bits of value.

I have understood a "unicode string" as being after decoding i.e. code points - RDF Concepts is about the abstract data model.

I don't see a reading "a sequence of 8-bit, 16-bit, or 32-bit integers" because that is about encode forms (all aside from RDF/XML which could be in a non-UTF form).

UTF-32 isn't 32-bits of value - it has "ill-formed" exclusions.

@pfps
Copy link
Contributor

pfps commented Sep 6, 2023

According to the Unicode spec, all Unicode encoding forms are restricted to only encode scalar values:

"As for all of the Unicode encoding forms, UTF-32 is restricted to representation of code points in the ranges 0..D7FF16 and E00016..10FFFF16—that is, Unicode scalar values. This guarantees interoperability with the UTF-16 and UTF-8 encoding forms." [Unicode 15.0, page 35]

"In the UTF-16 encoding form, non-surrogate code points in the range U+0000..U+FFFF are represented as a single 16-bit code unit; code points in the supplementary planes, in the range U+10000..U+10FFFF, are represented as pairs of 16-bit code units. These pairs of special code units are known as surrogate pairs." [Unicode 15.0, page 36] Allowing surrogate code points in an RDF lexical string to be serialized as code units in the surrogate range can result in round-tripping changes that are not detectable as errors.

"Beyond the ASCII range of Unicode, many of the non-ideographic scripts are represented by two bytes per code point in UTF-8; all non-surrogate code points between U+0800 and U+FFFF are represented by three bytes; and supplementary code points above U+FFFF require four bytes." [Unicode 1.50, page 37]

If an RDF lexical value includes surrogate code points the only way to serialize them in any RDF surface syntax that uses any UTF encoding form is via \u or \U escapes. This is the same situation as any non-code points that could occur in an RDF lexical value.

It is true that UTF-32 cannot contain code units that are not code points, requiring that they be serialized using \U escapes, but this is the same situation as for surrogate code points.

@afs
Copy link
Contributor

afs commented Sep 6, 2023

The PR does further - it is current takes up the Char production from XML which excludes U+0000 and the 2 "non-characters". The XML text advises against other characters which we'd inherit.

Unicode Scalar Value is better that XML Char.

@pfps
Copy link
Contributor

pfps commented Sep 6, 2023

There are lots more "non-characters" than the two excluded by Char.

@gkellogg
Copy link
Member Author

gkellogg commented Sep 6, 2023

I'm applying suggestions to the PR that remove the restriction to the Char production, but that also removes restrictions on surrogates. I would be in favor of replacing that with restricting code points to Unicode scalar values instead, if that satisfies the various requirements.

Note that I did simplify the table description of xsd:string to say "Character strings (see string)" rather than "Character strings (but not all Unicode character strings)", which may be over-simplifying.

@pfps
Copy link
Contributor

pfps commented Sep 6, 2023

xsd:string has its own value space (actually two) and can't be separated from that. The string definition is currently quite different.

@afs
Copy link
Contributor

afs commented Sep 7, 2023

There are lots more "non-characters" than the two excluded by Char.

True - I meant that FFFE and FFFF are "non-characters" because they relate to bad practice in decoding control (legacy).

@gkellogg
Copy link
Member Author

gkellogg commented Sep 7, 2023

Perhaps a place for the XML Char production is in the description of values of xsd:string.

@afs
Copy link
Contributor

afs commented Sep 8, 2023

What had you in mind?

I don't see the need in RDF Concepts (the abstract data model). The XML TR has advice about Char and we don't want to imply that. As with any datatype, the definition is with the datatype.

xsd:string is already a bit messy because it is defined for "XML", not 1.0 and 1.1 and the fact it says "The string datatype represents character strings in XML."

@gkellogg
Copy link
Member Author

gkellogg commented Sep 8, 2023

The previous "Character strings (but not all Unicode character strings)" was rather vague. I have a suggestion (#59 (comment)) to change it to "Character strings matching the Char production from [[XML11]],"

Updates for XML11 have been made elsewhere, and it would make sense to reference XML 1.1 Char production here, but maybe that's too restrictive? I think we need to improve the previous statement to make it more clear, though.

@afs
Copy link
Contributor

afs commented Sep 9, 2023

Why do we need to refer to Char?

Char is not the same as Unicode scalar value for XML-related reasons.

@gkellogg
Copy link
Member Author

gkellogg commented Sep 9, 2023

Suggest something else. The previous line was not helpful. It seemed to me that Char matched what could be in an s day:string, but I’m sure I don’t have a complete appreciation for the subtleties.

@aphillips
Copy link
Contributor

aphillips commented Sep 9, 2023

I note that we appear to be painting the bikeshed pretty thoroughly 😉. Mentioning xsd:string and saying it is a sequence of scalar values (i.e. Char) is probably sufficient (especially if the goal is to reference a definition for "string" rather than introducing new normative testing of stringhood). To recall the original thread, a key thing is that we want after defining a USV/codepoint/Char string is to allow implementations to do string identity matching without decoding to scalar values via e.g. Encoding, because such comparisons are identical in correctness but higher in performance/save allocations.

Note that XML in XML Schema means 1.1 and 1.1's Char production is USV except for NULL, U+FFFE, and U+FFFF (both omit surrogates).

@afs
Copy link
Contributor

afs commented Sep 9, 2023

Suggest something else.

Done. See the PR. Linking to the definition of (RDF) string from "core type" xsd:string and then mentioning Char is confusing.

@afs
Copy link
Contributor

afs commented Sep 10, 2023

we appear to be painting

If this were a new spec, then yes.

Whether a strict RDF 1.2. implementation can ingest any RDF 1.1 document, or whether a database that stores RDF 1.1 data is affected, is quite important in any claims this WG makes about compatibility. There is a certain amount of less-than-perfect data out there. 😑

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss-f2f Proposed for discussion during the next face-to-face meeting i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. needs discussion Proposed for discussion in an upcoming meeting spec:editorial Minor issue or proposed change in the specification (markup, typo, informative text) spec:substantive Issue or proposed change in the spec that changes its normative content
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants