Lift character set restriction of xs:string #414

ChristianGruen · 2023-03-31T08:04:08Z

I guess that raises the question of whether it is still appropriate to restrict the character set of xs:string to that of XML 1.0. Are there any benefits in doing so?

I believe that would simplify things a lot, in particular when working with input/output functions.

michaelhkay · 2023-03-31T08:42:35Z

that would simplify things

I fear that the opposite is true.

There are certainly some places where we need to align with the XML and XSD specs.

Validating a node against a schema needs to match the XSD semantics
Serialising using the XML output method needs to produce well-formed XML

We then have decisions to make: does casting to xs:string do something different from validation? Do we allow non-XML characters in the string value of an XDM node? What characters are allowed in an xs:anyURI or in an xs:token or in xs:untypedAtomic?

I suspect that the only clean way to do this is to introduce a new type, say xs:uString, that is a supertype of xs:string and that allows any characters. And then we have to go through all the functions and operators changing most of them to use xs:uString rather than xs:string. I think it's probably cleanest not to allow non-XML characters in XDM nodes, so validation happens at the time of tree construction rather than serialization. But there are lots more questions to be answered.

ndw · 2023-03-31T09:00:17Z

Can I steal the can opener and hide it somewhere? I really don't think we want to do this.

ChristianGruen · 2023-03-31T09:08:11Z

I fear that the opposite is true.

Well, I guess you are right. I have merely taken a common user perspective, which is certainly difficult to satisfy from a technical perspective: I have string data (JSON, CSV, plain text, …) → I don't care about edge cases or XML 1.0 restrictions → I end up using Python, Java, etc., because it doesn't reject my input.

michaelhkay · 2023-03-31T09:11:42Z

There might be a fudge solution (we already fudge whether xs:string is constrained by XSD 1.0 or 1.1, and XSD fudges what's allowed in an xs:anyURI). We could say that processors are not required to validate what characters are contained in strings returned for example by unparsed-text() or json-doc() or codepoints-to-string(), provided that they guarantee that serialized XML output is well-formed and that validation follows XSD semantics.

That solution isn't interoperable, but then the exact set of characters allowed in xs:string is already implementation-defined.

michaelhkay · 2023-03-31T15:39:15Z

We could possibly handle this as follows:

As far as the QT4 specifications are concerned, we say that xs:string (and xs:anyURI and xs:untypedAtomic) can hold any sequence of Unicode codepoints (not surrogates, though); functions that are capable of delivering non-XML characters (unparsed-text, codepoints-to-string, fn:char, etc) no longer raise an error in this case.
In serialized XML/XHTML output, invalid characters cause a serialization error.
Creating a node whose string value contains invalid XML characters MAY result in a dynamic error. (This is to allow implementations to use a third-party tree representation that disallows invalid characters; it may also improve performance to do the validation during tree construction rather than serialization).
Validation of nodes against schema types uses the XSD definitions of the data types, and must reject invalid characters (assuming you can construct a node with invalid characters to pass to the validator...)
Source XPath and XQuery expressions must use only valid XML characters (because we don't want a situation where some valid XPath expressions can't be embedded in XSLT).
Casting to xs:string (and xs:anyURI and xs:untypedAtomic) performs no checking for invalid characters.

Arithmeticus · 2023-03-31T15:39:32Z

I continue to process many files and streams that include control characters. Any changes that make that use more tractable, and error free, would be very welcome.

An important use case is unknown bytes in a text file, which may have characters not allowed by XML 1.0 or 1.1. Having to use non-core XPath functions to handle such cases doesn't feel right to me. I think it would be useful to bring versions of EXPath's file:read-binary() and saxon:base64Binary-to-string() into the main specs as, e.g., unparsed-text-bytes() and bytes-to-string() (with an options map for, e.g., how to handle streams whose decoding would result in U+0000, U+FFFE, U+FFFF). The names show my personal preference to work on the level of xs:byte and not xs:base64Binary or xs:integer.

ChristianGruen · 2023-03-31T16:33:35Z

We could possibly handle this as follows:

…sounds very good!

ndw · 2023-03-31T16:38:29Z

This is a slippery slope. If users can read documents that contain invalid characters and construct data model instances from them, they're going to wonder why they can't create them. Why isn't <xsl:sequence select=""/> allowed? Why can't I create this map let $var = map { "value":  } in XQuery?

Arithmeticus · 2023-03-31T17:24:14Z

@ndw I would say that we've already slipped to the bottom of that slope. We already can and do find ways to deal with non-standard character input, and I don't think it has inculcated the expectation that output should be the same. Relaxing strictures on xs:string or treating a file as a byte sequence IMO acknowledges the reality of messy input, provides the tools needed to deal with that mess, allows workflows to be less kludgy or error-prone, and will I think attract more developers to the language. Good chance, though, I'm naive about the full implications.

ChristianGruen · 2023-03-31T17:32:06Z

Why can't I create this map let $var = map { "value":  } in XQuery?

I suppose we’d need to allow this, too (if the entities occur within a string: ""), but it cannot be serialized with the XML/XHTML serialization methods.

rhdunn · 2023-04-04T11:22:12Z

There's going to be interoperability issues anyway depending on whether or not a processor supports XML 1.0 or XML 1.1.

ndw · 2023-04-04T11:32:01Z

XML 1.1 doesn't allow "&0;" and it's non-existant for all practical purposes.

ChristianGruen · 2023-04-04T12:58:47Z

XML 1.1 doesn't allow "&0;" and it's non-existant for all practical purposes.

Thanks for the clarification.

I share the experience that 00 rarely appears in texts. Instead of allowing any sequence of Unicode codepoints, we could restrict the legal input to XML 1.1 Unicode codepoints.

ndw · 2023-04-04T13:44:38Z

If we're going to go down the slippery slope, I think we might as well get on a sled and enjoy the ride. The set of characters allowed by 1.1 is incompatible with the set of characters allowed by 1.0 (because there was a desire, some might have felt misplaced, that it be possible to detect incorrectly encoded texts). The differences in C0 vs C1 control characters and the fact that you can have  but not  is just going to look arbitrary and capricious.

I think there are two reasonable places to stand: (point A) the set of characters allowed in an XDM are the characters that are allowed in XML (1.0 if you want to push on that), or (point B) the set of characters allowed in an XDM is unbounded and what's forbidden is attempting ot serialize (or add to an XML tree?) any characters that aren't allowed in XML.

We're currently standing at point A. Moving to point B is opening a can of worms. I'm not personally inconvenienced by standing at point A and, especially with my chair's hat on, I'm reluctant to open more worm cans. But we're all inconvenienced by different things and I'm in no way going to assert that the things that inconvience me are somehow more important than anyone else's inconviences. I'm not going to attempt to prevent the group from doing it, as long as I can take my "I told you so" token and save it to play later :-)

An example of the sort of complexity I'm imagining is the following.

I can't create an  in XSLT. So I load one from disk into a JSON array. I construct a string, $s, from that array. That's presumably allowed. But <doc><xsl:sequence select="$s"/></doc> presumably is not. Except if that's never actually serialized (for example, if it's assigned to a variable that is only conditionally output and it isn't output) is that an error? Is it required to be an error? Is it forbidden to be an error? Suppose for example, I have:

<xsl:variable name="s" select="(some expression that successfully constructs the string with `&#0;` in it)"/>
<xsl:variable name="p" as="element()"><p><xsl:sequence select="$s"/></p></xsl:variable>
<xsl:if test="f:moon-is-full">
  <xsl:sequence select="$p"/>
</xsl:if>

If the moon is full, $p will be serialized and that must be an error. If the moon isn't full, mayb it isn't an error.

Is string($p) an error? Always or never? Is string-length($p) an error?

Etc.

ChristianGruen · 2023-04-04T13:59:58Z

I can't create an  in XSLT. So I load one from disk into a JSON array. I construct a string, $s, from that array. That's presumably allowed.

If we allow constructing such contents from disk, I'm convinced we need to allow it in string/text constructors as well. If that goes too far, I would vote against lifting the current restrictions.

Moving to point B is opening a can of worms.

We could assign a low priority label to worm-eaten issues and consider them only in the end.

michaelhkay · 2023-04-04T14:07:26Z

I'm working on a proposal which I hope will be acceptable.

We do already have some muddiness, for example words that suggest reading a JSON file containing non-XML characters might work in some circumstances. I think it should work in all circumstances - except unpaired surrogates, where the RFC explicitly says that not all applications will accept it. And of course, actual practice in the field is likely to be even muddier - it's unlikely that a product that allows the input to be supplied as a DOMSource is going to check that the text in all nodes is squeaky-clean, and DOM certainly isn't going to check it for you.

Arithmeticus · 2023-04-04T14:08:19Z

To clarify, I am not arguing for allowance of the 3 no-no characters in xs:string, only that we have the means to read as a sequence of xs:bytes a file that might have them. Under my proposal, the guard rails are shifted to the attempt to cast from the byte sequence to a string. This means that the point of failure occurs within my code, and I can adjust as needed. Currently, the point of failure is outside the code, and I can't do anything within the core library to rectify then continue to process irregular input.

Arithmeticus · 2023-04-29T16:52:15Z

The new function parse-html() accepts as input (in addition to strings) xs:hexBinary and xs:base64Binary. That input implies that there is a convenient way in the core specs to grab a web page as a binary object. But there isn't.

I think this lends yet more rationale to my argument in this thread that we need an analogue for unparsed-text-available() and unparsed-text() for any byte sequence, e.g., binary-file-available() and binary-file().

Put another way, how is someone using core functions going to be able to grab a web page as a binary object to be able to feed it into parse-html()?

michaelhkay · 2023-04-30T19:22:52Z

I agree there's a good case for an unparsed-binary() analog of unparsed-text().

This doesn't solve the problem that you want to be able to read files such as CSV files and convert them say to JSON, without having first to check that all the characters are legal XML characters.

michaelhkay · 2023-04-30T19:31:18Z

@ndw wrote: "I can't create an � in XSLT. "

But if we relax the restrictions on fn:codepoints-to-string() and fn:char(), then you can. I think those mechanisms are quite adequate for the purpose.

ndw · 2023-05-01T15:57:18Z

Changing the XPath Data Model to allow an ASCII NUL character (since copying and pasting the numeric character reference seems to confuse the GitHub issue formatting) strikes me as a very significant change that we should consider carefully. It's arguable that ASCII NUL (and stupidity regarding the C1 control characters) were the fatal flaw in XML 1.1.

ndw · 2023-05-23T16:27:51Z

This was discussed at meeting 035 without coming to a final resolution. The consensus seemed to gravitate towards keeping a single definition of string (perhaps extended) rather than adding a new "Unicode string" data type, but that's not a definitive decision.

For dealing with non-XML characters, among the options considered were:

Add a user-defined callback function allowing a user to encode non-available characters (e.g., on fn:unparsed-text).
Add functions to encode non-available characters in some other way (e.g., as elements, <c cp='0'/>).
Extend string to include any Unicode character except U+FFFE and U+FFFF.
Extend string to include any Unicode character except U+0000, U+FFFE, and U+FFFF.
Add functions that represent such strings as sequences (or arrays) of integers, with functions for operating on them.

The chair attempted a straw poll at the end of the meeting in an effort to get a sense of where effort was likely to be fruitful. Members were permitted to vote for all the alternatives they considered worth considering.

The results:

7
1
6
1
2

ChristianGruen · 2023-05-23T16:45:51Z

Regarding 1: fn:parse-json already has a fallback option, we could possibly use the same for fn:unparsed-text.

Regarding 3, the text and json serialization methods could be allowed to return the full Unicode range.

michaelhkay · 2023-05-23T17:16:48Z

One way forward might be:

XDM section 2.7.3 has a lot of "mays" and "shoulds" about the set of characters supported by xs:string. We could add:

Implementations MAY allow xs:string atomic values, and the string values of nodes, to contain Unicode characters that are not permitted by the Char production in any version of XML. For example, they may allow such values to arise by omitting checks on the result of functions such as unparsed-text(), codepoints-to-string(), or char(); or by allowing extension functions to return such strings without checking; or by accepting as input source documents that have been constructed using third-party libraries that do not perform strict checking. However, implementations MUST ensure (a) that the output of serialization using the XML or XHTML output methods is well-formed according to the selected version of XML; and (b) that schema validation carried out using an XSD 1.0 or XSD 1.1 schema rejects any nodes containing disallowed characters as invalid.

Note that this almost certainly reflects the reality of existing products. A processor that accepts input from DOM will probably not check that all the characters in the DOM tree are valid, and the DOM library almost certainly doesn't care.

What this doesn't do is to give users an assurance that they can safely use unparsed-text() to read files containing arbitrary characters.

ChristianGruen added XDM An issue related to the XPath Data Model Enhancement A change or improvement to an existing feature labels Mar 31, 2023

ChristianGruen mentioned this issue Mar 31, 2023

New function: parse-csv() #413

Closed

ndw mentioned this issue Jun 12, 2023

414: Attempt to implement expanding the allowed character repertoire #546

Merged

michaelhkay added the PR Pending A PR has been raised to resolve this issue label Jun 14, 2023

ChristianGruen mentioned this issue Jul 21, 2023

600: fn:decode-from-uri #631

Merged

ndw closed this as completed in #546 Jul 25, 2023

ChristianGruen mentioned this issue Aug 9, 2023

fn:unparsed-text: End-of-line characters #216

Closed

ChristianGruen removed the PR Pending A PR has been raised to resolve this issue label Sep 13, 2023

Arithmeticus mentioned this issue Dec 12, 2023

fn:unparsed-binary: accessing and manipulating binary types #557

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lift character set restriction of xs:string #414

Lift character set restriction of xs:string #414

ChristianGruen commented Mar 31, 2023

michaelhkay commented Mar 31, 2023

ndw commented Mar 31, 2023

ChristianGruen commented Mar 31, 2023

michaelhkay commented Mar 31, 2023 •

edited

michaelhkay commented Mar 31, 2023

Arithmeticus commented Mar 31, 2023

ChristianGruen commented Mar 31, 2023

ndw commented Mar 31, 2023

Arithmeticus commented Mar 31, 2023

ChristianGruen commented Mar 31, 2023

rhdunn commented Apr 4, 2023

ndw commented Apr 4, 2023

ChristianGruen commented Apr 4, 2023

ndw commented Apr 4, 2023

ChristianGruen commented Apr 4, 2023

michaelhkay commented Apr 4, 2023

Arithmeticus commented Apr 4, 2023

Arithmeticus commented Apr 29, 2023

michaelhkay commented Apr 30, 2023

michaelhkay commented Apr 30, 2023

ndw commented May 1, 2023

ndw commented May 23, 2023

ChristianGruen commented May 23, 2023

michaelhkay commented May 23, 2023

Lift character set restriction of xs:string #414

Lift character set restriction of xs:string #414

Comments

ChristianGruen commented Mar 31, 2023

michaelhkay commented Mar 31, 2023

ndw commented Mar 31, 2023

ChristianGruen commented Mar 31, 2023

michaelhkay commented Mar 31, 2023 • edited

michaelhkay commented Mar 31, 2023

Arithmeticus commented Mar 31, 2023

ChristianGruen commented Mar 31, 2023

ndw commented Mar 31, 2023

Arithmeticus commented Mar 31, 2023

ChristianGruen commented Mar 31, 2023

rhdunn commented Apr 4, 2023

ndw commented Apr 4, 2023

ChristianGruen commented Apr 4, 2023

ndw commented Apr 4, 2023

ChristianGruen commented Apr 4, 2023

michaelhkay commented Apr 4, 2023

Arithmeticus commented Apr 4, 2023

Arithmeticus commented Apr 29, 2023

michaelhkay commented Apr 30, 2023

michaelhkay commented Apr 30, 2023

ndw commented May 1, 2023

ndw commented May 23, 2023

ChristianGruen commented May 23, 2023

michaelhkay commented May 23, 2023

michaelhkay commented Mar 31, 2023 •

edited