Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lift character set restriction of xs:string #414

Closed
ChristianGruen opened this issue Mar 31, 2023 · 24 comments · Fixed by #546
Closed

Lift character set restriction of xs:string #414

ChristianGruen opened this issue Mar 31, 2023 · 24 comments · Fixed by #546
Labels
Enhancement A change or improvement to an existing feature XDM An issue related to the XPath Data Model

Comments

@ChristianGruen
Copy link
Contributor

Adopted from #413 (comment)

I guess that raises the question of whether it is still appropriate to restrict the character set of xs:string to that of XML 1.0. Are there any benefits in doing so?

I believe that would simplify things a lot, in particular when working with input/output functions.

@ChristianGruen ChristianGruen added XDM An issue related to the XPath Data Model Enhancement A change or improvement to an existing feature labels Mar 31, 2023
@michaelhkay
Copy link
Contributor

that would simplify things

I fear that the opposite is true.

There are certainly some places where we need to align with the XML and XSD specs.

  • Validating a node against a schema needs to match the XSD semantics
  • Serialising using the XML output method needs to produce well-formed XML

We then have decisions to make: does casting to xs:string do something different from validation? Do we allow non-XML characters in the string value of an XDM node? What characters are allowed in an xs:anyURI or in an xs:token or in xs:untypedAtomic?

I suspect that the only clean way to do this is to introduce a new type, say xs:uString, that is a supertype of xs:string and that allows any characters. And then we have to go through all the functions and operators changing most of them to use xs:uString rather than xs:string. I think it's probably cleanest not to allow non-XML characters in XDM nodes, so validation happens at the time of tree construction rather than serialization. But there are lots more questions to be answered.

@ndw
Copy link
Contributor

ndw commented Mar 31, 2023

Can I steal the can opener and hide it somewhere? I really don't think we want to do this.

@ChristianGruen
Copy link
Contributor Author

I fear that the opposite is true.

Well, I guess you are right. I have merely taken a common user perspective, which is certainly difficult to satisfy from a technical perspective: I have string data (JSON, CSV, plain text, …) → I don't care about edge cases or XML 1.0 restrictions → I end up using Python, Java, etc., because it doesn't reject my input.

@michaelhkay
Copy link
Contributor

michaelhkay commented Mar 31, 2023

There might be a fudge solution (we already fudge whether xs:string is constrained by XSD 1.0 or 1.1, and XSD fudges what's allowed in an xs:anyURI). We could say that processors are not required to validate what characters are contained in strings returned for example by unparsed-text() or json-doc() or codepoints-to-string(), provided that they guarantee that serialized XML output is well-formed and that validation follows XSD semantics.

That solution isn't interoperable, but then the exact set of characters allowed in xs:string is already implementation-defined.

@michaelhkay
Copy link
Contributor

We could possibly handle this as follows:

  • As far as the QT4 specifications are concerned, we say that xs:string (and xs:anyURI and xs:untypedAtomic) can hold any sequence of Unicode codepoints (not surrogates, though); functions that are capable of delivering non-XML characters (unparsed-text, codepoints-to-string, fn:char, etc) no longer raise an error in this case.
  • In serialized XML/XHTML output, invalid characters cause a serialization error.
  • Creating a node whose string value contains invalid XML characters MAY result in a dynamic error. (This is to allow implementations to use a third-party tree representation that disallows invalid characters; it may also improve performance to do the validation during tree construction rather than serialization).
  • Validation of nodes against schema types uses the XSD definitions of the data types, and must reject invalid characters (assuming you can construct a node with invalid characters to pass to the validator...)
  • Source XPath and XQuery expressions must use only valid XML characters (because we don't want a situation where some valid XPath expressions can't be embedded in XSLT).
  • Casting to xs:string (and xs:anyURI and xs:untypedAtomic) performs no checking for invalid characters.

@Arithmeticus
Copy link
Contributor

I continue to process many files and streams that include control characters. Any changes that make that use more tractable, and error free, would be very welcome.

An important use case is unknown bytes in a text file, which may have characters not allowed by XML 1.0 or 1.1. Having to use non-core XPath functions to handle such cases doesn't feel right to me. I think it would be useful to bring versions of EXPath's file:read-binary() and saxon:base64Binary-to-string() into the main specs as, e.g., unparsed-text-bytes() and bytes-to-string() (with an options map for, e.g., how to handle streams whose decoding would result in U+0000, U+FFFE, U+FFFF). The names show my personal preference to work on the level of xs:byte and not xs:base64Binary or xs:integer.

@ChristianGruen
Copy link
Contributor Author

We could possibly handle this as follows:

…sounds very good!

@ndw
Copy link
Contributor

ndw commented Mar 31, 2023

This is a slippery slope. If users can read documents that contain invalid characters and construct data model instances from them, they're going to wonder why they can't create them. Why isn't <xsl:sequence select="&#0;"/> allowed? Why can't I create this map let $var = map { "value": &#0; } in XQuery?

@Arithmeticus
Copy link
Contributor

@ndw I would say that we've already slipped to the bottom of that slope. We already can and do find ways to deal with non-standard character input, and I don't think it has inculcated the expectation that output should be the same. Relaxing strictures on xs:string or treating a file as a byte sequence IMO acknowledges the reality of messy input, provides the tools needed to deal with that mess, allows workflows to be less kludgy or error-prone, and will I think attract more developers to the language. Good chance, though, I'm naive about the full implications.

@ChristianGruen
Copy link
Contributor Author

Why can't I create this map let $var = map { "value": &#0; } in XQuery?

I suppose we’d need to allow this, too (if the entities occur within a string: "&#0;"), but it cannot be serialized with the XML/XHTML serialization methods.

@rhdunn
Copy link
Contributor

rhdunn commented Apr 4, 2023

There's going to be interoperability issues anyway depending on whether or not a processor supports XML 1.0 or XML 1.1.

@ndw
Copy link
Contributor

ndw commented Apr 4, 2023

XML 1.1 doesn't allow "&0;" and it's non-existant for all practical purposes.

@ChristianGruen
Copy link
Contributor Author

XML 1.1 doesn't allow "&0;" and it's non-existant for all practical purposes.

Thanks for the clarification.

I share the experience that 00 rarely appears in texts. Instead of allowing any sequence of Unicode codepoints, we could restrict the legal input to XML 1.1 Unicode codepoints.

@ndw
Copy link
Contributor

ndw commented Apr 4, 2023

If we're going to go down the slippery slope, I think we might as well get on a sled and enjoy the ride. The set of characters allowed by 1.1 is incompatible with the set of characters allowed by 1.0 (because there was a desire, some might have felt misplaced, that it be possible to detect incorrectly encoded texts). The differences in C0 vs C1 control characters and the fact that you can have &#1; but not &#0; is just going to look arbitrary and capricious.

I think there are two reasonable places to stand: (point A) the set of characters allowed in an XDM are the characters that are allowed in XML (1.0 if you want to push on that), or (point B) the set of characters allowed in an XDM is unbounded and what's forbidden is attempting ot serialize (or add to an XML tree?) any characters that aren't allowed in XML.

We're currently standing at point A. Moving to point B is opening a can of worms. I'm not personally inconvenienced by standing at point A and, especially with my chair's hat on, I'm reluctant to open more worm cans. But we're all inconvenienced by different things and I'm in no way going to assert that the things that inconvience me are somehow more important than anyone else's inconviences. I'm not going to attempt to prevent the group from doing it, as long as I can take my "I told you so" token and save it to play later :-)

An example of the sort of complexity I'm imagining is the following.

I can't create an &#0; in XSLT. So I load one from disk into a JSON array. I construct a string, $s, from that array. That's presumably allowed. But <doc><xsl:sequence select="$s"/></doc> presumably is not. Except if that's never actually serialized (for example, if it's assigned to a variable that is only conditionally output and it isn't output) is that an error? Is it required to be an error? Is it forbidden to be an error? Suppose for example, I have:

<xsl:variable name="s" select="(some expression that successfully constructs the string with `&#0;` in it)"/>
<xsl:variable name="p" as="element()"><p><xsl:sequence select="$s"/></p></xsl:variable>
<xsl:if test="f:moon-is-full">
  <xsl:sequence select="$p"/>
</xsl:if>

If the moon is full, $p will be serialized and that must be an error. If the moon isn't full, mayb it isn't an error.

Is string($p) an error? Always or never? Is string-length($p) an error?

Etc.

@ChristianGruen
Copy link
Contributor Author

I can't create an &#0; in XSLT. So I load one from disk into a JSON array. I construct a string, $s, from that array. That's presumably allowed.

If we allow constructing such contents from disk, I'm convinced we need to allow it in string/text constructors as well. If that goes too far, I would vote against lifting the current restrictions.

Moving to point B is opening a can of worms.

We could assign a low priority label to worm-eaten issues and consider them only in the end.

@michaelhkay
Copy link
Contributor

I'm working on a proposal which I hope will be acceptable.

We do already have some muddiness, for example words that suggest reading a JSON file containing non-XML characters might work in some circumstances. I think it should work in all circumstances - except unpaired surrogates, where the RFC explicitly says that not all applications will accept it. And of course, actual practice in the field is likely to be even muddier - it's unlikely that a product that allows the input to be supplied as a DOMSource is going to check that the text in all nodes is squeaky-clean, and DOM certainly isn't going to check it for you.

@Arithmeticus
Copy link
Contributor

To clarify, I am not arguing for allowance of the 3 no-no characters in xs:string, only that we have the means to read as a sequence of xs:bytes a file that might have them. Under my proposal, the guard rails are shifted to the attempt to cast from the byte sequence to a string. This means that the point of failure occurs within my code, and I can adjust as needed. Currently, the point of failure is outside the code, and I can't do anything within the core library to rectify then continue to process irregular input.

@Arithmeticus
Copy link
Contributor

The new function parse-html() accepts as input (in addition to strings) xs:hexBinary and xs:base64Binary. That input implies that there is a convenient way in the core specs to grab a web page as a binary object. But there isn't.

I think this lends yet more rationale to my argument in this thread that we need an analogue for unparsed-text-available() and unparsed-text() for any byte sequence, e.g., binary-file-available() and binary-file().

Put another way, how is someone using core functions going to be able to grab a web page as a binary object to be able to feed it into parse-html()?

@michaelhkay
Copy link
Contributor

I agree there's a good case for an unparsed-binary() analog of unparsed-text().

This doesn't solve the problem that you want to be able to read files such as CSV files and convert them say to JSON, without having first to check that all the characters are legal XML characters.

@michaelhkay
Copy link
Contributor

@ndw wrote: "I can't create an � in XSLT. "

But if we relax the restrictions on fn:codepoints-to-string() and fn:char(), then you can. I think those mechanisms are quite adequate for the purpose.

@ndw
Copy link
Contributor

ndw commented May 1, 2023

Changing the XPath Data Model to allow an ASCII NUL character (since copying and pasting the numeric character reference seems to confuse the GitHub issue formatting) strikes me as a very significant change that we should consider carefully. It's arguable that ASCII NUL (and stupidity regarding the C1 control characters) were the fatal flaw in XML 1.1.

@ndw
Copy link
Contributor

ndw commented May 23, 2023

This was discussed at meeting 035 without coming to a final resolution. The consensus seemed to gravitate towards keeping a single definition of string (perhaps extended) rather than adding a new "Unicode string" data type, but that's not a definitive decision.

For dealing with non-XML characters, among the options considered were:

  1. Add a user-defined callback function allowing a user to encode non-available characters (e.g., on fn:unparsed-text).
  2. Add functions to encode non-available characters in some other way (e.g., as elements, <c cp='0'/>).
  3. Extend string to include any Unicode character except U+FFFE and U+FFFF.
  4. Extend string to include any Unicode character except U+0000, U+FFFE, and U+FFFF.
  5. Add functions that represent such strings as sequences (or arrays) of integers, with functions for operating on them.

The chair attempted a straw poll at the end of the meeting in an effort to get a sense of where effort was likely to be fruitful. Members were permitted to vote for all the alternatives they considered worth considering.

The results:

  1. 7
  2. 1
  3. 6
  4. 1
  5. 2

@ChristianGruen
Copy link
Contributor Author

Regarding 1: fn:parse-json already has a fallback option, we could possibly use the same for fn:unparsed-text.

Regarding 3, the text and json serialization methods could be allowed to return the full Unicode range.

@michaelhkay
Copy link
Contributor

One way forward might be:

XDM section 2.7.3 has a lot of "mays" and "shoulds" about the set of characters supported by xs:string. We could add:

Implementations MAY allow xs:string atomic values, and the string values of nodes, to contain Unicode characters that are not permitted by the Char production in any version of XML. For example, they may allow such values to arise by omitting checks on the result of functions such as unparsed-text(), codepoints-to-string(), or char(); or by allowing extension functions to return such strings without checking; or by accepting as input source documents that have been constructed using third-party libraries that do not perform strict checking. However, implementations MUST ensure (a) that the output of serialization using the XML or XHTML output methods is well-formed according to the selected version of XML; and (b) that schema validation carried out using an XSD 1.0 or XSD 1.1 schema rejects any nodes containing disallowed characters as invalid.

Note that this almost certainly reflects the reality of existing products. A processor that accepts input from DOM will probably not check that all the characters in the DOM tree are valid, and the DOM library almost certainly doesn't care.

What this doesn't do is to give users an assurance that they can safely use unparsed-text() to read files containing arbitrary characters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement A change or improvement to an existing feature XDM An issue related to the XPath Data Model
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants