Support Invisible XML #238

michaelhkay · 2022-11-12T08:30:38Z

I propose that we support Invisible XML by means of a function

fn:invisible-xml($grammar as xs:string) as (function($string) as document-node())

The function takes as input a string defining an invisible XML grammar in ixml format, and returns as output a function that can be used to parse strings conforming to that grammar, converting them into XDM document nodes.

As a "dog-food" use case, we could use this for rendering function signatures in the F&O specification. Rather than using manual markup to define the signature of each function, we could define an IXML grammar for function signatures, and use this as the basis for formatting the representation in the spec. This would be particularly beneficial as we start to introduce more complex signatures involving record types.

The text was updated successfully, but these errors were encountered:

ndw · 2022-11-12T09:37:19Z

I'm hardly going to object, given I've got an implementation :-)

However, I think we need a little bit more flexibility in the API. It should be possible to pass an XDM node to fn:invisible-xml because it's perfectly reasonable to construct a parser from the XML serialization of Invisible XML. I also think we want to allow an options map to be passed to the function because people will want the flexibility to, for example, suppress the state information about ambiguity or prefix parsing. And even if we thought that was unnecessary because they could transform the output to remove those things, implementations may want to expose additional options. My implementation, for example, has options to allow undefined/unreachable/unproductive symbols, multiply defined symbols, and options for debugging to show additional state information.

johnlumley · 2022-11-15T14:58:16Z

I would second Norm's request for an optional options map, both on the compiling and the runtime. My own implementation could support such a function in the XSLT API stylesheet, albeit in a non-fn namespace.

cmsmcq · 2022-11-15T23:07:51Z

Having built-in support for invisible XML appeals to me, so I am in favor.

But I am not sure about the function signature

fn:invisible-xml($grammar as xs:string) as (function($string) as document-node())

--- or, at least, I think that this is not the only function signature we are likely to want. This signature would work fine for processors which work by first compiling the input grammar into some usable form (whether an annotated grammar or a function, e.g. a recursive-descent parser) and then using that compiled form to handle the input string, and it is clearly desirable in cases where (a) grammar compilation takes a significant proportion of the necessary time and (b) the same grammar is to be used repeatedly. Where those conditions don't apply, other function signatures may be more appealing. Both Norm Tovey-Walsh's nineml processor and my Aparecium processor use the same parsing machinery to parse the input grammar and then the input string, and this interface would help in cases where the same grammar is used repeatedly. John Lumley's parser and Steven Pemberton's parser, on the other hand, use hand-tuned parsers for the input grammars, and grammar-preparation time is really not dominant.

In the design of the user-facing function interface for Aparecium (my ixml processor, intended for use in XSLT and XQuery but currently implemented only in XQuery), I included several functions, some of which may be possibilities we may wish to consider:

aparecium:parse-string($input-strngi as xs:string, $input-grammar as xs:string) as element()
aparecium:parse-resource($input-uri as xs:string, $grammar-uri as xs:string) as element()

I assume that an optimizing QT processor may be able to detect that the same grammar is used multiple times and avoid parsing the grammar repeatedly for repeated calls.

In Aparecium I also included functions for compiling a grammar and for parsing an input string using a compiled grammar. I won't list them here because I think MK's suggestion of returning a function item is better, but I do agree with Norm that it would be convenient to be able to supply the grammar in any of several forms:

a string conforming to the ixml specification grammar
an XML element / XDM node conforming to the ixml specification (a 'visible-XML' grammar, we sometimes call this in the ixml community group to avoid confusing with other grammar forms)
a URI pointing to an document (text/plain or other) with an ixml grammar in invisible-XML form
a URI pointing to an XML document with a visible-XML grammar

When I started working on ixml, I also thought it might be nice to have an invisible-XML function similar in its way to doc(), which would accept a URI pointing to an input string, dereference it, use information in the HTTP header to find an appropriate grammar, fetch the grammar, and return the parse result. (Steven Pemberton's 2013 paper describes using the HTTP header to point to an ixml grammar; failing that, my idea was to get the MIME type and for the ixml implementation to have a library of grammars for often-used MIME types.) That currently seems like a bit of a reach to me, but I mention it here because I still think it would be a nice idea, and no group is better positioned than this one to make it happen.

johnlumley · 2022-11-16T09:59:22Z

To clarify Michael SMQ’s remarks, my ixml parser does I think benefit from a compile-once / use-many invocation, especially where the input sentences are short and the grammar long/complex. The EBNF-BNF rewrites are done once and the compiled grammar is a tree of JS class instances ready to prime the Earley parser. Come to think of it I could probably create a clonable Earley parser which has all the zeroth-step predictions already performed. Will need to investigate on return from holiday. John Lumley

…

Sent from my iPad On 15 Nov 2022, at 23:08, C. M. Sperberg-McQueen ***@***.***> wrote: Having built-in support for invisible XML appeals to me, so I am in favor. But I am not sure about the function signature fn:invisible-xml($grammar as xs:string) as (function($string) as document-node())

--- or, at least, I think that this is not the only function signature we are likely to want. This signature would work fine for processors which work by first compiling the input grammar into some usable form (whether an annotated grammar or a function, e.g. a recursive-descent parser) and then using that compiled form to handle the input string, and it is clearly desirable in cases where (a) grammar compilation takes a significant proportion of the necessary time and (b) the same grammar is to be used repeatedly. Where those conditions don't apply, other function signatures may be more appealing. Both Norm Tovey-Walsh's nineml processor and my Aparecium processor use the same parsing machinery to parse the input grammar and then the input string, and this interface would help in cases where the same grammar is used repeatedly. John Lumley's parser and Steven Pemberton's parser, on the other hand, use hand-tuned parsers for the input grammars, and grammar-preparation time is really not dominant. In the design of the user-facing function interface for Aparecium (my ixml processor, intended for use in XSLT and XQuery but currently implemented only in XQuery), I included several functions, some of which may be possibilities we may wish to consider: * aparecium:parse-string($input-strngi as xs:string, $input-grammar as xs:string) as element() * aparecium:parse-resource($input-uri as xs:string, $grammar-uri as xs:string) as element() I assume that an optimizing QT processor may be able to detect that the same grammar is used multiple times and avoid parsing the grammar repeatedly for repeated calls. In Aparecium I also included functions for compiling a grammar and for parsing an input string using a compiled grammar. I won't list them here because I think MK's suggestion of returning a function item is better, but I do agree with Norm that it would be convenient to be able to supply the grammar in any of several forms: * a string conforming to the ixml specification grammar * an XML element / XDM node conforming to the ixml specification (a 'visible-XML' grammar, we sometimes call this in the ixml community group to avoid confusing with other grammar forms) * a URI pointing to an document (text/plain or other) with an ixml grammar in invisible-XML form * a URI pointing to an XML document with a visible-XML grammar When I started working on ixml, I also thought it might be nice to have an invisible-XML function similar in its way to doc(), which would accept a URI pointing to an input string, dereference it, use information in the HTTP header to find an appropriate grammar, fetch the grammar, and return the parse result. (Steven Pemberton's 2013 paper describes using the HTTP header to point to an ixml grammar; failing that, my idea was to get the MIME type and for the ixml implementation to have a library of grammars for often-used MIME types.) That currently seems like a bit of a reach to me, but I mention it here because I still think it would be a nice idea, and no group is better positioned than this one to make it happen. — Reply to this email directly, view it on GitHub<#238 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABHNADUZLSNG6MKNQBHUNX3WIQJVFANCNFSM6AAAAAAR6G2J54>. You are receiving this because you commented.Message ID: ***@***.***>

ChristianGruen · 2022-11-16T10:18:56Z

I wonder if we should confront users with the fact that a grammar may be compiled only once. Isn’t it better and more elegant to treat this as a processor-specific optimization issue?

There are many other use cases I can think of in which query compilers and optimizers may decide to do things only once during the runtime of a query without making this explicit (XSLT stylesheets; SQL statements; regular expression patterns; documents resulting from deterministic functions such as fn:doc; etc.), and I believe it should be fairly easy for implementations to cache compiled grammars for grammar strings if they are repeatedly used.

michaelhkay · 2022-11-16T12:08:00Z

Caching always has the disadvantage that it involves guesswork; there's a substantial memory cost in caching a large grammar in the case where it isn't used again. I think that capturing the compiled grammar in a function is a much more elegant approach (which could well be used elsewhere, e.g. for fn:transform).

ChristianGruen · 2023-11-08T09:59:27Z

Accepted at meeting 052.

ChristianGruen added XQFO An issue related to Functions and Operators Feature A change that introduces a new feature labels Nov 14, 2022

ndw mentioned this issue Jan 22, 2023

Refactored API; updated website stylesheets nineml/coffeesacks#33

Merged

GuntherRademacher mentioned this issue Mar 27, 2023

Add support for fn:invisible-xml BaseXdb/basex#2192

Merged

ChristianGruen added the Propose for V4.0 The WG should consider this item critical to 4.0 label Jun 20, 2023

ChristianGruen removed the Propose for V4.0 The WG should consider this item critical to 4.0 label Nov 8, 2023

ChristianGruen closed this as completed Nov 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Invisible XML #238

Support Invisible XML #238

michaelhkay commented Nov 12, 2022

ndw commented Nov 12, 2022

johnlumley commented Nov 15, 2022

cmsmcq commented Nov 15, 2022

johnlumley commented Nov 16, 2022 via email

ChristianGruen commented Nov 16, 2022

michaelhkay commented Nov 16, 2022

ChristianGruen commented Nov 8, 2023

Support Invisible XML #238

Support Invisible XML #238

Comments

michaelhkay commented Nov 12, 2022

ndw commented Nov 12, 2022

johnlumley commented Nov 15, 2022

cmsmcq commented Nov 15, 2022

johnlumley commented Nov 16, 2022 via email

ChristianGruen commented Nov 16, 2022

michaelhkay commented Nov 16, 2022

ChristianGruen commented Nov 8, 2023