New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification of @collection #555

Closed
eriksiegel opened this Issue Oct 11, 2018 · 27 comments

Comments

Projects
None yet
4 participants
@eriksiegel
Contributor

eriksiegel commented Oct 11, 2018

I think I misinterpreted the meaning of @collection on p:variable and p:with-option. The spec's description is currently rather sparse and IMHO needs some clarification:

All of the documents that appear on the connection for the p:variable will be available as the default collection within select expression.

I'll try to write some more prose on this if somebody can explain what is meant and/or send me a simple code example?

@eriksiegel eriksiegel self-assigned this Oct 11, 2018

@gimsieke

This comment has been minimized.

Contributor

gimsieke commented Oct 11, 2018

Example: <p:variable name="count" as="xs:integer" collection="true" select="count(collection())"/> ⇒ 3, if 3 documents are on the DRP.

@eriksiegel

This comment has been minimized.

Contributor

eriksiegel commented Oct 11, 2018

Ok. Aha. So this also means that you can access the n-th document by writing collection()[n], that's nice.

But what happens when non-XML documents are on the DRP?

@gimsieke

This comment has been minimized.

Contributor

gimsieke commented Oct 11, 2018

I think non-XML documents will just be an empty document node (citation needed), or a text node within a document node for text documents.

@gimsieke

This comment has been minimized.

Contributor

gimsieke commented Oct 11, 2018

Here’s a more elaborate definition of the context item: http://spec.xproc.org/master/head/xproc/#err.inline.D0008
JSON document are represented by their XDM representation, that is, an array or a map?
So if collection()[2] is of content type text/json, it is not a text node wrapped in a document node, but instead an array or map?
Binary documents are implementation defined. So contrary to what I believed, they are not necessarily represented as a document node. However, document-properties(collection()[3]) should return the document property map if the third document is a binary file.
You should be able to do the following, irrespective of the representation:

<p:variable name="binary-doc" as="???" collection="true" select="collection()[3]"/>
<p:store href="myfile.bin">
  <p:with-input name="source" select="$binary-doc">
    <p:empty/>
  </p:with-input>
</p:store>

Should you be able to give a document in a select attribute? The context for p:store here is still the DRP with 3 documents on it. So could we also say, if collection were allowed on p:with-input, <p:with-input name="source" select="collection()[3]" collection="true"/>?

What do we (interoperably) specify as the as attribute value if the variable is supposed to hold a binary document?

@eriksiegel

This comment has been minimized.

Contributor

eriksiegel commented Oct 11, 2018

As always, complications...

May I propose the following:

  1. If the document is an XML document everything is hunky dory
  2. If the document is a text document there is only a single text node as child of the document node
  3. If the document is anything else (JSON also), there are no children underneath the document node. (since you can't represent a map or array in a node tree this must apply to JSON documents also, there should be some other means to get to its map/array representation)
@ndw

This comment has been minimized.

Contributor

ndw commented Oct 11, 2018

<aside>I don't think collection()[2] is going to do what you want; the order of documents in the collection may not be stable.</aside>

The question of what to do with maps is an interesting one. We want JSON to be able to flow through the pipeline. We want to represent JSON as XDM maps. XDM maps aren't nodes. So I think we've just painted ourselves into a corner that says what flows between steps are XDM instances not documents. Bah, humbug.

Non-node values can't go into collections so either we have to serialize them and make them nodes or we have to leave them out of collections. Bah, double humbug.

@xml-project

This comment has been minimized.

Contributor

xml-project commented Oct 11, 2018

Non-node values can't go into collections so either we have to serialize them and make them nodes or we have to leave them out of collections.

Why not? The XPath 3.1 specification say:

Default collection. This is the sequence of items that would result from calling the fn:collection function with no arguments.

So in my reading, any instance of item (document nodes, text nodes etc, and maps) can be part of the default collection. What did I miss?

So I think we've just painted ourselves into a corner that says what flows between steps are XDM instances not documents.

Yes, we actually use document in a double sense, this was why I introduced the term "XProc document" in my London paper in June: What follows between steps in XProc is an (XProc) document.
XProc document are pair of properties and representations. A representation may be an XDM document or a map.

@xml-project

This comment has been minimized.

Contributor

xml-project commented Oct 11, 2018

@gimsieke

What do we (interoperably) specify as the as attribute value if the variable is supposed to hold a binary document?

Answer: item()*

@ndw

This comment has been minimized.

Contributor

ndw commented Oct 11, 2018

Sorry. My bad. I was looking at the XPath 3.0 functions and operators spec where fn:collection() returns node()*. I see that in 3.1 it returns item()*. Ignore that bit.

@eriksiegel

This comment has been minimized.

Contributor

eriksiegel commented Oct 12, 2018

Ok, looks fine. So summarizing:

  1. If the document is an XML document its a normal node document tree
  2. If the document is a text document there is only a single text node as child of the document node
  3. If the document is JSON you get a map or array
  4. If the document is binary you get item()*, unspecified, implementation defined

Ok. I'm unsure about 4. @xml-project, Is that what you meant.

@ndw We'll have to say something about the order of documents. But why wouldn't that be stable. Documents flow in a certain order, right?

@xml-project

This comment has been minimized.

Contributor

xml-project commented Oct 12, 2018

@eriksiegel

Ok. I'm unsure about 4. @xml-project, Is that what you meant.

Yes. You will get what you get, because we define the behavior of binary documents only on the XProc level, not on the XPath level were we are now.

I think your conclusion for JSON is not quite right: For documents with content-type application/json we decided to use fn:parse-json() and I think this is also true for collection().
The function specs say:

JSON-object -> Map
JSON-array -> Array
JSON-string -> xs:string
JSON-number -> xs:double
JSON-boolean -> s:boolean
JSON-null -> EMPTY-Sequence

So IMHO the correct answer (expressed as SequenceType) for JSON is item()?.

@gimsieke

This comment has been minimized.

Contributor

gimsieke commented Oct 12, 2018

About order:

In

<p:identity>
  <p:with-input port="source">
    <p:document href="doc1.xml"/>
    <p:document href="doc2.json"/>
    <p:document href="image.png"/>
  </p:with-input>
</p:identity>
<p:variable name="png" select="collection()[3]" collection="true"/>

$png is guaranteed to contain the image.png document. This is stated in the note that immediately precedes http://spec.xproc.org/master/head/xproc/#documentation.

Order would not be guaranteed if you connect to the secondary port of a p:xslt step and, for ex., expect the text document to be the first output document on this port, see xproc/1.0-specification#17

@xml-project

This comment has been minimized.

Contributor

xml-project commented Oct 12, 2018

@gimsieke Sorry, but I thought we were talking about the order in which the XPath-function collection() returns the documents, not about the order on an XProc port (the passage you have quoted).

I agree which @ndw that the specs of XPath-function collection() does not define an order for the sequence, so you can NOT be sure, that image.png is returned by collection()[3].

I think that is why XPath has function fn:collection(arg as s:string?) (arg is interpreted as uri) and the function will return the document (in the default collection) with this URI (if any).

@gimsieke

This comment has been minimized.

Contributor

gimsieke commented Oct 12, 2018

Implementations should be required to let collection() return the documents in the order in which they appear on the port. Is there a reason not to stipulate this?

@xml-project

This comment has been minimized.

Contributor

xml-project commented Oct 12, 2018

@gimsieke

Is there a reason not to stipulate this?

We are not in a position to stipulate this, because we are not the XPath next community group. collection() is an XPath function defined in their specs. How can we change their specs?

@gimsieke

This comment has been minimized.

Contributor

gimsieke commented Oct 12, 2018

I don’t see anything in https://www.w3.org/TR/xpath-functions-31/#func-collection that would prevent us from returning the default collection in a specific order.

@xml-project

This comment has been minimized.

Contributor

xml-project commented Oct 12, 2018

Sorry @gimsieke , I failed to make my point: We (which means in this case the XProc implementors) do not return anything here. We call an XPath processor to execute the XPath expression containing "fn:collection()". And the XPath processor evaluate the expression according to the XPath specs. And since the specs do not guarantee order, there might be order or not.

I do not see, what we (the XProc next community group) could do about this?

@gimsieke

This comment has been minimized.

Contributor

gimsieke commented Oct 12, 2018

Saxon for example has no built-in default collection. If I read this code correctly, @ndw constructs a default collection that he passes to net.sf.saxon.lib.CollectionURIResolver. This is for XSLT. For XProc 3.0 constructs that accept @collection, I assume that Norm will continue to use Saxon as the XPath processor. For these XPath expressions (outside of XSLT), you have your own XPath processor. What prevents you from defining the default collection in a specific way?

@xml-project

This comment has been minimized.

Contributor

xml-project commented Oct 12, 2018

  1. As far as I remember "CollectionURIResolver" is deprecated since 9.7. I looked up the APIs yesterday to see whether there are informations, but there are none. There is a new interface "CollectionFinder", but there is also no hint about order (and stability).

  2. I do not think Saxon Api can count as argument, because we are not building "XProc on Saxon".

  3. I do not think the problem is worth the whole discussion because you could easily use p:split-sequence to solve the problem. So IMHO there is no need to deviate from XPath standards or tie our specification to a specific XPath processor.

@ndw

This comment has been minimized.

Contributor

ndw commented Oct 12, 2018

I have no reason to believe that the collection() function returns the documents int he same order that I passed them to the collection URI resolver (or whatever the new interface is).

It's called collection not sequence because it's an unordered collection, I believe.

@gimsieke

This comment has been minimized.

Contributor

gimsieke commented Oct 12, 2018

I am not suggesting to tie XProc to a specific XPath processor. I am just proposing that each implementation be required to return the default collection in the order that the documents that appear on the corresponding port already have. In certain circumstances, the order in which they appear is already specified by the XProc spec.

@gimsieke

This comment has been minimized.

Contributor

gimsieke commented Oct 12, 2018

And I’m asserting that this does not deviate from the XPath spec.

@ndw

This comment has been minimized.

Contributor

ndw commented Oct 12, 2018

That is not within my control. I pass a bunch of documents off to Saxon to put in a collection. I don't know how Saxon keeps track of those. Maybe Michael puts them in a map and the insertion-order is lost. Maybe he doesn't. Whether or not they come back in the order I added them is at best implementation-dependent.

@xml-project

This comment has been minimized.

Contributor

xml-project commented Oct 12, 2018

I think @ndw comment should be the bottom line under the "order"-discussion Gentleman.

@gimsieke

This comment has been minimized.

Contributor

gimsieke commented Oct 12, 2018

Ok. Returning to @eriksiegel’s comment, maybe we should add a note to the default collection. Something like: “A specific XProc processor in a specific version might return collection items in a certain order, and maybe it is the order that the items appeared on a port. However, you should not rely on accessing collection items by position (for example, collection()[3]). Use other criteria, such as base URIs and other document properties, top-level element names or namespaces, or map keys in order to select specific items from a collection.”

@eriksiegel

This comment has been minimized.

Contributor

eriksiegel commented Oct 14, 2018

Fine with me. I'll add some more prose to this to clarify.

@ndw

This comment has been minimized.

Contributor

ndw commented Oct 18, 2018

@eriksiegel proposes that #565 also fixes this. I'm happy with that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment