Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

p:archive #3

Closed
mkraetke opened this issue Jun 27, 2017 · 29 comments

Comments

Projects
None yet
6 participants
@mkraetke
Copy link

commented Jun 27, 2017

Here is a proposal for a p:archive step as discussed at the XProc workshop after XML London. The step is based on Calabashs pxp:zip but with some changes:

The files to be zipped can be declared as XML manifest. The files itself may arrive as a sequence on the source port. If the manifest didn't match the documents with their base uris, the step tries to load them from disk or over another supported protocol, such as http.

As a shorthand it should be possible to provide a list of paths to directories and files to be zipped with the paths option. Using the paths option while a zip manifest arrives at the input port should result in an dynamic error. The result port provides a c:archive document containing a list of the zipped files. In case of errors, the report output should provide a c:errors document. Default format is zip, but implementations may provide other archive formats which can be addressed with the format option.

  <p:archive>
    <p:input port="source"/>                  <!-- xml documents or c:document-properties -->
    <p:input port="manifest"/>        <!-- zip manifest, e.g. c:files[c:file[@name]] or c:directory[@name]/c:file[@name] -->
    <p:output port="result"/>       <!-- zip with c:document-properties -->
    <p:output port="report"/>       <!-- c:errors -->
    <p:output port="file-list"/> <!-- c:errors -->
    <p:option name="serialization"/><!-- map(xs:string, xs:string) -->
    <p:option name="href"/>         <!-- anyURI -->
    <p:option name="command" 
              select="'update'"/>   <!-- "update" | "freshen" | "create" | "delete" -->
    <p:option name="format" 
              select="'zip'"/>      <!-- other formats are implementation-defined -->
    <p:option name="paths"/>        <!-- optional: white space separated list of files/directories -->
  </p:archive>

This is the expected input of the source port to create a zip file which has a valid OCF structure (think of EPUB files). Please note that you can set particular compression methods and levels for each file.

<c:manifest xmlns:c="http://www.w3.org/ns/xproc-step">
   <c:entry name="mimetype"
            href="file:///C:/home/kraetke/epub/mimetype"
            compression-method="stored"
            compression-level="none"/>
   <c:entry name="META-INF/container.xml"
            href="file:///C:/home/kraetke/epub/META-INF/container.xml"
            compression-method="deflate"
            compression-level="smallest"/>
   <c:entry name="OEBPS/content.xhtml"
            href="file:///C:/home/kraetke/epub/page01.xhtml"
            compression-method="deflate"
            compression-level="smallest"/>
   <c:entry name="OEBPS/content.opf"
            href="file:///C:/home/kraetke/epub/OEBPS/content.opf"
            compression-method="deflate"
            compression-level="smallest"/>
   <c:entry name="OEBPS/cover.xhtml"
            href="file:///C:/home/kraetke/epub/OEBPS/cover.xhtml"
            compression-method="deflate"
            compression-level="smallest"/>
   <c:entry name="OEBPS/styles/stylesheet.css"
            href="file:///C:/home/kraetke/epub/OEBPS/styles/stylesheet.css"
            compression-method="deflate"
            compression-level="smallest"/>
   <c:entry name="OEBPS/toc.ncx"
            href="file:///C:/home/kraetke/epub/OEBPS/toc.ncx"
            compression-method="deflate"
            compression-level="smallest"/>
   <c:entry name="OEBPS/toc.xhtml"
            href="file:///C:/home/kraetke/epub/OEBPS/toc.xhtml"
            compression-method="deflate"
            compression-level="smallest"/>
</c:manifest>

As a shorthand, it should also be possible to leave the source port empty and just pass a list of files/directories to p:zip. Then files and directories (including their subdirectories/files) should be added. In this case we expect that the user has already stored it's files in the appropriate file structure.

<p:archive href="file:///C:/home/kraetke/myfile.zip" 
       paths="file:///C:/home/kraetke/zip-archive1/ file:///C:/home/kraetke/zip-archive2/" 
       command="create"/>

You can add serialization options, for example to just add a directory and direct the XProc processor to create a EPUB-conformant zip structure. The options should be provided by the XProc implementation. On the other hand, it might be helpful to standardize global keys such as compression-method and compression-level

<p:archive href="file:///C:/home/kraetke/myEPUB.epub" 
       paths="file:///C:/home/kraetke/epub" 
       command="create"
       serialization="map{ 'conformance' : 'epub', 'password' : '123456' }"/>

The file-list output should look like this.

<c:archive xmlns:c="http://www.w3.org/ns/xproc-step"
href="file:///C:/home/kraetke/myEPUB.epub">
   <c:file compressed-size="20"
           size="20"
           name="mimetype"
           date="2017-06-27T12:58:52.000+02:00"/>
  <c:file  compressed-size="1323"
           size="3034"
           name="META-INF/container.xml"
           date="2017-06-27T12:58:52.000+02:00"/>
  <!-- (...) -->
</c:archive>

The result port provides the Zip document represented by c:document-properties

@mkraetke mkraetke changed the title p:zip p:archive Jun 27, 2017

@eriksiegel

This comment has been minimized.

Copy link
Contributor

commented Feb 15, 2018

Sounds like a good proposal to me. I have a few comments/additions:

  • I don't understand what you mean with the output as a c:document-properties document. Wouldn't it make sense to drop the file-list output port and output what you're proposing above on the result port? Anyway, some example to clarify?
  • What do you want to appear on the report port?
  • I would like @href to accept a pipe: protocol, like pipe:port@step. The contents for the file to add should be read from that port. If more than one document appears on that port its a dynamic error. There are many use-cases where you create stuff in your pipeline that must be written as part of an archive. The whole setup with "base-uri matching from the source port" is rather complex, hard to understand and even harder to program (has anybody done this? Example?)
  • Maybe we could skip the manifest port and specify that a p:archive step needs a manifest as its source input?
  • Maybe that's already possible, but can we think of something that allows you to specify on @href something in another zip/jar file?
  • We should also be able to specify the contents of the file to add/write directly in the manifest. Simply use a c:entry without an @href and with child contents.
@mkraetke

This comment has been minimized.

Copy link
Author

commented Feb 15, 2018

  • I think we can drop the file-list port if the report port provides a list of archived files and potential errors. The report port should provide either errors as c:errors document or useful information about the generated zip file, e.g. is it encrypted, what is the compression level of the files, how looks the file structure etc.
  • The result port should just provide the zip file with its document properties
  • The href port is to meant to provide an address for storing the zip file. Don't know whether this would make sense in the concept of XProc 3.0? Perhaps users would expect such an option as shorthand rather than storing the zip-file with p:store later
  • We can skip the manifest port. On the other hand I like the idea that the step receives a sequence of documents as primary input and creates one or more zips as primary output. The manifest was intended to provide some declaration for more complex zip files such as EPUBs. Perhaps we can think of the manifest as not required option which expects a map
  • I would like if p:archive has the capability to split a zip file into smaller zip chunks depending on the file size. So that people could store video games like quake on 21 3½-inch floppy disks and barter them on the schoolyard ;-) In this sense, the step should provide a sequence of zips.
@gimsieke

This comment has been minimized.

Copy link
Contributor

commented Feb 15, 2018

The c:document-properties pertains to my proposal for a placeholder document that was rejected in Aachen, so it shouldn’t appear here any more.
The source port needs to accept a sequence.
I don’t think we should introduce new protocols for @href (which denotes the archive destination as far as I understood it). Otherwise, for reading multiple files (also binary files), we have input ports and manifests. We should keep it that you can send the input files to the source port. Otherwise you’d need to store them on disk or use the proprietary pipe protocol for specifying input ports as URIs. But this would be a fundamental deviation from the current model. Processors would need to analyze the hrefs for occurrences of the pipe protocol. If we adopted it, we probably could abolish other connection mechanisms

@eriksiegel

This comment has been minimized.

Copy link
Contributor

commented Feb 15, 2018

I think there is confusion about what I menat with @href: I mean the c:entry/@href attribute in the manifest, which points to the source of what will be stored.

But we could just as easily introduce a c:entry/@pipe attribute

@gimsieke

This comment has been minimized.

Copy link
Contributor

commented Feb 15, 2018

Ah ok. I think currently the matching between manifest entries and input documents was done by base URI matching. I think if there’s an input document with a URI that is also referred to in the manifest, the step will not try to read it from disk or network, but use the document with the same base URI that was specified on the input port. Can’t we keep it like that? It needs to be specified mor explicitly though.

@eriksiegel

This comment has been minimized.

Copy link
Contributor

commented Feb 15, 2018

Sorry, that's exactly what I don't like nor use in the current setup. I would like to have a manifest that exactly spells out where all the content comes from (URI, pipe, inline document).

The base-uri feature thing can stay, no problem. But I would like some more enhanced capabilities in the manifest.

I will try to come up with a more complete proposal somewhere in the coming weeks (if time permits). Base it on this one so we can talk details. Ok?

@gimsieke

This comment has been minimized.

Copy link
Contributor

commented Feb 15, 2018

I’ll wait for some more people to weigh in. The current approach works for me, it just needs better documentation/specification. I’m against introducing a new protocol.

@eriksiegel

This comment has been minimized.

Copy link
Contributor

commented Feb 15, 2018

But it doesn't work for me. So let's see if we can make something that works for both (add stuff without changing the current workings).

No new protocol. Ok. But a c:entry/@pipe?

@eriksiegel

This comment has been minimized.

Copy link
Contributor

commented Mar 27, 2019

There is now a skeleton archive step with PR #55. This needs to be discussed and enhanced.

@eriksiegel eriksiegel removed their assignment Mar 27, 2019

@Conal-Tuohy

This comment has been minimized.

Copy link

commented Mar 27, 2019

The proposal above says that the source port accepts a sequence of documents to zip, but the example shows the source port receiving a manifest. I assume that's just a typo, and the example manifest should actually be input to the manifest port.

Using Calabash's zip step, I felt the need to write a wrapper step to convert an input sequence of documents into a zip. I provide my own zip-sequence step with a sequence of documents which have their own defined base URI, and it generates a manifest which simply uses those base URIs, and passes the manifest along with the sequence to Calabash's zip step. It sounds to me like the proposed archive step would allow me to skip that generation of the manifest, and just pass a sequence of documents (presumably their base URIs would be used in the zip's directory structure?).

With Calabash's zip step I was frustrated by having to store the zip file to disk, even if the file was only going to be sent over the network, but as I understand it, that wouldn't be necessary, here, since the result port will contain the zip document itself, and presumably the @href option could be omitted (for ephemeral archives).

@eriksiegel

This comment has been minimized.

Copy link
Contributor

commented Mar 27, 2019

Given that a manifest is always necessary and documents-to-zip-not (they could be inlined in the manifest or on disk) we could change the behavior to:

  • The source port receives the manifest
  • We add a contents port (or whatever name) for any documents-to-zip

For how I use it (the current p:zip) this would be very appropriate and simplify my code. I never supply document-to-zip through a port, they always come from disk. So I usually produce a manifest only as input for the zip/archive step...

@xml-project

This comment has been minimized.

Copy link
Contributor

commented Mar 27, 2019

@eriksiegel I think that is the way to go (with two ports). And I would like to see two port on p:uncompress also, so the manifest is clearly separated from the documents to flight in or out.

With regard to the port names, I have no special preferences, but a hint: To remember that one port is named "manifest" is easy, so may be documents on port "source"/"result" and manifest on "manifest". Just saying.

@eriksiegel

This comment has been minimized.

Copy link
Contributor

commented Mar 27, 2019

Sure. But what is the primary port? I'm advocating that the primary should be the manifest port (on p:archive at least).

And if we call the primary port manifest, that's an exception to everywhere else...

@gimsieke

This comment has been minimized.

Copy link
Contributor

commented Mar 27, 2019

I remember that someone recently wrote:

NOTE:
source and result are, by convention, the preferred names for primary in- and output ports. However, you can call them differently if you like (which is not recommended).

I sometimes wish that insertion were the primary port for p:insert, but source is the primary port there.

Are the reasons for making manifest primary strong enough to break with the pxp:zip precedent and with the recommendation to call the primary port source (especially if there is a source port, like in the zip case)?

@eriksiegel

This comment has been minimized.

Copy link
Contributor

commented Mar 27, 2019

LOL
But we could still provide the manifest on (primary) source and anything else on a port called contents or whatever.

@xml-project

This comment has been minimized.

Copy link
Contributor

commented Mar 27, 2019

Sure. But what is the primary port? I'm advocating that the primary should be the manifest port (on p:archive at least).

If you want the manifest port to be primary, you are right: we need another port name. Never thought of this preference for XProc 3.0. In 1.0 would seem natural to me, but in XProc 3.0 I do not expect the content always to come from disk. But ok: This is just a guess.
Changing the primary port from source to manifest would certainly produce more confusion than having to learn a new port name.

So I agree: manifest on source, content on another port. content would be fine for me.

@gimsieke

This comment has been minimized.

Copy link
Contributor

commented Mar 27, 2019

Ah, now I see what you mean. I have a slight preference for keeping the non-primary manifest input port. But yes, if source documents continue to be read from disk if they appear in the manifest but not on the source port, the source port’s “optionality” is higher than the manifest port’s.

Another thought: Can we try to make the manifest port optional? If, on the source port, we have documents with the following base URIs:

file:///C:/home/joe/foo/img/image.png
file:///C:/home/joe/foo/js/script.js
file:///C:/home/joe/foo/index.html
file:///C:/home/joe/foo/css/styles.css

and an option relative-to-base-uri="file:///C:/home/joe/foo/", then the step can create a manifest by itself:

<c:manifest xmlns:c="http://www.w3.org/ns/xproc-step">
   <c:entry name="img/image.png"
            href="file:///C:/home/joe/foo/img/image.png"/>
   <c:entry name="js/script.js"
            href="file:///C:/home/joe/foo/js/script.js"/>
   <c:entry name="img/index.html"
            href="file:///C:/home/joe/foo/index.html"/>
   <c:entry name="css/styles.css"
            href="file:///C:/home/joe/foo/css/styles.css"/>
</c:manifest>

If relative-to-base-uri were missing, the Zip file would start with a home directory.

@eriksiegel

This comment has been minimized.

Copy link
Contributor

commented Mar 27, 2019

We could do that.

But: It wouldn't be a use-case for me. In all my many XProc pipelines I always create a manifest and couldn't do without, so I would probably never use this. Having said that, its just me, so if other people think this is useful, I will not stand in the way.

I would prefer the source port=manifest option. Anybody else an opinion about this?

@gimsieke

This comment has been minimized.

Copy link
Contributor

commented Mar 27, 2019

You couldn’t do this in 1.0 (for non-XML documents), therefore you always had to create a manifest. I think that 3.0 can change the way that (some) people create zip files.

@xml-project

This comment has been minimized.

Copy link
Contributor

commented Mar 27, 2019

  1. I think @gimsieke 's proposal makes sense to me because it makes the creation of (simple) zips easier.

  2. @eriksiegel said: "I would prefer the source port=manifest option. Anybody else an opinion about this?" As I said: Not my first choice, but I would object to your proposal.

@gimsieke

This comment has been minimized.

Copy link
Contributor

commented Mar 27, 2019

@xml-project: “would object” → “wouldn’t object”?

In any case, something for the next conference call.

@xml-project

This comment has been minimized.

Copy link
Contributor

commented Mar 27, 2019

of course "would not object". Sorry.

@Conal-Tuohy

This comment has been minimized.

Copy link

commented Mar 28, 2019

The custom zip-sequence step that I mentioned above, which I wrote as a wrapper for Calabash's zip step, does almost exactly this:

Another thought: Can we try to make the manifest port optional? If, on the source port, we have documents with the following base URIs:

file:///C:/home/joe/foo/img/image.png
file:///C:/home/joe/foo/js/script.js
file:///C:/home/joe/foo/index.html
file:///C:/home/joe/foo/css/styles.css

and an option relative-to-base-uri="file:///C:/home/joe/foo/", then the step can create a manifest by itself:

<c:manifest xmlns:c="http://www.w3.org/ns/xproc-step">
   <c:entry name="img/image.png"
            href="file:///C:/home/joe/foo/img/image.png"/>
   <c:entry name="js/script.js"
            href="file:///C:/home/joe/foo/js/script.js"/>
   <c:entry name="img/index.html"
            href="file:///C:/home/joe/foo/index.html"/>
   <c:entry name="css/styles.css"
            href="file:///C:/home/joe/foo/css/styles.css"/>
</c:manifest>

If relative-to-base-uri were missing, the Zip file would start with a home directory.

Effectively the manifest is defined already by the base URIs of the source document (the other complexity it encapsulates is finding a location to save the temporary zip file, since this step is intended to run in the context of a web server).

<p:declare-step type="z:zip-sequence" name="zip-sequence">
	<p:input port="source" sequence="true"/>
	<p:output port="result"/>
	<p:input port="parameters" kind="parameter"/>
	<!-- create a zip manifest  -->
	<!-- convert each document in the sequence into a c:entry of a c:zip-manifest -->
	<p:for-each>
		<p:template>
			<p:input port="template">
				<p:inline>
					<c:entry name="{substring-after(base-uri(), 'file:/')}" href="{base-uri()}"/>
				</p:inline>
			</p:input>
		</p:template>
	</p:for-each>
	<!-- wrap entries into a manifest -->
	<p:wrap-sequence wrapper="c:zip-manifest" name="manifest"/>
	<!-- get global parameters to find a safe place to write a temp file -->
	<p:parameters name="global-parameters">
		<p:input port="parameters">
			<p:pipe step="zip-sequence" port="parameters"/>
		</p:input>
	</p:parameters>
	<p:group>
		<!-- We need an absolute URI for the temporary zip file, based on the "realPath" parameter -->
		<p:variable name="zip-file-name" select="
			concat(
				'file:', 
				/c:param-set/c:param[@name='realPath'][@namespace='tag:conaltuohy.com,2015:servlet-context']/@value,
				'/zip-sequence.zip'
			)
		">
			<p:pipe step="global-parameters" port="result"/>
		</p:variable>
		<!-- zip up the sequence of documents according to the manifest and stash it in the temporary file -->
		<zip name="zip" xmlns="http://exproc.org/proposed/steps" command="create">
			<p:with-option name="href" select="$zip-file-name"/>
			<p:input port="source">
				<p:pipe step="zip-sequence" port="source"/>
			</p:input>
			<p:input port="manifest">
				<p:pipe step="manifest" port="result"/>
			</p:input>
		</zip>
		<!-- create a request document to read the temporary file back in -->
		<p:identity>
			<p:input port="source">
				<p:inline>
					<c:request method="get"/>
				</p:inline>
			</p:input>
		</p:identity>
		<p:add-attribute match="/c:request" attribute-name="href">
			<p:with-option name="attribute-value" select="$zip-file-name"/>
		</p:add-attribute>
		<!-- Read ZIP file back in. NB explicit dependency on preceding step -->
		<p:http-request cx:depends-on="zip" xmlns:cx="http://xmlcalabash.com/ns/extensions"/>
	</p:group>
</p:declare-step>
@Conal-Tuohy

This comment has been minimized.

Copy link

commented Mar 28, 2019

I would be interested to see an example of @eriksiegel's use case, to better understand why it "always requires a manifest".

@Conal-Tuohy

This comment has been minimized.

Copy link

commented Mar 28, 2019

My 2 cents on the port naming question:

If the step has a source port my expectation would be that it would receive the actual content of the zip, rather than a manifest; i.e. it seems to me that the archive's content are better described by the word "source" than the manifest would be. I think if people have conflicting intuitions then perhaps it would be better to name the ports with more meaningful names like "manifest" and "content".

@gimsieke

This comment has been minimized.

Copy link
Contributor

commented Apr 18, 2019

  1. Does anyone have an opinion on where to create p:archive, that is, in which optional step spec? I was inclined to create it in the file steps spec. On second thought, p:archive will be able operate in main memory only, so file system access is not a constitutive part of it. Should we create a distinct optional “archive steps” spec?
  2. It would be nice if we were able to specify a content type for the result. But if p:archive is generic, we cannot use application/zip. I don’t think we can even say that the resulting archive is application/*.
  3. If we migrate the existing pxp:zip step, options such as compression-method, compression-level, and command may not be available for other archiving methods than Zip. Should the current options become items in a parameters map?
    Single-file compression tools such as gzip don’t need a manifest, so this is different, too.
  4. What to do with the href option of pxp:zip? I’m inclined to specify an archive input port (with sequence="true") where people can optionally pass an existing zip file that should be updated. (No more than one though, this needs to be checked by the processor.) They can use <p:with-input port="archive" href="{$zip-location}"/> instead of the href="{$zip-location}" option.
  5. In the light of items 2. and 3. above, do we rather say, let’s have dedicated p:archive-zip and p:archive-unzip steps and either a) skip other archivers in the XProc 3.0 standard steps or b) specify them in distinct steps, such as p:archive-tar, p:archive-bzip2, etc.?
@eriksiegel

This comment has been minimized.

Copy link
Contributor

commented Apr 18, 2019

My take for question 1: I remember us deciding p:archive is a standard, not optional step...

@gimsieke

This comment has been minimized.

Copy link
Contributor

commented Apr 18, 2019

Ah ok, apparently I missed that.

@ndw

This comment has been minimized.

Copy link
Collaborator

commented Jun 11, 2019

Gerrit has provided new prose, we're assuming that integrates everything from this issue. We're opening a new issue to track comments on the (now current) proposal.

@ndw ndw closed this Jun 11, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.