Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading XML files from an archive with weird filenames #364

Open
xatapult opened this issue Mar 9, 2020 · 21 comments
Open

Reading XML files from an archive with weird filenames #364

xatapult opened this issue Mar 9, 2020 · 21 comments
Assignees

Comments

@xatapult
Copy link
Contributor

@xatapult xatapult commented Mar 9, 2020

I ran into a use-case for which we have no elegant solution in XProc 3.0 but is quite common. @gimsieke , I think this is rather important for transpect too!

MS Office files (like .docx and .xlsx) are zip files with all kinds of XML inside. Some have filenames like .rels. Since this is not a standard XML file extension, p:unarchive does not recognize it as XML and loads it as binary. But you need the XML to interpret the office format correctly.

The base cause is of course that we load the contents of an archive in one go using p:unarchive. There is no mechanism to specify the content types of the individual files, you have to leave that to the processor which does its best but is sometimes wrong.

So how can we solve this? I see several solutions:

  1. We do nothing. To convert the binary to XML, you'll have to write it to disk (to a temp file) and then read it in again. Works but is expensive and rather inelegant.
  2. We add some feature to (probably) p:cast-content-type that forces it to view a binary file as text so you can convert it to XML after its loaded.
  3. We add a mechanism to p:unarchive to help the processor guessing the file types. Maybe just a content-type attribute that forces all unarchived files to this type and then use p:unarchive to load the files one-by-one or in controlled sets using include-filter and exclude-filter.

Actually, my preference would be to implement both 2 and 3. Both are useful in their own way and probably other use-cases benefit also.

@gimsieke

This comment has been minimized.

Copy link
Contributor

@gimsieke gimsieke commented Mar 9, 2020

My understanding was that processors can use any method that they deem appropriate in order to guess the correct media type. So I think they will at least support common XML archive members with unorthodox (or no) extensions.

In other functional languages one might solve this by submitting a function item to the step that accepts the file name and returns the content type.

Function items aren’t well supported in XProc 3.0, but maps are. So what about a map(xs:string, xs:string) with regular expressions as keys and content types as values? The semantics would be: If this map option is given and if a file name matches the regex supplied as a key, then the content type supplied as a value is returned, and the archive member is treated as if it were read with p:load with the respective content-type option.

If a given file name matches multiple keys and these keys are associated with different content types, it is undefined which content type is selected.

Otherwise (if no key matches), the processor-dependent mapping rules apply that would apply in the absence of this map option.

@xatapult

This comment has been minimized.

Copy link
Contributor Author

@xatapult xatapult commented Mar 9, 2020

I like Gerrit's solution. Sounds not too hard to implement. Opinions?

@xatapult

This comment has been minimized.

Copy link
Contributor Author

@xatapult xatapult commented Mar 9, 2020

Is an extension to cast-content-type to treat binary as text if needed something to be considered as well?

Once I ran into it it looked like an oversight we can't get from binary to xml.

@gimsieke

This comment has been minimized.

Copy link
Contributor

@gimsieke gimsieke commented Mar 9, 2020

I agree that it would be nice to use p:cast-content-type to attempt to parse an “other” document as XML, HTML, text, or JSON. Haven’t we discussed this already?

@ndw

This comment has been minimized.

Copy link
Collaborator

@ndw ndw commented Mar 10, 2020

I don't recall if we discussed the p:cast-content-type issue or not, but yes, I think attempts to cast binary to XML, HTML, text, or JSON, should succeed if the actual octets can be parsed into that format given their actual content type (which may include an encoding, I fear).

The only hesitation I have about the map idea on p:unarchive is that I think the problem it's solving applies more broadly than just p:unarchive. I've had to extend the JVM content type list several times in the course of testing and demos. I would rather not have a solution that works for p:unarchive but doesn't work for p:load (for example). I wonder if the JVM content type solution works in p:unarchive. Seems like it should, but I haven't tested it.

See: https://bugs.java.com/bugdatabase/view_bug.do?bug_id=JDK-8039362

@xatapult

This comment has been minimized.

Copy link
Contributor Author

@xatapult xatapult commented Mar 10, 2020

@ndw The solution for p:unarchive as proposed (with the map) is IMHO necessary only for p:unarchive because p:unarchive loads a whole bunch of files in one go.

The situation for p:load is rather different: You load a single file and the step already has a content-type option to override the automatic type guessing.

@xatapult

This comment has been minimized.

Copy link
Contributor Author

@xatapult xatapult commented Mar 10, 2020

I created a new issue #365 for further discussion of the extensions to p:cast-content-type.

This issue is now for discussing an optional extension to p:unarchive only.

@ndw

This comment has been minimized.

Copy link
Collaborator

@ndw ndw commented Mar 10, 2020

I agree that the details of p:unarchive vs p:load are different but they're also similar. If I get URIs from some other step, it may not be straightfoward to specify the override content type. In fact, if what's being loaded is some external, exploded archive, then the use case might even be exactly the same.

If we can avoid having to have two (or more) different mechanisms for things that are closely related or the same, that would be better. IMHO.

@xml-project

This comment has been minimized.

Copy link
Contributor

@xml-project xml-project commented Mar 15, 2020

FYI: In MorganaXProc-IIIse 0.9.1.6 I implemented the following solution to the original problem with p:unarchive:
Option parameters now accepts an entry for mox:content-types' which has to be associated with an instance of array{array()). Therefore you can call p:unarchive like so:

<p:unarchive include-filter="\S+\.rels"
  parameters="map{'mox:content-types' : 
    [
      ['\S+\.rels', 'application/xml']
    ]
  }" xmlns:mox="http://www.xml-project.com/morgana">
  <p:with-input href="some-file.docx" />	
</p:unarchive>

For each entry in the array an array with cardinality 2 is expected: The first entry is taken as a regular expression, the second entry is taken as a content-type.
The array is inspected in order: If the path of the unarchived document matches the regex, the given content-type is taken to interpret the unarchived document. No other regex is considered. If no regex is matched, processors standard association applies.
The implementation is clearly inspired by @gimsieke suggestion in his comment, but I went for arrays instead of a map. The basic advantage is that arrays have order, so they could be inspected in turn while in a map, I have to inspect ALL and raise an error if more than one is matched. Additionally it allows users to specify lists of regex which increasing levels of granularity.

@gimsieke

This comment has been minimized.

Copy link
Contributor

@gimsieke gimsieke commented Mar 15, 2020

Another question is whether the regexes are implicitly anchored, that is, whether you need to specify a regex that matches the file name front-to-back. This seems to be your approach (\S+\.rels), while in mine you can specify \.rels$. I think any specification or implementation should document whether matching is auto-anchored or not.

@gimsieke

This comment has been minimized.

Copy link
Contributor

@gimsieke gimsieke commented Mar 15, 2020

It probably should be interpreted in the same way as the include-filter and exclude-filter regexes, of which we also don’t say whether they are auto-anchored or not, at least when used on p:unarchive.

For p:directory-list, our examples imply that they are not auto-anchored, that is, you can also specify a regex that matches parts of the relative path.

I think we should unify all these filters on p:unarchive and p:directory-list:

  1. The regexes see the complete relative paths in the archive or below the directory given in $path, not just the base names of archive/directory members.
  2. The regexes match parts of the name, they are not tacitly anchored. If a pipeline author wants to anchor them, they need to use ^…$.
@xml-project

This comment has been minimized.

Copy link
Contributor

@xml-project xml-project commented Mar 15, 2020

@gimsieke Please mind that all antries in option parameters of p:unarchive are implementation defined. So what I stated above is an info about my solution and not a proposal to anything.

@xatapult

This comment has been minimized.

Copy link
Contributor Author

@xatapult xatapult commented Mar 16, 2020

No, but we could turn it into a proposal and add it (using a different, none Morgana specific name) to the spec, right?

@ndw

This comment has been minimized.

Copy link
Collaborator

@ndw ndw commented Mar 16, 2020

Would this be more useful globally or on a per-p:unarchive basis?

@xatapult

This comment has been minimized.

Copy link
Contributor Author

@xatapult xatapult commented Mar 18, 2020

There is a (I think) low impact solution to this: Add a content-type option to p:unarchive. If you specify this it will treat all unarchived files as this content type. Then unarchive the files one by one or in batches with the same content-type. What you need up-front for this is some idea of the content types of the archive's entries. See #368

@xatapult

This comment has been minimized.

Copy link
Contributor Author

@xatapult xatapult commented Mar 18, 2020

We take Achim's idea and add an option to both p:unarchive and p:archive-manifest to specify content type overrides.

@xatapult xatapult self-assigned this Mar 18, 2020
@xatapult

This comment has been minimized.

Copy link
Contributor Author

@xatapult xatapult commented Mar 23, 2020

@gimsieke About not anchoring the filename matching regexps:

Can we state that the regexp matching works like the XPath matches() function (with 2 parameters)?

matches(path-to-match, regexp)
@gimsieke

This comment has been minimized.

Copy link
Contributor

@gimsieke gimsieke commented Mar 23, 2020

I think so

@xatapult

This comment has been minimized.

Copy link
Contributor Author

@xatapult xatapult commented Mar 23, 2020

Do we need a check on the value of a content-type (the seconds members of the inner arrays) or is this just a string and can you specify [ ['.rels$', ' what a nonsense &#10; '] ]?

If we need checking, what would be the definition of a content-type?

There is already an error for specifying an invalid content type (XC0130), coming from p:http-request. Maybe we can re-use this?

If so, maybe there are more places where we need to add something about this (p:cast-content-type?)

@xatapult

This comment has been minimized.

Copy link
Contributor Author

@xatapult xatapult commented Mar 25, 2020

This will be covered by an error listed in the core spec.

@xatapult

This comment has been minimized.

Copy link
Contributor Author

@xatapult xatapult commented Apr 2, 2020

Done with #379, but the debate is not finished, maybe some additional changes are necessary.

@xatapult xatapult mentioned this issue Apr 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.