Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: URIS for file paths #81

Closed
ross-spencer opened this issue May 24, 2016 · 23 comments
Closed

Feature Request: URIS for file paths #81

ross-spencer opened this issue May 24, 2016 · 23 comments
Assignees

Comments

@ross-spencer
Copy link
Collaborator

ross-spencer commented May 24, 2016

Hi Richard,

Having been working with SF pretty intensely for the reporting tool I found it difficult to identify files inside of archive formats - i have to identify first an archive file format using the PUID, and log its complete path. I then look for occurrences of its path inside other paths - flagging them as content inside that archive.

Not the most elegant solution!

It's something I can achieve in DROID by just looking at the URI_SCHEME which I extract from the URI...

Do you think it scope creep to add to SF? - or could it be a 'goer' in the YAML output?

Cheers,

Ross

@richardlehane
Copy link
Owner

The code's already there for the -droid output so this should be a simple addition.

But perhaps as an option (-uri flag) rather than default?

I'm a bit wary about making it part of the standard output as it is:

a) non-utf8. File URIs percent encode non-ascii characters.
b) java-specific (zip: and tar: etc. aren't part of file URI scheme specs from the ietf, I don't think, they come from JAR https://docs.oracle.com/javase/7/docs/api/java/net/JarURLConnection.html and Apache Commons VFS http://commons.apache.org/proper/commons-vfs/filesystems.html#Zip_Jar_and_Tar)
c) an extra line (trying to keep the output as concise as possible).

@ross-spencer
Copy link
Collaborator Author

Thanks Richard. Given the use case it's better if it's a feature that is there by default. I also appreciate your concerns and so it sounds like it might need some more thought.

I'll have a think of other ways it might be clearer to indicate a file is inside an archive file. I'm happy my approach is nearly working - it just needs a little more testing. Will be useful to know if it is useful for other users over time as well.

@richardlehane
Copy link
Owner

Other options could consider here:

  1. when the -z flag is used, add a new field in the file block "container" that would name the path of the container immediately containing the file (unzipping is recursive as may have zips within zips - you'd need to follow chain of parents to get full ancestors).

  2. not adding any new fields at all, but just changing the layout of the filename field. The xxx.zip/xxx.doc paths currently provided for these aren't legal paths anyway so could probably just change the way these get constructed so they are easier for you to parse.
    E.g. [xxx.zip] xxx.doc (I quite like this one as it nests quite nicely: [xxx.tar.gz] [xx.tar] [folder\xxx.zip] anotherfolder\file.doc)
    or xxx.zip!/xxx.doc (like the file uris)

I'm kind of leaning towards one of the (2) options as this will mean keeping output as lean as possible (no additional fields).

@ross-spencer
Copy link
Collaborator Author

I think that option with the square brackets in (2) is pretty neat and
clear. I'm keen on that one if you are!

On Wed, May 25, 2016 at 12:30 PM, Richard Lehane notifications@github.com
wrote:

Other options could consider here:

  1. when the -z flag is used, add a new field in the file block "container"
    that would name the path of the container immediately containing the file
    (unzipping is recursive as may have zips within zips - you'd need to follow
    chain of parents to get full ancestors).

  2. not adding any new fields at all, but just changing the layout of the
    filename field. The xxx.zip/xxx.doc paths currently provided for these
    aren't legal paths anyway so could probably just change the way these get
    constructed so they are easier for you to parse.
    E.g. [xxx.zip] xxx.doc (I quite like this one as it nests quite nicely:
    [xxx.tar.gz] [xx.tar] [folder\xxx.zip] anotherfolder\file.doc)
    or xxx.zip!/xxx.doc (like the file uris)

I'm kind of leaning towards one of the (2) options as this will mean
keeping output as lean as possible (no additional fields).


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#81 (comment)

@richardlehane
Copy link
Owner

richardlehane commented May 30, 2016

This is implemented now on develop branch: 1c489a5

The new output looks like this:
archive paths

I've had specific feedback in the past about how WARC/ARC paths should look & this change may be a regression for this use case. Would be great to get some feedback esp. from web archivists on whether this change is OK? The next release won't be for a month or so, so should be ample time to consult.

@despens
Copy link

despens commented May 30, 2016

There have been some discussions inside the KDE community how to write URLs in such cases, since KIO is able to access files within an archive within an archive that is accessed via the network.

AFAIR these were the two proposed solutions:

I think in the end the # solution was chosen which produced URLs like

http://archives.online/downloads/stuff.zip#funny/archive.tar#lol.txt

At least for navigating the file system this worked great with the konqueror file manager/webbrowser. I haven't used KDE in a while, but honestly found this a great feature. Maybe that would be a way for sf to display nested files?

Otherwise, having a container field or creating nested report files would also be ok.

Some inconvenience is going to happen, either archive contents have to be nested in some way (DROID exports a parent column for CSV exports IIRC, YAML and JSON would support nesting natively), or some escaping/encoding has to happen with hierarchy separators.

I feel dealing with encodings and escaping is much more error prone than using nesting features of the chosen output format.

@despens
Copy link

despens commented May 30, 2016

Here is a note on how gvfs does it vs KIO: http://permalink.gmane.org/gmane.comp.kde.devel.core/61656

I think the gvfs notation is unreadable, and hints towards the nesting option.

@richardlehane
Copy link
Owner

richardlehane commented May 30, 2016

Thanks Dragan - using JSON/YAML nesting would be quite a change from current output (which is flat) & prefer if possible if I can get a satisfactory result for this issue just by tweaking one of the output fields. A parent/ or container field still definitely an option.

I quite like the "#" approach and it would certainly be better to adopt a notation that is already in use elsewhere, rather than creating an sf-specific one.

Here's the same example from above in the "#" format:

capture

@despens
Copy link

despens commented May 30, 2016

Thanks, that looks nicer than I thought!

Maybe you chose an unusual warc record here, but how would that look for a regular html file? I'm especially wondering about the \ (backslash) after the last #, IMHO that should be <datestring>/<url> (forward-slash). Like that it could be used to get that part of the filename and access it via a warc replay mechanism that follows wayback conventions. I wouldn't want to create a parser that reliably exchanges the slashes within that string…

And—I know we had a similar discussion before 😏, but I wonder if "transparent" compression formats like gz and xz, which by themselves cannot hold any hierarchy below them like a zip or tar archive, need to be expanded to the obvious, like in this case:

IAM-20080430204825-00000-blackbook.warc.gz#IAM-20080430204825-00000-blackbook.warc#[...]

For the purpose of accessing a warc record, I don't think there is any relevant tool out there that doesn't support warc+gzip; also since every warc record needs to be a gzip chunk, I wonder if the "whole warc file" is even the right next level in this case.

To me, this looks more like adhering to the book than being very useful, just makes the string longer with info that appears redundant to me at least.

@ross-spencer
Copy link
Collaborator Author

I'm just trying out the develop branch with the square brackets and that seems okay. Dragan's suggestions are good too and a hash delimiter would work well.

I'm struggling to understand the last examples though and so my expectation would still be something like this for regular tar.gz:

file.tar.gz#file.tar#this-is-a-file.txt

and similarly for WARC like the example above:

webarchive-file.warc.gz#webarchive-file.warc#this-is-a-file.txt

It's just a method to delineate between file system layers and file specific layers, and prevents us asking the question:

1. does it allow hierarchy?
2. if yes, hash
3. if no, don't hash

Which places a (slight) burden on interpreting the specification for the developer, and then another burden the users interpreting the tool's output, both in understanding the concept of hierarchies within archive formats, and then in parsing that string further if they want to say to WARC is inside a GZ, or not, likewise a TAR.

It also requires monitoring of the specification, for if, say GZ or enables hierarchical zipping/layering/wrapping in future...

I'd be more worried if adhering to the book was notably unhelpful or harmful.

@richardlehane
Copy link
Owner

re. the example outputs - I copied and pasted two bits together, so isn't completely real (you'd expect the warc and warc.gz output to appear before the contents of the warc). Was just trying to show how it would look for two use cases (nested zip/tar and warc.gz).

re. the forward/backslash thing - the filepath separators are OS-specific, I generated the example output on windows, if I'd done it on unix the backslash would have gone the other way. It is possible to always use unix-style separators when making webarchive paths and I'm happy to do. Not sure if any web archivists use windows but would be great to get their opinion on that!

re. transparent gz decoding for tar.gz and warc.gz (Ross, the proposal here is I think not to show that middle #file.tar bit at all and just do file.tar.gz#this-is-a-file.txt) - I'm open to changing this but would like the behaviour to be consistent across gzipped tars, gzipped warcs and gzipped plain files (where there is no container format). I.e. if you have a file.doc.gz - what happens? Do you just give file.doc.gz as the path or do you do file.doc.gz#file.doc?

@ross-spencer
Copy link
Collaborator Author

Okay, if i understand correctly...

If I just reply from the files I'm used to dealing with and re-iterating my expected behavior above, then I can create a file that looks like this:

outer-GZ.tar.gz
+
|
+----+ inner-tar.tar
     |
     |
     +----+ inner-file.txt

These are three distinct objects, a GZ, a TAR, and a TXT. In the 'transparent' model looking at an absolute path - I've no knowledge of the 'inner-tar.tar' file name or object.

image

And so I'd hope for a path that looks like:

<FSPATH>/outer-GZ.tar.gz#inner-tar.tar#inner-file.txt

or:

<FSPATH>/outer-GZ.tar.gz#/inner-tar.tar#/inner-file.txt

Another set of examples here when talking about URIs:

http://stackoverflow.com/a/9678657

It looks like as a discussion theme it's quite a common one!

@despens
Copy link

despens commented May 31, 2016

For the transparency discussion and URLs, here is an example of transparent gzip compression:

$ ls
inner.txt
$ tar -czf outer.tar.gz inner.txt # create a gzipped tar file with inner text inside
$ ls
inner.txt  outer.tar.gz
$ tar -ztvf outer.tar.gz # list the contents of the tar file
-rw-rw-r-- despens/despens   0 2016-05-31 07:59 inner.txt
$ mv outer.tar.gz outer-renamed.tar.gz # rename the tar file
$ gunzip outer-renamed.tar.gz # unzip the renamed tar file
$ ls
inner.txt  outer-renamed.tar

The gzip format does not store any information about what it compressed, it is just compression. Not even the name of the original "file" (more like a stream for gzip) that was compressed is available, instead, the extension .gz is removed from whatever is the name of gzipped file. Changing the name of the compressed file will change the name of the file that is the result of decompression. So in listing a hierarchy of outer.tar.gz#outer.tar#inner.txt, outer.tar is actually wrong, there is no outer.tar inside that outer.tar.gz. It is just convention that gzip will strip the .gz extension from the filename.

This will never change for any stream-based compression processes like gzip, xz, bzip, etc, so there is no need to monitor the specs. These tools are designed to work with data that can arbitrarily streamed through them, they are not as much file formats as transport encodings, and the transport can even be as short as from my local harddisk to my local RAM.

So putting an imaginary uncompressed filename into the hierarchy is not really "harmful", but more really redundant and suggesting that there is an option for another file name or a hierarchy when there is none.

The problem with "nested protocols" as discussed in Ross' example about apache-vfs is I think the mis-understanding that an encoding or a file format is a protocol. URLs like zip:zip:rar://outer.zip/inner.zip/innermost.rar/finally.txt are kinda crazy, this is like writing http:gif://example.com/empty.gif. In this regard I believe apache-vfs is badly designed. From a programmer's point of view it might make sense to record for every step of the way what the required routines are that need to be called, but actually this should be decided based on what renderers are available locally, on demand. In that spirit, it makes sense to have a separator like # to indicate that there is a break in the flow of access and sub-hierarchies cannot be accessed with the same methods as the hierarchies before the #. The information on how to get into that part of the hierarchy should be available in the Siegfried report for that point in the hierarchy.

IMHO! 😄 I understand the way of introducing the imaginary decompressed tar file is kind of dictated by how PRONOM is designed in the first place. That is, again IMHO, a shortcoming that tools like Siegfried could work around.

Regarding the slash in WARC files after the date string: a WARC file would not be accessed via the file system, but via a WARC replay mechanism like pywb or OpenWayback, via the http protocol. The forward slash is used in http to separate path hierarchies. The URL would look like http://mywarcreplay.com/20081224064531/http://example.com. So in web archives, no matter if they're running on Windows or a Unix-like system, the slash would be forward.

@ross-spencer
Copy link
Collaborator Author

I think you're saying that compression can simply be viewed at as an encoding of another file (?) but I'm not sure we're expressing the entirety of the gz specification if we think this is always the case, see:

image

https://tools.ietf.org/html/rfc1952

I'm not really thinking of PRONOM models here, I just know that (however we express it), in other applications like 7Zip I can retrieve an object of type tar with its own filename from a gzip and treat that discretely as a tar file whatever i want to do with that.

I can see how your CLI example handles the tar.gz transparently, but however unlikely the use-case we can see that in other cases the combination can be extracted over two stages, so more verbosely than that.

@despens
Copy link

despens commented May 31, 2016

I didn't know about that optional Latin1 encoded file name, and stand corrected!

@ross-spencer
Copy link
Collaborator Author

Yep! :)

I just found it in gunzip if it helps (-N):

image

@richardlehane
Copy link
Owner

capture

This is the bit of sf that handles gzip names. You see it uses the gzip-encoded name if it is present (not often I've found) but otherwise just tries to strip the last extension from the gzip file.

A possible, and possibly crazy, mid-way suggestion between the Ross and Dragan camps here might be to change this function so that it, if the gzip file contains an encoded filename, then return it concatenated with the parent e.g. file.tar.gz#file.tar. If, however, there is no name encoded within the file, rather than do the ugly extension trimming and concatenation, could just return the parent name i.e. file.tar.gz for the contents.

This approach would exhibit Ross-preferred behaviour when a name is embedded in the gzip file and Dragan-preferred behaviour when there is no embedded name (probably the case for most warc.gz). It has the benefit of being the most truthy approach, but the risk of seeming a bit of a gzip lottery to most users!

p.s. happy to take as an additional action from this thread to change the warc/arc paths so they always use unix slashes

@ross-spencer
Copy link
Collaborator Author

I'm stuck on which way to vote as I'd like to find the appropriate middle ground, and see the best for the community also.

On one hand I can respect the concept of 'truthy'. I also respect the concept of users being skilled at decision making and things only being as simple as they need to be not simpler.

On the other hand I'm keen to see the verbose method... technically without an FNAME in the GZ the GZ compression can still be unwrapped to a TAR and so i'm interested in saying there are two objects - and therefore two (not always inclusive of each other) techniques i need to be aware of to manipulate it - a GUNZIP and an UNTAR. That's a statement I feel is important for digital preservation - even if some tools do this work for us, not all tools will.

Also, in understanding there are definitively two objects in the file path means that I don't have to remove a false positive from an incorrect filename extension, e.g. just naming a tar, tar.gz or vice versa. I would have to do this by consulting at least two records in the SF output - the container(s), and the file i'm interested in, inside the container.

@richardlehane
Copy link
Owner

yes, that's a concern for me too: that if we say that an un-gzipped tar's file name is still file.tar.gz then we would have many more "extension mismatch" warnings appearing in current implementation.

We can change the filename matching routine however to mitigate this. It could accept multiple filename extensions, rather than just one as at present, and try to match on all of them (i.e. ".gz" and ".tar" could both be matched in this case). This might be useful in other contexts too as I sometimes see files like "truth.wb3.doc" where the proper extension is wb3 but someone has tried to stick another extension on it to open in word. Or "important.doc.old" etc.

In the middle ground approach you still would also get two objects in your output: the gzip file and the un-gzipped file. It is just that sometimes those objects would have the same name e.g. file.tar.gz twice (when there is no file name encoded within the gzip stream) and sometimes they wouldn't e.g. file.tar.gz and file.tar.gz#file.tar (when file.tar is encoded as a name within the gzip stream).

This might of course be a PITA if you are expecting that filename field within results to be unique!

For me the main issue would be the seeming inconsistent behaviour in the output (for users that don't appreciate nuances of gzip filenames)... i.e. the output might just seem odd if sometimes they are getting a sub-gzip filename and sometimes they aren't.

@despens
Copy link

despens commented Jun 1, 2016

I think the main goal should be that sf reports are consistent across many different formats. As a web person, I only know gzip et al as transparent compression. To me, a wrl.gz is not of type gzip, but of type VRML. But I understand that there are other cases and ignoring some information in a report is easier than not to have it when you need it.

Thinking further, if sf would support for example going down into ISO 9660 files or hard disk images, that might look like a transparent format again, but in this case I personally would want to know what is the format of that disk image and what file systems it uses, what are contained partitions etc. For other users who are only interested in files (since all of their disk images will be Windows XP NTFS anyway), it might seem redundant.

So my vote now goes to verbose. (But stand by my claim that inside of WARC files, gzip compression should be handled transparently if it is documented in the Content-Encoding header.)

The problem of the same file being repeated with a compressed tar.gz and an uncompressed tar.gz is probably not that grave since it will be expressed as inner.tar.gz#inner.tar.gz, so a folder-within-a-folder type of situation. I think it would be acceptable to strip the second extension .gz from the file name since all gzip tools I laid my hands on would do that by default. This might change in the future, but I also don't expect a PRONOM report to be valid for eternity.

@richardlehane
Copy link
Owner

I've implemented the hash paths on the develop branch. I will leave the rest of the gzip behaviour alone for present.

Outstanding action items are to:

  • output unix (rather than OS-specific) separators in the paths within WARC/ARC files
  • as future features, explore addition of other container formats (e.g. ISOs, .rar, 7z, .bz) for decompression with -z flag

Thanks for all the contributions on this thread, it's been informative!

@despens
Copy link

despens commented Jun 2, 2016

Thanks for your great work Richard! Siegfried 4 lyfe!

@richardlehane
Copy link
Owner

richardlehane commented Jun 26, 2016

hash (#) paths implemented in sf 1.6.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants