-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: URIS for file paths #81
Comments
The code's already there for the -droid output so this should be a simple addition. But perhaps as an option (-uri flag) rather than default? I'm a bit wary about making it part of the standard output as it is: a) non-utf8. File URIs percent encode non-ascii characters. |
Thanks Richard. Given the use case it's better if it's a feature that is there by default. I also appreciate your concerns and so it sounds like it might need some more thought. I'll have a think of other ways it might be clearer to indicate a file is inside an archive file. I'm happy my approach is nearly working - it just needs a little more testing. Will be useful to know if it is useful for other users over time as well. |
Other options could consider here:
I'm kind of leaning towards one of the (2) options as this will mean keeping output as lean as possible (no additional fields). |
I think that option with the square brackets in (2) is pretty neat and On Wed, May 25, 2016 at 12:30 PM, Richard Lehane notifications@github.com
|
This is implemented now on develop branch: 1c489a5 The new output looks like this: I've had specific feedback in the past about how WARC/ARC paths should look & this change may be a regression for this use case. Would be great to get some feedback esp. from web archivists on whether this change is OK? The next release won't be for a month or so, so should be ample time to consult. |
There have been some discussions inside the KDE community how to write URLs in such cases, since KIO is able to access files within an archive within an archive that is accessed via the network. AFAIR these were the two proposed solutions: I think in the end the
At least for navigating the file system this worked great with the konqueror file manager/webbrowser. I haven't used KDE in a while, but honestly found this a great feature. Maybe that would be a way for sf to display nested files? Otherwise, having a container field or creating nested report files would also be ok. Some inconvenience is going to happen, either archive contents have to be nested in some way (DROID exports a parent column for CSV exports IIRC, YAML and JSON would support nesting natively), or some escaping/encoding has to happen with hierarchy separators. I feel dealing with encodings and escaping is much more error prone than using nesting features of the chosen output format. |
Here is a note on how gvfs does it vs KIO: http://permalink.gmane.org/gmane.comp.kde.devel.core/61656 I think the gvfs notation is unreadable, and hints towards the nesting option. |
Thanks Dragan - using JSON/YAML nesting would be quite a change from current output (which is flat) & prefer if possible if I can get a satisfactory result for this issue just by tweaking one of the output fields. A parent/ or container field still definitely an option. I quite like the "#" approach and it would certainly be better to adopt a notation that is already in use elsewhere, rather than creating an sf-specific one. Here's the same example from above in the "#" format: |
Thanks, that looks nicer than I thought! Maybe you chose an unusual warc record here, but how would that look for a regular html file? I'm especially wondering about the And—I know we had a similar discussion before 😏, but I wonder if "transparent" compression formats like
For the purpose of accessing a warc record, I don't think there is any relevant tool out there that doesn't support warc+gzip; also since every warc record needs to be a gzip chunk, I wonder if the "whole warc file" is even the right next level in this case. To me, this looks more like adhering to the book than being very useful, just makes the string longer with info that appears redundant to me at least. |
I'm just trying out the develop branch with the square brackets and that seems okay. Dragan's suggestions are good too and a hash delimiter would work well. I'm struggling to understand the last examples though and so my expectation would still be something like this for regular tar.gz:
and similarly for WARC like the example above:
It's just a method to delineate between file system layers and file specific layers, and prevents us asking the question:
Which places a (slight) burden on interpreting the specification for the developer, and then another burden the users interpreting the tool's output, both in understanding the concept of hierarchies within archive formats, and then in parsing that string further if they want to say to WARC is inside a GZ, or not, likewise a TAR. It also requires monitoring of the specification, for if, say GZ or enables hierarchical zipping/layering/wrapping in future... I'd be more worried if adhering to the book was notably unhelpful or harmful. |
re. the example outputs - I copied and pasted two bits together, so isn't completely real (you'd expect the warc and warc.gz output to appear before the contents of the warc). Was just trying to show how it would look for two use cases (nested zip/tar and warc.gz). re. the forward/backslash thing - the filepath separators are OS-specific, I generated the example output on windows, if I'd done it on unix the backslash would have gone the other way. It is possible to always use unix-style separators when making webarchive paths and I'm happy to do. Not sure if any web archivists use windows but would be great to get their opinion on that! re. transparent gz decoding for tar.gz and warc.gz (Ross, the proposal here is I think not to show that middle #file.tar bit at all and just do file.tar.gz#this-is-a-file.txt) - I'm open to changing this but would like the behaviour to be consistent across gzipped tars, gzipped warcs and gzipped plain files (where there is no container format). I.e. if you have a file.doc.gz - what happens? Do you just give file.doc.gz as the path or do you do file.doc.gz#file.doc? |
Okay, if i understand correctly... If I just reply from the files I'm used to dealing with and re-iterating my expected behavior above, then I can create a file that looks like this:
These are three distinct objects, a GZ, a TAR, and a TXT. In the 'transparent' model looking at an absolute path - I've no knowledge of the 'inner-tar.tar' file name or object. And so I'd hope for a path that looks like:
or:
Another set of examples here when talking about URIs: http://stackoverflow.com/a/9678657 It looks like as a discussion theme it's quite a common one! |
For the transparency discussion and URLs, here is an example of transparent gzip compression: $ ls
inner.txt
$ tar -czf outer.tar.gz inner.txt # create a gzipped tar file with inner text inside
$ ls
inner.txt outer.tar.gz
$ tar -ztvf outer.tar.gz # list the contents of the tar file
-rw-rw-r-- despens/despens 0 2016-05-31 07:59 inner.txt
$ mv outer.tar.gz outer-renamed.tar.gz # rename the tar file
$ gunzip outer-renamed.tar.gz # unzip the renamed tar file
$ ls
inner.txt outer-renamed.tar The gzip format does not store any information about what it compressed, it is just compression. Not even the name of the original "file" (more like a stream for gzip) that was compressed is available, instead, the extension This will never change for any stream-based compression processes like gzip, xz, bzip, etc, so there is no need to monitor the specs. These tools are designed to work with data that can arbitrarily streamed through them, they are not as much file formats as transport encodings, and the transport can even be as short as from my local harddisk to my local RAM. So putting an imaginary uncompressed filename into the hierarchy is not really "harmful", but more really redundant and suggesting that there is an option for another file name or a hierarchy when there is none. The problem with "nested protocols" as discussed in Ross' example about apache-vfs is I think the mis-understanding that an encoding or a file format is a protocol. URLs like IMHO! 😄 I understand the way of introducing the imaginary decompressed tar file is kind of dictated by how PRONOM is designed in the first place. That is, again IMHO, a shortcoming that tools like Siegfried could work around. Regarding the slash in WARC files after the date string: a WARC file would not be accessed via the file system, but via a WARC replay mechanism like pywb or OpenWayback, via the http protocol. The forward slash is used in http to separate path hierarchies. The URL would look like |
I think you're saying that compression can simply be viewed at as an encoding of another file (?) but I'm not sure we're expressing the entirety of the gz specification if we think this is always the case, see: https://tools.ietf.org/html/rfc1952 I'm not really thinking of PRONOM models here, I just know that (however we express it), in other applications like 7Zip I can retrieve an object of type tar with its own filename from a gzip and treat that discretely as a tar file whatever i want to do with that. I can see how your CLI example handles the tar.gz transparently, but however unlikely the use-case we can see that in other cases the combination can be extracted over two stages, so more verbosely than that. |
I didn't know about that optional Latin1 encoded file name, and stand corrected! |
This is the bit of sf that handles gzip names. You see it uses the gzip-encoded name if it is present (not often I've found) but otherwise just tries to strip the last extension from the gzip file. A possible, and possibly crazy, mid-way suggestion between the Ross and Dragan camps here might be to change this function so that it, if the gzip file contains an encoded filename, then return it concatenated with the parent e.g. file.tar.gz#file.tar. If, however, there is no name encoded within the file, rather than do the ugly extension trimming and concatenation, could just return the parent name i.e. file.tar.gz for the contents. This approach would exhibit Ross-preferred behaviour when a name is embedded in the gzip file and Dragan-preferred behaviour when there is no embedded name (probably the case for most warc.gz). It has the benefit of being the most truthy approach, but the risk of seeming a bit of a gzip lottery to most users! p.s. happy to take as an additional action from this thread to change the warc/arc paths so they always use unix slashes |
I'm stuck on which way to vote as I'd like to find the appropriate middle ground, and see the best for the community also. On one hand I can respect the concept of 'truthy'. I also respect the concept of users being skilled at decision making and things only being as simple as they need to be not simpler. On the other hand I'm keen to see the verbose method... technically without an FNAME in the GZ the GZ compression can still be unwrapped to a TAR and so i'm interested in saying there are two objects - and therefore two (not always inclusive of each other) techniques i need to be aware of to manipulate it - a GUNZIP and an UNTAR. That's a statement I feel is important for digital preservation - even if some tools do this work for us, not all tools will. Also, in understanding there are definitively two objects in the file path means that I don't have to remove a false positive from an incorrect filename extension, e.g. just naming a tar, tar.gz or vice versa. I would have to do this by consulting at least two records in the SF output - the container(s), and the file i'm interested in, inside the container. |
yes, that's a concern for me too: that if we say that an un-gzipped tar's file name is still file.tar.gz then we would have many more "extension mismatch" warnings appearing in current implementation. We can change the filename matching routine however to mitigate this. It could accept multiple filename extensions, rather than just one as at present, and try to match on all of them (i.e. ".gz" and ".tar" could both be matched in this case). This might be useful in other contexts too as I sometimes see files like "truth.wb3.doc" where the proper extension is wb3 but someone has tried to stick another extension on it to open in word. Or "important.doc.old" etc. In the middle ground approach you still would also get two objects in your output: the gzip file and the un-gzipped file. It is just that sometimes those objects would have the same name e.g. file.tar.gz twice (when there is no file name encoded within the gzip stream) and sometimes they wouldn't e.g. file.tar.gz and file.tar.gz#file.tar (when file.tar is encoded as a name within the gzip stream). This might of course be a PITA if you are expecting that filename field within results to be unique! For me the main issue would be the seeming inconsistent behaviour in the output (for users that don't appreciate nuances of gzip filenames)... i.e. the output might just seem odd if sometimes they are getting a sub-gzip filename and sometimes they aren't. |
I think the main goal should be that sf reports are consistent across many different formats. As a web person, I only know gzip et al as transparent compression. To me, a wrl.gz is not of type gzip, but of type VRML. But I understand that there are other cases and ignoring some information in a report is easier than not to have it when you need it. Thinking further, if sf would support for example going down into ISO 9660 files or hard disk images, that might look like a transparent format again, but in this case I personally would want to know what is the format of that disk image and what file systems it uses, what are contained partitions etc. For other users who are only interested in files (since all of their disk images will be Windows XP NTFS anyway), it might seem redundant. So my vote now goes to verbose. (But stand by my claim that inside of WARC files, gzip compression should be handled transparently if it is documented in the The problem of the same file being repeated with a compressed tar.gz and an uncompressed tar.gz is probably not that grave since it will be expressed as |
I've implemented the hash paths on the develop branch. I will leave the rest of the gzip behaviour alone for present. Outstanding action items are to:
Thanks for all the contributions on this thread, it's been informative! |
Thanks for your great work Richard! Siegfried 4 lyfe! |
hash (#) paths implemented in sf 1.6.0 |
Hi Richard,
Having been working with SF pretty intensely for the reporting tool I found it difficult to identify files inside of archive formats - i have to identify first an archive file format using the PUID, and log its complete path. I then look for occurrences of its path inside other paths - flagging them as content inside that archive.
Not the most elegant solution!
It's something I can achieve in DROID by just looking at the URI_SCHEME which I extract from the URI...
Do you think it scope creep to add to SF? - or could it be a 'goer' in the YAML output?
Cheers,
Ross
The text was updated successfully, but these errors were encountered: