Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implementation does not follow metadata specification #1

Closed
johnscancella opened this issue Feb 28, 2017 · 29 comments
Closed

implementation does not follow metadata specification #1

johnscancella opened this issue Feb 28, 2017 · 29 comments

Comments

@johnscancella
Copy link

according to https://tools.ietf.org/html/draft-kunze-bagit-14#section-2.2.2 you are allowed to repeat metadata elements and that ordering is significant. However, looking at https://github.com/reeset/bagit_sharp/blob/master/bagit_sharp/Bag.cs#L13-L30
(not knowing C#) it seems like you don't allow certain "headers" to be repeated, nor with using a hashtable are you preserving the ordering of the "custom headers". If this is correct it goes against the bagit specification

@reeset
Copy link
Owner

reeset commented Feb 28, 2017

Interesting -- that is true. I've not preserved order nor allowed these items to be repeated; in part because I didn't see that in the python implementation (I also missed it in the spec). I'll take a look at this section.

@johnscancella
Copy link
Author

yeah, the various libraries sometimes forget small implementation details like that (in fact the previous bagit-java libraries did as well - which is how I know to check for it).

That is why the Library of Congress created the conformance suite, so that we didn't have to remember it all and just test against known issues.

Otherwise great work on the library in such a short time. Keep hacking on it and I am sure it will see lots of use in the community.

@reeset
Copy link
Owner

reeset commented Feb 28, 2017

Thanks -- the funny thing, I didn't see this in the conformance suite. The tool would have flagged an error if there any of the items elements were repeated. Might need to submit a pull request with a new element for the conformance testing suite to include this (either that, or I missed it -- because I primarily testing the 0.96 and 0.97 files)

@johnscancella
Copy link
Author

It does contain it
https://github.com/loc-rdc/bagit-conformance-suite/blob/master/v0.96/valid/duplicate-metadata-entries

To catch it you would need to read the bag and then write it back out and compare the before and after. So I guess it just depends on how you are checking against the conformance suite?

@reeset
Copy link
Owner

reeset commented Feb 28, 2017

Thanks -- question; looking at the specification, it says all fields could be repeatable. How is that possible? It seems that having two Payload-Oxum in the info file would cause significant validation problems. It would seem that only some of these fields, in practice, could be repeatable.

@johnscancella
Copy link
Author

You are correct in that it doesn't make sense for Payload-Oxum, which is why we are talking internally about moving it to the bagit.txt file since it is a kind of a special case. Otherwise the bag-info.txt is really just a list of key value pairs, what those key value pairs represent is opaque to the bag

@reeset
Copy link
Owner

reeset commented Feb 28, 2017

I would think the same would be said for Bag-Size (since it should represent the Payload-Oxum) and the Bagging-Date (since this should be generated on bag). I could see potentially wanting to keep multiple dates to show if things have been modified, but I would think having a specific flag, something like: Bag-Date-History, as a comma separated list separated from bagging date would make more sense.

@johnscancella
Copy link
Author

johnscancella commented Feb 28, 2017

The bagit specification doesn't define how to use bag-info.txt entries, that is up to the individual implementations. In the case of bagit-java, if you have multiple Payload-Oxum entries it just uses the first one and ignores the others. But this in no way affects if a bag is complete or valid, it is additional functionality outside of the bagit specification.

So in short, you can do weird things like multiple entries that may not make sense. But it is not the purpose of the bagit specification to ensure your metadata makes sense, just that it hasn't changed(which includes multiples and order).

Because you can do such weird things we recently added a linter ability in the latest bagit-java library and to make sure it conforms to a profile

@reeset
Copy link
Owner

reeset commented Feb 28, 2017

That is an interesting approach, because it potentially creates incompatible bags across systems. Say I decided the last payload file was the valid one...bags in my system would be valid and process, but that wouldn't be the case with other tools (like in bagit-java), even though technically, I'd be correctly using the spec. Given that my assumption is that this specification was created to make sharing easier, there definitely are places where prescribing rules to metadata makes sense and eliminates this kind of ambiguity.

@reeset
Copy link
Owner

reeset commented Feb 28, 2017

And thank you for the note about the profile. I had seen the message on the google group and was wondering if this was actively being used anywhere, though, as it is not part of the official spec, it's assuming a lot of good faith agreement in a space (preservation), where I'd think you'd want to be much more prescriptive.

@johnscancella
Copy link
Author

Incompatible across systems or tools? If a tool provides functionality outside of the bagit specification then it has to deal with all the cases that comes with. In the case of Payload-Oxum I suspect it was a hack that was put in place because we tend to have very large bags (1TB) and doing a full verification takes a long time.

@reeset
Copy link
Owner

reeset commented Feb 28, 2017

I'm not sure I see a difference since your system (or tool) will need to handle input and output. While the Oxum may be a hack, looking at the programming libraries available, it appears to be how most libraries (programming libraries) do top level validation. You can certainly go deeper, but I'm not sure if that is the practice if the shallow validation fails? Essentially, the tool is asked to make its own interpretation of the specification to support validation functionality. In the case of the java library, you'd mentioned that the decision was made to use the first Oxum for validation. Maybe another tool uses the last one. In both cases, the spec has been followed, but the interpretation by the tools has differed because the specification has left this too ambiguous (at least, it would seem that way to me). It also puts a lot of onus on the local institution to remember what interpretation they used when creating bags for storage. If the system is updated, or a tool set changes -- the assumptions may change as well, and bags that are perfectly well formed by the specification and one tools interpretation may suddenly all fail and require rebagging (assuming I'm understanding the process correctly)

@johnscancella
Copy link
Author

No where in the bagit specification does it mention using payload-oxum, thus it doesn't affect if a bag is valid. Quick verification is additional functionality outside of the bagit-specification and thus tool dependent - there is no getting around that unless quick verification is specifically added to the specification.

If you want it to be added to the specification send a pull request to https://github.com/jkunze/bagitspec

@reeset
Copy link
Owner

reeset commented Feb 28, 2017

That is an interesting take on the specification. I wouldn't read it that way in a million years -- especially since the spec tells you what the data is for, so if defined, it has to be valid -- otherwise, its not a valid bag according to the specification. The fact that tools build validation around that isn't a coincidence in my mind, because again, the specification tells you what is in this field, and if the data in that field isn't valid, the bag by definition, has to be invalid. This is why the potential of having multiple values is so problematic.

@johnscancella
Copy link
Author

spec tells you what the data is for, so if defined, it has to be valid

https://tools.ietf.org/html/draft-kunze-bagit-14#page-10 look at the definition of valid.
the only part that could apply is

  1. Every element present MUST comply with this specification

but nowhere in the specification does it say how to validate the bag-info.txt entries (just that the form needs to be <LABEL>: <VALUE>).

For example, how would you validate the entry External-Identifier? By definition you can't since it is external to the bagit tool.

@reeset
Copy link
Owner

reeset commented Feb 28, 2017

I disagree. I read the definition of valid, and I would argue that in some elements, like the Payload-Oxum, the specification explicitly tells you not just how the data should be formatted, but exactly what data must be in the field to be valid. While I may not be able to validate an external identifier, I can validate data that is explicitly defined, as machine readable. And the specification does make that clear:

Payload-Oxum The "octetstream sum" of the payload, namely, a two-
part number of the form "OctetCount.StreamCount", where OctetCount
is the total number of octets (8-bit bytes) across all payload
file content and StreamCount is the total number of payload files.
Payload-Oxum should be included in "bag-info.txt" if at all
possible. Compared to Bag-Size (above), Payload-Oxum is intended
for machine consumption.

If this value is present, it must conform to this description, or the data isn't valid. If it must conform to this description, then a bag isn't valid unless the value designated here can be reproduced. If I read valid strictly as you have noted here, I could make anything I want up and put it in this field, which I'd be surprised if that was the either the intention or the intent.

@acdha
Copy link

acdha commented Feb 28, 2017

Payload-Oxum is definitely an outlier: all of the other fields are intended to only be human-meaningful and so the enforcement is just the basic key:value structure. I agree that it should be an error to use Payload-Oxum with a value which does not match the specification's format but that then leaves it as the only field in bag-info.txt with a name and value defined in the specification.

Along with other spec updates we've been considering whether we should move this functionality to the bagit.txt file since that file has always been defined as machine-oriented and doesn't have the history of people editing it by hand or otherwise storing arbitrary content like bag-info.txt. The proposed update also split the value into two fields to avoid the need to define the custom format:

https://github.com/loc-rdc/bagitspec/pull/2

@reeset
Copy link
Owner

reeset commented Feb 28, 2017

John had mentioned this, and it would make a lot of sense. My big concern is that this would be a breaking change to the spec. I'm not familiar with the general history of how the bagit spec has evolved, but I'm assuming that these kinds of breaking changes would generally be avoided.

@acdha
Copy link

acdha commented Feb 28, 2017

That's something we're being sensitive to: Payload-Oxum was always optional since it's just an optimization for validation and I'm not aware of an implementation which breaks in its absence but we are looking to see if anything breaks if you add additional fields to bagit.txt (theoretically validation would check the version number but we can't assume that).

@reeset
Copy link
Owner

reeset commented Feb 28, 2017

Can I ask one more follow-up. I've been thinking about how to make the changes needed to fix this, and I just wanted to clarify -- the isn't case-sensitive; i.e., if in a baginfo file, I had the following:
Author: Terry
author: terry
AuTHor: terry

These would be three values, but all with the label AUTHOR. Is it important that these label casing stay retained if one was modifying the manifest?

@acdha
Copy link

acdha commented Feb 28, 2017

That's a very good question which the spec does not answer. The reserved metadata values are listed as case-insensitive (https://tools.ietf.org/html/draft-kunze-bagit-14#page-6) so an application would either need to special-case those or store everything insensitively.

I think we should clarify this in the next draft of the spec. I generally don't find the complexity worth it but I suspect backwards compatibility will require implementing case-insensitive-but-preserving access to this since human-edited bag info is pretty mainstream.

@reeset
Copy link
Owner

reeset commented Feb 28, 2017

My preference is definitely insensitive casing. I can preserve the casing on output if necessary, but I'm thinking that if I'm using my library -- if I asked for author -- I'd like all three variations of the tag, but if I preserved casing, I if this had special meaning to others, that would still be protected.

@johnscancella
Copy link
Author

yeah I didn't have that problem in the bagit-java library since the data structures preserve the case by default.

@acdha
Copy link

acdha commented Feb 28, 2017

The way I was thinking about implementing this in bagit-python will be to preserve the values but have a fallback path which does the case-insensitive access on a lookup error so the value written to disk would always be what you set but bag.info['foobar'] would retrieve "foobar", "FooBar", "fooBar", etc.

There's some prior art for this since RFC 2616 makes HTTP headers case-insensitive so there may be a library already available for reuse.

@reeset
Copy link
Owner

reeset commented Feb 28, 2017

So, I was thinking of taking the same approach Chris.

@acdha
Copy link

acdha commented Feb 28, 2017

I haven't looked at your code at all yet but I would definitely recommend taking the approach where every key returns a list even if it has only one element so people aren't surprised the first time they get a back with multiple values.

I'm going to add some extra edge cases to the conformance suite for this.

@reeset
Copy link
Owner

reeset commented Feb 28, 2017

So, what I've done is setup an bag_info object (much like I think you've done with the python). This will always return a list. But I also have bag specific values (oxum, bag-size, bagging-date) that correspond to this one bag -- these single values as they relate to the current bag loaded. Personally, I think that these three fields (oxum, size, and date) shouldn't be repeated. For the purposes of my code, I'm going to take the first element, and treat it as the element that represents the bag (when loading a manifest). When writing a new manifest, I won't likely allow these fields to be repeated, or, at least, will require someone to mark the bag as unsafe, because having these values duplicate just seems like a terrible idea. In validation, I'll at least warn users that duplicating these fields, while technically correct, should be avoided. Rather, I'd prefer to see bag date moved into a history note (bagging-date-history) with a comma delimited list, and have the bagging-date, the bag-size, and the payload-oxum be reserved terms.

@acdha
Copy link

acdha commented Mar 1, 2017

I agree that the mix of single and multiple values is sub-optimal. Based on the bug reports which we get, however, I would be prepared to get complaints about anything like that since people have a lot of legacy content and I wouldn't be surprised if you had to do something like set a flag so bags which arrived with dubious values can be saved but new ones either cannot or require a “I know what I'm doing” confirmation first.

reeset added a commit that referenced this issue Mar 2, 2017
…ibrary to include the following:

1) Support for Updating the Manifest file
2) Support for duplicate metadata tags
3) Protecting user written tags
4) Improved validation to include validation beyond the payload-oxum
5) Added functions to allow for bag-header modifications
6) Updated bag loading support for < 0.96 and the package-info.txt file.
7) Additional checksum option (sha384)
@reeset
Copy link
Owner

reeset commented Mar 2, 2017

Last update and release point fixes this issue.

@reeset reeset closed this as completed Mar 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants