-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
eml_validate doesn't check all EML validity rules #244
Comments
Thanks, we should definitely do this. Right, The trivial way (we actually had in EML 1.0) would be to just upload to your online Java validator, but a local validation would be nicer. Trying to distill this list a bit to think about how to test:
After that, I get a little more fuzzy. I get the idea that we cannot repeat an object that has an <creator>
<individualName>
<givenName>M</givenName>
<surName>Jones</surName>
</individualName>
</creator> appear twice in the same EML document (e.g. as both creator of metadata and in the author list of some paper cited in methods sections, etc)? Is that correct? It's like it would be basically impossible to detect and enforce that provision though -- without and other things on the above list:
One thing you didn't mention -- isn't it necessary to make sure that an object with an I think this raises some very interesting larger questions though too. For one, the R package is probably a liability for creating duplicates when references should be used instead (though this can be elegantly solved in But there's also a deeper question to my mind about the use of Lastly, I just want to note that, possibly in contrast to some XML tooling, I think JSON-LD really excels at this use of references. Duplicating objects with the same id is permitted in JSON-LD, but compacting or framing can be used to replace all but one of these to simple references; consistent with the EML rule of no duplicates (alternately, you can ask it to embed all reference objects explicitly, which can be nicer for software dev, since things like |
I think your analysis is right on @cboettig. Last week we discussed clarifying these rules in the spec, so I filed NCEAS/eml#306 to cover that. However, where you say:
That is not a requirement, and elements can and do get referenced before they are defined. Which is one thing that makes validating them impossible within the XSD world, which has a similar feature with key/keyref pairs. The eml-dev archives has a series of threads on why we as a community decided not to use key/keyref. So, in our case, one must accumulate a list of all IDs and all references and then compare them. In the Java parser that accumulation is done via a DOM model, which is why it is so slow on large documents, and why we want to switch to using SAX and a much lighter-weight model. There is an example doc that is slow to process attached to NCEAS/eml#1. The XSLT we created for EML and that we use in various repositories supports resolution of references, but it is not straightforward and I've seen other sites that just ignore references. Its a useful feature, but complex enough that some implementations don't deal well with it. |
Migrating this over to |
Note validation rules have been written up here: https://github.com/NCEAS/eml/blob/BRANCH_EML_2_2/docs/eml-validation-refs.md And that will be included in the next EML release (2.2.0). |
the
eml_validate
function only checks schema validity. The EML specification madates several other validity requirements which are not covered by the XSD files, but which are enforced by the EMLParser. As a result, there are EML documents created in R which are invalid and get rejected by repositories even after passing theeml_validate
method check. TO fix this, add the other validity checks toeml_validate
so it is compliant with the specification. Details follow. @maier-m may be willing to help with these changes.The additional rules beyond schema validation are written in section 3.3 Reusable Content in the EML spec: https://knb.ecoinformatics.org/#external//emlparser/docs/index.html#reusableContent I list them here for convenience:
system
attribute cannot exist in a single document.document
scope is defined as identifiers unique only to a single instance document (if a document does not have a system attribute or if scope is set to 'document' then all IDs are defined as distinct content).system
scope is defined as identifiers unique to an entire data management system (if two documents share a system string, then any IDs in those two documents that are identical refer to the same object).What would be great is if we wrote a function that could check all of these issues as an XML document is being parsed, and then call that after the
xml2::xml_validate
call is made. Both must be valid for the EML document to be considered valid.The current EMLParser is slow in part because it tries to do these checks in memory by loading the XML document as a DOM, and then querying the DOM for matches. A better algorithm is planned to fix the Java EMLParser (Issue NCEAS/eml#1). In this approach, we would 1) use a SAX parser to parse the EML document, and 2) record all
id
,reference
, and element details in a data structure as they are encountered, and 3) once the whole document is parsed, do the id/ref comparisons for uniqueness and for following the rules. The same approach could be implemented in R, but I'm not sure if xml2 supports SAX parsing. If not, the loaded XML document might be usable to directly query for rule checking. Let's discuss.This request originated from a discussion by our data team and recorded in NCEAS/datamgmt#133
The text was updated successfully, but these errors were encountered: