Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

validating EML 2.2/schema location #292

Open
scelmendorf opened this issue Dec 30, 2019 · 7 comments
Open

validating EML 2.2/schema location #292

scelmendorf opened this issue Dec 30, 2019 · 7 comments

Comments

@scelmendorf
Copy link

I'm having trouble generating/validating 2.2. I think the problem is that the second part of the schema location is just eml.xsd rather than the full path, but unsure. Reproducible example below w/comments

library (EML)
me <- list(individualName = list(givenName = "Carl", surName = "Boettiger"))
my_eml <- list(dataset = list(
  title = "A Minimal Valid EML Dataset",
  creator = me,
  contact = me)
)

#validates in R but not in oxygen?
write_eml(my_eml, "ex.xml")
eml_validate("ex.xml")

#add in the correct schema location
my_eml$schemaLocation="https:://ecoinformatics.org/eml-2.2.0  https://nis.lternet.edu/schemas/EML/eml-2.2.0/xsd/eml.xsd"
write_eml(my_eml, "ex_2.xml")

#after this validates when read from file
eml_validate("ex_2.xml")

#but not when working on the not-yet-written-out eml object in R?
eml_validate(my_eml)
@amoeba
Copy link
Collaborator

amoeba commented Dec 30, 2019

Hi @, thanks for filing an issue. It looks like there are a few things going on.

Re:

#validates in R but not in oxygen?

What error(s) does Oxygen report? I don't have a license over here to test with.

Re:

#add in the correct schema location

XML validation is a pitfall-laden part of working with XML. The EML package does set the xsi:schemaLocation for the eml namespace by default to local path which really only works under certain circumstances. My understanding of this part of the XML spec is that schemaLocation is merely a hint and whatever's doing the validating may use it or not. So no value is "incorrect", per se. That said, others here might think we should change our default schemaLocation to something web-resolvable rather than a local path.

#but not when working on the not-yet-written-out eml object in R?

This looks like a bug. @cboettig it looks like eml_validate(foo) (validating an in-memory object) isn't equivalent to write_eml(foo, bar) -> eml_validate(bar) (writing to disk, then validating an on-disk doc), though I think they oughta be. What do you think? Here's an MRE (copied from @scelmendorf:

> library(EML)
> me <- list(individualName = list(givenName = "Carl", surName = "Boettiger"))
> my_eml <- list(dataset = list(
+   title = "A Minimal Valid EML Dataset",
+   creator = me,
+   contact = me)
+ )
> eml_validate(my_eml)
[1] FALSE
attr(,"errors")
[1] "Element '{https://eml.ecoinformatics.org/eml-2.2.0}eml': The attribute 'packageId' is required but missing."
[2] "Element '{https://eml.ecoinformatics.org/eml-2.2.0}eml': The attribute 'system' is required but missing."   

@scelmendorf
Copy link
Author

@amoeba - Oxygen error is (on line 2):
"Cannot find the declaration of element 'eml:eml'."

@cboettig
Copy link
Member

@amoeba Thanks!

That behavior was intentional, though maybe misguided.

write_eml() adds a packageId using a UUID if no packageId has been assigned (and of course system refers to the packageId. I did this intentionally so that a user could create a minimal EML file like the one above, where all of the elements are intuitive. Asking a new user to create a packageId (and the correspondingsystem!) is I think way less intuitive, so I thought being able to generate one on the fly makes sense (and matches the behavior of earlier versions of EML`). But perhaps that was a mistake and users should be forced to set that manually.

Clearly in the list constructor has no mechanism to automatically generate a packageId, and I think it would be weird if it did, since list constructors are designed so that you can build things up piecewise, so there's no expectation that they should be valid.

We could allow eml_validate to call write_eml first, where it would automatically add the packageId if missing, but I think that is also misguided. IMO the above eml fragment is really a fragment and it shouldn't validate.

I agree it's a bit weird that it validates when you call write_eml, I think the best fix would be that write_eml should throw a warning if no packageId is found, and explain that it is adding an ID automatically. I'd love a PR for that if we have consensus on that.

Also, I know we've discussed the schemaLocation issue before and the potential security risk of using a resolvable URL as the schemaLocation instead of using the locally installed copy of the schema, but if it makes the EML we generate more compatible with other tools, perhaps we should use that as our default schemaLocation instead. IMO it's the security issue is more the responsibility of the user and the other external tools. Open to discussion as to whether this would mean that the R package use the local copy or the online copy to validate (we could at at least check hashes or something, though it's nice to be able to validate offline!)

@mobb
Copy link

mobb commented Jan 3, 2020

There are a couple of topics coming up here, so I will try to address them individually. Note: comments based on my experience with a subset of EML-builders who work with LTER and EDI. Those users typically pre-assign the packageId, and the PASTA system also checks schema validity. Sometimes we suggest commercial XML editors like Oxygen.

Re validation within R in general, assigning packageIds:
IMO, it’s a great idea to validate the EML as you go! And as long as EML-builders understand those two errors (packageId missing, system missing), they can do that in R - as is - with eml_validate(my_eml).

EML-builders should see the error; so I agree, this is misguided:

We could allow eml_validate to call write_eml first, where it would automatically add the packageId if missing, but I think that is also misguided.

The temporary packageId and system are fine (as inserted by write_eml). However, users should be aware that this is happening, so I recommend a message to the screen if write_eml inserts these. For those who control their packageIds, the msg will help remind them to assign it.

Re schemaLocation:
Does the R-EML package always use its own internal version? Ie, never the contents of the schemaLocation attribute? (I confess to not reading all the documentation).

IMO current behavior is fine -- the simple, local filename is a good choice for a default schemaLocation, as this can always be overwritten and appropriate for concerns already outlined by others.

Some EML-builders will want to (a) use the OxygenXML editor or (b) point schemaLocation to a URL. They may need to learn how to set the schemaLocation attribute, and to validate with several tools. However, teaching them is the responsibility of their communities (e.g., EDI, LTER); not the R code.

So bottom line(s):

  • no changes to code behavior, but consider adding a msg if a placeholder packageId was added.
  • Communities who promote certain EML/XML tools should show EML-builders how to validate (code can’t anticipate every possibility).

@amoeba
Copy link
Collaborator

amoeba commented Jan 4, 2020

Thanks @cboettig, @mobb.

@scelmendorf we're discussing a number of things now that don't directly help you out so I want to address your original issues first. To avoid the errors about packageId and system you get before you write to disk, you'll need to set them ahead of time like:

my_eml <- list(packageId = "mypackage", 
               system = "mysystem",
               dataset = list( # continued...

And, as you found out, if you write_eml first, these two values are filled in automatically. The why of the errors is that those are required elements on the root eml element, as defined in the schema. There is some info there about how to use those attributes but feel free to ask about them.

The second part of your issue, Oxygen not being able to validate your documents without modification, your guess is probably right on and editing the schemaLocation is probably the best fix. Some XML validation tools support defining a catalog of local schemas which Oxygen does as well which will also work. This is a thing you'd have to deal with in other tools, not just Oxygen but it's a general XML problem and not something specific to EML or this package.

Does that help you out enough? It's a bit of a half fix, though it looks like we're keen to discuss some quality of life improvements that'd make what you experienced less painful.

@cboettig, @mobb, mind if I open two new issues to discuss (1) the possibility of adding warnings/messages when write_eml fills in packageId and system on behalf of the user and (2) changing the default value of schemaLocation to an HTTP-resolvable copy of the root schema?

@mobb, Re:

Does the R-EML package always use its own internal version?

Looks like it does always try a local copy stored with the package, though this package uses the xml2 package which may have other behavior.

@scelmendorf
Copy link
Author

Thanks @amoeba. I can add the schema location (that was my original solution - I just didn't know if that was intentional), and have been putting in the packageID; I just came across the write_eml issues when trying to make a reproducible example for the schemaLocation. My 2 cents is that it would be easier for most users if the schemaLocation default were HTTP resolvable and yes putting in warnings for filling in default packageId and system would be useful.

@cboettig
Copy link
Member

cboettig commented Jan 6, 2020

Thanks all! yes, new issues / PRs would be great for both the warning message and a remote schemaLocation default value. (or I'll get around to that sooner-or-later!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants