Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EML::eml_validate conflicts with knb.ecoinformatics.org parser & appears to introduce invalid xml into valid files #348

Open
RobLBaker opened this issue Jan 4, 2023 · 1 comment

Comments

@RobLBaker
Copy link

RobLBaker commented Jan 4, 2023

This is fairly odd behavior and may be specific to older EML files. I suspect it has to do with the eml_validate() function not being backwards-compatible with EML schema 2.1.1. But specifying schema 2.1.1 changes - but does not solve - the problem.

I downloaded an older data package with metadata built under EML 2.1.1. I checked the validity of the EML file using https://knb.ecoinformatics.org/emlparser and found that it passed both XML and EML specific tests. I then read the file in to R. EML::eml_validate() found that it contained invalid EML. Thus, when I wrote it back to .xml it re-arranged some aspects of the original EML file. When I re-ran the parser tests at knb.econinformatics.org, the newly exported file failed the XML-specific tests. Is the EML package introducing invalid xml into (valid?) EML-formatted .xml files?

I downloaded the following data package: https://portal.edirepository.org/nis/mapbrowse?packageid=knb-lter-and.4780.4 and ran the file, "knb-lter-and.4780.4.xml" through the EML parser at https://knb.ecoinformatics.org/emlparser/. The file passed both XML-specific and EML-specific tests.

I read the file in to R using EML::read_eml(), checked to see whether it validated using EML::eml_validate() (with schemas 2.1.1 and 2.2.0) and then wrote it back to xml using EML::write_eml():

mymeta<-EML::read_eml("knb-lter-and.4780.4.xml", from="xml")

The EML does not validate using schema 2.2.0. Perhaps this is not unexpected, given it was created under 2.1.1.:

EML::eml_validate(mymeta)
[1] FALSE
attr(,"errors")
[1] "Element 'boundingCoordinates': This element is not expected. Expected is one of ( geographicDescription, references )."
(and 17 additional identical errors are listed)

Switched to schema 2.1.1:

emld::eml_version("eml-2.1.1")
[1] "eml-2.1.1"
EML::eml_validate(mymeta)
[1] FALSE
attr(,"errors")
[1] "Element '{https://eml.ecoinformatics.org/eml-2.2.0}eml': The attribute 'packageId' is required but missing."
[2] "Element '{https://eml.ecoinformatics.org/eml-2.2.0}eml': The attribute 'system' is required but missing."
[3] "Element '{https://eml.ecoinformatics.org/eml-2.2.0}eml': Missing child element(s). Expected is one of ( access, dataset, citation, software, protocol )."

In this case the EML doesn't validate, but it appears that the problem is despite switching to schema 2.1.1, the eml_validate function is still checking against version 2.2.0, but it does not seem to have problems with the geography (or is simply not reporting them?).

In any case, I can then write the object back to .xml:

EML::write_eml(mymeta, "exportedEML.xml")

The newly exported "exportedEML.xml" file now contains the namespace conflicts described in issue #347, despite having specified that the EML 2.1.1 schema should be used prior to calling the EML::write_eml function

When I now check the exportedEML.xml file using the EML parser at https://knb.ecoinformatics.org/emlparser/ I find that although it passes EML-specific tests, it fails XML-specific tests:

XML specific tests: Failed
The following errors were found:
cvc-complex-type.2.4.a: Invalid content was found starting with element 'boundingCoordinates'. One of '{geographicDescription, references}' is expected.

Has the EML package introduced invalid XML into the file?

Further comparisons of the .xml files indicates that various elements within the original knb-lter-and.4780.4.xml have been re-arranged compared to the exportedEML.xml file. Specifically, in the original knb file, there are 18 elements listed under with the following general format:

<spatialSamplingUnits>
     <coverage>
          <geographicDescription>HJA Phenology Sites</geographicDescription>
               <boundingCoordinates>
                    <westBoundingCoordinate>-122.26083000</westBoundingCoordinate>
                    <eastBoundingCoordinate>-122.11159208</eastBoundingCoordinate>
                    <northBoundingCoordinate>44.28199677</northBoundingCoordinate>
                    <southBoundingCoordinate>44.20198189</southBoundingCoordinate>
                    <boundingAltitudes>
                         <altitudeMinimum>1314</altitudeMinimum>
                         <altitudeMaximum>1314</altitudeMaximum>
                         <altitudeUnits>meter</altitudeUnits>
                    </boundingAltitudes>
               </boundingCoordinates>
     </coverage>

Whereas in the exportedEML.xml file, the corresponding elements have the following arrangement:

<spatialSamplingUnits>
     <coverage>
          <boundingCoordinates>
               <westBoundingCoordinate>-122.26083000</westBoundingCoordinate>
               <eastBoundingCoordinate>-122.11159208</eastBoundingCoordinate>
               <northBoundingCoordinate>44.28199677</northBoundingCoordinate>
               <southBoundingCoordinate>44.20198189</southBoundingCoordinate>
               <boundingAltitudes>
                    <altitudeMinimum>1314</altitudeMinimum>
                    <altitudeMaximum>1314</altitudeMaximum>
                    <altitudeUnits>meter</altitudeUnits>
               </boundingAltitudes>
          </boundingCoordinates>
          <geographicDescription>HJA Phenology Sites</geographicDescription>
     </coverage>

As you can see, the children of have been re-arranged in alphabetical order, which seems to be the default approach for EML::write_eml when handling invalid EML. Except in this case, was the EML invalid? knb's EML parser says it was valid. If the original file was valid EML, then the EML package appears to be taking valid EML and turning it into an invalid format that does not pass XML tests (or the EML::eml_validate test). Either way, I would not expect reading and then writing a (valid?) EML file to introduce these sorts of changes.

@RobLBaker RobLBaker changed the title EML::eml_validate conflicts with knb.ecoinformatics.org parser, and appears to introduce invalid xml into valid files EML::eml_validate conflicts with knb.ecoinformatics.org parser & appears to introduce invalid xml into valid files Jan 4, 2023
@jeanetteclark
Copy link
Contributor

Hi @RobLBaker , I tracked down the problem and wrote up an issue over in the sister repository emld. Thanks for the report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants