Consider changing default `xsi:schemaLocation` value to point to a remote copy of the eml.xsd #44

amoeba · 2020-01-06T23:22:23Z

As discussed in ropensci/EML#292, the current default behavior of emld is to set the xsi:schemaLocation to values like https:://ecoinformatics.org/eml-2.2.0/ eml.xsd which matches the examples in https://github.com/NCEAS/eml but isn't useful for validation tools that follow remote URIs to find a copy of the schema. @scelmendorf had to set a custom value to get validation to work which I'd argue isn't desirable.

We should discuss whether it makes sense to set a value like https://eml.ecoinformatics.org/eml-2.2.0/eml.xsd instead of just eml.xsd so, by default and with no tweaking from the user, validation tools like Oxygen or any XML-aware text editors can do validation of documents EML/emld produces.

What do others think about this?

The text was updated successfully, but these errors were encountered:

cboettig · 2020-01-06T23:25:41Z

Thanks Bryce! I'm for moving to https://eml.ecoinformatics.org/eml-2.2.0/eml.xsd. @mbjones? others?

mbjones · 2020-01-08T06:31:24Z

LGTM. As @amoeba said, it is an optional attribute, so you could also just leave it out altogether. But your proposal seems good too.

Closes ropensci#44

amoeba · 2020-01-23T02:08:23Z

PR sent. @cboettig take a look when you get a chance?

mbjones · 2020-01-28T05:57:54Z

@cboettig asked a question on this xsi:schemaLocation validation issue over in #51 (comment). I'm going to answer here to keep the thread together. Sorry this turned out to be a bit longwinded. Hopefully its useful.

I don't think we should be doing any string parsing of xsi:schemaLocation to determine which submodules should be used for validation. XMLSchema already fully handles assigning each element and attribute to a specific namespace with a well-defined (albeit complicated) set of rules. You should never have to guess what namespace somebody is asserting for a root level element via string parsing. Instead, we should 1) use proper XML parsing rules to determine the type of the root element, 2) use the formal namespace xs:import and xs:include statements to track namespace dependencies in schemas, and 3) use a local schema cache based on namespace identifier rather than ever using schemaLocation to resolve a schema. For example, the following root element is explicitly rooted at the EML 2.2.0 eml element, which is matchable via its namespace and only its namespace to its XMLSchema structure:

<?xml version="1.0"?>
<eml:eml
    packageId="doi:10.xxxx/eml.1.1" system="knb"
    xmlns:eml="https://eml.ecoinformatics.org/eml-2.2.0"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

    <dataset id="dataset-01">
    ...
    </dataset>
</eml>

Validation of that structure explicitly starts with validation of the root eml:eml element, which resolves to the namespace https://eml.ecoinformatics.org/eml-2.2.0 with local name eml. The XML parser will tell us all of that explicitly. Even if I were to set xsi:schemaLocation="https://eml.ecoinformatics.org/eml-2.2.0 https://hoser.com/xsd/2.1.1/eml.xsd" on the root element, it would be a mistake to use that file to parse the document. It should be ignored as incorrect user data.

In contrast, this structure explicitly is rooted at a different root schema, namely https://eml.ecoinformatics.org/eml-dataset-2.2.0 with local element name dataset:

<?xml version="1.0"?>
<ns2:dataset id="dataset-01" 
    xmlns:ns2="https://eml.ecoinformatics.org/eml-dataset-2.2.0">
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
...
</dataset>

This is everything one needs to know to know its type, and can be resolved as long as the validator has been configured with a canonical set of schemas for each of the referenced namespaces. This is usually done via an XML Catalog file, but each parser provides multiple mechanisms to pass in namespace to schema-file hashes. If a local schema hasn't been provided for a referenced namespace, an error is rightfully generated. And you might note that the namespace of the dataset element in the first example (it's absent/null) is NOT the same as the namespace of the dataset element in the second example, due to EML's use of xs:elementFormDefault=unqualified. See https://stackoverflow.com/questions/1463138/what-does-elementformdefault-do-in-xsd

String parsing for namespaces is highly error prone, because there are a lot of rules about how to determine the namespace for a given XML element or attribute. At a minimum, you have to consider the prefix and namespace string definitions, the xs:import and xs:include statements in the xsd files themselves, and the values of the xs:elementFormDefault and xs:attributeFormDefault values in the schema, and the current value of xs:targetNamespace and the default namespace (set with e.g., xmlns=foo) as you traverse the element tree. These can all change dynamically as you traverse the element tree, inheriting from parents but potentially being overridden in each child node, and then being reset as one steps back up the element tree. And they all have a large impact on namespace interpretation. I think its essentially impossible to do this right without implementing the full XMLSchema grammer, in which case you've reinvented a formal XML Parser.

The only piece of information that can always be ignored by an XML validator is xsi:schemaLocation. So, bottom line, I advocate for not doing any string parsing or logic around xsi:schemaLocation. Let's just ignore it.

cboettig · 2020-01-28T17:23:08Z

@mbjones Thanks for this, I agree 💯 on this, which is why I wanted to flag it again before cutting the next release. I think the key issue here is where you observe:

This is everything one needs to know to know its type, and can be resolved as long as the validator has been configured with a canonical set of schemas for each of the referenced namespaces

The validator in emld has not been configured with "a canonical set of schemas for each of the referenced namespaces". Currently it is guessing these from schemaLocation, which I agree is wrong. So, action items:

should I hold off on cutting a new release until that's fixed? (I believe CRAN still rate-limits the frequency at which they want updates to established packages, though maybe with the automated system they aren't as picky now?)
Do you think you or @amoeba or someone would have a chance to send a PR to fix this?

(I do wish I had time to dive in here more -- its always been a bit frustrating that I haven't been able to secure funding or academic credit that would let me allocate more time to this! But I do really really appreciate all the excellent pull requests here and on EML package too!)

Stemming from discussion in ropensci#44, we wanted to change the behavior of `eml_validate` to no longer validate by guessing the schema to validate by parsing the `xsi:schemaLocation` value on the root. Now, `eml_validate` determines the schema file to validate with in a more XML-aware way using a combination of the full QName of the root element and any namespaces defined on the root. See eml_validate for two new helpers: `find_real_root_name` and `guess_root_schema`. These are really workarounds for a limitation of `xml2` that may be removed in the future. Basically, `xml2` can't give us the full QName of the root element so we can't know which namespace the root is from. To work around thsi, we parse the document with regular expressions (hacky but not horrible) to find the QName and go from there.

amoeba added a commit to amoeba/emld that referenced this issue Jan 23, 2020

Change default schemaLocation so it points at a URL

4ba83c9

Closes ropensci#44

amoeba mentioned this issue Jan 23, 2020

Change default schemaLocation so it points at a URL #45

Merged

cboettig mentioned this issue Jan 28, 2020

Bump version, add authors, update NEWS #51

Merged

cboettig closed this as completed in #45 Jan 31, 2020

amoeba mentioned this issue Jan 31, 2020

Remove dependency of eml_validate on schemaLocation #53

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider changing default `xsi:schemaLocation` value to point to a remote copy of the eml.xsd #44

Consider changing default `xsi:schemaLocation` value to point to a remote copy of the eml.xsd #44

amoeba commented Jan 6, 2020

cboettig commented Jan 6, 2020

mbjones commented Jan 8, 2020

amoeba commented Jan 23, 2020

mbjones commented Jan 28, 2020

cboettig commented Jan 28, 2020

Consider changing default xsi:schemaLocation value to point to a remote copy of the eml.xsd #44

Consider changing default xsi:schemaLocation value to point to a remote copy of the eml.xsd #44

Comments

amoeba commented Jan 6, 2020

cboettig commented Jan 6, 2020

mbjones commented Jan 8, 2020

amoeba commented Jan 23, 2020

mbjones commented Jan 28, 2020

cboettig commented Jan 28, 2020

Consider changing default `xsi:schemaLocation` value to point to a remote copy of the eml.xsd #44

Consider changing default `xsi:schemaLocation` value to point to a remote copy of the eml.xsd #44