Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove dependency of eml_validate on schemaLocation #53

Merged
merged 5 commits into from Feb 3, 2020

Conversation

amoeba
Copy link
Contributor

@amoeba amoeba commented Jan 31, 2020

Stemming from discussion in #44, we wanted to change the behavior of eml_validate to no longer validate by guessing the schema to validate by parsing the xsi:schemaLocation value on the root. Now, eml_validate determines the schema file to validate with in a more XML-aware way using a combination of the full QName of the root element and any namespaces defined on the root.

See eml_validate for two new helpers: find_real_root_name and guess_root_schema. These are really workarounds for a limitation of xml2 that may be removed in the future. Basically, xml2 can't give us the full QName of the root element so we can't know which namespace the root is from. To work around thsi, we parse the document with regular expressions (hacky but not horrible) to find the QName and go from there.

Closes #52

Stemming from discussion in ropensci#44, we wanted to change the behavior of `eml_validate` to no longer validate by guessing the schema to validate by parsing the `xsi:schemaLocation` value on the root. Now, `eml_validate` determines the schema file to validate with in a more XML-aware way using a combination of the full QName of the root element and any namespaces defined on the root.

See eml_validate for two new helpers: `find_real_root_name` and `guess_root_schema`. These are really workarounds for a limitation of `xml2` that may be removed in the future. Basically, `xml2` can't give us the full QName of the root element so we can't know which namespace the root is from. To work around thsi, we parse the document with regular expressions (hacky but not horrible) to find the QName and go from there.
@amoeba
Copy link
Contributor Author

amoeba commented Jan 31, 2020

As a note, I ran R CMD CHECK on emld and all is good. I also ran the tests on EML to make sure we weren't breaking anything there.

@cboettig
Copy link
Member

👏 👏 🎉

Thanks for this @amoeba. Now I recall running into the QName problem when trying to get validation to work initially. Still surprised this is missing from the xml2 bindings but this solution looks reasonable to me!

Minor thing, but can you just maybe add a test file that doesn't specify a schemaLocation? I don't think we had one yet and I think such a file would have caused a test failure prior to this patch, so it would be good to include.

It's probably a subject for a separate PR and maybe not mission critical enough, but I think that with this fix some of the logic in test-validate.R round trip checks could be simplified. Most importantly, I think there are some test files that had been skipped for wrongly throwing validation errors, and it would be good to get them un-skipped. I think the first step there is probably extending the test-validator code to confirm that the validator is correctly scoring all valid and invalid test files (without doing any of the parsing/serializing round trip stuff of test-validate.R, which is really a test of those read/write methods and not intended to be a test of validation routines itself).

@cboettig
Copy link
Member

@amoeba also can you maybe give a concise statement of the issue into NEWS.md? Of course can link the PR and related issues.

@amoeba
Copy link
Contributor Author

amoeba commented Jan 31, 2020

Minor thing, but can you just maybe add a test file that doesn't specify a schemaLocation? I don't think we had one yet and I think such a file would have caused a test failure prior to this patch, so it would be good to include.

Sure thing. Good call on this: The validator still works but the roundtrip test breaks. Should be a quick fix.

@amoeba also can you maybe give a concise statement of the issue into NEWS.md? Of course can link the PR and related issues.

Yep! I'll try to wrap this today.

Also adds a pattern in the roundtrip tests to let us validate and roundtrip docs that are supposed to be invalid (negative tests).

Ref ropensci#53
Removes the dep on schemaLocation in the test suite I missed earlier. Ref ropensci#53
@amoeba
Copy link
Contributor Author

amoeba commented Jan 31, 2020

All ready for a look. See NEWS.md in this PR for a description of the changes.

To add tests for docs without schemaLocation, I ended up also giving the roundtrip function the ability to test that invalid docs are indeed invalid. Also, do to the new behavior where emld automatically adds schemaLocation if absent in the source document, I had to remove schemaLocation from the before and after count comparison.

We'll have to have a look in the future about cleaning up the validation tests as we've been discussing here and on Slack.

@cboettig
Copy link
Member

cboettig commented Feb 1, 2020

Awesome, this looks really good to me, and glad to have these issues addressed, even given the limitations with xml2 bindings.

I'm happy to merge this and prepare a CRAN update. @mbjones let me know if you want to give this a code review first?

@amoeba
Copy link
Contributor Author

amoeba commented Feb 3, 2020

Last night I read the source for the libxml2 bindings and figured out that I can in fact get the namespace on the root so I'd like to re-factor this once more before merging.

Refactors find_real_root_name and guess_root_schema in light of me figuring out how to actually use `xml2::xml_name`. Much simpler now.

Goes with ropensci#52.

Also removes guess_ns helper from test-validate.R in favor of `find_real_root_name` which is now an internal helper.
@amoeba
Copy link
Contributor Author

amoeba commented Feb 3, 2020

A lot simpler now: c746ad5.

@mbjones
Copy link
Member

mbjones commented Feb 3, 2020

I just read through and this all sounds like an improvement. I don't have cycles to review this week, so I'd say go for it and we can always tweak it later. I think getting this out will be an improvement that everyone will benefit from. So, 👍 from me @cboettig

On a side note, I still don't see why we need to determine the QName of the root node at all to get validation to work. I'll need to look more carefully at the code, but it seems to me that we shouldn't have to tell libxml2 at all what the document type is -- it can figure it out on its own. That's what we do on the Java side. In particular, we really should just be providing an XML Catalog file listing all of the schema namespaces we know about, and then libxml2 will match namespaces using that. It seems to me the root of our problem is that xml2 only provides the method xml2::xml_validate(x, schema), and doesn't seem to support XML catalogs. Which means the application has to guess what the right schema is, when in fact that can and should be done by libxml2. And it also means that the validate function can only pass one schema document, when in reality multiple might be needed (e.g., EML and STMML schemas). The libxml2 docs say the following about XML Catalogs:

In a normal environment libxml2 will by default check the presence of a catalog in /etc/xml/catalog, and assuming it has been correctly populated, the processing is completely transparent to the document user.

The user can change the default catalog behaviour by redirecting queries to its own set of catalogs, this can be done by setting the XML_CATALOG_FILES environment variable to a list of catalogs, an empty one should deactivate loading the default /etc/xml/catalog default catalog.

So we should be able to add a catalog parameter to xml2::xml_validate that would get used to look up schema locations by redirecting via XML_CATALOG_FILES. The new signature might be xml2::xml_validate(x, schema=NA, catalog=NA) and when called with catalog='./catalog.xml' would set the appropriate environment variable before calling libxml2 routines (and would still honor schema docs that are directly passed in.

So, a bit more to discuss here, but maybe we should do that under a new issue?

@amoeba
Copy link
Contributor Author

amoeba commented Feb 3, 2020

Thanks @mbjones. Re: /etc/xml/catalog/XML_CATALOG_FILES, I did see that when researching a fix here but couldn't get them to work with XML schemas. The docs indicate they should work with DTDs but don't mention XML schemas explicitly. The libxml docs indicate that schema support is incomplete and I haven't been able to find any worked examples online (yet?) so I wonder if catalogs may not work with XML schemas. I've followed an approach based on this excellent reference but didn't have any luck.

@cboettig
Copy link
Member

cboettig commented Feb 3, 2020

@amoeba @mbjones Thanks both for the feedback! Merging this now but please do open another issue to track the above discussion. Sounds like it might be worth a bug report against xml2 R package, maybe linking back here for our reference.

Unrelated, but you can see devel is now failing due to ... being documented but not used, which last night's version of R-devel has now decided deserves a WARNING (and the resulting pls fix emails from CRAN for both emld and EML. I'll try and squash those tonight I guess and get this off.

@cboettig cboettig merged commit 1d1eace into ropensci:master Feb 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Don't depend on schemaLocation to find the schema file to validate with
3 participants