Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow to ensure users write valid EML? #46

Closed
cboettig opened this issue Sep 4, 2013 · 3 comments
Closed

Workflow to ensure users write valid EML? #46

cboettig opened this issue Sep 4, 2013 · 3 comments

Comments

@cboettig
Copy link
Member

cboettig commented Sep 4, 2013

In our early discussions about validation, we agreed it was really just part of the developer testing suite. For a user consuming EML, having the software complain the file isn't valid isn't really helpful, it's best just to give it our best shot anyway. For writing EML, since this is programmatically generated we can assure it is valid ... or can we?

The S4 R objects we use mimic the schema, but they don't enforce required vs optional slots (in fact, all slots are always 'present' in the S4 objects, so an operational definition of "empty" is that the slot has an empty S4 object (recursive) or a length 0 character/numeric/logical string.) A user can create an S4 object and pass it into their EML file (seems like a useful/powerful option to have, particularly for reusing elements). If the object is missing some required elements, this will create invalid EML.

We can avoid this in several ways:

  • We could write a validation check as part of each S4 method. Rather tedious, this also seems redundant with the schema validation check. On the other hand, this approach provides a nice warning earlier to the advanced user.
  • We could instead write constructor functions for each object. Also tedious, but allows clear indication of optional and required parameters and can be easier to use than the new constructor. This is the strategy we employ so far, but we still permit pre-built S4 nodes to be passed to some constructors to facilitate reuse (but bypassing the protection regarding required elements).
  • Run the validator by default on calls to write_eml (would require an internet connection or packaging the schema). If we we check only by validating the final EML file, the user may be at some trouble to find just what they need to change. On the other hand, it is perhaps the surest way to guarantee validity.
@mbjones
Copy link
Member

mbjones commented Sep 5, 2013

@cboettig One potential source of validation errors that you may not have considered is the use of illegal XML characters in the user input. Before you write out the XML, all illegal XML chanracters need to be escaped. Does your S4 class handle this escaping automatically when moving data in and out of R data structures?

@cboettig
Copy link
Member Author

cboettig commented Sep 5, 2013

Yes, it looks like the R XML library automatically escapes these characters. (Noticed this somewhat by accident in my example from the README: https://github.com/ropensci/reml/blob/70c1f8b2747515ae32b770007c84c905f1fda3d3/inst/doc/reml_example.xml

I added an html-marked up link for intellectual rights, which you will see escaped there. (How would you suggest that section be marked up properly to include a link to the relevant license? Or should I just stick in the whole license text?

@cboettig
Copy link
Member Author

I think we aim for a two-fold strategy: (1) mostly to provide constructor functions for which it is difficult to make invalid EML, while still supporting direct construction for advanced users, and then (2) wrap the validation check in the "publish" functions (with toggle off option), but not wrap it in the regular "write" functions. We also expose the validation function for end-users to run it themselves if they wish. (Re-tagging question as "publish" instead of "write").

To Do:

  • Add validation check to publish functions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants