Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change default schemaLocation so it points at a URL #45

Merged
merged 2 commits into from Jan 31, 2020

Conversation

amoeba
Copy link
Contributor

@amoeba amoeba commented Jan 23, 2020

Closes #44

Couple of things to look at here:

  • Is it okay I just threw that helper in eml_version.R?
  • I couldn't remember the smart way to put a package level constant in a package, is that alright?

Copy link
Member

@cboettig cboettig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks much for this. A few things here:

First, I think we should discuss whether we want to do it this way, or just stop putting anything in schemaLocation at all by default. I'm kind of inclined to the latter (because all the test files had this set, I had historically assumed it was a "good idea", but now I'm not convinced).

Relatedly, given the proposal that eml_validate ignores schemaLocation anyway and always uses the local copy, just leaving that blank seems the simplest (i.e. it feels a bit weird to me that we have add a function to set the schema location while also ignoring that value in validation. Though I do agree with @mbjones that it is better we always use the local copy to validate).

Second, I think (but may be mistaken!) that we have a bit bigger nuisance in fixing #44. Currently validation is always local, but because the test suite includes a bunch of documents that are not valid complete EML, but are instead only valid subsets (e.g. dataset etc), the validator is currently parsing schemaLocation in order to decide if it should be testing against the full eml.xsd or some subset like dataset.xsd. This causes validation to fail if schemaLocation is unset. It sounds like this should be a simple fix, though I did take a quick stab at it on the plane and created some test failures which I have not yet dug into. See patch-schemaLocation branch....

R/eml_version.R Outdated

# Helper string used below to factor out the host from the generation of
# schemaLocation values
eml_host <- "https://eml.ecoinformatics.org"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine, unless we feel it should be configurable as an option. But why make it a package global variable? It seems more natural to me to make eml_host an optional argument to the new eml_schema_location() function, with this as the default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a better approach.

@amoeba
Copy link
Contributor Author

amoeba commented Jan 23, 2020

Hey @cboettig:

I think I'm fine with either setting schemaLocation or not. I see adding it as a convenience: When set, my text editors automatically validates as I type and shows in-context errors. Also when set, my command line validation tools work out of the box without editing the file. These conveniences are nice when working with a document here or there but add up when doing this all the time.

Considering the other two points you bring up, I can get onboard with not setting schemaLocation by default and fixing module-level validation as you suggest to make things more consistent. My above convenience points are mitigated in a world where EML uses an https URI for the namespace which directs people to a copy of the schema so they can set schemaLocation on their own if needed.

I'm happy to close this PR if we agree.

@cboettig
Copy link
Member

You make a good case for setting it as being 'good practice' even though it's not required. I think we should set it by default then.

Let's change the PR to

  • a. use an argument instead of the global variable.
  • b. have the default value for that argument be set by option and accessed by getOption(), like we do in eml_version() function.
  • c. Then we can support the ability to optionally not set a schema by setting the eml_host to NULL

Does that make sense?

Not sure if eml_host is a good name for the option or if we should prefer something more explicit to schemaLocation? Also not sure if doing this through options is really preferable or if this deserves being exposed through user-facing function arguments (possibly just via ... ?) thoughts?

@amoeba
Copy link
Contributor Author

amoeba commented Jan 23, 2020

I wanna think through this a bit...

The user-facing interface for setting schemaLocation currently is setting it on in the list representing the document they're generating/editing. I think this is good and should be the way because the list object is the interface the user is already working with. I think the question here is what we do when it's not set.

So if it's not set, we automatically generate a value for them which would be a side-effect and should be overridable. Good so far. (Aside: Should it be overridable in EML _and) emld? EML's write_eml currently doesn't have a dedicated argument for this but it has other arguments that are similar to schemaLocation.)

This makes me want to add support for an option like schemaLocation which is just the full attribute value so a user can set it permanently via something like .REnviron and can easily include multiple mappings (not just one). I think my current eml_schema_location is too modular and too magical to be user-facing. You rightly noticed this with calling out eml_host as maybe not being the best way.

@cboettig
Copy link
Member

Yup, good point! I agree the user way to set it should be in the list construction (or also in an argument to a dedicated constructor function, e.g. the proposed set_eml() function for packageId could take an argument). I agree that doing it in options as well is kinda too magical.

I guess the only trick is that it's a bit less obvious how to have a default value that the user can override with NULL than it is to do the logic for a default NULL / omitted value that the user can override by supplying one. I guess one option is to use NA instead of NULL, NA would mean "omit the schemaLocation" while not setting one (which is the same thing as setting the list element to NULL) would mean "use the default schemaLocation". How's that sound?

@amoeba
Copy link
Contributor Author

amoeba commented Jan 24, 2020

Hrm, at the risk of discussing this to death, what about TRUE|FALSE|""?

  • TRUE -> Automatically set schemaLocation to a helpful eml.ecoinformatics.org location
  • FALSE -> Do not set schemaLocation
  • "" -> Set this schemaLocation. This is kinda like the col_names argument in read_* in readr.

This could be exposed at the EML::write_eml boundary and bubbled down into as_xml*.

If you aren't a fan, I am happy to do NULL|NA like above.

@cboettig
Copy link
Member

@amoeba thanks, yeah, I like your proposal better, I think it's more intuitive than NA vs NULL. An update with that would be great!

Also if you have a chance, maybe you want to take a whack at the validator code to remove any use of xsi:schemaLocation on that end as well? either as part of this PR or separately would be great.

As per discussion in ropensci#45, this refactors the behavior of the schemaLocation argument in as_xml to provide an API for asking emld to automatically guess a helpful, web-resolvable schemaLocation value when a value isn't set on the document before serialization.
@amoeba
Copy link
Contributor Author

amoeba commented Jan 29, 2020

I refactored the schemaLocation argument as discussed above. Please take a look. I'll move on to looking at excising schemaLocation from validation logic.

Edit: And will send a separate PR for that.

@cboettig
Copy link
Member

thanks for this, current approach looks great!

@cboettig cboettig merged commit 730f56e into ropensci:master Jan 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider changing default xsi:schemaLocation value to point to a remote copy of the eml.xsd
2 participants