Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support for multiple missing value codes in set_attributes #268

Open
atn38 opened this issue Apr 16, 2019 · 17 comments
Open

support for multiple missing value codes in set_attributes #268

atn38 opened this issue Apr 16, 2019 · 17 comments

Comments

@atn38
Copy link
Contributor

atn38 commented Apr 16, 2019

(I looked through prev. issues + set_attributes.R and found no reference to this. Apologies if feature's already implemented.)

It's important for scientific interpretation that metadata documents explanations for missing values. "data not collected due to field conditions" is different from "no specimens were found". EML spec supports repeated missingValueCode elements.

If we want to implement this, how should the input to set_attributes look like? Most simple might be for the missingValueCode and missingValueCodeExplanation columns in the data.frame supplied to set_attributes to take two paired comma separated list of strings and parse them into sets of missingValueCode EML elements.

folks over at LTER would like to have this feature in a core metadata database design. How this gets done here might bear on how we do this in database schema and then in R workflow to generate EML, which uses this package under the hood.

@jeanetteclark
Copy link
Contributor

Hi An

You are correct that set_attributes does not support multiple missing value codes, but you can still construct a valid EML document with multiple missing value codes using EML.

Here is a MRE:

me <- list(individualName = list(givenName = "Jeanette", surName = "Clark"))

attributes <- data.frame(attributeName = 'length_1',
                         attributeDefinition = 'def1',
                         measurementScale = 'ratio',
                         domain = 'numericDomain',
                         unit = 'meter',
                         numberType = 'real',
                         stringsAsFactors = FALSE)

att_list <- set_attributes(attributes)

att_list$attribute[[1]]$missingValueCode <- list(eml$missingValueCode(code = "A", codeExplanation = "exp 1"),
                                                 eml$missingValueCode(code = "B", codeExplanation = "exp 2"))

doc <- list(packageId = "id", system = "system", 
               dataset = list(title = "A Mimimal Valid EML Dataset",
                              creator = me,
                              contact = me,
                              dataTable = list(entityName = "data table", attributeList = att_list))
)

eml_validate(doc)

IMO adding support for this to set_attributes would probably cause more problems than it would solve. A separate function, taking the attributeList, attributeName and a data.frame of missing value codes and definitions as arguments might be helpful. Carl or Bryce can maybe chime in on whether this would be appropriate for the EML package, or better suited for you to put in your own package.

Hope that helps!

@amoeba
Copy link
Collaborator

amoeba commented Apr 16, 2019

Hi An, thanks for filing this issue!

Re:

IMO adding support for this to set_attributes would probably cause more problems than it would solve. A separate function, taking the attributeList, attributeName and a data.frame of missing value codes and definitions as arguments might be helpful. Carl or Bryce can maybe chime in on whether this would be appropriate for the EML package, or better suited for you to put in your own package.

I agree with @jeanetteclark here. set_attributes has the limitation you point out and many others. Its limited ability is also part of its utility. There are a number of sub-elements underneath attribute that are repeatable that the helper currently does not cover and, while adding support for one or two might be fine, I don't think we'd want to add them all.

Does @jeanetteclark 's "manual" solution work for you, or would writer a helper function for your group work? I could definitely help with that if needed.

@mbjones
Copy link
Member

mbjones commented Apr 16, 2019

@jeanetteclark I think this would be a great addition to the EML package. We need more of these helpers that make it easier to construct common patterns. Your MRE illustrates how much one needs to know about the EML schema to add the metadata to the EML document. Can you explain why you think this should be a separate function rather than an addition to set_attributes? Seems like this could be syntactically handled as:

attributes <- data.frame(attributeName = 'length_1',
                         attributeDefinition = 'def1',
                         measurementScale = 'ratio',
                         domain = 'numericDomain',
                         unit = 'meter',
                         numberType = 'real',
                         stringsAsFactors = FALSE,
                         missingValueCode = list(c(code = "A", codeExplanation = "exp 1"),
                                                 c(code = "B", codeExplanation = "exp 2")))

Would that work? Probably also nice if it accepted a non-list version as well in case someone only had to add one missing value code (so just missinngValueCode = c(code = "A", codeExplanation = "exp 1")). Thoughts?

@jeanetteclark
Copy link
Contributor

jeanetteclark commented Apr 16, 2019

@mbjones the code you included above generates a somewhat garbled data.frame. I agree with @amoeba that the value in set_attributes is it's simplicity. data.frames are easy to understand and easy to manipulate - I don't think we should muck them up with list columns or some other similar structure.

I like the idea of adding a helper function (set_missingValues?)

Another option would be to add support to set_attributes where the missing values codes and definitions are passed similar to the way that the factors argument works. I don't prefer this though - adding missingValues as an argument to set_attributes to me would imply that if you want to set missing value codes you have to do it via that argument, when you really only need it if you have multiple missing value codes per attribute. This is in contrast to the factors argument which is required for valid EML if you have an enumeratedDomain. A similar set of functions to this case (high level helper and child element helper) exists in the package already - set_coverage and set_taxonomicCoverage

Happy to hear arguments for the other side and write the function :)

@mbjones
Copy link
Member

mbjones commented Apr 16, 2019

Good point. And I forgot that as.data.frame converts the list to columns, which is indeed garbled (lists can be columns, but they have to be added separately I guess). I like your proposal to be consistent with how enumeratedDomain is handled with a separate factors argument (although I wish that parameter were labeled domain instead of factors). So, a new argument missingValueCodes which takes a data.frame structured like factors with columns attributeName, code, and codeExplanation would be great and still pretty simple. So the new signature would be set_attributes(attributes, factors = NULL, col_classes = NULL, missingValueCodes = NULL). And a additional set_missingValues would be a great helper too, but of course would only work once the attributes list was already created.

@jeanetteclark
Copy link
Contributor

jeanetteclark commented Apr 16, 2019

Yeah in thinking about it more - adding the argument to set_attributes is probably the easiest way to go. It allows for easy addition of multiple missing value codes on multiple attributes, and as you said doesn't require having the attribute list already.

@atn38
Copy link
Contributor Author

atn38 commented Apr 16, 2019

@amoeba: I'm working on a workflow/package to take a set of data.frames imported from PostgreSQL views (which will come from a metadata database schema designed for LTER sites), then use rEML under the hood to insert info from these dfs into appropriate slots, then validate and write EML docs. The workflow is meant to be reused with different datasets at different sites.

@jeanetteclark thank you for the MRE. Good to know that it's possible. I'd be interested in a general solution however, since I'm not writing custom EML-generation scripts for each dataset.

I like the idea of an additional argument to set_attributes() that takes a three-column data.frame. Happy to help write it if I could be useful. We'd need a warning if there are already missingValueCode + Explanation columns in the supplied attributes data.frame.

@scelmendorf
Copy link

+1 for @mbjones suggestion above (updating set_attributes to have a missingValueCodes argument and workflow paralleling that of the "factors" argument that currently populates codes/definitions for enumeratedDomain)

@jeanetteclark
Copy link
Contributor

jeanetteclark commented Apr 16, 2019

I submitted a PR with the addition of this feature for @cboettig and @amoeba to review - thanks for the suggestion @atn38 and @scelmendorf!

@atn38
Copy link
Contributor Author

atn38 commented Apr 18, 2019

@jeanetteclark, thank you for the unbelievably quick feature add! Let me know if I got the behavior correct: (1) for a given attribute, the missingValues argument takes precedence if there are code-def pairs in both attributes and missingValues. (2) If there is one code-def pair defined in attributes but none in missingValues, that code still makes it into EML, correct?

If above is true, then it'll be easier for the LTER metabase users to transition!

Edit: did some testing, (1) is true but (2) is not.

@jeanetteclark
Copy link
Contributor

Hi @atn38 - yes you have the behavior correct.

In my testing, both of your listed cases are true. Have you set the columns in your attributes data.frame as missingValueCode and missingValueCodeExplanation? I frequently forget what the verbiage is for these.

The code below shows both of your cases:

atts <- data.frame(attributeName = c('length_1', 'length_2'),
                   attributeDefinition = c('def1', 'def2'),
                   measurementScale = c('ratio', 'ratio'),
                   domain = c('numericDomain', 'numericDomain'),
                   unit = c('meter','meter'),
                   numberType = c('real','real'),
                   missingValueCode = c("NA", NA),
                   missingValueCodeExplanation = c("def",NA),
                   stringsAsFactors = FALSE)

missing_values <- data.frame(attributeName = c("length_1", "length_2" ,"length_2"), code = c("1", "2", "3"), definition = c("A", "B", "C"))

set_attributes(atts, missingValues = missing_values)
set_attributes(atts)

I'll submit a separate issue to improve the documentation and error checking for correct column names in the attributes table.

@atn38
Copy link
Contributor Author

atn38 commented Apr 18, 2019

I should get into the habit of MREs! Usually lots of moving pieces in workflow. And yes -- more documentation for set_attributes would be very helpful for new users -- I learned via trial and error that it takes a bunch of columns that doc doesn't mention.

The second case I'm concerned with is more like this:

library(EML)

atts <- data.frame(attributeName = c('length_1', 'length_2', 'length_3'),
                   attributeDefinition = c('def1', 'def2', 'def3'),
                   measurementScale = c('ratio', 'ratio', 'ratio'),
                   domain = c('numericDomain', 'numericDomain', 'numericDomain'),
                   unit = c('meter','meter', 'kilometer'),
                   numberType = c('real','real', 'real'),
                   missingValueCode = c("NA", NA, 'code_here'),
                   missingValueCodeExplanation = c("def", NA, 'explanation_here'),
                   stringsAsFactors = FALSE)

missing_values <- data.frame(attributeName = c("length_1", "length_2" ,"length_2"), code = c("1", "2", "3"), definition = c("A", "B", "C"))

atts1 <- set_attributes(atts, missingValues = missing_values)
atts2 <- set_attributes(atts)

Interesting behavior here with atts1, despite length_3 not appearing in missingValues at all, set_attributes still returns attribute 'length_3' has missing value codes set in both the 'attributes' and 'missingValues' data.frames. Using codes from 'missingValues' data.frame.

Also note how two missingValueCode elements are returned for length_3 in atts1, neither of which is what we want (we would want ("code_here", "explanation_here") which is defined in attributes):

atts1[["attribute"]][[3]][["missingValueCode"]]
[[1]]
[[1]]$code
[1] NA

[[1]]$codeExplanation
[1] NA


[[2]]
[[2]]$code
character(0)

[[2]]$codeExplanation
character(0)

Hopefully this makes sense!

@jeanetteclark
Copy link
Contributor

Ah yes, thanks for that example. That definitely is not what we want! I'll look into it

@jeanetteclark
Copy link
Contributor

okay @atn38 I made the function smarter to handle these cases. Would you mind installing the package from my fork to see if it works as you expect now?

devtools::install_github("jeanetteclark/EML")

@atn38
Copy link
Contributor Author

atn38 commented Apr 18, 2019 via email

@jeanetteclark
Copy link
Contributor

@atn38 I just updated the documentation for the set_attributes function. If you install again and look in the Details section of the help for set_attributes you should see missingValueCode and missingValueCodeExplanation added to the "for all data" section

@atn38
Copy link
Contributor Author

atn38 commented Apr 18, 2019

Works beautifully now -- thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants