Skip to content
This repository has been archived by the owner on May 19, 2021. It is now read-only.

R package to store/access metadata associated with data/functions #18

Open
jonocarroll opened this issue Mar 30, 2016 · 3 comments
Open

Comments

@jonocarroll
Copy link

First off, I see that there is already ropensci/EML and the associated idea, but I'm not a fan of S4, and I'm thinking bigger.

I've brought this up in discussions elsewhere in the past and I know that hadley hasn't made attributes a priority in his workflows (e.g. in relation to assertr() https://twitter.com/hadleywickham/status/559183346144522241) -- in fact, it was only recently that attributes were preserved in dplyr pipelines. They're certainly not preserved in plyr functions.

I'd love to be able to attach a python-esque docstring to data and functions that can be printed without invoking the full help menu (?library), which might contain the last time the object was updated (either automated or manually stated), source, attribution, etc... It's certainly possible to use comment() on a data.frame but I'm thinking perhaps these can be stored similarly to .Rmd files (with full markdown capability?) in a cache and searched/loaded independently to ensure they survive processing. This could include a checksum on the object to enforce reproducibility and perhaps even a trigger system if an object is declared immutable but is altered (override <- ... does one dare?). Needless to say, these would have to be transparent to existing structures, so that would need some careful consideration and balance.

Just thoughts at this stage.

@jonocarroll
Copy link
Author

Roxygen would be the natural method of doing this, which should make it transparent to anything existing (#' docstring)

@ivanhanigan
Copy link
Contributor

Great idea. My proposal issue title was perhaps too specific to
ropensci/EML. I think the generic issue you describe is better, because it
starts by not assuming the extant technology / solutions are a fait
accompli.
Especially if there are issues with depending on S4. The bonus of EML in
my eyes is the convenience of leveraging international standards and
schemas, and the tools that exist to work within that standard (also see
https://github.com/DataONEorg/rdataone to interface with the EML-based
metacat data repositories).

I am not sure from what you wrote if your idea builds on existing
internatlonal standards or intends to develop new standards (ie is this
python-esque docstring considered a 'standard'? If not will this
development generate a schema for attributes on data/functions that will
then become an international agreed standard or another R flavoured
dialect?).

My suggestion was based on a pressing unmet need I face when ingesting,
synthesising and disseminating data and code, especially while working at
the coal-face of data analysis (ie generating metadata while working with
data.code rather than creating metadata prior/post data analysis). The act
of doing metadata at the same time as doing data munging is appealing to
me, especially if it is automated to the hilt.

In terms of choosing the standard, EML seems to be the most generically
applicable standard I have used across environmental, social, health and
geographic data types (others I tested were ANSLIC, DDI and RIF/CS).

I also like the idea that this topic may have cross-cutting potential with
#9 as the automagic ingestion of
data/metadata to R will facilitate validation analyses, and may also cut
across #8 where such metadata
may make reproducible workflows easier, quicker and more
re-usable/re-configurable, also I imagine WRT
#13 has to also deal with
communicating uncertainty related to underlying construction/collection of
data such as measurement error, modelling error or related such complicated
algorithmic processing of data prior to generating the uncertain results
that they wish to communicate.

Good stuff mate, let me know if I am off track with where you saw this
thread going?

On Wed, Mar 30, 2016 at 1:35 PM, Jonathan Carroll notifications@github.com
wrote:

Roxygen would be the natural method of doing this, which should make it
transparent to anything existing (#' docstring)


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#18 (comment)

@jonocarroll
Copy link
Author

Python docstrings are a standard in that they are strongly encouraged and are handled as official attributes, but I'm only using those as an example to launch from.

From a structural point of view, the EML standard would be perfect, but I was thinking more in terms of Roxygen defined attributes than an XML structure. The attributes would be retrievable as first-class objects via some method, or printable with a context.print() method. A known set of expandable attributes would be a good start. Some thought would need to go into the object structure, whether it's better to define a new OOP construct, an extension of data.frame, or some auxiliary structure.

I have in mind (and remember, this is all purely brainstorming at this point) the case where you load some data from a trusted source, validate that it is indeed unchanged (validate_checksum(data)), print out the context (context(data)$owner; context(data)$last_modified), etc... ditto for functions that do what one thinks they do (context(my_function)$assumptions). The context travels with the data/function and can be tested against, e.g.

## ensure that the function is at least as up-to-date as the data
stopifnot(context(data)$last_modified < context(my_function)$last_modified)

A somewhat complicated extension of this would be to overload <- when this package is loaded so that data/functions with an immutable flag can't be overwritten. Some automation could be included there to update the last_modified or owner attribute.

Some related reading: http://simplystatistics.org/2015/11/06/how-i-decide-when-to-trust-an-r-package/

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants