Interested in PR w/ scripts to create EML files? #25

cboettig · 2016-09-19T23:56:18Z

After reading your blog post about hosting the Portal Data on GitHub, I thought it would be a fun example to use with an undergraduate student (Anna Liu) who is working with me to test out our in-development [https://github.com/ropensci/EML] R package. Anna wrote a little R script that uses the EML package to automatically generate the EML files for each of the datasets here.

We'd love your feedback on the exercise; and could send a pull request that would add the script and resulting EML files if you think it would be of any use. It might be fun to bounce some further ideas off you based on this idea. For instance; while it's certainly cool this data is on GitHub, such a script could also automatically upload a new release of your data into an ecological repository like KNB / DataONE. Not only would this give each release a DOI, but the EML files would mean that much of the data should be a deal easier to discover; for instance, by searching for any of the species names, or geographic area covered by the data; since this information is indexed as such in KNB (as you may already be very familiar with).

Since Portal data is used in data carpentry and other teaching, this might also be a way to teach EML (relevant to use of other LTER data and NEON data). Your example was particularly interesting to me as a 'streaming' data set as well, that is still being updated; something it might be worth discussing more with people like @mbjones on how best to archive.

Anyway, just wanted to get in touch and share ideas. Curious what you think.

Best,

Carl

mbjones · 2016-09-20T16:10:21Z

@cboettig Thanks for making me aware of this, and thanks @skmorgane for such an amazing resource. Like Carl, I think it would be fantastic to include data such as this in the KNB repository, which would allow you to have a DOI for each version of the data set (the KNB links versions so it is clear that one version replaces another), would make it discoverable in a broader context, and supports replication to other DataONE repositories for backup and accessibility. I really like how you are using GitHub for data management, and the use of Ecological Archives for archived snapshots. A small group of us have started discussions on how to modernize Ecological Archives and connect ESA data publications into data repository networks like DataONE (the prior ESA Data Registry is already part of DataONE), and I think your data set would be an excellent test case for the types of features we would need. Looking forward to hearing your thoughts on Carl and Anna's metadata generation experiment.

ethanwhite · 2016-09-22T13:37:46Z

@cboettig & @mbjones - quick answer since we're both in the midst of hectic parts of our teaching semesters.

We both really like the idea of having machine readable metadata here, both in the form of EML and other emerging standards like datapackage.json. A PR with the EML would be welcome.

Regarding permanent archiving, that's an ongoing topic of discussion for how to handle this. Assuming a modernization of Ecological Archives I suspect this will still be @skmorgane's preference, but the current state of it is definitely a concern (the last data paper for portal was recently published as Wiley supporting material 😱). Happy to chat more about what's optimal here at some point.

Regarding teaching - Most of the other teaching work is being done through https://github.com/weecology/portal-teachingdb. It's not streaming, which is a good thing for more basic teaching cases, but I think that adding machine readable metadata over there would also be valuable for extending the scope of things that can be taught with that data.

cboettig · 2016-10-25T17:37:12Z

Looks like PR #28 has been successfully merged; thanks @liuanna @ethanwhite @skmorgane et al!

There's still the issue Ethan & Matt discuss above about snap-shotting a copy of this database in DataONE (e.g. via KNB or Ecological Archives), particularly since this could better expose the metadata in the EML files to search systems (as well as providing a better probability of persistence). Since that issue isn't reflected in the title issue, should I close this and open a new issue or would you prefer to just keep this thread open for continuity (& maybe rename the thread)?

ethanwhite · 2016-10-26T03:04:11Z

@cboettig - yes, let's go ahead and close this issue. As I mentioned above we're discussioning how to handle future snap shots. If Ecological Archives gets turned around I suspect that will be the direction the Portal folks want to go since that's what they've done historically, but I'm not convinced that's going to happen at the moment. I think the current thought is that since they just archived a snapshot there earlier this year that we'll wait and see for a bit what becomes of Ecological Archives and if it doesn't return to something we're all happy with then we'll dig into the conversation of what to replace it with. You're welcome to open an issue if you'd like to discuss this further, but we probably won't have time to really get into it until after the semester is over.

cboettig mentioned this issue Sep 30, 2016

Openness to integrating retrieval of external associated data ropensci/rgpdd#5

Closed

ethanwhite closed this as completed Oct 26, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interested in PR w/ scripts to create EML files? #25

Interested in PR w/ scripts to create EML files? #25

cboettig commented Sep 19, 2016

mbjones commented Sep 20, 2016

ethanwhite commented Sep 22, 2016

cboettig commented Oct 25, 2016

ethanwhite commented Oct 26, 2016

Interested in PR w/ scripts to create EML files? #25

Interested in PR w/ scripts to create EML files? #25

Comments

cboettig commented Sep 19, 2016

mbjones commented Sep 20, 2016

ethanwhite commented Sep 22, 2016

cboettig commented Oct 25, 2016

ethanwhite commented Oct 26, 2016