Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interested in PR w/ scripts to create EML files? #25

Closed
cboettig opened this issue Sep 19, 2016 · 4 comments
Closed

Interested in PR w/ scripts to create EML files? #25

cboettig opened this issue Sep 19, 2016 · 4 comments

Comments

@cboettig
Copy link

Hi Morgan / @skmorgane,

After reading your blog post about hosting the Portal Data on GitHub, I thought it would be a fun example to use with an undergraduate student (Anna Liu) who is working with me to test out our in-development [https://github.com/ropensci/EML] R package. Anna wrote a little R script that uses the EML package to automatically generate the EML files for each of the datasets here.

We'd love your feedback on the exercise; and could send a pull request that would add the script and resulting EML files if you think it would be of any use. It might be fun to bounce some further ideas off you based on this idea. For instance; while it's certainly cool this data is on GitHub, such a script could also automatically upload a new release of your data into an ecological repository like KNB / DataONE. Not only would this give each release a DOI, but the EML files would mean that much of the data should be a deal easier to discover; for instance, by searching for any of the species names, or geographic area covered by the data; since this information is indexed as such in KNB (as you may already be very familiar with).

Since Portal data is used in data carpentry and other teaching, this might also be a way to teach EML (relevant to use of other LTER data and NEON data). Your example was particularly interesting to me as a 'streaming' data set as well, that is still being updated; something it might be worth discussing more with people like @mbjones on how best to archive.

Anyway, just wanted to get in touch and share ideas. Curious what you think.

Best,

Carl

@mbjones
Copy link

mbjones commented Sep 20, 2016

@cboettig Thanks for making me aware of this, and thanks @skmorgane for such an amazing resource. Like Carl, I think it would be fantastic to include data such as this in the KNB repository, which would allow you to have a DOI for each version of the data set (the KNB links versions so it is clear that one version replaces another), would make it discoverable in a broader context, and supports replication to other DataONE repositories for backup and accessibility. I really like how you are using GitHub for data management, and the use of Ecological Archives for archived snapshots. A small group of us have started discussions on how to modernize Ecological Archives and connect ESA data publications into data repository networks like DataONE (the prior ESA Data Registry is already part of DataONE), and I think your data set would be an excellent test case for the types of features we would need. Looking forward to hearing your thoughts on Carl and Anna's metadata generation experiment.

@ethanwhite
Copy link
Member

@cboettig & @mbjones - quick answer since we're both in the midst of hectic parts of our teaching semesters.

We both really like the idea of having machine readable metadata here, both in the form of EML and other emerging standards like datapackage.json. A PR with the EML would be welcome.

Regarding permanent archiving, that's an ongoing topic of discussion for how to handle this. Assuming a modernization of Ecological Archives I suspect this will still be @skmorgane's preference, but the current state of it is definitely a concern (the last data paper for portal was recently published as Wiley supporting material 😱). Happy to chat more about what's optimal here at some point.

Regarding teaching - Most of the other teaching work is being done through https://github.com/weecology/portal-teachingdb. It's not streaming, which is a good thing for more basic teaching cases, but I think that adding machine readable metadata over there would also be valuable for extending the scope of things that can be taught with that data.

@cboettig
Copy link
Author

Looks like PR #28 has been successfully merged; thanks @liuanna @ethanwhite @skmorgane et al!

There's still the issue Ethan & Matt discuss above about snap-shotting a copy of this database in DataONE (e.g. via KNB or Ecological Archives), particularly since this could better expose the metadata in the EML files to search systems (as well as providing a better probability of persistence). Since that issue isn't reflected in the title issue, should I close this and open a new issue or would you prefer to just keep this thread open for continuity (& maybe rename the thread)?

@ethanwhite
Copy link
Member

@cboettig - yes, let's go ahead and close this issue. As I mentioned above we're discussioning how to handle future snap shots. If Ecological Archives gets turned around I suspect that will be the direction the Portal folks want to go since that's what they've done historically, but I'm not convinced that's going to happen at the moment. I think the current thought is that since they just archived a snapshot there earlier this year that we'll wait and see for a bit what becomes of Ecological Archives and if it doesn't return to something we're all happy with then we'll dig into the conversation of what to replace it with. You're welcome to open an issue if you'd like to discuss this further, but we probably won't have time to really get into it until after the semester is over.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants