Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

where to publish DwC datasets? #94

Open
ekrimmel opened this issue Sep 4, 2017 · 1 comment
Open

where to publish DwC datasets? #94

ekrimmel opened this issue Sep 4, 2017 · 1 comment
Assignees

Comments

@ekrimmel
Copy link
Collaborator

ekrimmel commented Sep 4, 2017

discussion from DwC Hour #7: Aggregators - a Darwin Core View...

Wouter Addink (Naturalis): Should data providers provide their dataset to as many aggregators as possible in order to benefit from the tools building on specific aggregators? Or do we have a huge data duplication problem then?

Deb (talking): Yes. Different aggregators focus on different data users and can provide different data quality/feedback mechanisms. Discoverability is a huge issue, so that takes precedence over a fear of duplicate data being published across aggregators.

John Wieczorek: I second that, Deb.

Andrea Hahn (GBIFS): Yep. We will have to handle duplication on all levels anyway. Try to keep record identifiers as stable as possible though.

John Wieczorek: Until one aggregator can enable the community to make custom views and indexes anyway.

Alex Thompson: @wouter Don Hobern and I have talked about working on an infrastructure to solve the duplication (and duplication of effort) problem. But we haven't gotten anywhere yet.

Wouter Addink (Naturalis): Will aggregators aggregate the data from each other then?

Brad Millen, ROM, Toronto, Canada: Canadensys is essentially Inverts, Plants etc. but they come to ROM IPT and poll records and publish on their site.

Cindy opitz, university of iowa MNH: But don't the big aggregators pick up data from others? Do we have to submit data to multiple aggregators?

Alex Thompson: All of the aggregators are directly sourced, we don't exchange data internally.

Joanna (iDigBio): @cindy - we try very hard not to intentionally ingest duplicates.

Matthew Collins: If you make your data available (via IPT or other methods) then the relationship is more that multiple providers can get data from you.

Alex Thompson: we all build on the same infrastructures though, IPT, DWCA, etc.

Gary (speaking): If publishing in several portals, issues identified in diff portals would go back to the provider, responsibility falls on the provider.

Deb (speaking): In some portals updates might be faster, then one portal could have a more updated dataset. Versions, with ids associated. Data is sitting in IPT, then all aggregators can grab from the same place. Eg: GBIF was already there and publishing when iDigBio came to be, they would not grab datasets they already have, but what’s new through iDigBio. Need for robust identifiers to avoid duplicates.

Andrea (speaking): Aggregators are not ingesting everybody else’s data.

John Wieczorek: @cindy One way to think of it is that our data sets on IPTs are like market places. Aggregators come to get the data of interest.

Cindy opitz, university of iowa MNH: I'm confused. I submitted data to VertNet, and now it is also in iDigBio and GBIF.

Brad Millen, ROM, Toronto, Canada: Records from ROM are appearing from our ROM IPT and/or GBIF on many sites.

Alex Thompson: It's useful to think about VertNet as two parts: A data publishing service (ipt.vertnet.org), and an aggregator (vertnet portal).

John Wieczorek: @cindy You publish data. VertNet facilitates making sure aggregators get it.

Cindy opitz, university of iowa MNH: ah, thanks

Brad Millen, ROM, Toronto, Canada: Once published anywhere it spreads rapidly.

Alex Thompson: @wouter think of it less as exchanging data, and more of a global harvesting layer built on top of IPT that is then available for other people to build portals off of. Like John was talking about with custom views and indexes.

Joanna (iDigBio): https://www.idigbio.org/wiki/index.php/Data_Ingestion_Guidance#Instructions_on_changing_identifiers_.28occurrenceID.29

John Wieczorek: This identifier change issue is important and a bit complex. VertNet helps people get through this and build a ResourceRelationship extension for the data set. GBIF and iDigBio both use that to make the change.

Wouter Addink (Naturalis): A central authority is needed to create the persistent identifiers

John Wieczorek: GBIF gets them from VertNet. (speaking of ResourceRelationship extensions mapping old occurrenceIDs to new ones).

@ekrimmel
Copy link
Collaborator Author

ekrimmel commented Jul 2, 2018

should add this to https://github.com/tdwg/dwc-qa/wiki/Sharing-DwC-Data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant