where to publish DwC datasets? #94

ekrimmel · 2017-09-04T13:28:08Z

discussion from DwC Hour #7: Aggregators - a Darwin Core View...

Wouter Addink (Naturalis): Should data providers provide their dataset to as many aggregators as possible in order to benefit from the tools building on specific aggregators? Or do we have a huge data duplication problem then?

Deb (talking): Yes. Different aggregators focus on different data users and can provide different data quality/feedback mechanisms. Discoverability is a huge issue, so that takes precedence over a fear of duplicate data being published across aggregators.

John Wieczorek: I second that, Deb.

Andrea Hahn (GBIFS): Yep. We will have to handle duplication on all levels anyway. Try to keep record identifiers as stable as possible though.

John Wieczorek: Until one aggregator can enable the community to make custom views and indexes anyway.

Alex Thompson: @wouter Don Hobern and I have talked about working on an infrastructure to solve the duplication (and duplication of effort) problem. But we haven't gotten anywhere yet.

Wouter Addink (Naturalis): Will aggregators aggregate the data from each other then?

Brad Millen, ROM, Toronto, Canada: Canadensys is essentially Inverts, Plants etc. but they come to ROM IPT and poll records and publish on their site.

Cindy opitz, university of iowa MNH: But don't the big aggregators pick up data from others? Do we have to submit data to multiple aggregators?

Alex Thompson: All of the aggregators are directly sourced, we don't exchange data internally.

Joanna (iDigBio): @cindy - we try very hard not to intentionally ingest duplicates.

Matthew Collins: If you make your data available (via IPT or other methods) then the relationship is more that multiple providers can get data from you.

Alex Thompson: we all build on the same infrastructures though, IPT, DWCA, etc.

Gary (speaking): If publishing in several portals, issues identified in diff portals would go back to the provider, responsibility falls on the provider.

Deb (speaking): In some portals updates might be faster, then one portal could have a more updated dataset. Versions, with ids associated. Data is sitting in IPT, then all aggregators can grab from the same place. Eg: GBIF was already there and publishing when iDigBio came to be, they would not grab datasets they already have, but what’s new through iDigBio. Need for robust identifiers to avoid duplicates.

Andrea (speaking): Aggregators are not ingesting everybody else’s data.

John Wieczorek: @cindy One way to think of it is that our data sets on IPTs are like market places. Aggregators come to get the data of interest.

Cindy opitz, university of iowa MNH: I'm confused. I submitted data to VertNet, and now it is also in iDigBio and GBIF.

Brad Millen, ROM, Toronto, Canada: Records from ROM are appearing from our ROM IPT and/or GBIF on many sites.

Alex Thompson: It's useful to think about VertNet as two parts: A data publishing service (ipt.vertnet.org), and an aggregator (vertnet portal).

John Wieczorek: @cindy You publish data. VertNet facilitates making sure aggregators get it.

Cindy opitz, university of iowa MNH: ah, thanks

Brad Millen, ROM, Toronto, Canada: Once published anywhere it spreads rapidly.

Alex Thompson: @wouter think of it less as exchanging data, and more of a global harvesting layer built on top of IPT that is then available for other people to build portals off of. Like John was talking about with custom views and indexes.

Joanna (iDigBio): https://www.idigbio.org/wiki/index.php/Data_Ingestion_Guidance#Instructions_on_changing_identifiers_.28occurrenceID.29

John Wieczorek: This identifier change issue is important and a bit complex. VertNet helps people get through this and build a ResourceRelationship extension for the data set. GBIF and iDigBio both use that to make the change.

Wouter Addink (Naturalis): A central authority is needed to create the persistent identifiers

John Wieczorek: GBIF gets them from VertNet. (speaking of ResourceRelationship extensions mapping old occurrenceIDs to new ones).

ekrimmel · 2018-07-02T17:28:47Z

should add this to https://github.com/tdwg/dwc-qa/wiki/Sharing-DwC-Data

ekrimmel added answered IPT labels Sep 4, 2017

ekrimmel self-assigned this Sep 4, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

where to publish DwC datasets? #94

where to publish DwC datasets? #94

ekrimmel commented Sep 4, 2017 •

edited

ekrimmel commented Jul 2, 2018

where to publish DwC datasets? #94

where to publish DwC datasets? #94

Comments

ekrimmel commented Sep 4, 2017 • edited

ekrimmel commented Jul 2, 2018

ekrimmel commented Sep 4, 2017 •

edited