Skip to content

Latest commit

 

History

History
24 lines (18 loc) · 2.03 KB

readme.md

File metadata and controls

24 lines (18 loc) · 2.03 KB

Distinct Values - Why this data directory?

Amassing biodiversity collections data into very large aggregated datasets offers never-before-possible ways in which to use the existing data to enhance current data and improve future data. Concurrently, new ways to see these data present opportunities to engage both data providers, collectors, and downstream researchers.

Invitations were sent to biodiversity data aggregators asking them to share the distinct values present (along with a value count) in each of the 23 fields for which the DwC standard recommends using a controlled vocabulary. In this directory, you can see the files sent by GBIF, iDigBio, ALA and VertNet. Additionally, for some terms in the iDigBio folder, you can see the distinct values (with count) for raw data, and for the same term after some data quality scripts are run and the data are indexed.

These files are just the very beginning of an idea to enable data providers, domain-specific researchers, collectors, and others to easily see and understand data issues. Once noted, these can be used to highlight what can and cannot be fixed by scripts. Data aggregators are well aware of the data issues seen inside these buckets. But, it is not so easy for data providers, collectors or other downstream users to visualize the data challenges seen in these files. For the issues that require humans - we need a way to make it easy so humans can take action.

Not only do we need a dynamic resource, we need to go beyond distinct values and counts. Efforts are underway in the TDWG DQ community to use these data to drive the development of some recommended controlled vocabularies. Other efforts are underway at iDigBio to build a tool to generate these data and visualize the values using clustering algorithms. As the worldwide community develops and implements better controlled vocabularies, clusters should go down, and have fewer items in a given cluster. These potential metrics and data form part of a larger initiative to tackle biodiversity data quality in a strategic manner.