-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use only verified data for pipeline #84
Comments
This is of utmost importance! For the Walloon data, it is partly the biggest part of Max's work. He made a big effort to convince the experts to validate the datasets before publication. We also chose to validate ourselves the data from some experts for some taxonomic groups for which they had a very good expertise. That's part of the reason why it took so long. It seems that Natagora followed the same process as Natuurpunt. |
The record in question is this one: https://www.gbif.org/occurrence/2631775528 ( Both Natagora and Natuurpunt have the field Since other datasets do not have this field, the only option I see is removing records that are explicitly marked as unverified, i.e.:
|
Let's do this properly as it could have a big effect. Can we have a quick preview here of the number of records per |
Currently our modelling workflow does not discriminate using *
identificationVerificationStatus.
B*ased on the many factors we use to filter occurrencedata, the data are
already greatly reduced, so I would hate to add another, especially if
filling this attribute is not widely adopted. However it seems most
relevant to pay attention to the identification status when 2 species
closely resemble each other (especially if one is alien/invasive and the
other is native).
…On Thu, Jun 4, 2020 at 6:36 PM Tim Adriaens ***@***.***> wrote:
Let's do this properly as it could have a big effect. Can we have a quick
preview here of the number of records per identificationVerificationStatus
and per classis for instance ? It might be that there are other relevant
categories (and the verification types were adapted along the way after
discussions with admins). There is also a category "pending" for example.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#84 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC4KXBXUPR6L2EMYYN4FM3LRU7ERTANCNFSM4NSTMAZQ>
.
--
Dr. Amy J.S. Davis
Data-driven solutions to invasive species and biodiversity conservation
Terrestrial Ecology Unit
Department of Biology
Ghent University
K. L. Ledeganckstraat 35
B-9000 Ghent
Belgium
http://amyjsdavis.com
|
@timadriaens: as it seems important, I will not wait to check it while making a new cube. I try to find some time tomorrow or next week to tackle this. |
The GBIF download used as start point for the occurrence cube pulished on Zenodo, contains 2447 distinct values of
|
Interesting! |
This was just a relatively fast check.
Stay tuned 📻 |
Ok, but I think removing records that are explicitly marked as unverified is indeed fine. |
As promised, a little more insight about the 6.652.040 unverified data in our GBIF download (date of download: 28 Jan 2020) containing occurrences in BE. DatasetsAround 77% of the unverified data come from Waarnemingen.be - Bird occurrences in Flanders and the Brussels Capital Region, Belgium. Almost all of the datasets are "Natuurpunt" related data. One comes from Wallonia: Observations.be - Non-native species occurrences in Wallonia, Belgium. There is also an INBO dataset: Vlinderdatabank - Butterflies in Flanders and the Brussels Capital Region, Belgium.
ClassesHere below the distribution of unverified occurrences at class level, ordered by
YearsDistribution among years in a plot (from 1980) and in a table where years are given in a descending order of number of occurrences ,
I hope this first analysis give you all more elements to discuss. I would remove these data. I think that data quality is as important as transparency in science. |
Indeed @damianooldoni this is as expected. obs.be/wnm.be have a well established validation flow and therefore have that field @damianooldoni @peterdesmet @qgroom @SoVDH @amyjsdavis @DiederikStrubbe we exlude the unverified records from the occurrence based indicators. But do we keep all records to build the cube assuming that even an unverified records represents a survey effort? Or how do we deal with this? |
At the risk of sounding like an extremist, I'd rule them out. |
I agree with @SoVDH for two reasons:
|
I agree that we should not make a distinction between verified and
unverified data, but I am not sure that we should exclude unverified data
from the cube. But if this what you want to do (exclude unverified data),
we need to decide quickly, because at present these data are being included
in the risk models.
…On Wed, Jun 10, 2020 at 2:01 PM Damiano Oldoni ***@***.***> wrote:
I agree with @SoVDH <https://github.com/SoVDH> for two reasons:
1. a minimum of data quality (= validation) is extremely important, no
matter the goal the data are used for
2. the indicators are built upon the cube, where data are already
aggregated per year, taxon and grid cell. Making a distinction between
verified and unverified data means adding an extra column validation (TRUE
or FALSE) to maintaining things tidy. It would make the understanding of
the occurrence cube more difficult and it don't think it's worth.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#84 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC4KXBSOWI3KDP73QNC2LELRV5Y27ANCNFSM4NSTMAZQ>
.
--
Dr. Amy J.S. Davis
Data-driven solutions to invasive species and biodiversity conservation
Terrestrial Ecology Unit
Department of Biology
Ghent University
K. L. Ledeganckstraat 35
B-9000 Ghent
Belgium
http://amyjsdavis.com
|
ok of course, but not sure @amyjsdavis will like it? |
@DiederikStrubbe and I discussed this and now I have a better understanding. I am ok with you excluding the unverified data and I don't think this will substantially change the risk models. |
If we want to exclude records that are marked as |
@SoVDH : I have 17 out of 19 plant species SDM models for the risk assessment completed. These of course include the "unvalidated" or "unverified" label. Is it your preference that I run them again with these data excluded or do you want the maps now? It will take a few days, but it can be done. |
Yes think that would be better. |
@amyjsdavis: I thought you were using the cube for Europe I made for your SDM (eu_modellingtaxa_cube.csv,metadata: eu_modellingtaxa_info.csv). And in this cube there is no way to exclude unverified taxa. So, I wonder which occurrence data you are using. By the way, I will try to make a new version of the cubes before end of June. |
@damianooldoni : These are a different set of species (the plant species on the Species selection for PRA-ing list on google drive). They are different from the modellingtaxa list, and thus there is not an European cube. I realized that in the future, there will likely not be a cube for every species that are to be evaluated, so my modelling flow has the option to use the Cube as an input if already existing, or to process data directly downloaded from GBIF. As you may recall, I have to download global data for each species anyway for the models. |
@amyjsdavis it's good to keep the options open, but there is a cube for every species on the unified and in fact on every spp. |
Are your belgian maps just crops of a Eu/global risk map or how does that work? |
@timadriaens : indeed, there is a cube for every species, but only for Belgium. The risk maps for Belgium are essentially a crop of a European risk model. |
R we planning to do anything with the european maps? There is certainly interest cf Crassula helmsii and Muntiacus reevesi. |
I would stop this interesting discussion here as it has nothing to do with verification anymore. I started a new one here: trias-project/occ-cube-alien#25 |
I have also seen "unvalidated' as an attribute for identificationVerificationStatus. Should those data also be excluded? |
@amyjsdavis That was the term we were discussing or am I missing something? |
Oh, you mean “unvalidated” in addition to “unverified”. Yes, those should ideally be removed as well. Which dataset did you find those in? |
Yes, I found it in my global download for the plants for the risk assessment. I just happened to notice it for Symphyotrichum lanceolatum. The dataset provider is urn:lsid:swedishlifewatch.se:DataProvider:1, the dataset name is Artportalen (Swedish Species Observation System). |
My global download dataset is here: https://doi.org/10.15468/dl.ruaasw |
This issue can be closed as we filter out unvalidated data, see https://github.com/trias-project/occ-cube/blob/master/src/2_create_db.Rmd#L252-L260 and https://github.com/trias-project/occ-cube-alien/blob/master/src/belgium/2_create_db.Rmd#L288-L296 for the names of the issue whose occurrences we filter out. |
Hi, this came up when checking a unverified, false record of a supposedly new alien species for Belgium in the wnm.be data (Vespa orientalis). Waarnemingen and observations publish all records with IdentificationVerificationStatus on gbif (which is ok!). However, for the pipeline, the models etc. it is imperative we only use validated occurrences. Therefore: the pipeline needs a line to subset data based on IdentificationVerificationStatus
Perhaps we can do some sort of sensitivity analysis to see how this impacts (I'm sure there is no time)...
Question to @damianooldoni @peterdesmet @qgroom @SoVDH , anticipating that perhaps many datasets/records on gbif do not even have a IdentificationVerificationStatus : what do we do if that field is not filled?
The text was updated successfully, but these errors were encountered: