Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use only verified data for pipeline #84

Closed
timadriaens opened this issue Jun 4, 2020 · 32 comments
Closed

Use only verified data for pipeline #84

timadriaens opened this issue Jun 4, 2020 · 32 comments
Assignees

Comments

@timadriaens
Copy link
Member

Hi, this came up when checking a unverified, false record of a supposedly new alien species for Belgium in the wnm.be data (Vespa orientalis). Waarnemingen and observations publish all records with IdentificationVerificationStatus on gbif (which is ok!). However, for the pipeline, the models etc. it is imperative we only use validated occurrences. Therefore: the pipeline needs a line to subset data based on IdentificationVerificationStatus

  • approved on expert judgement
  • approved on photographic evidence
  • approved on knowledge rules

Perhaps we can do some sort of sensitivity analysis to see how this impacts (I'm sure there is no time)...

Question to @damianooldoni @peterdesmet @qgroom @SoVDH , anticipating that perhaps many datasets/records on gbif do not even have a IdentificationVerificationStatus : what do we do if that field is not filled?

@SoVDH
Copy link

SoVDH commented Jun 4, 2020

This is of utmost importance! For the Walloon data, it is partly the biggest part of Max's work. He made a big effort to convince the experts to validate the datasets before publication. We also chose to validate ourselves the data from some experts for some taxonomic groups for which they had a very good expertise. That's part of the reason why it took so long. It seems that Natagora followed the same process as Natuurpunt.
I confirm what Tim just said above, only validated data can be used to run the indicators, to identify emerging species, to run the models for risk mapping. I know that we potentially 'lose' a lot of data, but here quality MUST take precedence over quantity! I also include here @amyjsdavis and @DiederikStrubbe as this discussion is relevant for them too.

@peterdesmet
Copy link
Member

The record in question is this one: https://www.gbif.org/occurrence/2631775528 (Natuurpunt:Waarnemingen:190863847).

Both Natagora and Natuurpunt have the field identificationVerificationStatus and both (see e.g. https://www.gbif.org/occurrence/2270408500) are publishing unverified records to GBIF (which is fine).

Since other datasets do not have this field, the only option I see is removing records that are explicitly marked as unverified, i.e.:

identificationVerificationStatus = "unverified"

@timadriaens
Copy link
Member Author

Let's do this properly as it could have a big effect. Can we have a quick preview here of the number of records per identificationVerificationStatus and per classis for instance ? It might be that there are other relevant categories (and the verification types were adapted along the way after discussions with admins). There is also a category "pending" for example.

@amyjsdavis
Copy link

amyjsdavis commented Jun 4, 2020 via email

@damianooldoni
Copy link
Contributor

@timadriaens: as it seems important, I will not wait to check it while making a new cube. I try to find some time tomorrow or next week to tackle this.

@damianooldoni
Copy link
Contributor

The GBIF download used as start point for the occurrence cube pulished on Zenodo, contains 2447 distinct values of identificationVerificationStatus. Here below they are shown based on number of occurrences in descending order. As you can see there are a lot of unverified occurrences, 6.652.040, almost 19% of the data. The filtering based on issue (coordinate issues) and occurrenceStatus (absences) removes "just" 165.653 occurrences. So, even if all of them would be unverified the amount of unverified occurrences would still remain very high.

identificationVerificationStatus n
"" 15901818
"unverified" 6652040
"approved on knowledge rules" 6471644
"approved on expert judgement" 3598234
"approved on photographic evidence" 1273927
"verified" 748793
"Validated on the basis of rules" 60848
"Verified Observation" 33048
"validated by PAULY A" 27731
"validated by RASMONT P" 22352
"approved on photographic evide" 18643
"validated by LECLERCQ J" 18533
"Validated without evidence (additional information provided, ...)" 15668
"validated by D'Haeseleer J." 12881
"validated without a document in support (expertise or additional informations)" 10925
"validated by REMACLE A" 8211
... ...

@qgroom
Copy link

qgroom commented Jun 6, 2020

Interesting!
It is not often appreciated that very common species don't need verifying, because even if the identification was wrong, there is a very good chance that that species is present within a grid cell anyway.
On the other hand, for rare species the numbers of false identifications can far exceed the number of correct identifications.
Therefore, you can happily accept the "unverified" records for common species, but where do you put the cut off?

@damianooldoni
Copy link
Contributor

damianooldoni commented Jun 6, 2020

This was just a relatively fast check.
I will investigate further by:

  1. searching for the datasets the unverified obs come from. Waarnemingen (Natuurpunt data) for sure, maybe other ones?
  2. grouping them by class as asked by @timadriaens
  3. grouping them by year (maybe most of them are "too" recent data from actual year? Then impact on our analysis is very limited)

Stay tuned 📻

@timadriaens
Copy link
Member Author

Ok, but I think removing records that are explicitly marked as unverified is indeed fine.

@damianooldoni
Copy link
Contributor

As promised, a little more insight about the 6.652.040 unverified data in our GBIF download (date of download: 28 Jan 2020) containing occurrences in BE.

Datasets

Around 77% of the unverified data come from Waarnemingen.be - Bird occurrences in Flanders and the Brussels Capital Region, Belgium. Almost all of the datasets are "Natuurpunt" related data. One comes from Wallonia: Observations.be - Non-native species occurrences in Wallonia, Belgium. There is also an INBO dataset: Vlinderdatabank - Butterflies in Flanders and the Brussels Capital Region, Belgium.

title n datasetKey
Waarnemingen.be - Bird occurrences in Flanders and the Brussels Capital Region, Belgium 5137863 e7cbb0ed-04c6-44ce-ac86-ebe49f4efb28
Waarnemingen.be - Plant occurrences in Flanders and the Brussels Capital Region, Belgium 442505 bfc6fe18-77c7-4ede-a555-9207d60d1d86
Waarnemingen.be - Butterfly occurrences in Flanders and the Brussels Capital Region, Belgium 328363 1f968e89-ca96-4065-91a5-4858e736b5aa
Waarnemingen.be - Non-native animal occurrences in Flanders and the Brussels Capital Region, Belgium 281091 9a0b66df-7535-4f28-9f4e-5bc11b8b096c
Waarnemingen.be - Hymenoptera occurrences in Flanders and the Brussels Capital Region, Belgium 168301 71cfd412-6327-4ec7-8035-d8b2d0509ac5
Waarnemingen.be - Orthoptera occurrences in Flanders and the Brussels Capital Region, Belgium 99233 958b1d2f-2d11-4e94-a828-c8e2d2c013ca
Waarnemingen.be - Non-native plant occurrences in Flanders and the Brussels Capital Region, Belgium 61194 7f5e4129-0717-428e-876a-464fbd5d9a47
Observations.be - Non-native species occurrences in Wallonia, Belgium 44387 629befd5-fb45-4365-95c4-d07e72479b37
Waarnemingen.be - Hemiptera occurrences in Flanders and the Brussels Capital Region, Belgium 43826 37e094f3-dcf2-469f-93a2-c4b9b5fa7275
Vlinderdatabank - Butterflies in Flanders and the Brussels Capital Region, Belgium 20478 7888f666-f59e-4534-8478-3a10a3bfee45
Waarnemingen.be - Fish occurrences in Flanders and the Brussels Capital Region, Belgium 13963 8124cd73-ac84-43d2-ab39-1d80dc346525
Waarnemingen.be - Other insect occurrences in Flanders and the Brussels Capital Region, Belgium 10836 27e9e069-2862-4183-bcec-1e1a7f74d3e7

Classes

Here below the distribution of unverified occurrences at class level, ordered by n, number of occs. Empty class value = occs of taxa which don't belong to any class.

class kingdom n
Aves Animalia 5371281
Insecta Animalia 684639
Magnoliopsida Plantae 355435
Liliopsida Plantae 136518
Mammalia Animalia 51388
Actinopterygii Animalia 17016
Polypodiopsida Plantae 9383
Plantae 4767
Pinopsida Plantae 4084
Reptilia Animalia 3711
Amphibia Animalia 2973
Gastropoda Animalia 2732
Bryopsida Plantae 2033
Bivalvia Animalia 1401
Malacostraca Animalia 1211
Animalia 1201
Elasmobranchii Animalia 626
Jungermanniopsida Plantae 530
Leotiomycetes Fungi 293
incertae sedis 173
Phaeophyceae Chromista 136
Maxillopoda Animalia 94
Tentaculata Animalia 89
Lycopodiopsida Plantae 82
Ascidiacea Animalia 58
Arachnida Animalia 47
Cephalaspidomorphi Animalia 30
Hydrozoa Animalia 20
Polychaeta Animalia 19
Ginkgoopsida Plantae 15
Florideophyceae Plantae 13
Demospongiae Animalia 10
Gymnolaemata Animalia 9
Chilopoda Animalia 6
Anthozoa Animalia 5
Agaricomycetes Fungi 4
Phylactolaemata Animalia 4
Cephalopoda Animalia 2
Clitellata Animalia 1
Leptocardii Animalia 1

Years

Distribution among years in a plot (from 1980) and in a table where years are given in a descending order of number of occurrences , n. As the GBIF download has been triggered at 2020-01-28 there is still no data from waarnemingen.be which are updated monthly and so no unverified occs for 2020. There are also way less unverified data for 2019, due to a typical publishing delay, which is longer than 28 days. Both expected facts.

image

year n
2018 934689
2017 720012
2016 663146
2015 574543
2014 508016
2013 467497
2012 437100
2011 432699
2010 422923
2009 347533
2008 162985
2007 81114
2005 65306
2006 64999
2004 53171
2019 51730
1996 43995
2003 39278
2002 38855
1999 38227
1997 36425
2000 36116
1998 35821
1995 35511
2001 34515
1994 32038
1993 30831
1992 25246
1991 23535
1987 20743
1984 20421
1986 19869
1990 16730
1985 16152
1988 15827
1981 13706
1989 13307
1982 10714
1983 10649
1980 8561
1979 6545
1978 5408
1974 4790
1975 4563
1973 4532
1976 3675
1972 3459
1977 3453
1971 1746
1968 957
1969 728
1958 648
1959 534
1967 504
1970 482
1966 476
1964 452
1962 420
1965 387
1963 371
1961 363
1960 347
1957 345
1956 311
1955 306
1954 128
1919 123
1948 108
... < 100

I hope this first analysis give you all more elements to discuss. I would remove these data. I think that data quality is as important as transparency in science.

@timadriaens
Copy link
Member Author

Indeed @damianooldoni this is as expected. obs.be/wnm.be have a well established validation flow and therefore have that field identificationVerificationStatus filled. Data from the vlinderdatabank are high quality atlas data and not very relevant to TrIAS (unless for the cube used for survey effort correction but I guess for that we don't need to exclude unverified as it is all about the effort) since they contain no non-native species. The distribution looks like it follows the same trend as the total number of observations.

@damianooldoni @peterdesmet @qgroom @SoVDH @amyjsdavis @DiederikStrubbe we exlude the unverified records from the occurrence based indicators. But do we keep all records to build the cube assuming that even an unverified records represents a survey effort? Or how do we deal with this?

@SoVDH
Copy link

SoVDH commented Jun 10, 2020

At the risk of sounding like an extremist, I'd rule them out.

@damianooldoni
Copy link
Contributor

I agree with @SoVDH for two reasons:

  1. a minimum of data quality (= validation) is extremely important, no matter the goal the data are used for
  2. the indicators are built upon the cube, where data are already aggregated per year, taxon and grid cell. Making a distinction between verified and unverified data means adding an extra column validation (TRUE or FALSE) to maintaining things tidy. It would make the understanding of the occurrence cube more difficult and it don't think it's worth.

@amyjsdavis
Copy link

amyjsdavis commented Jun 10, 2020 via email

@timadriaens
Copy link
Member Author

timadriaens commented Jun 10, 2020

ok of course, but not sure @amyjsdavis will like it?

@amyjsdavis
Copy link

amyjsdavis commented Jun 11, 2020

@DiederikStrubbe and I discussed this and now I have a better understanding. I am ok with you excluding the unverified data and I don't think this will substantially change the risk models.

@peterdesmet
Copy link
Member

If we want to exclude records that are marked as unvalidated (I'm fine with that), I suggest to do that for all processing (alien cube + all cube) and all datasets. It is clearer to explain.

@amyjsdavis
Copy link

amyjsdavis commented Jun 11, 2020

@SoVDH : I have 17 out of 19 plant species SDM models for the risk assessment completed. These of course include the "unvalidated" or "unverified" label. Is it your preference that I run them again with these data excluded or do you want the maps now? It will take a few days, but it can be done.

@timadriaens
Copy link
Member Author

Yes think that would be better.

@damianooldoni
Copy link
Contributor

@amyjsdavis: I thought you were using the cube for Europe I made for your SDM (eu_modellingtaxa_cube.csv,metadata: eu_modellingtaxa_info.csv). And in this cube there is no way to exclude unverified taxa. So, I wonder which occurrence data you are using.

By the way, I will try to make a new version of the cubes before end of June.

@amyjsdavis
Copy link

@damianooldoni : These are a different set of species (the plant species on the Species selection for PRA-ing list on google drive). They are different from the modellingtaxa list, and thus there is not an European cube. I realized that in the future, there will likely not be a cube for every species that are to be evaluated, so my modelling flow has the option to use the Cube as an input if already existing, or to process data directly downloaded from GBIF. As you may recall, I have to download global data for each species anyway for the models.

@timadriaens
Copy link
Member Author

@amyjsdavis it's good to keep the options open, but there is a cube for every species on the unified and in fact on every spp.

@timadriaens
Copy link
Member Author

Are your belgian maps just crops of a Eu/global risk map or how does that work?

@amyjsdavis
Copy link

@timadriaens : indeed, there is a cube for every species, but only for Belgium. The risk maps for Belgium are essentially a crop of a European risk model.

@timadriaens
Copy link
Member Author

R we planning to do anything with the european maps? There is certainly interest cf Crassula helmsii and Muntiacus reevesi.

@damianooldoni
Copy link
Contributor

I would stop this interesting discussion here as it has nothing to do with verification anymore. I started a new one here: trias-project/occ-cube-alien#25

@amyjsdavis
Copy link

I have also seen "unvalidated' as an attribute for identificationVerificationStatus. Should those data also be excluded?

@peterdesmet
Copy link
Member

@amyjsdavis That was the term we were discussing or am I missing something?

@peterdesmet
Copy link
Member

Oh, you mean “unvalidated” in addition to “unverified”. Yes, those should ideally be removed as well. Which dataset did you find those in?

@amyjsdavis
Copy link

amyjsdavis commented Jun 22, 2020

Yes, I found it in my global download for the plants for the risk assessment. I just happened to notice it for Symphyotrichum lanceolatum. The dataset provider is urn:lsid:swedishlifewatch.se:DataProvider:1, the dataset name is Artportalen (Swedish Species Observation System).

@amyjsdavis
Copy link

My global download dataset is here: https://doi.org/10.15468/dl.ruaasw

@damianooldoni
Copy link
Contributor

This issue can be closed as we filter out unvalidated data, see https://github.com/trias-project/occ-cube/blob/master/src/2_create_db.Rmd#L252-L260 and https://github.com/trias-project/occ-cube-alien/blob/master/src/belgium/2_create_db.Rmd#L288-L296 for the names of the issue whose occurrences we filter out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

6 participants