Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

occurrencelist() not returning all the gbif records for a species #25

Closed
dmcglinn opened this issue Jul 27, 2013 · 17 comments
Closed

occurrencelist() not returning all the gbif records for a species #25

dmcglinn opened this issue Jul 27, 2013 · 17 comments
Assignees

Comments

@dmcglinn
Copy link
Contributor

In version 0.3.0, I noticed that the occurrencelist() function does return all of the records that an identical query at http://data.gbif.org/ .

For example the query:

occurrencelist(scientificname = 'Aristolochia serpentaria', coordinatestatus = TRUE, maxresults = 1e6)

returns 179 records but the identical query at http://data.gbif.org/ returns 431 records.

I have not tried to track down the potential source of this discrepancy in the code yet. I also have not investigated if other versions of rgbif have similar issues.

@sckott
Copy link
Contributor

sckott commented Jul 27, 2013

Hey @dmcglinn , answer in a second

@sckott
Copy link
Contributor

sckott commented Jul 27, 2013

Note that occurrencelist and occurencelist_many now return S3 objects, so you gotta use gbifdata to get the data (or convert it yourself I guess).

So, the problem is that the search is saying I want exactly "Aristolochia serpentaria", when what it seems like you want is that, but with variants, right?

Try this:

library(rgbif)
out <- gbifdata(occurrencelist(scientificname = 'Aristolochia serpentaria*', coordinatestatus = TRUE, maxresults = 1000))

unique(out$taxonName)

[1] Aristolochia serpentaria l. Aristolochia serpentaria   
Levels: Aristolochia serpentaria Aristolochia serpentaria l.

nrow(out)

[1] 96

Notice the asterisk after the taxon name, and that you get two names returned, one with l. , presumably for Linnaeus.

Gives 96 georeferenced records though, where GBIF gives 116 (GBIF does give 431 records as you said, but not all have lat/long data)

@sckott
Copy link
Contributor

sckott commented Jul 27, 2013

Hi again @dmcglinn Here's the responsible line in the code https://github.com/ropensci/rgbif/blob/master/R/methods.r#L44

It removes rows that have NA's for both lat and long. And 20 of the 116 records have zeros for both lat and long, even on the GBIF site, see here http://data.gbif.org/ws/rest/occurrence/list?scientificname=Aristolochia%20serpentaria*&coordinatestatus=TRUE

So those zeros get converted to NA's and removed in rgbif since I assumed that people wouldn't be interested in records without lat/long data.

What do you think?

@dmcglinn
Copy link
Contributor Author

Hey @schamberlain thanks for the help and speedy replies on these issues. It does look like adding the '*' to the species name so that variants were returned was the primary issue, but also changing coordinatestatus to FALSE increased the number of returns as well. The query:

occurrencelist(scientificname = 'Aristolochia serpentaria*', coordinatestatus = FALSE, maxresults = 1e6)

returns 306 records which is identical to

http://data.gbif.org/ws/rest/occurrence/list?scientificname=Aristolochia%20serpentaria*&coordinatestatus=FALSE

but these queries do not return the full 431 items that a normal gbif species query on Aristolochia serpentaria returns. It appears that this may be do the fact that a GBIF query returns a broader range of names, specifically

library(rgbif)

out = occurrencelist(scientificname = 'Aristolochia serpentaria*', coordinatestatus = FALSE, maxresults = 1e6)

unique(out$taxonName)

[1] "aristolochia serpentaria"                                    
[2] "Aristolochia serpentaria L."                                 
[3] "Aristolochia serpentaria"                                    
[4] "Aristolochia serpentaria L. var. hastata (Nutt.) Duchartre"  
[5] "Aristolochia serpentaria L. var. hastata (Nuttall) Duchartre"
[6] "Aristolochia serpentaria var. hastata Duch."                 
[7] "Aristolochia serpentaria var. serpentaria"                   
[8] "Aristolochia serpentaria BARTON"  

whereas the GBIF query at http://data.gbif.org/ returns these names as well as synonym names such as:

Aristolochia hastata
Aristolochia nashii
Aristolochia convolvulacea Small
Aristolochia serpentaria var. hastata (Nutt.) Duchartre

@dmcglinn
Copy link
Contributor Author

I checked that those additional names were indeed synonyms here: http://www.itis.gov/

@sckott
Copy link
Contributor

sckott commented Jul 27, 2013

Interesting. So GBIF.org is giving back synonyms as well as actual matches of the query string, whereas their API does not do that. Let me see if there is a parameter that we could fiddle with to get exactly what they give back.

@sckott
Copy link
Contributor

sckott commented Jul 27, 2013

The API docs says

scientificname - count only records where the scientific name matches that supplied - this is based on the scientific name found in the original record from the data provider and does not make use of extra knowledge of possible synonyms or of child taxa. For these functions, use taxonconceptkey.

and

taxonconceptkey - return only records which are for the taxon identified by the supplied numeric key, including any records provided under synonyms of the taxon concerned, and any records for child taxa (e.g. all genera and species within a family).

So the taxonconceptkey parameter does give back synonyms. However, from a user perspective, you would first have to get the taxonconceptkey, which is not ideal.

I'm guessing GBIF.org gets a taxonconceptkey based on your search, then looks up synonyms - but doesn't do this with the API - weird.

@dmcglinn
Copy link
Contributor Author

Yea that's unfortunate, but I suppose one solution is the following?

library(rgbif)

gbifkey = taxonsearch(scientificname='Aristolochia serpentaria')$gbifkey
name_lkup = taxonget(key = as.numeric(as.character(gbifkey)))
sciname = as.character(subset(name_lkup, select='sciname', subset= rank == 'species' | rank == 'variety')[ ,1])
## to include variates add '*'
sciname = paste(sciname, '*', sep='')

out = occurrencelist_many(scientificname = sciname, coordinatestatus = FALSE, maxresults = 1e6)

out
$NumberFound
[1] 1845

However this now returns many more records than the original gbif.org query.

@sckott
Copy link
Contributor

sckott commented Jul 27, 2013

Hmmm, was trying getting synonyms from ITIS, and feeding those in to GBIF, but GBIF has different synonyms! Anyway, would be nice if GBIF had a synonyms API.

@dmcglinn
Copy link
Contributor Author

The problem with the approach I proposed is that it does not guarantee that duplicate records are not returned. Does
occurrencelist return a unique record identifier field we could filter results on to ensure lack of duplication?

@sckott
Copy link
Contributor

sckott commented Jul 27, 2013

going out for a bit...

@dmcglinn
Copy link
Contributor Author

just posted this pull request to include unique id's with the query results: #29

@dmcglinn
Copy link
Contributor Author

Once #29 is merged the following query will return the same number of results as the GBIF web portal:

library(rgbif)

gbifkey = taxonsearch(scientificname='Aristolochia serpentaria')$gbifkey
name_lkup = taxonget(key = as.numeric(as.character(gbifkey)))

sciname = unique(as.character(subset(name_lkup, select='sciname',
                   subset= rank == 'species' | rank == 'variety')[ ,1]))

sciname = paste(sciname, '*', sep='')

out = occurrencelist_many(scientificname = sciname, maxresults = 1e6)

out 
$NumberFound
[1] 431

431 results matches the number of results returned when you do a simple web query for this species.

@ghost ghost assigned sckott Jul 28, 2013
@sckott
Copy link
Contributor

sckott commented Jul 28, 2013

merged your pull, thanks for that!

What do you think @dmcglinn ? Should functions try to match exactly what happens in the GBIF web interface? Or not?

@dmcglinn
Copy link
Contributor Author

I think you should provide the option for this with a new function, see my suggested solution in #30

The primary benefit in my mind is that if someone doesn't want to do the work of sorting out synonymy on their own and then querying each name individually you can provide the option of using GBIF's internal synonym mapping to complete the query. There is also the added benefit of the similarity between the web interface and the R query but that seems relatively minor (you'll probably just get less users complaining that something may have gone wrong). However, more functions in the package results in more effort maintaining so you may ultimately decide its not worth it.

@sckott
Copy link
Contributor

sckott commented Jul 28, 2013

Thanks for the new function!

Right, we should definitely strive to make it easier for users, which your function does.

I would like to have just one function that does everything with the occurencelist endpoint, but I imagine that is too difficult b/c there is a lot going on there. Another thing not included is the ability to specify many values for the same parameter, discussed here #28 . Hoping that they will change that since it's a lot of waste to used named params over and over again.

@sckott
Copy link
Contributor

sckott commented Sep 13, 2013

closing this for now

@sckott sckott closed this as completed Sep 13, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants