Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long species list downloads via occ_download #362

Open
jhnwllr opened this issue Jul 12, 2019 · 22 comments

Comments

@jhnwllr
Copy link

commented Jul 12, 2019

Recently see discussion here, it has become possible to download 1000s of taxon keys from GBIF via an http POST request. I recently made a successful test download using 9000 taxon keys:

This can be done because of use of the 'IN' predicate. Because it saves space on the 12,000 character limit.

 curl --include --user user:password --header "Content-Type: application/json" --data @file_9000.json http://api.gbif.org/v1/occurrence/download/request

Where file_9000.json looks like this but with MANY more taxon keys.

{
"creator": "jwaller",
"notification_address": [
"jwaller@gbif.org"
],
"sendNotification": true,
"format": "SIMPLE_CSV",
"predicate": {
"type": "and",
"predicates": [
{
"type": "in",
"key": "TAXON_KEY",
"values": [1000003,
1000094,
___MANY MORE KEYS____
1000095,
1000096
]}
]}
}

I was wondering if the same thing would be possible to do with rgbif::occ_download? Since making the json file yourself is not ideal for most users.

I tried to do the following but get a warning.

library(rgbif)
occ_download("taxonKey = 2480946,5229208",body=NULL,type="in",format = "SIMPLE_CSV",user=user,pwd=pwd,email=email)

Warning message: package 'rgbif' was built under R version 3.5.3 Error: Instantiation of [simple type, class org.gbif.api.model.occurrence.predicate.InPredicate] value failed: may not be null Execution halted

Note: using type=or works fine. But this will send verbose json that will not be able to download anywhere near 9000 taxon keys.

 '{
 "type": "or",
 "predicates": [
 {
 "type": "equals",
 "key": "TAXON_KEY",
 "value": "2480946"
 },
 {
"type": "equals",
 "key": "TAXON_KEY",
 "value": "5229208"
 }
 ]
 }'

Is there any way to get rgbif::occ_download to send json that looks like my file file_9000.json above using less characters?

@sckott

This comment has been minimized.

Copy link
Member

commented Jul 17, 2019

thanks @jhnwllr for this - sorry about delay, was on vacation.

(this link https://www.gbif.org/occurrence/download/0010219-190621201848488 is not giving back the download info)

having a look now

sckott added a commit that referenced this issue Jul 17, 2019
we werent creating the json body correctly when type in was used
add a minimal test for occ_download_prep for this
bump version
@sckott

This comment has been minimized.

Copy link
Member

commented Jul 17, 2019

@jhnwllr fix pushed, reinstall from master. Should work now. The API docs have a slightly different JSON example from yours. Where the API docs have

"predicate": {
"type": "in",
...

using the predicate key and the value as a hash

Whereas you have

"predicates": [
{
"type": "in",
// ... etcl

using the predicates key and the value as an array

Does that different matter? Are both allowed with the "in" type?

@sckott

This comment has been minimized.

Copy link
Member

commented Jul 23, 2019

any thoughts @jhnwllr ?

@sckott sckott added this to the v1.4 milestone Jul 23, 2019
@jhnwllr

This comment has been minimized.

Copy link
Author

commented Jul 24, 2019

hi @sckott, I have not been able to get to work. Even after the update.

occ_download("taxonKey = 2480946,5229208",body=NULL,type="in",format = "SIMPLE_CSV",user=user,pwd=pwd,email=email)

Error: Instantiation of [simple type, class org.gbif.api.model.occurrence.predicate.InPredicate] value failed: may not be null
Execution halted

Apparently the 9000 taxon keys download is sort of an edge case that only works some of the time. So the reliable maximum is probably somewhat lower.

@sckott

This comment has been minimized.

Copy link
Member

commented Jul 24, 2019

thanks - i did have a question for you in #362 (comment) - any idaes?

@sckott

This comment has been minimized.

Copy link
Member

commented Jul 24, 2019

hmm, so you get that error on the occ_download("taxonKey = 2480946,5229208",body=NULL,type="in",format = "SIMPLE_CSV",user=user,pwd=pwd,email=email) example ? Or an example with 9000 taxon keys? I assume the latter

@sckott

This comment has been minimized.

Copy link
Member

commented Jul 24, 2019

trying an example with 3100 taxon keys

library(rgbif)
of <- seq(0, 3000, by = 100)
out <- list()
for (i in seq_along(of)) {
  out[[i]] <- name_suggest(rank = "species", limit=100, start = of[i])
}
df <- dplyr::bind_rows(out)
keys <- df$key
zz <- occ_download(sprintf("taxonKey = %s", paste0(keys, collapse = ",")), 
    type="in", format = "SIMPLE_CSV", curlopts = list(verbose = TRUE))
occ_download_meta(zz)
#> <<gbif download metadata>>
#>   Status: SUCCEEDED
#>   Format: SIMPLE_CSV
#>   Download key: 0016520-190621201848488
#>   Created: 2019-07-24T17:19:40.819+0000
#>   Modified: 2019-07-24T17:24:54.153+0000
#>   Download link: http://api.gbif.org/v1/occurrence/download/request/0016520-190621201848488.zip
#>   Total records: 0
#>   Request:
#>     type:  in
#>     predicates:
#>       > type: in, key: TAXON_KEY, value(s): 1000003,1000005,1000006,...

On Success, it says that there are 0 records - see API call https://api.gbif.org/v1/occurrence/download/0016520-190621201848488 - And there is a download to fetch, which has ~67K rows of data. Not sure what's going on there.

@jhnwllr

This comment has been minimized.

Copy link
Author

commented Jul 25, 2019

thanks @jhnwllr for this - sorry about delay, was on vacation.

(this link https://www.gbif.org/occurrence/download/0010219-190621201848488 is not giving back the download info)

having a look now

I think this 9000-taxonkey download was deleted because it does not work consistently. Apparently the number of taxonkeys has to be somewhat less than 9000. The 8000 taxonkey download I did is still viewable:
https://www.gbif.org/occurrence/download/0010212-190621201848488

There might also be other variables involved that I am not aware that affect the maximum number of taxonkeys the system can handle. In any case, probably 3000 or so will work consistently. With regards to the 0 total records, it is apparently just a bug that needs to be fixed with these type of downloads.

I will try your example later to see if I can get it working.

@MattBlissett

This comment has been minimized.

Copy link

commented Jul 25, 2019

I think this 9000-taxonkey download was deleted because it does not work consistently.

We currently have a bug gbif/registry#134 causing the Registry to overflow memory when retrieving some of these very long (≥ ~9000 taxa) queries. Please don't experiment with finding the limit on gbif.org, wait or use gbif-uat.org instead (John too!). I have had to temporarily block queries to John's example downloads.

totalRecords being 0 is recorded here: gbif/occurrence#123 — increasing the limit for downloads exceeded a limit somewhere else.

With gbif/occurrence#49 we will determine what the limits are, and return an appropriate HTTP status (413 Payload Too Large) rather than accepting the request and having the download fail. The limit might be different depending on the query, e.g. large polygons vs a huge disjunction.

@peterdesmet

This comment has been minimized.

Copy link
Member

commented Jul 26, 2019

Also pinging @damianooldoni on this issue, since we will use this functionality for TrIAS

@sckott

This comment has been minimized.

Copy link
Member

commented Jul 27, 2019

Sounds good @MattBlissett - thanks for the info.

@MattBlissett

This comment has been minimized.

Copy link

commented Aug 13, 2019

The limit is now 101,000 things, where every element in an "in" predicate counts as one, plus every other comparison predicate (equals, not, greaterThan etc).

The intention is that that allows for 100,000 taxa/occurrenceIds/collectionCodes etc, and any other necessary filters for country/dataset etc.

Geometry WKTs are limited to 10,000 points.

Requests that are too large will be rejected immediately, with HTTP 413 Payload Too Large and an appropriate error message.

@sckott

This comment has been minimized.

Copy link
Member

commented Aug 13, 2019

thanks @MattBlissett ! saw the email as well

@jhnwllr

This comment has been minimized.

Copy link
Author

commented Aug 29, 2019

hi @sckott

I was able to get occ_download() to work with type="in", and downloaded ~60,000 taxa using it.

If I try to use another parameter like 'hasGeospatialIssue = FALSE' it fails...

Fails

occ_download(
'taxonKey = 2977832,2977901,2977966,2977835',
'hasGeospatialIssue = FALSE',
type="in", format = "SIMPLE_CSV",user=user,pwd=pwd,email=email)

Error: Instantiation of [simple type, class org.gbif.api.model.occurrence.predicate.InPredicate] value failed: may not be null

With type="and" it works, but I will not be able to download large taxa lists with this type.

@sckott

This comment has been minimized.

Copy link
Member

commented Aug 30, 2019

thanks for the report @jhnwllr - having a look

@sckott

This comment has been minimized.

Copy link
Member

commented Aug 30, 2019

Hmm, I may need to rework how occ_download works since we're not currently allowing the user to specify the type for an individual predicate when there is more than one predicate (other than having the same type for all predicates, which isn't always the case). Will push a branch soon

sckott added a commit that referenced this issue Sep 4, 2019
new helper fxns pred and pred_multi to create predicates to pass in to occ_download
filled out tests for changes, and added examples of usage
@sckott

This comment has been minimized.

Copy link
Member

commented Sep 4, 2019

@jhnwllr if you have time, reinstall remotes::install_github("ropensci/rgbif@downloads-rework") - it's a reliatively big change in how you construct predicates, see examples https://github.com/ropensci/rgbif/blob/downloads-rework/R/occ_download.R#L141-L215 - instead of passing in a string like "decimalLatitude > 50", which we parse internally, now you pass in each of those three parts into a function: pred(decimalLatitude, 50, ">") - and pred_multi() for composing predicates with multiple values combined with or or combined with in - In pygbif these methods are called other things, but I thought these names made more sense here

let me know any thoughts

@MattBlissett

This comment has been minimized.

Copy link

commented Sep 5, 2019

I haven't tested it, but that looks like a useful improvement. Queries like this aren't very common, since they can't be made using the website, but if R supports them they probably would be more common.

In case it's helpful you can see this Java test class which has several examples of download predicates (it's then checking the generated SQL). I've highlighted the most complicated example: Human or machine observations of animals or archea, in the UK or Ireland, observed in or before 1989 or in 2000.

https://github.com/gbif/occurrence/blob/fdea92e0421706d756a4f91e5a4eba627842e7b7/occurrence-download/src/test/java/org/gbif/occurrence/download/query/HiveQueryVisitorTest.java#L57-L81

Avoiding the verbosity of Java, but keeping the same structure, that could be represented as

and(
  in(BASIS_OF_RECORD, [MACHINE_OBSERVATION, HUMAN_OBSERVATION]),
  in(TAXON_KEY, [1, 2]),
  in(COUNTRY, [GB, IE]),
  or(
    lessThanEquals(YEAR, 1989),
    equals(YEAR, 2000)
  )
)

(The Java example uses a DisjunctionPredicate for the countries to test that the resulting SQL is the same as an InPredicate.)

Also, the definitive list of possible search terms is here: https://gbif.github.io/gbif-api/apidocs/org/gbif/api/model/occurrence/search/OccurrenceSearchParameter.html

@jhnwllr

This comment has been minimized.

Copy link
Author

commented Sep 5, 2019

hi thanks @sckott

The examples work fine for me!

I was able to get my previous download attempt to work also:

remotes::install_github("ropensci/rgbif@downloads-rework")

library(rgbif)

occ_download(
pred_multi("taxonKey", c(2977832, 2977901, 2977966, 2977835), "in"),
pred("hasGeospatialIssue", "FALSE"),
format = "SIMPLE_CSV",
user=user,pwd=pwd,email=email)

https://www.gbif.org/occurrence/download/0009769-190813142620410

I am working on a blog post about long species list downloads:
https://data-blog.gbif.org/post/downloading-long-species-lists-on-gbif/

I will add this method to the post if you think it is stable...

@MattBlissett

This comment has been minimized.

Copy link

commented Sep 5, 2019

Human or machine observations of animals or archea, in the UK or Ireland, observed in or before 1989 or in 2000.

With different ids (so as not to create unnecessary huge downloads), an example of this is here: https://www.gbif-uat.org/occurrence/download/0000222-190802075855554 / http://api.gbif-uat.org/v1/occurrence/download/0000222-190802075855554

@sckott

This comment has been minimized.

Copy link
Member

commented Sep 5, 2019

thanks very much @MattBlissett for the pointer to the Java examples, and for the simpler eg w/o Java verbosity. I'll see about doing the same tests here. And thanks for the enum link. And example of the query

@sckott

This comment has been minimized.

Copy link
Member

commented Sep 5, 2019

thanks for testing it @jhnwllr

I will add this method to the post if you think it is stable...

Not sure when you want to get the post out, but I want to add some more tests from the Java test suite, and do a bit more testing and asking other folks about the new user interface to make sure I'm not missing something - once I merge to master then i'd consider it stable - hopefully by tomorrow, maybe next monday

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.