Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

general purpose rate limiting across pkg #320

Closed
sckott opened this issue Aug 29, 2018 · 22 comments
Closed

general purpose rate limiting across pkg #320

sckott opened this issue Aug 29, 2018 · 22 comments

Comments

@sckott
Copy link
Contributor

sckott commented Aug 29, 2018

via request from GBIF (email title "Re: Help on server error")

except the download request API - though should use for some download routes for checking status/etc.

@sckott sckott added this to the v1.1 milestone Aug 29, 2018
@sckott
Copy link
Contributor Author

sckott commented Sep 13, 2018

Can folks help me test this? on branch rate-limit I've added rate limiting across the package so regardless of the function used, one should only be able to do 60 requests per minute. You don't have to do anything, just use the package as usual, you can e.g. test how long things are taking with system.time or perhaps the microbenchmark pkg or similar

install like remotes::install_github("ropensci/rgbif@rate-limit")

@damianooldoni @dmcglinn @MattBlissett @jkmccarthy @jwhalennds @poldham @andzandz11

Let me know if you see any potential issues with the internal helper that does the waiting between requests https://github.com/ropensci/rgbif/blob/rate-limit/R/HttpStore.R

@sckott
Copy link
Contributor Author

sckott commented Sep 17, 2018

@maelle can you give this a try and see if you find any problems?

@poldham
Copy link

poldham commented Sep 17, 2018

@sckott I have installed and will try to find something to give this a whirl with.

@maelle
Copy link
Member

maelle commented Sep 17, 2018

Looks fine, but I only tested this:

days <- seq(from = Sys.Date() - 240,
            to = Sys.Date(),
            by = 1)

get_one_day <- function(day){
  date <- format(day, "%Y-%m-%d")
  result <- rgbif::occ_search(eventDate = date,
                    country = "fr",
                    limit = 1)$data
  result$time <- Sys.time()
  result
}

results <- purrr::map_df(days, get_one_day)
unique(results$time)
#>   [1] "2018-09-17 16:54:39 CEST" "2018-09-17 16:54:40 CEST"
#>   [3] "2018-09-17 16:54:41 CEST" "2018-09-17 16:54:42 CEST"
#>   [5] "2018-09-17 16:54:43 CEST" "2018-09-17 16:54:44 CEST"
#>   [7] "2018-09-17 16:54:45 CEST" "2018-09-17 16:54:46 CEST"
#>   [9] "2018-09-17 16:54:47 CEST" "2018-09-17 16:54:48 CEST"
#>  [11] "2018-09-17 16:54:49 CEST" "2018-09-17 16:54:50 CEST"
#>  [13] "2018-09-17 16:54:51 CEST" "2018-09-17 16:54:52 CEST"
#>  [15] "2018-09-17 16:54:53 CEST" "2018-09-17 16:54:54 CEST"
#>  [17] "2018-09-17 16:54:55 CEST" "2018-09-17 16:54:56 CEST"
#>  [19] "2018-09-17 16:54:57 CEST" "2018-09-17 16:54:58 CEST"
#>  [21] "2018-09-17 16:54:59 CEST" "2018-09-17 16:55:00 CEST"
#>  [23] "2018-09-17 16:55:01 CEST" "2018-09-17 16:55:02 CEST"
#>  [25] "2018-09-17 16:55:03 CEST" "2018-09-17 16:55:04 CEST"
#>  [27] "2018-09-17 16:55:05 CEST" "2018-09-17 16:55:06 CEST"
#>  [29] "2018-09-17 16:55:07 CEST" "2018-09-17 16:55:08 CEST"
#>  [31] "2018-09-17 16:55:09 CEST" "2018-09-17 16:55:10 CEST"
#>  [33] "2018-09-17 16:55:11 CEST" "2018-09-17 16:55:12 CEST"
#>  [35] "2018-09-17 16:55:13 CEST" "2018-09-17 16:55:14 CEST"
#>  [37] "2018-09-17 16:55:15 CEST" "2018-09-17 16:55:16 CEST"
#>  [39] "2018-09-17 16:55:17 CEST" "2018-09-17 16:55:18 CEST"
#>  [41] "2018-09-17 16:55:19 CEST" "2018-09-17 16:55:20 CEST"
#>  [43] "2018-09-17 16:55:21 CEST" "2018-09-17 16:55:22 CEST"
#>  [45] "2018-09-17 16:55:23 CEST" "2018-09-17 16:55:24 CEST"
#>  [47] "2018-09-17 16:55:25 CEST" "2018-09-17 16:55:26 CEST"
#>  [49] "2018-09-17 16:55:27 CEST" "2018-09-17 16:55:28 CEST"
#>  [51] "2018-09-17 16:55:29 CEST" "2018-09-17 16:55:30 CEST"
#>  [53] "2018-09-17 16:55:31 CEST" "2018-09-17 16:55:32 CEST"
#>  [55] "2018-09-17 16:55:33 CEST" "2018-09-17 16:55:34 CEST"
#>  [57] "2018-09-17 16:55:35 CEST" "2018-09-17 16:55:36 CEST"
#>  [59] "2018-09-17 16:55:37 CEST" "2018-09-17 16:55:38 CEST"
#>  [61] "2018-09-17 16:55:39 CEST" "2018-09-17 16:55:40 CEST"
#>  [63] "2018-09-17 16:55:41 CEST" "2018-09-17 16:55:42 CEST"
#>  [65] "2018-09-17 16:55:43 CEST" "2018-09-17 16:55:44 CEST"
#>  [67] "2018-09-17 16:55:45 CEST" "2018-09-17 16:55:46 CEST"
#>  [69] "2018-09-17 16:55:47 CEST" "2018-09-17 16:55:48 CEST"
#>  [71] "2018-09-17 16:55:49 CEST" "2018-09-17 16:55:50 CEST"
#>  [73] "2018-09-17 16:55:51 CEST" "2018-09-17 16:55:52 CEST"
#>  [75] "2018-09-17 16:55:53 CEST" "2018-09-17 16:55:54 CEST"
#>  [77] "2018-09-17 16:55:55 CEST" "2018-09-17 16:55:56 CEST"
#>  [79] "2018-09-17 16:55:57 CEST" "2018-09-17 16:55:58 CEST"
#>  [81] "2018-09-17 16:55:59 CEST" "2018-09-17 16:56:00 CEST"
#>  [83] "2018-09-17 16:56:01 CEST" "2018-09-17 16:56:02 CEST"
#>  [85] "2018-09-17 16:56:03 CEST" "2018-09-17 16:56:04 CEST"
#>  [87] "2018-09-17 16:56:05 CEST" "2018-09-17 16:56:06 CEST"
#>  [89] "2018-09-17 16:56:07 CEST" "2018-09-17 16:56:08 CEST"
#>  [91] "2018-09-17 16:56:09 CEST" "2018-09-17 16:56:10 CEST"
#>  [93] "2018-09-17 16:56:11 CEST" "2018-09-17 16:56:12 CEST"
#>  [95] "2018-09-17 16:56:13 CEST" "2018-09-17 16:56:14 CEST"
#>  [97] "2018-09-17 16:56:15 CEST" "2018-09-17 16:56:16 CEST"
#>  [99] "2018-09-17 16:56:17 CEST" "2018-09-17 16:56:18 CEST"
#> [101] "2018-09-17 16:56:19 CEST" "2018-09-17 16:56:20 CEST"
#> [103] "2018-09-17 16:56:21 CEST" "2018-09-17 16:56:22 CEST"
#> [105] "2018-09-17 16:56:23 CEST" "2018-09-17 16:56:24 CEST"
#> [107] "2018-09-17 16:56:25 CEST" "2018-09-17 16:56:26 CEST"
#> [109] "2018-09-17 16:56:27 CEST" "2018-09-17 16:56:28 CEST"
#> [111] "2018-09-17 16:56:29 CEST" "2018-09-17 16:56:30 CEST"
#> [113] "2018-09-17 16:56:31 CEST" "2018-09-17 16:56:32 CEST"
#> [115] "2018-09-17 16:56:33 CEST" "2018-09-17 16:56:34 CEST"
#> [117] "2018-09-17 16:56:35 CEST" "2018-09-17 16:56:36 CEST"
#> [119] "2018-09-17 16:56:37 CEST" "2018-09-17 16:56:38 CEST"
#> [121] "2018-09-17 16:56:39 CEST" "2018-09-17 16:56:40 CEST"
#> [123] "2018-09-17 16:56:41 CEST" "2018-09-17 16:56:42 CEST"
#> [125] "2018-09-17 16:56:43 CEST" "2018-09-17 16:56:44 CEST"
#> [127] "2018-09-17 16:56:45 CEST" "2018-09-17 16:56:46 CEST"
#> [129] "2018-09-17 16:56:47 CEST" "2018-09-17 16:56:48 CEST"
#> [131] "2018-09-17 16:56:49 CEST" "2018-09-17 16:56:50 CEST"
#> [133] "2018-09-17 16:56:51 CEST" "2018-09-17 16:56:52 CEST"
#> [135] "2018-09-17 16:56:53 CEST" "2018-09-17 16:56:54 CEST"
#> [137] "2018-09-17 16:56:55 CEST" "2018-09-17 16:56:56 CEST"
#> [139] "2018-09-17 16:56:57 CEST" "2018-09-17 16:56:58 CEST"
#> [141] "2018-09-17 16:56:59 CEST" "2018-09-17 16:57:00 CEST"
#> [143] "2018-09-17 16:57:01 CEST" "2018-09-17 16:57:02 CEST"
#> [145] "2018-09-17 16:57:03 CEST" "2018-09-17 16:57:04 CEST"
#> [147] "2018-09-17 16:57:05 CEST" "2018-09-17 16:57:06 CEST"
#> [149] "2018-09-17 16:57:07 CEST" "2018-09-17 16:57:08 CEST"
#> [151] "2018-09-17 16:57:09 CEST" "2018-09-17 16:57:10 CEST"
#> [153] "2018-09-17 16:57:11 CEST" "2018-09-17 16:57:12 CEST"
#> [155] "2018-09-17 16:57:13 CEST" "2018-09-17 16:57:14 CEST"
#> [157] "2018-09-17 16:57:15 CEST" "2018-09-17 16:57:16 CEST"
#> [159] "2018-09-17 16:57:17 CEST" "2018-09-17 16:57:18 CEST"
#> [161] "2018-09-17 16:57:19 CEST" "2018-09-17 16:57:20 CEST"
#> [163] "2018-09-17 16:57:21 CEST" "2018-09-17 16:57:22 CEST"
#> [165] "2018-09-17 16:57:23 CEST" "2018-09-17 16:57:24 CEST"
#> [167] "2018-09-17 16:57:25 CEST" "2018-09-17 16:57:26 CEST"
#> [169] "2018-09-17 16:57:27 CEST" "2018-09-17 16:57:28 CEST"
#> [171] "2018-09-17 16:57:29 CEST" "2018-09-17 16:57:30 CEST"
#> [173] "2018-09-17 16:57:31 CEST" "2018-09-17 16:57:32 CEST"
#> [175] "2018-09-17 16:57:33 CEST" "2018-09-17 16:57:34 CEST"
#> [177] "2018-09-17 16:57:35 CEST" "2018-09-17 16:57:36 CEST"
#> [179] "2018-09-17 16:57:37 CEST" "2018-09-17 16:57:38 CEST"
#> [181] "2018-09-17 16:57:39 CEST" "2018-09-17 16:57:40 CEST"
#> [183] "2018-09-17 16:57:41 CEST" "2018-09-17 16:57:42 CEST"
#> [185] "2018-09-17 16:57:43 CEST" "2018-09-17 16:57:44 CEST"
#> [187] "2018-09-17 16:57:45 CEST" "2018-09-17 16:57:46 CEST"
#> [189] "2018-09-17 16:57:47 CEST" "2018-09-17 16:57:48 CEST"
#> [191] "2018-09-17 16:57:49 CEST" "2018-09-17 16:57:50 CEST"
#> [193] "2018-09-17 16:57:51 CEST" "2018-09-17 16:57:52 CEST"
#> [195] "2018-09-17 16:57:53 CEST" "2018-09-17 16:57:55 CEST"
#> [197] "2018-09-17 16:57:55 CEST" "2018-09-17 16:57:56 CEST"
#> [199] "2018-09-17 16:57:57 CEST" "2018-09-17 16:57:58 CEST"
#> [201] "2018-09-17 16:57:59 CEST" "2018-09-17 16:58:00 CEST"
#> [203] "2018-09-17 16:58:01 CEST" "2018-09-17 16:58:02 CEST"
#> [205] "2018-09-17 16:58:03 CEST" "2018-09-17 16:58:04 CEST"
#> [207] "2018-09-17 16:58:05 CEST" "2018-09-17 16:58:06 CEST"
#> [209] "2018-09-17 16:58:07 CEST" "2018-09-17 16:58:08 CEST"
#> [211] "2018-09-17 16:58:09 CEST" "2018-09-17 16:58:10 CEST"
#> [213] "2018-09-17 16:58:11 CEST" "2018-09-17 16:58:12 CEST"
#> [215] "2018-09-17 16:58:13 CEST" "2018-09-17 16:58:14 CEST"
#> [217] "2018-09-17 16:58:15 CEST" "2018-09-17 16:58:16 CEST"
#> [219] "2018-09-17 16:58:17 CEST" "2018-09-17 16:58:18 CEST"
#> [221] "2018-09-17 16:58:19 CEST" "2018-09-17 16:58:20 CEST"
#> [223] "2018-09-17 16:58:21 CEST" "2018-09-17 16:58:22 CEST"
#> [225] "2018-09-17 16:58:23 CEST" "2018-09-17 16:58:24 CEST"
#> [227] "2018-09-17 16:58:25 CEST" "2018-09-17 16:58:26 CEST"
#> [229] "2018-09-17 16:58:27 CEST" "2018-09-17 16:58:28 CEST"
#> [231] "2018-09-17 16:58:29 CEST" "2018-09-17 16:58:30 CEST"
#> [233] "2018-09-17 16:58:31 CEST" "2018-09-17 16:58:32 CEST"
#> [235] "2018-09-17 16:58:33 CEST" "2018-09-17 16:58:34 CEST"
#> [237] "2018-09-17 16:58:35 CEST" "2018-09-17 16:58:36 CEST"
#> [239] "2018-09-17 16:58:37 CEST" "2018-09-17 16:58:38 CEST"
#> [241] "2018-09-17 16:58:39 CEST"

Created on 2018-09-17 by the reprex package (v0.2.0).

Session info
devtools::session_info()
#> Session info -------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.5.0 (2018-04-23)
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  English_United States.1252  
#>  tz       Europe/Paris                
#>  date     2018-09-17
#> Packages -----------------------------------------------------------------
#>  package    * version    date       source                         
#>  assertthat   0.2.0      2017-04-11 CRAN (R 3.5.0)                 
#>  backports    1.1.2      2017-12-13 CRAN (R 3.5.0)                 
#>  base       * 3.5.0      2018-04-23 local                          
#>  bindr        0.1.1      2018-03-13 CRAN (R 3.5.0)                 
#>  bindrcpp     0.2.2      2018-03-29 CRAN (R 3.5.0)                 
#>  colorspace   1.4-0      2018-08-14 R-Forge (R 3.5.1)              
#>  compiler     3.5.0      2018-04-23 local                          
#>  crayon       1.3.4      2017-09-16 CRAN (R 3.5.0)                 
#>  crul         0.6.0      2018-07-10 CRAN (R 3.5.0)                 
#>  curl         3.2        2018-03-28 CRAN (R 3.5.0)                 
#>  data.table   1.11.4     2018-05-27 CRAN (R 3.5.0)                 
#>  datasets   * 3.5.0      2018-04-23 local                          
#>  devtools     1.13.6     2018-06-27 CRAN (R 3.5.1)                 
#>  digest       0.6.17     2018-09-12 CRAN (R 3.5.1)                 
#>  dplyr        0.7.6      2018-06-29 CRAN (R 3.5.1)                 
#>  evaluate     0.11       2018-07-17 CRAN (R 3.5.1)                 
#>  geoaxe       0.1.0      2016-02-19 CRAN (R 3.5.0)                 
#>  ggplot2      3.0.0      2018-07-03 CRAN (R 3.5.1)                 
#>  glue         1.3.0      2018-07-17 CRAN (R 3.5.0)                 
#>  graphics   * 3.5.0      2018-04-23 local                          
#>  grDevices  * 3.5.0      2018-04-23 local                          
#>  grid         3.5.0      2018-04-23 local                          
#>  gtable       0.2.0      2016-02-26 CRAN (R 3.5.0)                 
#>  htmltools    0.3.6      2017-04-28 CRAN (R 3.5.1)                 
#>  httpcode     0.2.0      2016-11-14 CRAN (R 3.5.0)                 
#>  httr         1.3.1      2017-08-20 CRAN (R 3.5.0)                 
#>  jsonlite     1.5        2017-06-01 CRAN (R 3.5.0)                 
#>  knitr        1.20       2018-02-20 CRAN (R 3.5.0)                 
#>  lattice      0.20-35    2017-03-25 CRAN (R 3.5.0)                 
#>  lazyeval     0.2.1      2017-10-29 CRAN (R 3.5.0)                 
#>  lubridate    1.7.4      2018-04-11 CRAN (R 3.5.0)                 
#>  magrittr     1.5        2014-11-22 CRAN (R 3.5.0)                 
#>  memoise      1.1.0      2017-04-21 CRAN (R 3.5.0)                 
#>  methods    * 3.5.0      2018-04-23 local                          
#>  munsell      0.5.0      2018-06-12 CRAN (R 3.5.0)                 
#>  oai          0.2.2      2016-11-24 CRAN (R 3.5.0)                 
#>  pillar       1.3.0      2018-07-14 CRAN (R 3.5.1)                 
#>  pkgconfig    2.0.1      2017-03-21 CRAN (R 3.5.0)                 
#>  plyr         1.8.4      2016-06-08 CRAN (R 3.5.0)                 
#>  purrr        0.2.5      2018-05-29 CRAN (R 3.5.0)                 
#>  R6           2.2.2      2017-06-17 CRAN (R 3.5.0)                 
#>  Rcpp         0.12.18    2018-07-23 CRAN (R 3.5.0)                 
#>  rgbif        1.0.2.9421 2018-09-17 Github (ropensci/rgbif@6584a42)
#>  rgeos        0.3-28     2018-06-08 CRAN (R 3.5.1)                 
#>  rlang        0.2.2      2018-08-16 CRAN (R 3.5.1)                 
#>  rmarkdown    1.10       2018-06-11 CRAN (R 3.5.0)                 
#>  rprojroot    1.3-2      2018-01-03 CRAN (R 3.4.3)                 
#>  scales       1.0.0      2018-08-09 CRAN (R 3.5.1)                 
#>  sp           1.3-1      2018-06-05 CRAN (R 3.5.0)                 
#>  stats      * 3.5.0      2018-04-23 local                          
#>  stringi      1.2.4      2018-07-23 local                          
#>  stringr      1.3.1      2018-05-10 CRAN (R 3.5.0)                 
#>  tibble       1.4.2      2018-01-22 CRAN (R 3.5.0)                 
#>  tidyselect   0.2.4      2018-02-26 CRAN (R 3.5.0)                 
#>  tools        3.5.0      2018-04-23 local                          
#>  triebeard    0.3.0      2016-08-04 CRAN (R 3.5.0)                 
#>  urltools     1.7.1      2018-08-03 CRAN (R 3.5.1)                 
#>  utils      * 3.5.0      2018-04-23 local                          
#>  whisker      0.3-2      2013-04-28 CRAN (R 3.4.0)                 
#>  withr        2.1.2      2018-03-15 CRAN (R 3.4.4)                 
#>  xml2         1.2.0      2018-01-24 CRAN (R 3.5.0)                 
#>  yaml         2.2.0      2018-07-25 CRAN (R 3.5.1)

@sckott
Copy link
Contributor Author

sckott commented Sep 17, 2018

thanks @maelle !

looks like its working as expected

@Andreas-Bio
Copy link

Andreas-Bio commented Sep 19, 2018

Here are the timings of 10 iterations [each] before [sec]:

      min       lq   mean  median       uq     max neval
 4998.093 5016.831 5110.7 5108.02 5189.526 5248.55    10

The timings afterwards are still running after 20 hours and it seems the first iteration has just finished! Everything gets slowed down from 29.72 requests per second to an average of 1.018 requests per second.
So definetly an unacceptable slowdown for me. If this change gets pushed through without an option to skip the limit, the package becomes unusable for me.

Can you please post the email from GBIF that requested you to make this change? Which offical board was resposible for this?

@dnoesgaard @MattBlissett @mdoering

@sckott
Copy link
Contributor Author

sckott commented Sep 19, 2018

thanks for testing @andzandz11 !

Yes, the goal was to limit to 1 request per second, or 60 requests per minute, as requested by GBIF.

I don't have a sense for whether GBIF is flexible on this or not. Any thoughts @MattBlissett @timrobertson100

@timrobertson100
Copy link

Thanks @andzandz11 and @sckott

The request to explore this came from me as there have been a few instances recently where rogue scripts (e.g. infinite loops) have been issuing a lot of requests to GBIF.org. When it comes to the occurrence APIs of GBIF, it makes little sense to be issuing a lot of deep paging requests when a single download call can bring any filtered result set far more efficiently, and with DOI based citation. I asked Scott to explore options to rate limit in the client, as we also explore dynamically throttling based on IP to safeguard the services.

It would be helpful to understand what query patterns require you to hit GBIF occurrence search services s often from a single R application. Normally we'd recommend the download service for that. Can you elaborate on your use case please?

@damianooldoni
Copy link
Collaborator

Sorry for late reply.
Typically I need to retrieve extension information such as distribution, description and species profile for thousands of taxa in several species checklists (no occurrences involved!).
For example, retrieving distribution for more than 2600 taxa takes 40 minutes via branch rate-limit instead of 4 minutes by using the master branch.
I agree with @timrobertson100 about the correct use of asynchronous download for occurrences. Maybe set a rate limit only for occurrences and not for checklist related functions would be an option?

@Andreas-Bio
Copy link

Andreas-Bio commented Sep 20, 2018

I am regularly building barcode reference databases from scratch using a R script (data from GenBank). I re-build these databases from time to time to fix some errors or to incorporate new sequences that have been pusblished on GenBank and in the same script I call the GBIF backbone to get the species key and then I use this key to count occurences in multiple countrys using count_facet to score presence/absence. I have somewhat ~74000 species in the database, and apart from downloading tens of GB of .csv files I see no other way than just to loop over it using R. The script is fully automated and working really well. It is fast, always up to date, has a small memory footprint, uses very little bandwidth and leaves no trash behind (R is really bad getting the data out of RAM and it will slow my machine down considerably during the runtime). Just downloading almost world-wide occurence data for plants will make my RAM explode, I can't do that. It is also a lot of overkill because I just need the occurence counts per counry. If somebody knows how to download a table just containing all plant species vs. country occurence counts please let me know.
My solution would be to introduce the request throttling but maybe exempt API-Key users at the same time. So people who go through the effort to request an API key will still be able to work undisturbed. Has the additional benefit of being able to lock individual API keys if they are being misused.
To be able to spontaneously try out new scripts without having to design them around some kind of request limit is so valuable (in a time=salary manner) to my project, I would even rather buy an API key with a high limit on it rather than having my development time slowed down.
I am no fan of IP throttling, if you start throttling IPs from some German universities, people will be very upset I guess. It is also intransparent and frustrating, because the first tests of your script will run very fast and if run the whole thing it will miraculously need the whole day.

@timrobertson100
Copy link

Thank you @damianooldoni and @andzandz11 for taking the time to clarify your use cases. It is great to hear that you find the services useful, and please be assured that our objective is to ensure quality of service and not to negatively affect real usage.

Based on the feedback I propose that this not be included @sckott and GBIF consider alternatives - in particular that we should only activate defensive throttling when we observe issue (e.g. the DDOS) which is not the norm. Thank you for exploring this though - and sorry to waste your time.

Off topic to this thread:
@andzandz11 - we are going to be expanding output formats from GBIF in the coming weeks / months. The first will be species lists derived from occurrence search which is already in test. Would it be of any interest to have a service that allows a list of species to be POST'ed, and for the response to be a matrix of "species, country, count" for example? If you could help specify any formats that would be immediately useful to you, please let us know (informatics@gbif.org).

@peterdesmet
Copy link
Member

@timrobertson100 allowing to POST a species list (list of species IDs) as a paramater for an occurrence search/download would certainly cater to our main use case for the TrIAS project!

@poldham
Copy link

poldham commented Sep 20, 2018

@timrobertson100 I would also like to support the POST method. I frequently end up with a few thousand species of interest (for national reports for example) and want to retrieve the occurrence data only for those species using the IDs. At present that would involve making individual calls (e.g. for 4,000 species) or combining into a query which will run for a while and then fail. My work around has been to use bounding boxes on the website etc but that involves too much guess work and a lot of unnecessary data (e.g. I recently did the whole of South East Asia to get to marine species with occurrences in the ASEAN region). So I think a POST method would be a great help to those of us working with species data at the level of thousands. On rate limiting, I can recognise the need for that in some circumstances but if it can be avoided that really would be much better.

@Andreas-Bio
Copy link

Also very helpful would be if you can specify the fields you want returned as information. For example I have 80000 species and I just need the "country" data, but right now, using the website download function, I have to get the whole dataset which is 99GB and too big to be handled by R properly. Even with the POST method the whole world-wide dataset that is being returned would be too large.

@timrobertson100
Copy link

timrobertson100 commented Sep 20, 2018

Thank you all - very useful.

Would it be of any interest to allow a user to post a SQL statement for an asynchronous download?

It would be for the more experience user, take a few minutes to return and we'd probably need to sanitise and offer only a subset of SQL (single table, aggregations, groupings etc) but we could allow e.g.:

-- species richness by 10 degree latitudinal band (pseudo SQL follows for example)
SELECT FLOOR(decimalLatitude/10) AS latitudeBand, COUNT(DISTINCT species) as speciesCount
FROM occurrence
WHERE genusKey=... AND...
GROUP BY latitudeBand

CC @MattBlissett @gbif for info as we consider options

@peterdesmet
Copy link
Member

@timrobertson100 that would be really nice! Since you can request aggregated data via SQL, I assume the downloads would of another type than the current GBIF occurrence downloads?

@sckott
Copy link
Contributor Author

sckott commented Sep 20, 2018

@timrobertson100 no worries, not a waste of time. had fun writing it

@timrobertson100
Copy link

@peterdesmet

Yes a SQL download would be a new service. I have wondered about it several times, but I have seen a few instances recently where I think it might be an enabling service.

@sckott sckott removed this from the v1.1 milestone Sep 25, 2018
@sckott
Copy link
Contributor Author

sckott commented Sep 25, 2018

I'll leave this issue open and leave the work on the branch (rate-limit) in case we need to roll it in later. Thanks all!

@maelle
Copy link
Member

maelle commented Sep 14, 2023

@jhnwllr fun coincidence you're closing this now as I just added throttling/rate limiting to another package! 😸

@jhnwllr
Copy link
Collaborator

jhnwllr commented Sep 14, 2023

@maelle I closed it because I thought the issue had sort of become out of date. I am not sure any rate limiting is needed at all. I have been abusing the GBIF API for years and it's fine.

@maelle
Copy link
Member

maelle commented Sep 14, 2023

oh yeah it makes sense! I was just reacting on the coicidence of topics, not judging the decision. 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants