-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
general purpose rate limiting across pkg #320
Comments
Can folks help me test this? on branch install like @damianooldoni @dmcglinn @MattBlissett @jkmccarthy @jwhalennds @poldham @andzandz11 Let me know if you see any potential issues with the internal helper that does the waiting between requests https://github.com/ropensci/rgbif/blob/rate-limit/R/HttpStore.R |
@maelle can you give this a try and see if you find any problems? |
@sckott I have installed and will try to find something to give this a whirl with. |
Looks fine, but I only tested this: days <- seq(from = Sys.Date() - 240,
to = Sys.Date(),
by = 1)
get_one_day <- function(day){
date <- format(day, "%Y-%m-%d")
result <- rgbif::occ_search(eventDate = date,
country = "fr",
limit = 1)$data
result$time <- Sys.time()
result
}
results <- purrr::map_df(days, get_one_day)
unique(results$time)
#> [1] "2018-09-17 16:54:39 CEST" "2018-09-17 16:54:40 CEST"
#> [3] "2018-09-17 16:54:41 CEST" "2018-09-17 16:54:42 CEST"
#> [5] "2018-09-17 16:54:43 CEST" "2018-09-17 16:54:44 CEST"
#> [7] "2018-09-17 16:54:45 CEST" "2018-09-17 16:54:46 CEST"
#> [9] "2018-09-17 16:54:47 CEST" "2018-09-17 16:54:48 CEST"
#> [11] "2018-09-17 16:54:49 CEST" "2018-09-17 16:54:50 CEST"
#> [13] "2018-09-17 16:54:51 CEST" "2018-09-17 16:54:52 CEST"
#> [15] "2018-09-17 16:54:53 CEST" "2018-09-17 16:54:54 CEST"
#> [17] "2018-09-17 16:54:55 CEST" "2018-09-17 16:54:56 CEST"
#> [19] "2018-09-17 16:54:57 CEST" "2018-09-17 16:54:58 CEST"
#> [21] "2018-09-17 16:54:59 CEST" "2018-09-17 16:55:00 CEST"
#> [23] "2018-09-17 16:55:01 CEST" "2018-09-17 16:55:02 CEST"
#> [25] "2018-09-17 16:55:03 CEST" "2018-09-17 16:55:04 CEST"
#> [27] "2018-09-17 16:55:05 CEST" "2018-09-17 16:55:06 CEST"
#> [29] "2018-09-17 16:55:07 CEST" "2018-09-17 16:55:08 CEST"
#> [31] "2018-09-17 16:55:09 CEST" "2018-09-17 16:55:10 CEST"
#> [33] "2018-09-17 16:55:11 CEST" "2018-09-17 16:55:12 CEST"
#> [35] "2018-09-17 16:55:13 CEST" "2018-09-17 16:55:14 CEST"
#> [37] "2018-09-17 16:55:15 CEST" "2018-09-17 16:55:16 CEST"
#> [39] "2018-09-17 16:55:17 CEST" "2018-09-17 16:55:18 CEST"
#> [41] "2018-09-17 16:55:19 CEST" "2018-09-17 16:55:20 CEST"
#> [43] "2018-09-17 16:55:21 CEST" "2018-09-17 16:55:22 CEST"
#> [45] "2018-09-17 16:55:23 CEST" "2018-09-17 16:55:24 CEST"
#> [47] "2018-09-17 16:55:25 CEST" "2018-09-17 16:55:26 CEST"
#> [49] "2018-09-17 16:55:27 CEST" "2018-09-17 16:55:28 CEST"
#> [51] "2018-09-17 16:55:29 CEST" "2018-09-17 16:55:30 CEST"
#> [53] "2018-09-17 16:55:31 CEST" "2018-09-17 16:55:32 CEST"
#> [55] "2018-09-17 16:55:33 CEST" "2018-09-17 16:55:34 CEST"
#> [57] "2018-09-17 16:55:35 CEST" "2018-09-17 16:55:36 CEST"
#> [59] "2018-09-17 16:55:37 CEST" "2018-09-17 16:55:38 CEST"
#> [61] "2018-09-17 16:55:39 CEST" "2018-09-17 16:55:40 CEST"
#> [63] "2018-09-17 16:55:41 CEST" "2018-09-17 16:55:42 CEST"
#> [65] "2018-09-17 16:55:43 CEST" "2018-09-17 16:55:44 CEST"
#> [67] "2018-09-17 16:55:45 CEST" "2018-09-17 16:55:46 CEST"
#> [69] "2018-09-17 16:55:47 CEST" "2018-09-17 16:55:48 CEST"
#> [71] "2018-09-17 16:55:49 CEST" "2018-09-17 16:55:50 CEST"
#> [73] "2018-09-17 16:55:51 CEST" "2018-09-17 16:55:52 CEST"
#> [75] "2018-09-17 16:55:53 CEST" "2018-09-17 16:55:54 CEST"
#> [77] "2018-09-17 16:55:55 CEST" "2018-09-17 16:55:56 CEST"
#> [79] "2018-09-17 16:55:57 CEST" "2018-09-17 16:55:58 CEST"
#> [81] "2018-09-17 16:55:59 CEST" "2018-09-17 16:56:00 CEST"
#> [83] "2018-09-17 16:56:01 CEST" "2018-09-17 16:56:02 CEST"
#> [85] "2018-09-17 16:56:03 CEST" "2018-09-17 16:56:04 CEST"
#> [87] "2018-09-17 16:56:05 CEST" "2018-09-17 16:56:06 CEST"
#> [89] "2018-09-17 16:56:07 CEST" "2018-09-17 16:56:08 CEST"
#> [91] "2018-09-17 16:56:09 CEST" "2018-09-17 16:56:10 CEST"
#> [93] "2018-09-17 16:56:11 CEST" "2018-09-17 16:56:12 CEST"
#> [95] "2018-09-17 16:56:13 CEST" "2018-09-17 16:56:14 CEST"
#> [97] "2018-09-17 16:56:15 CEST" "2018-09-17 16:56:16 CEST"
#> [99] "2018-09-17 16:56:17 CEST" "2018-09-17 16:56:18 CEST"
#> [101] "2018-09-17 16:56:19 CEST" "2018-09-17 16:56:20 CEST"
#> [103] "2018-09-17 16:56:21 CEST" "2018-09-17 16:56:22 CEST"
#> [105] "2018-09-17 16:56:23 CEST" "2018-09-17 16:56:24 CEST"
#> [107] "2018-09-17 16:56:25 CEST" "2018-09-17 16:56:26 CEST"
#> [109] "2018-09-17 16:56:27 CEST" "2018-09-17 16:56:28 CEST"
#> [111] "2018-09-17 16:56:29 CEST" "2018-09-17 16:56:30 CEST"
#> [113] "2018-09-17 16:56:31 CEST" "2018-09-17 16:56:32 CEST"
#> [115] "2018-09-17 16:56:33 CEST" "2018-09-17 16:56:34 CEST"
#> [117] "2018-09-17 16:56:35 CEST" "2018-09-17 16:56:36 CEST"
#> [119] "2018-09-17 16:56:37 CEST" "2018-09-17 16:56:38 CEST"
#> [121] "2018-09-17 16:56:39 CEST" "2018-09-17 16:56:40 CEST"
#> [123] "2018-09-17 16:56:41 CEST" "2018-09-17 16:56:42 CEST"
#> [125] "2018-09-17 16:56:43 CEST" "2018-09-17 16:56:44 CEST"
#> [127] "2018-09-17 16:56:45 CEST" "2018-09-17 16:56:46 CEST"
#> [129] "2018-09-17 16:56:47 CEST" "2018-09-17 16:56:48 CEST"
#> [131] "2018-09-17 16:56:49 CEST" "2018-09-17 16:56:50 CEST"
#> [133] "2018-09-17 16:56:51 CEST" "2018-09-17 16:56:52 CEST"
#> [135] "2018-09-17 16:56:53 CEST" "2018-09-17 16:56:54 CEST"
#> [137] "2018-09-17 16:56:55 CEST" "2018-09-17 16:56:56 CEST"
#> [139] "2018-09-17 16:56:57 CEST" "2018-09-17 16:56:58 CEST"
#> [141] "2018-09-17 16:56:59 CEST" "2018-09-17 16:57:00 CEST"
#> [143] "2018-09-17 16:57:01 CEST" "2018-09-17 16:57:02 CEST"
#> [145] "2018-09-17 16:57:03 CEST" "2018-09-17 16:57:04 CEST"
#> [147] "2018-09-17 16:57:05 CEST" "2018-09-17 16:57:06 CEST"
#> [149] "2018-09-17 16:57:07 CEST" "2018-09-17 16:57:08 CEST"
#> [151] "2018-09-17 16:57:09 CEST" "2018-09-17 16:57:10 CEST"
#> [153] "2018-09-17 16:57:11 CEST" "2018-09-17 16:57:12 CEST"
#> [155] "2018-09-17 16:57:13 CEST" "2018-09-17 16:57:14 CEST"
#> [157] "2018-09-17 16:57:15 CEST" "2018-09-17 16:57:16 CEST"
#> [159] "2018-09-17 16:57:17 CEST" "2018-09-17 16:57:18 CEST"
#> [161] "2018-09-17 16:57:19 CEST" "2018-09-17 16:57:20 CEST"
#> [163] "2018-09-17 16:57:21 CEST" "2018-09-17 16:57:22 CEST"
#> [165] "2018-09-17 16:57:23 CEST" "2018-09-17 16:57:24 CEST"
#> [167] "2018-09-17 16:57:25 CEST" "2018-09-17 16:57:26 CEST"
#> [169] "2018-09-17 16:57:27 CEST" "2018-09-17 16:57:28 CEST"
#> [171] "2018-09-17 16:57:29 CEST" "2018-09-17 16:57:30 CEST"
#> [173] "2018-09-17 16:57:31 CEST" "2018-09-17 16:57:32 CEST"
#> [175] "2018-09-17 16:57:33 CEST" "2018-09-17 16:57:34 CEST"
#> [177] "2018-09-17 16:57:35 CEST" "2018-09-17 16:57:36 CEST"
#> [179] "2018-09-17 16:57:37 CEST" "2018-09-17 16:57:38 CEST"
#> [181] "2018-09-17 16:57:39 CEST" "2018-09-17 16:57:40 CEST"
#> [183] "2018-09-17 16:57:41 CEST" "2018-09-17 16:57:42 CEST"
#> [185] "2018-09-17 16:57:43 CEST" "2018-09-17 16:57:44 CEST"
#> [187] "2018-09-17 16:57:45 CEST" "2018-09-17 16:57:46 CEST"
#> [189] "2018-09-17 16:57:47 CEST" "2018-09-17 16:57:48 CEST"
#> [191] "2018-09-17 16:57:49 CEST" "2018-09-17 16:57:50 CEST"
#> [193] "2018-09-17 16:57:51 CEST" "2018-09-17 16:57:52 CEST"
#> [195] "2018-09-17 16:57:53 CEST" "2018-09-17 16:57:55 CEST"
#> [197] "2018-09-17 16:57:55 CEST" "2018-09-17 16:57:56 CEST"
#> [199] "2018-09-17 16:57:57 CEST" "2018-09-17 16:57:58 CEST"
#> [201] "2018-09-17 16:57:59 CEST" "2018-09-17 16:58:00 CEST"
#> [203] "2018-09-17 16:58:01 CEST" "2018-09-17 16:58:02 CEST"
#> [205] "2018-09-17 16:58:03 CEST" "2018-09-17 16:58:04 CEST"
#> [207] "2018-09-17 16:58:05 CEST" "2018-09-17 16:58:06 CEST"
#> [209] "2018-09-17 16:58:07 CEST" "2018-09-17 16:58:08 CEST"
#> [211] "2018-09-17 16:58:09 CEST" "2018-09-17 16:58:10 CEST"
#> [213] "2018-09-17 16:58:11 CEST" "2018-09-17 16:58:12 CEST"
#> [215] "2018-09-17 16:58:13 CEST" "2018-09-17 16:58:14 CEST"
#> [217] "2018-09-17 16:58:15 CEST" "2018-09-17 16:58:16 CEST"
#> [219] "2018-09-17 16:58:17 CEST" "2018-09-17 16:58:18 CEST"
#> [221] "2018-09-17 16:58:19 CEST" "2018-09-17 16:58:20 CEST"
#> [223] "2018-09-17 16:58:21 CEST" "2018-09-17 16:58:22 CEST"
#> [225] "2018-09-17 16:58:23 CEST" "2018-09-17 16:58:24 CEST"
#> [227] "2018-09-17 16:58:25 CEST" "2018-09-17 16:58:26 CEST"
#> [229] "2018-09-17 16:58:27 CEST" "2018-09-17 16:58:28 CEST"
#> [231] "2018-09-17 16:58:29 CEST" "2018-09-17 16:58:30 CEST"
#> [233] "2018-09-17 16:58:31 CEST" "2018-09-17 16:58:32 CEST"
#> [235] "2018-09-17 16:58:33 CEST" "2018-09-17 16:58:34 CEST"
#> [237] "2018-09-17 16:58:35 CEST" "2018-09-17 16:58:36 CEST"
#> [239] "2018-09-17 16:58:37 CEST" "2018-09-17 16:58:38 CEST"
#> [241] "2018-09-17 16:58:39 CEST" Created on 2018-09-17 by the reprex package (v0.2.0). Session infodevtools::session_info()
#> Session info -------------------------------------------------------------
#> setting value
#> version R version 3.5.0 (2018-04-23)
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate English_United States.1252
#> tz Europe/Paris
#> date 2018-09-17
#> Packages -----------------------------------------------------------------
#> package * version date source
#> assertthat 0.2.0 2017-04-11 CRAN (R 3.5.0)
#> backports 1.1.2 2017-12-13 CRAN (R 3.5.0)
#> base * 3.5.0 2018-04-23 local
#> bindr 0.1.1 2018-03-13 CRAN (R 3.5.0)
#> bindrcpp 0.2.2 2018-03-29 CRAN (R 3.5.0)
#> colorspace 1.4-0 2018-08-14 R-Forge (R 3.5.1)
#> compiler 3.5.0 2018-04-23 local
#> crayon 1.3.4 2017-09-16 CRAN (R 3.5.0)
#> crul 0.6.0 2018-07-10 CRAN (R 3.5.0)
#> curl 3.2 2018-03-28 CRAN (R 3.5.0)
#> data.table 1.11.4 2018-05-27 CRAN (R 3.5.0)
#> datasets * 3.5.0 2018-04-23 local
#> devtools 1.13.6 2018-06-27 CRAN (R 3.5.1)
#> digest 0.6.17 2018-09-12 CRAN (R 3.5.1)
#> dplyr 0.7.6 2018-06-29 CRAN (R 3.5.1)
#> evaluate 0.11 2018-07-17 CRAN (R 3.5.1)
#> geoaxe 0.1.0 2016-02-19 CRAN (R 3.5.0)
#> ggplot2 3.0.0 2018-07-03 CRAN (R 3.5.1)
#> glue 1.3.0 2018-07-17 CRAN (R 3.5.0)
#> graphics * 3.5.0 2018-04-23 local
#> grDevices * 3.5.0 2018-04-23 local
#> grid 3.5.0 2018-04-23 local
#> gtable 0.2.0 2016-02-26 CRAN (R 3.5.0)
#> htmltools 0.3.6 2017-04-28 CRAN (R 3.5.1)
#> httpcode 0.2.0 2016-11-14 CRAN (R 3.5.0)
#> httr 1.3.1 2017-08-20 CRAN (R 3.5.0)
#> jsonlite 1.5 2017-06-01 CRAN (R 3.5.0)
#> knitr 1.20 2018-02-20 CRAN (R 3.5.0)
#> lattice 0.20-35 2017-03-25 CRAN (R 3.5.0)
#> lazyeval 0.2.1 2017-10-29 CRAN (R 3.5.0)
#> lubridate 1.7.4 2018-04-11 CRAN (R 3.5.0)
#> magrittr 1.5 2014-11-22 CRAN (R 3.5.0)
#> memoise 1.1.0 2017-04-21 CRAN (R 3.5.0)
#> methods * 3.5.0 2018-04-23 local
#> munsell 0.5.0 2018-06-12 CRAN (R 3.5.0)
#> oai 0.2.2 2016-11-24 CRAN (R 3.5.0)
#> pillar 1.3.0 2018-07-14 CRAN (R 3.5.1)
#> pkgconfig 2.0.1 2017-03-21 CRAN (R 3.5.0)
#> plyr 1.8.4 2016-06-08 CRAN (R 3.5.0)
#> purrr 0.2.5 2018-05-29 CRAN (R 3.5.0)
#> R6 2.2.2 2017-06-17 CRAN (R 3.5.0)
#> Rcpp 0.12.18 2018-07-23 CRAN (R 3.5.0)
#> rgbif 1.0.2.9421 2018-09-17 Github (ropensci/rgbif@6584a42)
#> rgeos 0.3-28 2018-06-08 CRAN (R 3.5.1)
#> rlang 0.2.2 2018-08-16 CRAN (R 3.5.1)
#> rmarkdown 1.10 2018-06-11 CRAN (R 3.5.0)
#> rprojroot 1.3-2 2018-01-03 CRAN (R 3.4.3)
#> scales 1.0.0 2018-08-09 CRAN (R 3.5.1)
#> sp 1.3-1 2018-06-05 CRAN (R 3.5.0)
#> stats * 3.5.0 2018-04-23 local
#> stringi 1.2.4 2018-07-23 local
#> stringr 1.3.1 2018-05-10 CRAN (R 3.5.0)
#> tibble 1.4.2 2018-01-22 CRAN (R 3.5.0)
#> tidyselect 0.2.4 2018-02-26 CRAN (R 3.5.0)
#> tools 3.5.0 2018-04-23 local
#> triebeard 0.3.0 2016-08-04 CRAN (R 3.5.0)
#> urltools 1.7.1 2018-08-03 CRAN (R 3.5.1)
#> utils * 3.5.0 2018-04-23 local
#> whisker 0.3-2 2013-04-28 CRAN (R 3.4.0)
#> withr 2.1.2 2018-03-15 CRAN (R 3.4.4)
#> xml2 1.2.0 2018-01-24 CRAN (R 3.5.0)
#> yaml 2.2.0 2018-07-25 CRAN (R 3.5.1) |
thanks @maelle ! looks like its working as expected |
Here are the timings of 10 iterations [each] before [sec]:
The timings afterwards are still running after 20 hours and it seems the first iteration has just finished! Everything gets slowed down from 29.72 requests per second to an average of 1.018 requests per second. Can you please post the email from GBIF that requested you to make this change? Which offical board was resposible for this? |
thanks for testing @andzandz11 ! Yes, the goal was to limit to 1 request per second, or 60 requests per minute, as requested by GBIF. I don't have a sense for whether GBIF is flexible on this or not. Any thoughts @MattBlissett @timrobertson100 |
Thanks @andzandz11 and @sckott The request to explore this came from me as there have been a few instances recently where rogue scripts (e.g. infinite loops) have been issuing a lot of requests to GBIF.org. When it comes to the occurrence APIs of GBIF, it makes little sense to be issuing a lot of deep paging requests when a single download call can bring any filtered result set far more efficiently, and with DOI based citation. I asked Scott to explore options to rate limit in the client, as we also explore dynamically throttling based on IP to safeguard the services. It would be helpful to understand what query patterns require you to hit GBIF occurrence search services s often from a single R application. Normally we'd recommend the download service for that. Can you elaborate on your use case please? |
Sorry for late reply. |
I am regularly building barcode reference databases from scratch using a R script (data from GenBank). I re-build these databases from time to time to fix some errors or to incorporate new sequences that have been pusblished on GenBank and in the same script I call the GBIF backbone to get the species key and then I use this key to count occurences in multiple countrys using count_facet to score presence/absence. I have somewhat ~74000 species in the database, and apart from downloading tens of GB of .csv files I see no other way than just to loop over it using R. The script is fully automated and working really well. It is fast, always up to date, has a small memory footprint, uses very little bandwidth and leaves no trash behind (R is really bad getting the data out of RAM and it will slow my machine down considerably during the runtime). Just downloading almost world-wide occurence data for plants will make my RAM explode, I can't do that. It is also a lot of overkill because I just need the occurence counts per counry. If somebody knows how to download a table just containing all plant species vs. country occurence counts please let me know. |
Thank you @damianooldoni and @andzandz11 for taking the time to clarify your use cases. It is great to hear that you find the services useful, and please be assured that our objective is to ensure quality of service and not to negatively affect real usage. Based on the feedback I propose that this not be included @sckott and GBIF consider alternatives - in particular that we should only activate defensive throttling when we observe issue (e.g. the DDOS) which is not the norm. Thank you for exploring this though - and sorry to waste your time. Off topic to this thread: |
@timrobertson100 allowing to POST a species list (list of species IDs) as a paramater for an occurrence search/download would certainly cater to our main use case for the TrIAS project! |
@timrobertson100 I would also like to support the POST method. I frequently end up with a few thousand species of interest (for national reports for example) and want to retrieve the occurrence data only for those species using the IDs. At present that would involve making individual calls (e.g. for 4,000 species) or combining into a query which will run for a while and then fail. My work around has been to use bounding boxes on the website etc but that involves too much guess work and a lot of unnecessary data (e.g. I recently did the whole of South East Asia to get to marine species with occurrences in the ASEAN region). So I think a POST method would be a great help to those of us working with species data at the level of thousands. On rate limiting, I can recognise the need for that in some circumstances but if it can be avoided that really would be much better. |
Also very helpful would be if you can specify the fields you want returned as information. For example I have 80000 species and I just need the "country" data, but right now, using the website download function, I have to get the whole dataset which is 99GB and too big to be handled by R properly. Even with the POST method the whole world-wide dataset that is being returned would be too large. |
Thank you all - very useful. Would it be of any interest to allow a user to post a SQL statement for an asynchronous download? It would be for the more experience user, take a few minutes to return and we'd probably need to sanitise and offer only a subset of SQL (single table, aggregations, groupings etc) but we could allow e.g.:
CC @MattBlissett @gbif for info as we consider options |
@timrobertson100 that would be really nice! Since you can request aggregated data via SQL, I assume the downloads would of another type than the current GBIF occurrence downloads? |
@timrobertson100 no worries, not a waste of time. had fun writing it |
Yes a SQL download would be a new service. I have wondered about it several times, but I have seen a few instances recently where I think it might be an enabling service. |
I'll leave this issue open and leave the work on the branch ( |
@jhnwllr fun coincidence you're closing this now as I just added throttling/rate limiting to another package! 😸 |
@maelle I closed it because I thought the issue had sort of become out of date. I am not sure any rate limiting is needed at all. I have been abusing the GBIF API for years and it's fine. |
oh yeah it makes sense! I was just reacting on the coicidence of topics, not judging the decision. 😃 |
via request from GBIF (email title "Re: Help on server error")
except the download request API - though should use for some download routes for checking status/etc.
The text was updated successfully, but these errors were encountered: