-
Notifications
You must be signed in to change notification settings - Fork 80
Description
Hello,
I am trying to connect to a website to do some scraping. I am trying to learn how to slow down the requests because if I request to many times in a minute, I get a AWS challenge from the website that httr2 cant push past. If i wait the change goes away and it works just fine (which is annoying because it means the full pull I have to do will take a while, but whatever. Below is only going to one URL, but I have a couple hundred I want to pull (that are different URLs in the same website).
Any clarification, or advice, would be super helpful.
Thank you for rvest, httr, and httr2 in general, I have taught myself how to use them with the good good documentation and it has helped streamline a lot of my work!
Here is the normal code working just fine (i.e. i have not tried to call it quickly)
> table_url[1] %>%
+ request() %>%
+ req_headers(
+ authority = "www.hospitalsafetygrade.org",
+ accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
+ `accept-language` = "en-US,en;q=0.9",
+ `cache-control` = "no-cache",
+ ) %>%
+ req_retry(max_seconds = 15,
+ is_transient = ~resp_status(.x) %in% c(429, 500, 503, 202),
+ after = ~.x) %>%
+ req_perform(verbosity = 1)
-> GET /table-details/the-queens-medical-center HTTP/1.1
-> Host: www.hospitalsafetygrade.org
-> User-Agent: httr2/0.2.3 r-curl/4.3.2 libcurl/7.64.1
-> Accept-Encoding: deflate, gzip
-> authority: www.hospitalsafetygrade.org
-> accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7
-> accept-language: en-US,en;q=0.9
-> cache-control: no-cache
->
<- HTTP/1.1 200 OK
<- Date: Sat, 11 Nov 2023 01:02:28 GMT
<- Content-Type: text/html; charset=utf-8
<- Transfer-Encoding: chunked
<- Connection: keep-alive
<- Server: Apache/2.2.34 (Amazon)
<- X-AWC-Cache: partial
<- Set-Cookie: sid=6b3aad3f8ec0fd5382d3bde1d0dbc2d0; path=/; HttpOnly
<-
<httr2_response>
GET https://www.hospitalsafetygrade.org/table-details/the-queens-medical-center
Status: 200 OK
Content-Type: text/html
Body: In memory (80435 bytes)
Here is the code that causes the errors - at the bottom it says "Error in check_number():! seconds must be a number" but I don't understand the error...I have been working on this a bit (took me a while to realize the status 202 meant it was giving me a redirect to a challenge)
> table_url[1] %>%
+ request() %>%
+ req_headers(
+ authority = "www.hospitalsafetygrade.org",
+ accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
+ `accept-language` = "en-US,en;q=0.9",
+ `cache-control` = "no-cache",
+ ) %>%
+ req_retry(max_seconds = 15,
+ is_transient = ~resp_status(.x) %in% c(429, 500, 503, 202),
+ after = ~.x) %>%
+ req_perform(verbosity = 1)
-> GET /table-details/the-queens-medical-center HTTP/1.1
-> Host: www.hospitalsafetygrade.org
-> User-Agent: httr2/0.2.3 r-curl/4.3.2 libcurl/7.64.1
-> Accept-Encoding: deflate, gzip
-> authority: www.hospitalsafetygrade.org
-> accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7
-> accept-language: en-US,en;q=0.9
-> cache-control: no-cache
->
<- HTTP/1.1 202 Accepted
<- Server: awselb/2.0
<- Date: Sat, 11 Nov 2023 00:50:46 GMT
<- Content-Length: 2411
<- Connection: keep-alive
<- x-amzn-waf-action: challenge
<- Cache-Control: no-store, max-age=0
<- Content-Type: text/html; charset=UTF-8
<-
Error in `check_number()`:
! `seconds` must be a number
Run `rlang::last_trace()` to see where the error occurred.
Warning message:
In if (is.na(after)) { :
the condition has length > 1 and only the first element will be used
> rlang::last_trace()
<error/rlang_error>
Error in `check_number()`:
! `seconds` must be a number
---
Backtrace:
x
1. +-... %>% req_perform(verbosity = 1)
2. \-httr2::req_perform(., verbosity = 1)
3. \-httr2:::sys_sleep(delay)
4. \-httr2:::check_number(seconds, "`seconds`")
Run rlang::last_trace(drop = FALSE) to see 1 hidden frame.
> ?httr2::check_number()
Error in .helpForCall(topicExpr, parent.frame()) :
no methods for ‘check_number’ and no documentation for it as a function
> ?check_number()
Error in .helpForCall(topicExpr, parent.frame()) :
no methods for ‘check_number’ and no documentation for it as a function
here is my system information
R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RPostgreSQL_0.7-4 tmap_3.3-3 odbc_1.3.4 logger_0.2.2 DBI_1.1.3 glue_1.6.2 httr2_0.2.3 chromote_0.1.2
[9] magrittr_2.0.3 jsonlite_1.8.4 xml2_1.3.3 openxlsx_4.2.5 dbplyr_2.3.2 rvest_1.0.3 lubridate_1.9.2 forcats_1.0.0
[17] stringr_1.5.0 dplyr_1.1.1 purrr_1.0.1 readr_2.1.4 tidyr_1.3.0 tibble_3.2.1 ggplot2_3.4.2 tidyverse_2.0.0
[25] pacman_0.5.1
loaded via a namespace (and not attached):
[1] sf_1.0-12 bit64_4.0.5 RColorBrewer_1.1-3 httr_1.4.4 tools_4.1.1 utf8_1.2.2 R6_2.5.1
[8] KernSmooth_2.23-20 colorspace_2.0-3 raster_3.6-20 sp_1.5-0 withr_2.5.0 tidyselect_1.2.0 tictoc_1.2
[15] processx_3.7.0 leaflet_2.1.2 curl_4.3.2 bit_4.0.4 compiler_4.1.1 leafem_0.2.0 cli_3.6.1
[22] scales_1.2.1 classInt_0.4-7 proxy_0.4-27 rappdirs_0.3.3 digest_0.6.29 base64enc_0.1-3 dichromat_2.0-0.1
[29] pkgconfig_2.0.3 htmltools_0.5.3 sessioninfo_1.2.2 fastmap_1.1.0 htmlwidgets_1.5.4 rlang_1.1.0 readxl_1.4.2
[36] Microsoft365R_2.4.0 rstudioapi_0.14 generics_0.1.3 crosstalk_1.2.0 zip_2.2.1 AzureGraph_1.3.2 Rcpp_1.0.10
[43] munsell_0.5.0 fansi_1.0.3 abind_1.4-5 terra_1.7-23 lifecycle_1.0.3 stringi_1.7.6 leafsync_0.1.0
[50] snakecase_0.11.0 tmaptools_3.1-1 grid_4.1.1 blob_1.2.3 parallel_4.1.1 promises_1.2.0.1 lattice_0.20-45
[57] stars_0.6-1 hms_1.1.2 ps_1.7.1 pillar_1.8.1 codetools_0.2-18 XML_3.99-0.10 selectr_0.4-2
[64] png_0.1-7 vctrs_0.6.1 tzdb_0.3.0 cellranger_1.1.0 gtable_0.3.1 AzureAuth_1.3.3 janitor_2.1.0
[71] lwgeom_0.2-11 e1071_1.7-11 later_1.3.0 viridisLite_0.4.1 class_7.3-20 websocket_1.4.1 units_0.8-0