You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
TL;DR: I am building polite web scraping tools on top of httr/curl. I would like httr to use the same default useragent strings as curl (that I can override in my package).
Below is a quick survey of default useragent strings as used by various web facing packages.
# in terminal
getOption("HTTPUserAgent")
#> [1] "R (3.6.3 x86_64-pc-linux-gnu x86_64 linux-gnu)"# in Rstudio
getOption("HTTPUserAgent")
#> [1] "RStudio Desktop (1.3.959); R (3.6.3 x86_64-pc-linux-gnu x86_64 linux-gnu)"
The same is also used by download.file() in base R:
# reproducible help example
library(magrittr)
h<- help("download.file") %>%
utils:::.getHelpFile() %>%
{capture.output(tools:::Rd2txt(.))}
s<- grep("headers: ", h)
h[0:3+s]
#> [1] " headers: named character vector of HTTP headers to use in HTTP" #> [2] " requests. It is ignored for non-HTTP URLs. The ‘User-Agent’"#> [3] " header, coming from the ‘HTTPUserAgent’ option (see" #> [4] " ‘options’) is used as the first header, automatically."
Can we please prepend the current library versions with the content of option("HTTPUserAgent"). So for my system it would read:
RStudio Desktop (1.3.959); R (3.6.3x86_64-pc-linux-gnux86_64linux-gnu); libcurl/7.58.0r-curl/4.3httr/1.4.1
This will allow personification of default useragent in .Rprofile and introduce consistency across modern web-accessing packages in R
1 Other connections in base R are a lost case, though, because they use a mix of interfaces for different platforms, which come with their own defaults. At least for Linux it populates useragent with libcurl version number.
The text was updated successfully, but these errors were encountered:
Current system of caching the default useragent might be a little bit problematic as well. I see why you would want to cache (potentially expensive) sourcing of version numbers, but since useragent is always available via getOption("HTTPUserAgent") why don't we store the concatenated version string and just combine it on the fly with the conent of HTTPUserAgent option.
What I hope to accomplish with polite is running a session with temporary "exceptionally polite" human readable user agent string. In order to achieve it I will need to either pass custom header (which I may not be able to do) or temporarily swap the value of the option (like withr does). Therefore I would love the content of the HTTPUserAgent option not cached to enable last-minute updates at call time.
httr has been superseded in favour of httr2, so is no longer under active development. If this problem is still important to you in httr2, I'd suggest filing an issue offer there 😄. Thanks for using httr!
TL;DR: I am building polite web scraping tools on top of httr/curl. I would like
httr
to use the same default useragent strings ascurl
(that I can override in my package).Below is a quick survey of default useragent strings as used by various web facing packages.
CURL C library does not seem to have "default" setting for useragent. CURLOPT_USERAGENT is not set by default. Consequently, default useragent in RCurl is NULL, and it can only be set in
curlOptions()
or directly in the handle. 1This was the reason why
httr
introduceduser_agent()
httr/R/user-agent.r
Line 3 in 292759d
curl
uses "HTTPUserAgent" option, as default UA unless changed bycurl::handle_setopt
The same is also used by download.file() in base R:
The same convention is followed by Rstudio in rstudio/r-builds#30
Currently
httr
does not use optionHTTPUserAgent
and instead uses concatenated version numbers forlibcurl
,curl
andhttr
pasted into a single stringhttr/R/config.r
Line 142 in af25ebd
Therefore, on my machine httr has this as default UA
Can we please prepend the current library versions with the content of
option("HTTPUserAgent")
. So for my system it would read:This will allow personification of default useragent in .Rprofile and introduce consistency across modern web-accessing packages in R
1 Other connections in base R are a lost case, though, because they use a mix of interfaces for different platforms, which come with their own defaults. At least for Linux it populates useragent with
libcurl
version number.The text was updated successfully, but these errors were encountered: