Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does read_html() support using a proxy? #224

Closed
magic-lantern opened this issue Sep 12, 2018 · 5 comments
Closed

Does read_html() support using a proxy? #224

magic-lantern opened this issue Sep 12, 2018 · 5 comments

Comments

@magic-lantern
Copy link

I'm doing some work behind a SOCKS5 proxy. I've setup my environment such that I can update or install new packages by setting the proper environment variable (Sys.setenv(http_proxy = "socks://proxyhost:port")).

However, read_html("some url") fails due to not attempting (or failing) to use my proxy.

Does rvest support use of a proxy directly?

It does appear that you can instead pass an httr result, of the form:

read_html(httr::GET("https://cran.r-project.org/src/contrib/", httr::use_proxy("socks5://proxyhost:port")))

@jimhester
Copy link
Collaborator

rvest (and xml2, which is actually what read_html() is from) use httr for the connections.

You can set the global configuration options for httr (such as this) with httr::set_config().

So something like this should use the proxy configuration for all httr connections.

httr::set_config(httr::use_proxy("socks5://proxyhost:port")))

@magic-lantern
Copy link
Author

Thanks for the clarification.

@1beb
Copy link

1beb commented Feb 17, 2020

Hey Jim,

Can confirm that this does not work. The httr config does not get called through into read_html.

library(rvest)
library(httr)
httr::set_config(
  httr::use_proxy(
    "socks5://proxy-nl.privateinternetaccess.com", port=1080, username = "USER", password = "PASSWORD") 
)

# works for httr
> GET("https://myip.com") %>% content(as='text') %>% read_html %>% html_nodes("li") %>% html_text()
[1] "\nYour IP address is:\n\n109.201.154.197\ncopy\n\n"               
[2] "\nHost:\n172.69.55.92\n"                                          
[3] "\nRemote Port:\n33336\n"                                          
[4] "\nISP:\n\n\nAmsterdam Residential Television and Internet, LLC \n"
# rvest ignores the proxy, returning my true IP
> read_html("https://myip.com") %>% html_nodes("li") %>% html_text()
[1] "\nYour IP address is:\n\n200.121.230.93\ncopy\n\n"  "\nHost:\n198.41.231.175\n"                         
[3] "\nRemote Port:\n33762\n"                            "\nISP:\n\n\nInternet Assigned Numbers Authority \n"

I can provide user/pass for test if you need it @jimhester

@jimhester
Copy link
Collaborator

You are correct, looking at the code again xml2 uses curl::curl() when passed a URL. However if you call read_html(GET("https://myip.com")) than this should work.

@rcepka
Copy link

rcepka commented Apr 1, 2024

Hello,
Is there any way to incorporate read_html_live() please?
It is obviously not possible just simply replace read_html by read_html_live. In other words I am trying to use proxy with read_html_live(). Any advices please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants