Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proxy connection is being refused #99

Open
m2jobe opened this issue Jan 10, 2017 · 15 comments
Open

Proxy connection is being refused #99

m2jobe opened this issue Jan 10, 2017 · 15 comments

Comments

@m2jobe
Copy link

m2jobe commented Jan 10, 2017

The error below suggest that my proxy connection is being refused. The proxy was tested with curl and it is infact working, it requires no credentials which is why the username and password fields were omitted in set_proxy. What else could be the reason for this connection being refused?

--ERROR
RenderErrorInfo(type='Network', code=99, text='Proxy connection refused',
"message": "Lua error: [string "..."]:10: network99", "type": "LUA_ERROR", "error": "network99"},

--TESTING PROXY
curl --proxy 127.0.0.1:24000 "http://lumtest.com/myip.json"

--SPIDER CODE

import scrapy
from scrapy_splash import SplashRequest

script = """
function main(splash)
  splash:on_request(function(request)
    request:set_proxy{
        host = "127.0.0.1",
        port = 24000,
    }
  end)

  assert(splash:go{
    splash.args.url,
    headers=splash.args.headers,
    http_method=splash.args.http_method,
    body=splash.args.body,
    })
  assert(splash:wait(0.5))

  local entries = splash:history()
  local last_response = entries[#entries].response
  return {
    url = splash:url(),
    headers = last_response.headers,
    http_status = last_response.status,
    html = splash:html(),
  }
end
"""
class TestlumSpider(scrapy.Spider):
    name = "testlum"
    allowed_domains = ["amazon.ca"]

    def start_requests(self):
		url = "https://www.amazon.ca/dp/1482703270"
 	        yield SplashRequest(url, self.parse, endpoint='execute',
                            args={'lua_source': script,})

    def parse(self, response):
        pass
@m2jobe
Copy link
Author

m2jobe commented Jan 10, 2017

Just tested using scrapy to connect to the proxy instead of going through splash and that seems to work.

@kmike
Copy link
Member

kmike commented Jan 10, 2017

@m2jobe I think the problem could be 127.0.0.1 address - Splash is usually executed in a Docker container, and localhost is different from localhost of the spider.

@kmike
Copy link
Member

kmike commented Jan 10, 2017

See also: scrapinghub/splash#234

@m2jobe
Copy link
Author

m2jobe commented Jan 10, 2017

What if the proxy was executed in a docker container too?

@kmike
Copy link
Member

kmike commented Jan 10, 2017

If a proxy is executed in a docker container too then you can use e.g. docker-compose to link these containers. You can have 'splash' and 'proxy' containers, link them, and proxy will be available at 'proxy:24000' address inside splash container. See e.g. how is it done in aquarium: https://github.com/TeamHG-Memex/aquarium/blob/master/%7B%7Bcookiecutter.folder_name%7D%7D/docker-compose.yml - it sets up Tor proxy and makes it available for all Splash containers under 'tor' hostname.

@m2jobe
Copy link
Author

m2jobe commented Jan 10, 2017

Ok ill try that just tried the solution of using http://10.0.2.2, this led to a 504 error

@m2jobe
Copy link
Author

m2jobe commented Jan 10, 2017

Following the linked yml example, is this correct?

splash:on_request(function(request) request:set_proxy{ host = "luminati", port = 24000, } end)

`
tor:
image: jess/tor-proxy
expose:
- 9050
logging:
driver: "none"
restart: always

luminati:
    image: luminati/luminati-proxy
    expose:
        - 24000
    logging:
        driver: "none"
    restart: always`

@m2jobe
Copy link
Author

m2jobe commented Jan 10, 2017

Now this is the error
loadFinished: RenderErrorInfo(type='Network', code=99, text='Error communicating with HTTP proxy'

Previously it was outputting "Proxy connection refused"

@m2jobe
Copy link
Author

m2jobe commented Jan 10, 2017

Made some modifications, I think this is right: http://pastebin.com/pqiXGYN4

However still receiving the same "Error communicating with HTTP proxy"

@m2jobe
Copy link
Author

m2jobe commented Jan 11, 2017

Finally got it working with this

`version: '2'

services:
haproxy:
image: haproxy:1.6
ports:
# stats
- "8036:8036"

        # splash
        - "8050:8050"

    links:
        - luminati
        - splash0
        - splash1
        - splash2
        - splash3
        - splash4
    volumes:
        - ./haproxy.cfg:/usr/local/etc/haproxy/haproxy.cfg:ro

splash0:
    image: scrapinghub/splash:2.2.1
    command: --max-timeout 3600 --slots 10 --maxrss 6000 --verbosity 1
    expose:
        - 8050
    mem_limit: 8400m
    memswap_limit: 10800m
    restart: always
    links:
        - tor
        - luminati
    volumes:
        - ./proxy-profiles:/etc/splash/proxy-profiles:ro
        - ./filters:/etc/splash/filters:ro

splash1:
    image: scrapinghub/splash:2.2.1
    command: --max-timeout 3600 --slots 10 --maxrss 6000 --verbosity 1
    expose:
        - 8050
    mem_limit: 8400m
    memswap_limit: 10800m
    restart: always
    links:
        - tor
        - luminati
    volumes:
        - ./proxy-profiles:/etc/splash/proxy-profiles:ro
        - ./filters:/etc/splash/filters:ro

splash2:
    image: scrapinghub/splash:2.2.1
    command: --max-timeout 3600 --slots 10 --maxrss 6000 --verbosity 1
    expose:
        - 8050
    mem_limit: 8400m
    memswap_limit: 10800m
    restart: always
    links:
        - tor
        - luminati
    volumes:
        - ./proxy-profiles:/etc/splash/proxy-profiles:ro
        - ./filters:/etc/splash/filters:ro

splash3:
    image: scrapinghub/splash:2.2.1
    command: --max-timeout 3600 --slots 10 --maxrss 6000 --verbosity 1
    expose:
        - 8050
    mem_limit: 8400m
    memswap_limit: 10800m
    restart: always
    links:
        - tor
        - luminati
    volumes:
        - ./proxy-profiles:/etc/splash/proxy-profiles:ro
        - ./filters:/etc/splash/filters:ro

splash4:
    image: scrapinghub/splash:2.2.1
    command: --max-timeout 3600 --slots 10 --maxrss 6000 --verbosity 1
    expose:
        - 8050
    mem_limit: 8400m
    memswap_limit: 10800m
    restart: always
    links:
        - tor
        - luminati
    volumes:
        - ./proxy-profiles:/etc/splash/proxy-profiles:ro
        - ./filters:/etc/splash/filters:ro



tor:
    image: jess/tor-proxy
    expose:
        - 9050
    logging:
        driver: "none"
    restart: always


luminati:
    image: luminati/luminati-proxy
    ports:
        # proxy
        - "24000:24000"
        # prox manager
        - "22999:22999"
    logging:
        driver: "none"
    restart: always`

@kmike
Copy link
Member

kmike commented Jan 13, 2017

Glad you got it working! Have you figured out what was the problem?

@SergeyKalutsky
Copy link

SergeyKalutsky commented May 3, 2018

Hi, everyone.
I've managed to struggle with the same issue for the last couple of hours.
The problem is accessing a localhost from a docker container.
According to docker docs:
"The host has a changing IP address (or none if you have no network access). From 18.03 onwards our recommendation is to connect to the special DNS name host.docker.internal, which resolves to the internal IP address used by the host.
The gateway is also reachable as gateway.docker.internal."
https://docs.docker.com/docker-for-windows/networking/#per-container-ip-addressing-is-not-possible
So the solution in this case is to change the host name:
from
function main(splash) splash:on_request(function(request) request:set_proxy{ host = "127.0.0.1", port = 24000, } end)
to
function main(splash) splash:on_request(function(request) request:set_proxy{ host = "host.docker.internal", port = 24000, } end)

@Gallaecio
Copy link
Contributor

This does not look specific to scrapy-splash. Shall we move this to https://github.com/scrapinghub/splash?

@vazkir
Copy link

vazkir commented May 30, 2020

Hi, everyone.
I've managed to struggle with the same issue for the last couple of hours.
The problem is accessing a localhost from a docker container.
According to docker docs:
"The host has a changing IP address (or none if you have no network access). From 18.03 onwards our recommendation is to connect to the special DNS name host.docker.internal, which resolves to the internal IP address used by the host.
The gateway is also reachable as gateway.docker.internal."
https://docs.docker.com/docker-for-windows/networking/#per-container-ip-addressing-is-not-possible
So the solution in this case is to change the host name:
from
function main(splash) splash:on_request(function(request) request:set_proxy{ host = "127.0.0.1", port = 24000, } end)
to
function main(splash) splash:on_request(function(request) request:set_proxy{ host = "host.docker.internal", port = 24000, } end)

You made my day man, been struggling with this as well for a few hours now and your solution works!

@chipzzz
Copy link

chipzzz commented Dec 15, 2020

@kmike If i'm using scraper api as a proxy do I need tor as a service running? What does tor do if you're running an external proxy?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants