Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using proxy through http fails (https works) #4505

Closed
teodoroanca opened this issue Apr 16, 2020 · 11 comments · Fixed by #4649
Closed

Using proxy through http fails (https works) #4505

teodoroanca opened this issue Apr 16, 2020 · 11 comments · Fixed by #4649

Comments

@teodoroanca
Copy link

teodoroanca commented Apr 16, 2020

Description

When I scrape without proxy, both https and http urls work.
Using proxy through https works just fine. My problem is when I try http urls.
In that moment I get the twisted.web.error.SchemeNotSupported: Unsupported scheme: b'' error

As I see, most of the people have this issue the other way around.

Steps to Reproduce

  1. Scrape a http link with proxy

Expected behavior: Get a 200 with the desired data.

Actual behavior:

ERROR: Error downloading <GET http://*********>
Traceback (most recent call last):
  File "/srv/scraper/venv/lib/python3.7/site-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/srv/scraper/venv/lib/python3.7/site-packages/twisted/python/failure.py", line 512, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/srv/scraper/venv/lib/python3.7/site-packages/scrapy/core/downloader/middleware.py", line 42, in process_request
    defer.returnValue((yield download_func(request=request, spider=spider)))
  File "/srv/scraper/venv/lib/python3.7/site-packages/scrapy/utils/defer.py", line 55, in mustbe_deferred
    result = f(*args, **kw)
  File "/srv/scraper/venv/lib/python3.7/site-packages/scrapy/core/downloader/handlers/__init__.py", line 76, in download_request
    return handler.download_request(request, spider)
  File "/srv/scraper/venv/lib/python3.7/site-packages/scrapy/core/downloader/handlers/http11.py", line 82, in download_request
    return agent.download_request(request)
  File "/srv/scraper/venv/lib/python3.7/site-packages/scrapy/core/downloader/handlers/http11.py", line 361, in download_request
    d = agent.request(method, to_bytes(url, encoding='ascii'), headers, bodyproducer)
  File "/srv/scraper/venv/lib/python3.7/site-packages/scrapy/core/downloader/handlers/http11.py", line 262, in request
    endpoint=self._getEndpoint(self._proxyURI),
  File "/srv/scraper/venv/lib/python3.7/site-packages/twisted/web/client.py", line 1729, in _getEndpoint
    return self._endpointFactory.endpointForURI(uri)
  File "/srv/scraper/venv/lib/python3.7/site-packages/twisted/web/client.py", line 1607, in endpointForURI
    raise SchemeNotSupported("Unsupported scheme: %r" % (uri.scheme,))
twisted.web.error.SchemeNotSupported: Unsupported scheme: b''

Reproduces how often: Every time I scrape with proxy

Versions

Scrapy       : 2.0.1
lxml         : 4.4.1.0
libxml2      : 2.9.9
cssselect    : 1.1.0
parsel       : 1.5.2
w3lib        : 1.21.0
Twisted      : 20.3.0
Python       : 3.7.3 (default, Apr  3 2019, 05:39:12) - [GCC 8.3.0]
pyOpenSSL    : 19.0.0 (OpenSSL 1.1.1c  28 May 2019)
cryptography : 2.7
Platform     : Linux-4.19.0-5-amd64-x86_64-with-debian-10.0

Additional context

I tried to add some breakpoints at the end to see where it cracks.
I added the following lines in "twisted/web/client/py", before the cracking point:

        endpoint = HostnameEndpoint(self._reactor, host, uri.port, **kwargs)
        import logging
        logger = logging.getLogger(__name__)
        logger.error("%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%")
        logger.error(uri)
        logger.error(uri.host)
        logger.error(uri.port)
        logger.error(uri.scheme)
        logger.error(dir(uri))
        if uri.scheme == b'http':
            return endpoint
        elif uri.scheme == b'https':
            connectionCreator = self._policyForHTTPS.creatorForNetloc(uri.host,
                                                                      uri.port)
            return wrapClientTLS(connectionCreator, endpoint)
        else:
            raise SchemeNotSupported("Unsupported scheme: %r" % (uri.scheme,))

Apparently in this point there is no schema. If I run the same code with a https url, this code is never reached. It seems that getting up to point there is bad and the proxy is not used

(edited to apply formatting)

@Gallaecio
Copy link
Member

Could you share a code snippet to reproduce this issue?

@teodoroanca
Copy link
Author

    def make_request(self, url, callback, meta={}, method='GET'):
        proxy_address = getattr(settings, "PROXY_ADDRESS")

        meta.update({
            'proxy': proxy_address,
            'dont_redirect': self.dont_redirect,
        })
        yield Request(
            url,
            meta=meta,
            method=method,
            callback=callback,
            errback=self.handle_error,
            dont_filter=self.dont_filter
        )

@Gallaecio
Copy link
Member

Does your Request.url and PROXY_ADDRESS both have a URL schema?

@teodoroanca
Copy link
Author

Apparently, this was the issue. My PROXY_ADDRESS did not have a URL schema.
Thanks for the help!

What I want to mention is that this behavior can be confusing. I will describe 4 cases:

  1. PROXY_ADDRESS without URL schema: test-proxy.com:3333
  • if Request.url is https it works just fine
  • if Request.url is http it results in twisted.web.error.SchemeNotSupported: Unsupported scheme: b''
  1. PROXY_ADDRESS with URL schema: http://second-test-proxy.com:5555
  • Both http and https work fine

In my case, the solution was to add the 'http' in front of my PROXY_ADDRESS. I was confused by the fact that it still was working in case Request.URL was 'https' case even without URL schema for the PROXY_ADDRESS. I don't know if it is a bug or not.

@Gallaecio
Copy link
Member

I guess we can take this as an enhancement to support schema-less HTTP proxy URLs.

I checked, and there is no bug, the logic to handle HTTP and HTTPS proxies is different, and the HTTPS one is implemented in a way that the schema is not needed in the proxy URL.

As a reference for people wishing to work on this, it should be as simple as modifying ScrapyProxyAgent.request so that the URL parameter passed to self._getEndpoint is ensured to have http:// as schema. Parsing the URL, setting the schema and then unparsing should do the job (https://docs.python.org/3/library/urllib.parse.html).

@liveprasad
Copy link

@Gallaecio I would like to contribute , I will start this as my first open source contribution
But I may need some help from you

@willbeaufoy
Copy link
Contributor

@liveprasad are you still working on this? If not I can take it.

@HausCloud
Copy link

Hi, I'm pretty new to open source. I have something that is working, but I'm having trouble implementing a test case as required from the contributing docs.

@Gallaecio
Copy link
Member

@HausCloud Create a pull request with the current state of your changes. Maybe we can help you with the rest.

@HausCloud
Copy link

HausCloud commented May 6, 2020

@Gallaecio Will do! Thanks.

UPDATE: Done! If any adjustments are needed, I can fix it! I'd probably need a hint towards the right direction for testing however.

@ajaymittur
Copy link
Contributor

ajaymittur commented Jun 25, 2020

Noticed the pull request was closed accidentally and the branch @HausCloud was working on seems to have been deleted, Is the issue open to work on or is there someone on it already? @Gallaecio

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants