Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proxy + invalid domain makes Downloader stuck #5379

Open
vryazanov opened this issue Jan 26, 2022 · 0 comments
Open

Proxy + invalid domain makes Downloader stuck #5379

vryazanov opened this issue Jan 26, 2022 · 0 comments
Labels

Comments

@vryazanov
Copy link

vryazanov commented Jan 26, 2022

Description

Downloader gets stuck when trying to download a url having not valid domain. It works good, but without proxy.

Steps to Reproduce

  1. Set proxy
  2. Try to crawl any invalid domain, for example https://text_example.scrapy.com

Expected behavior: a request leaves downloader
Actual behavior: a request do not leave downloader and it gets stuck
Reproduces how often: 100%

Versions

Scrapy : 2.5.0
lxml : 4.6.3.0
libxml2 : 2.9.10
cssselect : 1.1.0
parsel : 1.6.0
w3lib : 1.22.0
Twisted : 21.2.0
Python : 3.8.9 (default, Apr 10 2021, 15:55:09) - [GCC 8.3.0]
pyOpenSSL : 20.0.1 (OpenSSL 1.1.1k 25 Mar 2021)
cryptography : 3.4.7
Platform : Linux-5.10.15-200.fc33.x86_64-x86_64-with-glibc2.2.5

Additional context

The problem is in this part from scrapy.core.downloader.handlers.http11TunnelingTCP4ClientEndpoint. creatorForNetloc raises an exception which is not handled by twisted.

    def processProxyResponse(self, rcvd_bytes):
        """Processes the response from the proxy. If the tunnel is successfully
        created, notifies the client that we are ready to send requests. If not
        raises a TunnelError.
        """
        self._connectBuffer += rcvd_bytes
        # make sure that enough (all) bytes are consumed
        # and that we've got all HTTP headers (ending with a blank line)
        # from the proxy so that we don't send those bytes to the TLS layer
        #
        # see https://github.com/scrapy/scrapy/issues/2491
        if b'\r\n\r\n' not in self._connectBuffer:
            return
        self._protocol.dataReceived = self._protocolDataReceived
        respm = TunnelingTCP4ClientEndpoint._responseMatcher.match(self._connectBuffer)
        if respm and int(respm.group('status')) == 200:
            # set proper Server Name Indication extension
            sslOptions = self._contextFactory.creatorForNetloc(self._tunneledHost, self._tunneledPort)
            self._protocol.transport.startTLS(sslOptions, self._protocolFactory)
            self._tunnelReadyDeferred.callback(self._protocol)

Traceback

Unhandled Error
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/twisted/python/log.py", line 101, in callWithLogger
    return callWithContext({"system": lp}, func, *args, **kw)
  File "/usr/local/lib/python3.8/site-packages/twisted/python/log.py", line 85, in callWithContext
    return context.call({ILogContext: newCtx}, func, *args, **kw)
  File "/usr/local/lib/python3.8/site-packages/twisted/python/context.py", line 118, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/local/lib/python3.8/site-packages/twisted/python/context.py", line 83, in callWithContext
    return func(*args, **kw)
--- <exception caught here> ---
  File "/usr/local/lib/python3.8/site-packages/twisted/internet/posixbase.py", line 687, in _doReadOrWrite
    why = selectable.doRead()
  File "/usr/local/lib/python3.8/site-packages/twisted/internet/tcp.py", line 246, in doRead
    return self._dataReceived(data)
  File "/usr/local/lib/python3.8/site-packages/twisted/internet/tcp.py", line 251, in _dataReceived
    rval = self.protocol.dataReceived(data)
  File "/usr/local/lib/python3.8/site-packages/twisted/internet/endpoints.py", line 149, in dataReceived
    return self._wrappedProtocol.dataReceived(data)
  File "/usr/local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 139, in processProxyResponse
    sslOptions = self._contextFactory.creatorForNetloc(self._tunneledHost, self._tunneledPort)
  File "/application/crawler/contextfactory.py", line 53, in creatorForNetloc
    return super().creatorForNetloc(hostname, port)
  File "/usr/local/lib/python3.8/site-packages/scrapy/core/downloader/contextfactory.py", line 67, in creatorForNetloc
    return ScrapyClientTLSOptions(hostname.decode("ascii"), self.getContext(),
  File "/usr/local/lib/python3.8/site-packages/scrapy/core/downloader/tls.py", line 42, in __init__
    super().__init__(hostname, ctx)
  File "/usr/local/lib/python3.8/site-packages/twisted/internet/_sslverify.py", line 1130, in __init__
    self._hostnameBytes = _idnaBytes(hostname)
  File "/usr/local/lib/python3.8/site-packages/twisted/internet/_idna.py", line 31, in _idnaBytes
    return idna.encode(text)
  File "/usr/local/lib/python3.8/site-packages/idna/core.py", line 362, in encode
    s = alabel(label)
  File "/usr/local/lib/python3.8/site-packages/idna/core.py", line 270, in alabel
    ulabel(label)
  File "/usr/local/lib/python3.8/site-packages/idna/core.py", line 308, in ulabel
    check_label(label)
  File "/usr/local/lib/python3.8/site-packages/idna/core.py", line 261, in check_label
    raise InvalidCodepoint('Codepoint {0} at position {1} of {2} not allowed'.format(_unot(cp_value), pos+1, repr(label)))
idna.core.InvalidCodepoint: Codepoint U+005F at position 10 of 'resilient_test' not allowed
@wRAR wRAR added the bug label Jan 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants