Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CachingHostnameResolver does not work with reactor.resolve() #4802

Closed
michael-lazar opened this issue Sep 21, 2020 · 4 comments · Fixed by #4803
Closed

CachingHostnameResolver does not work with reactor.resolve() #4802

michael-lazar opened this issue Sep 21, 2020 · 4 comments · Fixed by #4803

Comments

@michael-lazar
Copy link

Description

Hi. Thank you for maintaining this awesome software :)

I am working on a project using scrapy that implements a custom downloader class (link).

I want to resolve IPv6 addresses, and I found the section in the documentation about the DNS_RESOLVER setting that was added in #4227. I tried enabling the new DNS_RESOLVER = "scrapy.resolver.CachingHostnameResolver" and was immediately greeted with this exception.

Unhandled Error
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/scrapy/commands/crawl.py", line 27, in run
    self.crawler_process.start()
  File "/usr/local/lib/python3.8/site-packages/scrapy/crawler.py", line 327, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/usr/local/lib/python3.8/site-packages/twisted/internet/base.py", line 1283, in run
    self.mainLoop()
  File "/usr/local/lib/python3.8/site-packages/twisted/internet/base.py", line 1292, in mainLoop
    self.runUntilCurrent()
--- <exception caught here> ---
  File "/usr/local/lib/python3.8/site-packages/twisted/internet/base.py", line 913, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/usr/local/lib/python3.8/site-packages/twisted/internet/tcp.py", line 449, in resolveAddress
    d = self.reactor.resolve(self.addr[0])
  File "/usr/local/lib/python3.8/site-packages/twisted/internet/base.py", line 638, in resolve
    return self.resolver.getHostByName(name, timeout)
  File "/usr/local/lib/python3.8/site-packages/twisted/internet/_resolver.py", line 277, in getHostByName
    self._nameResolver.resolveHostName(FirstOneWins(result), name, 0,
  File "/usr/local/lib/python3.8/site-packages/scrapy/resolver.py", line 80, in resolveHostName
    class CachingResolutionReceiver(resolutionReceiver):
builtins.TypeError: __init__() takes 2 positional arguments but 4 were given

Steps to Reproduce

This is also reproducible using the bundled FTP downloader

  1. scrapy startproject scrapy_test
  2. scrapy genspider example mozz.us
  3. Add DNS_RESOLVER = "scrapy.resolver.CachingHostnameResolver" to the settings file
  4. Change the spider start_url to ftp://mozz.us
  5. scrapy crawl scrapy_test

Versions

Scrapy       : 2.3.0
lxml         : 4.5.2.0
libxml2      : 2.9.10
cssselect    : 1.1.0
parsel       : 1.6.0
w3lib        : 1.22.0
Twisted      : 20.3.0
Python       : 3.8.5 (default, Jul 21 2020, 10:48:26) - [Clang 11.0.3 (clang-1103.0.32.62)]
pyOpenSSL    : 19.1.0 (OpenSSL 1.1.1g  21 Apr 2020)
cryptography : 3.0
Platform     : macOS-10.15.6-x86_64-i386-64bit

Additional context

This was a tricky one to debug because everything works as expected with the HTTP Agent downloader. This issue only appears when you implement a downloader that depends on calling reactor.resolve() directly without using twisted.internet.endpoints.HostnameEndpoint.

I discovered that in the twisted IHostnameResolver interface, the resolutionReceiver method argument is expected to be an instance of a resolution receiver class, and not a type of a resolution receiver class. So I believe the scrapy code below is incorrect:

def resolveHostName(self, resolutionReceiver, hostName, portNumber=0,
addressTypes=None, transportSemantics='TCP'):
@provider(IResolutionReceiver)
class CachingResolutionReceiver(resolutionReceiver):

The subclass here only works with the Scrapy Agent because the HostnameEndpoint does this weird thing where it defines a class with only static methods, so it can pass the class itself instead of instantiating it.

https://github.com/twisted/twisted/blob/22f949f7ce187513f0c218b73186c8a73baa00b4/src/twisted/internet/endpoints.py#L942-L958

        @provider(IResolutionReceiver)
        class EndpointReceiver:
            @staticmethod
            def resolutionBegan(resolutionInProgress):
                pass

            @staticmethod
            def addressResolved(address):
                addresses.append(address)

            @staticmethod
            def resolutionComplete():
                d.callback(addresses)

        self._nameResolver.resolveHostName(
            EndpointReceiver, self._hostText, portNumber=self._port
        )

However, there are other places in the twisted reactor where twisted does pass an object instance directly to this method.

https://github.com/twisted/twisted/blob/7e3ce790ca9f004ab386f9ecbba8f505d66cd3bd/src/twisted/internet/_resolver.py#L307

        result = Deferred()
        self._nameResolver.resolveHostName(FirstOneWins(result), name, 0, [IPv4Address])
        return result
@Gallaecio
Copy link
Member

Thanks for the detailed report!

@elacuesta
Copy link
Member

elacuesta commented Sep 21, 2020

Nice find! At the moment I'm not finding the place from where I took that inheritance pattern used in #4227. This is indeed weird, but I'm going to take the liberty of removing the "bug" label and adding "upstream issue" instead, because we are in fact receiving a class instead of an instance from Twisted (this might grant an issue in the Twisted tracker).
More extensive testing is probably required, but this PR should be a good starting point to get to a solution in our side: #4803. @michael-lazar please test it when you have a minute and let us know what you find.

@michael-lazar
Copy link
Author

@elacuesta Thanks for the quick response! I was able to test our your branch and it did fix the original exception for me. I like your general approach and it was better than what I could come up with late last night 😄

However, the next thing I ran into is that the cache doesn't exactly work in its current implementation. The first DNS request from my scraper resolves fine, but the second request to the same domain will just hang indefinitely. I think that returning the cached value from resolveHostName() doesn't help much, because the return value isn't used for anything. Instead we need to hook into the addressResolved() and resolutionComplete() methods to signal that the cached address is ready to be used.

Here's what I came up with. I haven't tested thoroughly with IPv4 / IPv4 but it seems to work and has me unblocked for now.

from twisted.internet._resolver import HostResolution

...


    def resolveHostName(
        self, resolutionReceiver, hostName, portNumber=0, addressTypes=None, transportSemantics="TCP"
    ):

        cached_addresses = dnscache.get(hostName)
        if cached_addresses:
            resolutionReceiver.resolutionBegan(HostResolution(hostName))
            for address in cached_addresses:
                resolutionReceiver.addressResolved(address)
            resolutionReceiver.resolutionComplete()
            return resolutionReceiver

        @provider(IResolutionReceiver)
        class CachingResolutionReceiver:

            def __init__(self):
                self.addresses = []

            def resolutionBegan(self, resolution):
                resolutionReceiver.resolutionBegan(resolution)

            def addressResolved(self, address):
                resolutionReceiver.addressResolved(address)
                self.addresses.append(address)

            def resolutionComplete(self):
                resolutionReceiver.resolutionComplete()
                if self.addresses:
                    dnscache[hostName] = tuple(self.addresses)

        return self.original_resolver.resolveHostName(
            CachingResolutionReceiver(),
            hostName,
            portNumber,
            addressTypes,
            transportSemantics,
        )

@elacuesta elacuesta added the bug label Oct 6, 2020
@elacuesta
Copy link
Member

Re-tagging as "bug" because of the hanging issue mentioned above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants