Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPv6 support? Problem running home page example from an IPv6 network #1031

Closed
ronnix opened this issue Jan 31, 2015 · 10 comments · Fixed by #4227
Closed

IPv6 support? Problem running home page example from an IPv6 network #1031

ronnix opened this issue Jan 31, 2015 · 10 comments · Fixed by #4227

Comments

@ronnix
Copy link

ronnix commented Jan 31, 2015

I'm running into problems while trying to run the example on the scrapy.org home page from the FOSDEM IPv6-only Wi-Fi network. (The scraper works fine from an IPv4 network.)

If both IPv4 and IPv6 are enabled on my computer (OS X Yosemite), and the IPv4 is configured with DHCP, and thus gets a self-assigned address (169.254.x.x), then I get timeout errors:

$ scrapy runspider myspider.py
:0: UserWarning: You do not have a working installation of the service_identity module: 'No module named service_identity'.  Please install it from <https://pypi.python.org/pypi/service_identity> and make sure all of its dependencies are satisfied.  Without the service_identity module and a recent enough pyOpenSSL to support it, Twisted can perform only rudimentary TLS client hostname verification.  Many valid certificate/hostname mappings may be rejected.
2015-01-31 12:13:32+0100 [scrapy] INFO: Scrapy 0.24.4 started (bot: scrapybot)
2015-01-31 12:13:32+0100 [scrapy] INFO: Optional features available: ssl, http11
2015-01-31 12:13:32+0100 [scrapy] INFO: Overridden settings: {}
2015-01-31 12:13:32+0100 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-01-31 12:13:32+0100 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-01-31 12:13:32+0100 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-01-31 12:13:32+0100 [scrapy] INFO: Enabled item pipelines:
2015-01-31 12:13:32+0100 [blogspider] INFO: Spider opened
2015-01-31 12:13:32+0100 [blogspider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-01-31 12:13:32+0100 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-01-31 12:13:32+0100 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-01-31 12:14:32+0100 [blogspider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-01-31 12:14:48+0100 [blogspider] DEBUG: Retrying <GET http://blog.scrapinghub.com> (failed 1 times): TCP connection timed out: 60: Operation timed out.
2015-01-31 12:15:32+0100 [blogspider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-01-31 12:16:03+0100 [blogspider] DEBUG: Retrying <GET http://blog.scrapinghub.com> (failed 2 times): TCP connection timed out: 60: Operation timed out.
2015-01-31 12:16:32+0100 [blogspider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-01-31 12:17:18+0100 [blogspider] DEBUG: Gave up retrying <GET http://blog.scrapinghub.com> (failed 3 times): TCP connection timed out: 60: Operation timed out.
2015-01-31 12:17:18+0100 [blogspider] ERROR: Error downloading <GET http://blog.scrapinghub.com>: TCP connection timed out: 60: Operation timed out.
2015-01-31 12:17:18+0100 [blogspider] INFO: Closing spider (finished)
2015-01-31 12:17:18+0100 [blogspider] INFO: Dumping Scrapy stats:
    {'downloader/exception_count': 3,
     'downloader/exception_type_count/twisted.internet.error.TCPTimedOutError': 3,
     'downloader/request_bytes': 657,
     'downloader/request_count': 3,
     'downloader/request_method_count/GET': 3,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2015, 1, 31, 11, 17, 18, 494774),
     'log_count/DEBUG': 5,
     'log_count/ERROR': 1,
     'log_count/INFO': 10,
     'scheduler/dequeued': 3,
     'scheduler/dequeued/memory': 3,
     'scheduler/enqueued': 3,
     'scheduler/enqueued/memory': 3,
     'start_time': datetime.datetime(2015, 1, 31, 11, 13, 32, 955024)}
2015-01-31 12:17:18+0100 [blogspider] INFO: Spider closed (finished)

If I turn off IPv4 completely, then scrapy fails with "No route to host" errors:

$ scrapy runspider myspider.py
:0: UserWarning: You do not have a working installation of the service_identity module: 'No module named service_identity'.  Please install it from <https://pypi.python.org/pypi/service_identity> and make sure all of its dependencies are satisfied.  Without the service_identity module and a recent enough pyOpenSSL to support it, Twisted can perform only rudimentary TLS client hostname verification.  Many valid certificate/hostname mappings may be rejected.
2015-01-31 12:10:06+0100 [scrapy] INFO: Scrapy 0.24.4 started (bot: scrapybot)
2015-01-31 12:10:06+0100 [scrapy] INFO: Optional features available: ssl, http11
2015-01-31 12:10:06+0100 [scrapy] INFO: Overridden settings: {}
2015-01-31 12:10:06+0100 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-01-31 12:10:06+0100 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-01-31 12:10:06+0100 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-01-31 12:10:06+0100 [scrapy] INFO: Enabled item pipelines:
2015-01-31 12:10:06+0100 [blogspider] INFO: Spider opened
2015-01-31 12:10:06+0100 [blogspider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-01-31 12:10:06+0100 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-01-31 12:10:06+0100 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-01-31 12:10:06+0100 [blogspider] DEBUG: Retrying <GET http://blog.scrapinghub.com> (failed 1 times): No route to host: 51: Network is unreachable.
2015-01-31 12:10:06+0100 [blogspider] DEBUG: Retrying <GET http://blog.scrapinghub.com> (failed 2 times): No route to host: 51: Network is unreachable.
2015-01-31 12:10:06+0100 [blogspider] DEBUG: Gave up retrying <GET http://blog.scrapinghub.com> (failed 3 times): No route to host: 51: Network is unreachable.
2015-01-31 12:10:06+0100 [blogspider] ERROR: Error downloading <GET http://blog.scrapinghub.com>: No route to host: 51: Network is unreachable.
2015-01-31 12:10:06+0100 [blogspider] INFO: Closing spider (finished)
2015-01-31 12:10:06+0100 [blogspider] INFO: Dumping Scrapy stats:
    {'downloader/exception_count': 3,
     'downloader/exception_type_count/twisted.internet.error.NoRouteError': 3,
     'downloader/request_bytes': 657,
     'downloader/request_count': 3,
     'downloader/request_method_count/GET': 3,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2015, 1, 31, 11, 10, 6, 482287),
     'log_count/DEBUG': 5,
     'log_count/ERROR': 1,
     'log_count/INFO': 7,
     'scheduler/dequeued': 3,
     'scheduler/dequeued/memory': 3,
     'scheduler/enqueued': 3,
     'scheduler/enqueued/memory': 3,
     'start_time': datetime.datetime(2015, 1, 31, 11, 10, 6, 463382)}
2015-01-31 12:10:06+0100 [blogspider] INFO: Spider closed (finished)

Note that I can open the blog.scrapinghub.com site in Safari, so the target web site does support IPv6 and the problem seems to be on scrapy's side.

@kmike
Copy link
Member

kmike commented Feb 2, 2015

Scrapy uses ThreadedResolver (see https://github.com/scrapy/scrapy/blob/master/scrapy/resolver.py), and it uses stdlib socket.gethostbyname which doesn't support IPv6. So yes, Scrapy doesn't support IPv6 now.

It looks like Twisted itself supports IPv6, and Scrapy may start supporting it by switching to some other resolver (from twisted.names?), but I haven't checked the details.

@ronnix
Copy link
Author

ronnix commented Feb 2, 2015

Thanks for the explanation!

@barraponto
Copy link
Contributor

barraponto commented Feb 6, 2015

@kmike
Copy link
Member

kmike commented Feb 6, 2015

Yes, it looks like ThreadedResolver subclass which uses getaddrinfo is an option.

@nyov
Copy link
Contributor

nyov commented Mar 24, 2015

An implementation based on socket.getaddrinfo: #1104

@w495
Copy link

w495 commented Jan 3, 2017

While #1104 is not in the master this gist (dns resolver middleware) may be helpful.

@qknight
Copy link

qknight commented Oct 7, 2018

any progress in this? it seems that other than the here discussed address resolution scrapy seems to be able to do ipv6 requests:

@nyov
Copy link
Contributor

nyov commented Oct 10, 2018

@qknight, is this really still an issue? Can you please provide a testcase/log?
I was under the impression that current Twisted Agent did getaddrinfo by default and so is also used by ScrapyAgent in HTTP11DownloadHandler.

@qknight
Copy link

qknight commented Oct 10, 2018

@nyov i don't have a clue about the current implementation but it seems, out of your response, that it supports ipv6 now. is there an explicit switch to force scrapy to use ipv6?

@glyph
Copy link

glyph commented Dec 12, 2018

Indeed, this is still an issue. Scrapy disables Twisted's IPv6 support by installing a non-IPv6-aware resolver. The problem is here:

reactor.installResolver(self._get_dns_resolver())

If you don't want to trust the operating system's DNS caching for some reason, you can use the more modern API to install a custom resolver: https://twistedmatrix.com/documents/18.9.0/api/twisted.internet.interfaces.IReactorPluggableNameResolver.html#installNameResolver

and, rather than subclassing a resolver within Twisted (you shouldn't need the internal _GAIResolver to be made public), you can write a generalized caching layer; an twisted.internet.interfaces.IHostnameResolver that takes another IHostnameResolver as an argument, and caches its results; then simply pass the previous value of reactor.nameResolver to it.

Hope that this helps!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants