IPv6 addresses not correctly recognized #1832

nyov · 2016-03-01T14:41:34Z

In a follow-up to #1116 scrapy does not recognize IPv6 addresses correctly.

IPv6 address notation should be written inside brackets as [<ip>].
(Check browser behavior for http://::1/ and http://[::1]/. But beware of the wrongly urlescaped [] when copying the second link).

Scrapy seems to do the exact opposite:

$ scrapy-dev shell "http://[::1]/"
2016-03-01 14:31:34 [scrapy] INFO: Scrapy 1.2.0dev2 started (bot: testbot)
2016-03-01 14:31:34 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'testbot.spiders', 'SPIDER_MODULES': ['testbot.spiders'], 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0, 'BOT_NAME': 'testbot'}
2016-03-01 14:31:34 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2016-03-01 14:31:34 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-03-01 14:31:34 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-03-01 14:31:34 [scrapy] INFO: Enabled item pipelines:
[]
2016-03-01 14:31:34 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-03-01 14:31:34 [scrapy] INFO: Spider opened
2016-03-01 14:31:34 [scrapy] DEBUG: Retrying <GET http://%5B::1%5D/> (failed 1 times): 503 Service Unavailable
2016-03-01 14:31:34 [scrapy] DEBUG: Retrying <GET http://%5B::1%5D/> (failed 2 times): 503 Service Unavailable
2016-03-01 14:31:34 [scrapy] DEBUG: Gave up retrying <GET http://%5B::1%5D/> (failed 3 times): 503 Service Unavailable
2016-03-01 14:31:34 [scrapy] DEBUG: Crawled (503) <GET http://%5B::1%5D/> (referer: None)
2016-03-01 14:31:34 [root] DEBUG: Using default logger
2016-03-01 14:31:34 [root] DEBUG: Using default logger
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x7f29e720d590>
[s]   item       {}
[s]   request    <GET http://%5B::1%5D/>
[s]   response   <503 http://%5B::1%5D/>
[s]   settings   <scrapy.settings.Settings object at 0x7f29e720d610>
[s]   spider     <DefaultSpider 'default' at 0x7f29e1d0b7d0>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser
In [1]:

...without the brackets it seems to work. Where it shouldn't, IMO.

$ scrapy-dev shell "http://::1/"
# <snip>
In [1]: response.status
Out[1]: 200

The text was updated successfully, but these errors were encountered:

nyov · 2016-03-01T14:42:59Z

This second example result requires running a webserver on localhost and having a working IPv6 stack.

nyov · 2016-03-26T13:03:39Z

If anyone wonders why this [] escaping is necessary, here is why.

redapple · 2016-04-04T17:43:27Z

@nyov , FYI, I tried with #1874 and contacting www.google.com with IPv6 and got the host to be passed correctly to Twisted,
but it seems that the default Agent only handles IPv4, and missed the correct Host header value with brackets.

What wireshark sniffed (it did contact the correct endpoint at 2a00:1450:4007:80d::2004)

GET / HTTP/1.1
Host: 2a00:1450:4007:80d::2004
Accept-Language: en
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Encoding: gzip,deflate
User-Agent: Scrapy/1.2.0dev2 (+http://scrapy.org)

HTTP/1.1 400 Bad Request
Content-Length: 54
Content-Type: text/html; charset=UTF-8
Date: Mon, 04 Apr 2016 17:38:45 GMT
Connection: close

<html><title>Error 400 (Bad Request)!!1</title></html>

Scrapy shell showing HTTP 400 (for bad host header presumably)

$ scrapy shell http://[2a00:1450:4007:80d::2004]/
2016-04-04 19:35:59 [scrapy] INFO: Scrapy 1.2.0dev2 started (bot: scrapybot)
2016-04-04 19:35:59 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter'}
2016-04-04 19:36:00 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats']
2016-04-04 19:36:00 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-04-04 19:36:00 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-04-04 19:36:00 [scrapy] INFO: Enabled item pipelines:
[]
2016-04-04 19:36:00 [scrapy] INFO: Spider opened
http://[2a00:1450:4007:80d::2004]/
2016-04-04 19:36:00 [scrapy] DEBUG: Crawled (400) <GET http://[2a00:1450:4007:80d::2004]/> (referer: None)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x7f55c2ba5048>
[s]   item       {}
[s]   request    <GET http://[2a00:1450:4007:80d::2004]/>
[s]   response   <400 http://[2a00:1450:4007:80d::2004]/>
[s]   settings   <scrapy.settings.Settings object at 0x7f55c007fd30>
[s]   spider     <DefaultSpider 'default' at 0x7f55b8b0d6d8>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser
>>> request.headers
{b'Accept-Encoding': [b'gzip,deflate'], b'Accept-Language': [b'en'], b'User-Agent': [b'Scrapy/1.2.0dev2 (+http://scrapy.org)'], b'Accept': [b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8']}
>>> response.headers
{b'Date': [b'Mon, 04 Apr 2016 17:36:00 GMT'], b'Content-Type': [b'text/html; charset=UTF-8']}
>>> response.body
b'<html><title>Error 400 (Bad Request)!!1</title></html>'
>>>

nyov · 2016-04-05T05:14:19Z

Sounds like a big improvement. Thanks for the work! :)

nyov · 2016-05-23T21:21:02Z

I was about to close this as all the linked patches seem merged.
But while IPv6-IPs seem to work in the 1.1 branch correctly now, it's still wrong on current master:
<GET http://%5B::1%5D/> instead of <GET http://[::1]/>.

I can't follow all the issues and backports spawned from #1874 to figure out what commit might be missing on master but made it into 1.1.

wRAR · 2022-11-17T14:15:34Z

This currently doesn't work (with "ValueError: invalid hostname: :") because of scrapy/w3lib#193, but if I downgrade w3lib to 1.22.0, the URL is parsed correctly and not escaped. So not further changes in Scrapy are needed.

redapple added the bug label Mar 23, 2016

qknight mentioned this issue Oct 7, 2018

IPv6 support? Problem running home page example from an IPv6 network #1031

Closed

wRAR closed this as completed Nov 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IPv6 addresses not correctly recognized #1832

IPv6 addresses not correctly recognized #1832

nyov commented Mar 1, 2016

nyov commented Mar 1, 2016

nyov commented Mar 26, 2016

redapple commented Apr 4, 2016

nyov commented Apr 5, 2016

nyov commented May 23, 2016

wRAR commented Nov 17, 2022

IPv6 addresses not correctly recognized #1832

IPv6 addresses not correctly recognized #1832

Comments

nyov commented Mar 1, 2016

nyov commented Mar 1, 2016

nyov commented Mar 26, 2016

redapple commented Apr 4, 2016

nyov commented Apr 5, 2016

nyov commented May 23, 2016

wRAR commented Nov 17, 2022