Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPv6 addresses not correctly recognized #1832

Closed
nyov opened this issue Mar 1, 2016 · 6 comments
Closed

IPv6 addresses not correctly recognized #1832

nyov opened this issue Mar 1, 2016 · 6 comments
Labels

Comments

@nyov
Copy link
Contributor

nyov commented Mar 1, 2016

In a follow-up to #1116 scrapy does not recognize IPv6 addresses correctly.

IPv6 address notation should be written inside brackets as [<ip>].
(Check browser behavior for http://::1/ and http://[::1]/. But beware of the wrongly urlescaped [] when copying the second link).

Scrapy seems to do the exact opposite:

$ scrapy-dev shell "http://[::1]/"
2016-03-01 14:31:34 [scrapy] INFO: Scrapy 1.2.0dev2 started (bot: testbot)
2016-03-01 14:31:34 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'testbot.spiders', 'SPIDER_MODULES': ['testbot.spiders'], 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0, 'BOT_NAME': 'testbot'}
2016-03-01 14:31:34 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2016-03-01 14:31:34 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-03-01 14:31:34 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-03-01 14:31:34 [scrapy] INFO: Enabled item pipelines:
[]
2016-03-01 14:31:34 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-03-01 14:31:34 [scrapy] INFO: Spider opened
2016-03-01 14:31:34 [scrapy] DEBUG: Retrying <GET http://%5B::1%5D/> (failed 1 times): 503 Service Unavailable
2016-03-01 14:31:34 [scrapy] DEBUG: Retrying <GET http://%5B::1%5D/> (failed 2 times): 503 Service Unavailable
2016-03-01 14:31:34 [scrapy] DEBUG: Gave up retrying <GET http://%5B::1%5D/> (failed 3 times): 503 Service Unavailable
2016-03-01 14:31:34 [scrapy] DEBUG: Crawled (503) <GET http://%5B::1%5D/> (referer: None)
2016-03-01 14:31:34 [root] DEBUG: Using default logger
2016-03-01 14:31:34 [root] DEBUG: Using default logger
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x7f29e720d590>
[s]   item       {}
[s]   request    <GET http://%5B::1%5D/>
[s]   response   <503 http://%5B::1%5D/>
[s]   settings   <scrapy.settings.Settings object at 0x7f29e720d610>
[s]   spider     <DefaultSpider 'default' at 0x7f29e1d0b7d0>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser
In [1]: 

...without the brackets it seems to work. Where it shouldn't, IMO.

$ scrapy-dev shell "http://::1/"
# <snip>
In [1]: response.status
Out[1]: 200
@nyov
Copy link
Contributor Author

nyov commented Mar 1, 2016

This second example result requires running a webserver on localhost and having a working IPv6 stack.

@redapple redapple added the bug label Mar 23, 2016
@nyov
Copy link
Contributor Author

nyov commented Mar 26, 2016

If anyone wonders why this [] escaping is necessary, here is why.

@redapple
Copy link
Contributor

redapple commented Apr 4, 2016

@nyov , FYI, I tried with #1874 and contacting www.google.com with IPv6 and got the host to be passed correctly to Twisted,
but it seems that the default Agent only handles IPv4, and missed the correct Host header value with brackets.

What wireshark sniffed (it did contact the correct endpoint at 2a00:1450:4007:80d::2004)

GET / HTTP/1.1
Host: 2a00:1450:4007:80d::2004
Accept-Language: en
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Encoding: gzip,deflate
User-Agent: Scrapy/1.2.0dev2 (+http://scrapy.org)

HTTP/1.1 400 Bad Request
Content-Length: 54
Content-Type: text/html; charset=UTF-8
Date: Mon, 04 Apr 2016 17:38:45 GMT
Connection: close

<html><title>Error 400 (Bad Request)!!1</title></html>

Scrapy shell showing HTTP 400 (for bad host header presumably)

$ scrapy shell http://[2a00:1450:4007:80d::2004]/
2016-04-04 19:35:59 [scrapy] INFO: Scrapy 1.2.0dev2 started (bot: scrapybot)
2016-04-04 19:35:59 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter'}
2016-04-04 19:36:00 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats']
2016-04-04 19:36:00 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-04-04 19:36:00 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-04-04 19:36:00 [scrapy] INFO: Enabled item pipelines:
[]
2016-04-04 19:36:00 [scrapy] INFO: Spider opened
http://[2a00:1450:4007:80d::2004]/
2016-04-04 19:36:00 [scrapy] DEBUG: Crawled (400) <GET http://[2a00:1450:4007:80d::2004]/> (referer: None)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x7f55c2ba5048>
[s]   item       {}
[s]   request    <GET http://[2a00:1450:4007:80d::2004]/>
[s]   response   <400 http://[2a00:1450:4007:80d::2004]/>
[s]   settings   <scrapy.settings.Settings object at 0x7f55c007fd30>
[s]   spider     <DefaultSpider 'default' at 0x7f55b8b0d6d8>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser
>>> request.headers
{b'Accept-Encoding': [b'gzip,deflate'], b'Accept-Language': [b'en'], b'User-Agent': [b'Scrapy/1.2.0dev2 (+http://scrapy.org)'], b'Accept': [b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8']}
>>> response.headers
{b'Date': [b'Mon, 04 Apr 2016 17:36:00 GMT'], b'Content-Type': [b'text/html; charset=UTF-8']}
>>> response.body
b'<html><title>Error 400 (Bad Request)!!1</title></html>'
>>> 

@nyov
Copy link
Contributor Author

nyov commented Apr 5, 2016

Sounds like a big improvement. Thanks for the work! :)

@nyov
Copy link
Contributor Author

nyov commented May 23, 2016

I was about to close this as all the linked patches seem merged.
But while IPv6-IPs seem to work in the 1.1 branch correctly now, it's still wrong on current master:
<GET http://%5B::1%5D/> instead of <GET http://[::1]/>.

I can't follow all the issues and backports spawned from #1874 to figure out what commit might be missing on master but made it into 1.1.

@wRAR
Copy link
Member

wRAR commented Nov 17, 2022

This currently doesn't work (with "ValueError: invalid hostname: :") because of scrapy/w3lib#193, but if I downgrade w3lib to 1.22.0, the URL is parsed correctly and not escaped. So not further changes in Scrapy are needed.

@wRAR wRAR closed this as completed Nov 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants