New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
urllib2 can't handle http://www.wikispaces.com #46716
Comments
Try the following code: import urllib2
gmail = urllib2.urlopen("https://www.gmail.com").read()
wikispaces = urllib2.urlopen("http://www.wikispaces.com").read() Getting the html over HTTPS from gmail.com works, but not over HTTP from
wikispaces. Here's the traceback:
>>> wikispaces = urllib2.urlopen("http://www.wikispaces.com").read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.5/urllib2.py", line 121, in urlopen
return _opener.open(url, data)
File "/usr/lib/python2.5/urllib2.py", line 380, in open
response = meth(req, response)
File "/usr/lib/python2.5/urllib2.py", line 491, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.5/urllib2.py", line 412, in error
result = self._call_chain(*args)
File "/usr/lib/python2.5/urllib2.py", line 353, in _call_chain
result = func(*args)
File "/usr/lib/python2.5/urllib2.py", line 575, in http_error_302
return self.parent.open(new)
File "/usr/lib/python2.5/urllib2.py", line 380, in open
response = meth(req, response)
File "/usr/lib/python2.5/urllib2.py", line 491, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.5/urllib2.py", line 412, in error
result = self._call_chain(*args)
File "/usr/lib/python2.5/urllib2.py", line 353, in _call_chain
result = func(*args)
File "/usr/lib/python2.5/urllib2.py", line 575, in http_error_302
return self.parent.open(new)
File "/usr/lib/python2.5/urllib2.py", line 374, in open
response = self._open(req, data)
File "/usr/lib/python2.5/urllib2.py", line 392, in _open
'_open', req)
File "/usr/lib/python2.5/urllib2.py", line 353, in _call_chain
result = func(*args)
File "/usr/lib/python2.5/urllib2.py", line 1100, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.5/urllib2.py", line 1075, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error (104, 'Connection reset by peer')> Note the two 302 redirects. I tried accessing wikispaces.com with SSL turned off in Firefox Why doesn't urllib2 handle the "hidden" SSL properly? (Not to be rude, Thanks! |
The problem does not appear to have anything to do with SSL. The GET -> 302 -> 302 -> 301 On the final 301 urllib2's internal state is messed up such that by the 'http://www.wikispaces.com/\\x00/?responseToken=481aec3249f429316459e01c00b7e522' The \x00 and everything after it should not be there and is not there if |
Please take your time, because this bug isn't critical. Thanks! |
Instrumenting the code and looking closer at the tcpdump, its true. The "fix" on our end should be to handle such garbage from such broken |
I'm not sure what the best solution for this is. If I truncate the Verdict: wikispaces.com is broken. urllib2 could do better. wget and firefox deal with it properly. but patch to implement either behavior of dealing with nulls where they Index: Lib/httplib.py --- Lib/httplib.py (revision 62033)
+++ Lib/httplib.py (working copy)
@@ -291,9 +291,18 @@
break
headerseen = self.isheader(line)
if headerseen:
+ # Some bad web servers reply with headers with a \x00 null
+ # embedded in the value. Other http clients deal with
+ # this by treating it as a value terminator, ignoring the
+ # rest so we will too. http://bugs.python.org/issue2464.
+ if '\x00' in line:
+ line = line[:line.find('\x00')]
+ # if you want to just remove nulls instead use this:
+ #line = line.replace('\x00', '')
# It's a legal header line, save it.
hlist.append(line)
- self.addheader(headerseen,
line[len(headerseen)+1:].strip())
+ value = line[len(headerseen)+1:].strip()
+ self.addheader(headerseen, value)
continue
else:
# It's not a header line; throw it back and stop here. |
The issue is not just with null character. If you observe now the
diretion is 302-302-200 and there is no null character.
However, still urllib2 is unable to handle multiple redirection properly
(IIRC, there is a portion of code to handle multiple redirection and
exit on infinite loop)
>>> url = "http://www.wikispaces.com"
>>> opened = urllib.urlopen(url)
>>> print opened.geturl()
http://www.wikispaces.com?responseToken=344289da354a29c67d48928dbe72042a
>>> print opened.read()
<html>
<head><title>400 Bad Request</title></head>
<body bgcolor="white">
<center><h1>400 Bad Request</h1></center>
<hr><center>nginx/0.6.30</center>
</body>
</html> Needs a relook, IMO. |
Senthil: Look at that URL that the server returned in the second redirect: http://www.wikispaces.com?responseToken=ee3fca88a9b0dc865152d8a9e5b6138d See that the "?" appears without a path between the host and it. Check the item 3.2.2 in the RFC 2616, it says that a HTTP URL should be: http_URL = "http:" "//" host [ ":" port ] [ abs_path [ "?" query ]] So, we should fix that URL that the server returned. Guess what: if we The patch I attach here does that. All tests pass ok. What do you think? |
looks good to me. |
Ah, I that was a simple fix. :) I very much overlooked the problem after I have some comments on the patch, Facundo.
But if we handle it in the urlparse methods, then we are much So,I introduced fix_broken() method in urlparse and called it to solve All tests pass with bpo-2464-py26-FINAL.diff Comments,please? |
Patch for py3k, but please test this before applying. |
Senthil: I don't like that. Creating a public method called "fix_broken", introducing new behaviours I commited the change I proposed. Maybe in the future will have a |
i was pondering if it should go in urlparse instead. if it did, i think anyways, agreed, this fixes this specific bug. should it be backported |
Maybe we can put it in urlunparse... do you all agree with this test cases? def test_alwayspath(self):
u = urlparse.urlparse("http://netloc/path;params?query#fragment")
self.assertEqual(urlparse.urlunparse(u),
"http://netloc/path;params?query#fragment")
u = urlparse.urlparse("http://netloc?query#fragment")
self.assertEqual(urlparse.urlunparse(u),
"http://netloc/?query#fragment")
u = urlparse.urlparse("http://netloc#fragment")
self.assertEqual(urlparse.urlunparse(u), "http://netloc/#fragment") Maybe we could backport this more general fix... |
That test case looks good to me for 2.6 and 3.0. Also add a note to the I would not back port the more general urlunparse behavior change to 2.5. |
Gregory... I tried to fill the path in urlunparse, and other functions As we're so close to final releases, I'll leave this as it's right now, |
That was reason in making fix_broken in the urlparse in my patch, Facundo. I am kind of +0 with the current fix in urllib2. Should we think/plan |
This fix was applied in the wrong place. URI path components, and HTTP URI path components in particular, *can* Note that RFC 2616 incorrectly claims to refer to the definition of No test was added with this fix, which makes it unnecessarily hard to 'http://www.wikispaces.com' + and after the fix was applied: 'http://www.wikispaces.com' + |
I've raised bpo-4493 about the issue I raised in my previous comment. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: