urllib2 can't handle http://www.wikispaces.com #46716

weijie90 · 2008-03-23T14:41:09Z

BPO	2464
Nosy	@facundobatista, @gpshead, @orsenthil
Files	issue2464-facundo.diff issue2464-py26-FINAL.diff issue2464-PATCH1.diff issue2463-py3k.diff

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/facundobatista'
closed_at = <Date 2008-08-17.03:39:08.602>
created_at = <Date 2008-03-23.14:41:09.498>
labels = ['type-bug', 'library']
title = "urllib2 can't handle http://www.wikispaces.com"
updated_at = <Date 2008-12-02.20:47:14.241>
user = 'https://bugs.python.org/weijie90'

bugs.python.org fields:

activity = <Date 2008-12-02.20:47:14.241>
actor = 'jjlee'
assignee = 'facundobatista'
closed = True
closed_date = <Date 2008-08-17.03:39:08.602>
closer = 'facundobatista'
components = ['Library (Lib)']
creation = <Date 2008-03-23.14:41:09.498>
creator = 'weijie90'
dependencies = []
files = ['11127', '11132', '11133', '11134']
hgrepos = []
issue_num = 2464
keywords = ['patch']
message_count = 18.0
messages = ['64363', '64676', '64677', '64679', '64680', '66889', '71231', '71255', '71259', '71260', '71261', '71295', '71303', '71304', '71990', '72024', '76736', '76779']
nosy_count = 5.0
nosy_names = ['facundobatista', 'gregory.p.smith', 'jjlee', 'orsenthil', 'weijie90']
pr_nums = []
priority = 'low'
resolution = 'fixed'
stage = None
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue2464'
versions = ['Python 2.6', 'Python 2.5', 'Python 3.0']

weijie90 · 2008-03-23T14:41:08Z

Try the following code:

import urllib2
gmail = urllib2.urlopen("https://www.gmail.com").read()
wikispaces = urllib2.urlopen("http://www.wikispaces.com").read()

Getting the html over HTTPS from gmail.com works, but not over HTTP from
wikispaces. Here's the traceback:
>>> wikispaces = urllib2.urlopen("http://www.wikispaces.com").read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.5/urllib2.py", line 121, in urlopen
    return _opener.open(url, data)
  File "/usr/lib/python2.5/urllib2.py", line 380, in open
    response = meth(req, response)
  File "/usr/lib/python2.5/urllib2.py", line 491, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python2.5/urllib2.py", line 412, in error
    result = self._call_chain(*args)
  File "/usr/lib/python2.5/urllib2.py", line 353, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.5/urllib2.py", line 575, in http_error_302
    return self.parent.open(new)
  File "/usr/lib/python2.5/urllib2.py", line 380, in open
    response = meth(req, response)
  File "/usr/lib/python2.5/urllib2.py", line 491, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python2.5/urllib2.py", line 412, in error
    result = self._call_chain(*args)
  File "/usr/lib/python2.5/urllib2.py", line 353, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.5/urllib2.py", line 575, in http_error_302
    return self.parent.open(new)
  File "/usr/lib/python2.5/urllib2.py", line 374, in open
    response = self._open(req, data)
  File "/usr/lib/python2.5/urllib2.py", line 392, in _open
    '_open', req)
  File "/usr/lib/python2.5/urllib2.py", line 353, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.5/urllib2.py", line 1100, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib/python2.5/urllib2.py", line 1075, in do_open
    raise URLError(err)
urllib2.URLError: <urlopen error (104, 'Connection reset by peer')>

Note the two 302 redirects.

I tried accessing wikispaces.com with SSL turned off in Firefox
2.0.0.12, which didn't work because SSL was required, perhaps in between
the redirects that wikispaces uses.

Why doesn't urllib2 handle the "hidden" SSL properly? (Not to be rude,
but httplib2 works.)

Thanks!
WJ

gpshead · 2008-03-29T03:13:37Z

The problem does not appear to have anything to do with SSL. The
problem is that the chain of HTTP requests goes:

GET -> 302 -> 302 -> 301

On the final 301 urllib2's internal state is messed up such that by the
time its in the handle_error_302 method it believes the Location header
contains the following:

'http://www.wikispaces.com/\\x00/?responseToken=481aec3249f429316459e01c00b7e522'

The \x00 and everything after it should not be there and is not there if
you look at what is sent over the socket itself. The ?responseToken=xxx
value is leftover from the previous 302 response. No idea where the
\x00 came from yet. I'm debugging...

weijie90 · 2008-03-29T03:41:53Z

Please take your time, because this bug isn't critical. Thanks!

gpshead · 2008-03-29T03:48:40Z

Instrumenting the code and looking closer at the tcpdump, its true.
wikispaces.com is returning an invalid Location: header with a null byte
in the middle of it.

The "fix" on our end should be to handle such garbage from such broken
web servers more gracefully. Other clients seem to treat the null as an
end of string or end of that header.

gpshead · 2008-03-29T04:19:07Z

I'm not sure what the best solution for this is. If I truncate the
header values at a \x00 character it ends in an infinite redirect loop
(which urllib2 detects and raises on). If I simple remove all \x00
characters the resulting url is not accepted by wikispaces.com due to
having an extra / in it.

Verdict: wikispaces.com is broken.

urllib2 could do better. wget and firefox deal with it properly. but
i'll leave deciding which patch to use up to someone who cares about
handling broken sites.

patch to implement either behavior of dealing with nulls where they
shouldn't be:

Index: Lib/httplib.py
===================================================================

--- Lib/httplib.py      (revision 62033)
+++ Lib/httplib.py      (working copy)
@@ -291,9 +291,18 @@
                 break
             headerseen = self.isheader(line)
             if headerseen:
+                # Some bad web servers reply with headers with a \x00 null
+                # embedded in the value.  Other http clients deal with
+                # this by treating it as a value terminator, ignoring the
+                # rest so we will too.  http://bugs.python.org/issue2464.
+                if '\x00' in line:
+                    line = line[:line.find('\x00')]
+                    # if you want to just remove nulls instead use this:
+                    #line = line.replace('\x00', '')
                 # It's a legal header line, save it.
                 hlist.append(line)
-                self.addheader(headerseen,
line[len(headerseen)+1:].strip())
+                value = line[len(headerseen)+1:].strip()
+                self.addheader(headerseen, value)
                 continue
             else:
                 # It's not a header line; throw it back and stop here.

orsenthil · 2008-05-16T03:07:48Z

The issue is not just with null character. If you observe now the
diretion is 302-302-200 and there is no null character.
However, still urllib2 is unable to handle multiple redirection properly
(IIRC, there is a portion of code to handle multiple redirection and
exit on infinite loop)
>>> url = "http://www.wikispaces.com"
>>> opened = urllib.urlopen(url)
>>> print opened.geturl()
http://www.wikispaces.com?responseToken=344289da354a29c67d48928dbe72042a
>>> print opened.read()
<html>
<head><title>400 Bad Request</title></head>
<body bgcolor="white">
<center><h1>400 Bad Request</h1></center>
<hr><center>nginx/0.6.30</center>
</body>
</html>

Needs a relook, IMO.

facundobatista · 2008-08-16T19:30:06Z

Senthil:

Look at that URL that the server returned in the second redirect:

http://www.wikispaces.com?responseToken=ee3fca88a9b0dc865152d8a9e5b6138d

See that the "?" appears without a path between the host and it.

Check the item 3.2.2 in the RFC 2616, it says that a HTTP URL should be:

  http_URL = "http:" "//" host [ ":" port ] [ abs_path [ "?" query ]]

So, we should fix that URL that the server returned. Guess what: if we
put a "/" (as obligates the RFC), everything works ok.

The patch I attach here does that. All tests pass ok.

What do you think?

gpshead · 2008-08-17T01:20:47Z

looks good to me.

orsenthil · 2008-08-17T02:42:18Z

Ah, I that was a simple fix. :) I very much overlooked the problem after
being so much given the hints at the web-sig.

I have some comments on the patch, Facundo.

I don't think is a good idea to include that portion in the
http_error_302 method. That makes the fix "very" specific to "this"
issue only.
Another point is, fixing broken url's should not be under urllib2,
urlparse would be a better place.
So, I came up with the approach wherein urllib2 does unparse(parse) of
the url and parse methods will fix the url if it is broken. ( See
attached bpo-2464-PATCH1.diff)

But if we handle it in the urlparse methods, then we are much
susceptible to breaking RFC conformance, breaking a lot of tests, Which
is not a good idea.

So,I introduced fix_broken() method in urlparse and called it to solve
the issue, using the same logic as yours (bpo-2464-py26-FINAL.diff)
With fix_broken() method in urlparse, we will have better control
whenever we want to implement a behavior which is RFC non-confirming but
implemented widely by browsers and clients.

All tests pass with bpo-2464-py26-FINAL.diff

Comments,please?

orsenthil · 2008-08-17T02:56:26Z

Patch for py3k, but please test this before applying.

facundobatista · 2008-08-17T03:39:07Z

Senthil: I don't like that.

Creating a public method called "fix_broken", introducing new behaviours
now in beta, and actually not fixing the url in any broken possibility
(just the path if it's not there), it's way too much for this fix.

I commited the change I proposed. Maybe in the future will have a
"generic url fixing" function, now is not the moment.

gpshead · 2008-08-17T22:38:07Z

i was pondering if it should go in urlparse instead. if it did, i think
it should be part of urlparse.urlunparse to ensure that there is always
a trailing slash after the host:port regardless of what the inputs are.

anyways, agreed, this fixes this specific bug. should it be backported
to release25-maint?

facundobatista · 2008-08-18T00:08:52Z

Maybe we can put it in urlunparse... do you all agree with this test cases?

def test_alwayspath(self):
    u = urlparse.urlparse("http://netloc/path;params?query#fragment")
    self.assertEqual(urlparse.urlunparse(u),
"http://netloc/path;params?query#fragment")

    u = urlparse.urlparse("http://netloc?query#fragment")
    self.assertEqual(urlparse.urlunparse(u),
"http://netloc/?query#fragment")

    u = urlparse.urlparse("http://netloc#fragment")
    self.assertEqual(urlparse.urlunparse(u), "http://netloc/#fragment")

Maybe we could backport this more general fix...

gpshead · 2008-08-18T00:51:46Z

That test case looks good to me for 2.6 and 3.0. Also add a note to the
documentation with a versionchanged 2.6 about urlunparse always ensuring
there is a / between the netloc and the rest of the url.

I would not back port the more general urlunparse behavior change to 2.5.

facundobatista · 2008-08-26T17:51:46Z

Gregory... I tried to fill the path in urlunparse, and other functions
that use this started to fail.

As we're so close to final releases, I'll leave this as it's right now,
that actually fixed the bug...

orsenthil · 2008-08-27T12:50:16Z

That was reason in making fix_broken in the urlparse in my patch, Facundo.
I had thought, it should be handled in urlparse module and if we make
changes in the urlparse.urlunparse/urlparse.urlparse, then we are
stepping into area which will break a lot of tests.

I am kind of +0 with the current fix in urllib2. Should we think/plan
for something in urlparse, akin to fix_broken?

jjlee · 2008-12-02T13:00:13Z

This fix was applied in the wrong place.

URI path components, and HTTP URI path components in particular, *can*
be empty. See RFC 3986. So the comment in the code that was inserted
with the fix for this bug that says "possibly malformed" is incorrect,
and should instead just refer to section 3.2.2 of RFC 2616. Also,
because 3.2.2 says "If the abs_path is not present in the URL, it MUST
be given as "/" when used as a Request-URI for a resource (section
5.1.2)", it seems clear that this transformation (add the slash if
there's no path component) should always happen when retrieving URIs,
regardless of where the URI came from -- not only for redirects.

Note that RFC 2616 incorrectly claims to refer to the definition of
abs_path from RFC 2396. The fact that it's incorrect is made clear in
2616 itself, in section 3.2.3, when it says that abs_path can be empty.
In any case, RFC 2396 is obsoleted by RFC 3986, which is clear on this
issue, and reflects actual usage of URIs. URIs like http://python.org
and http://python.org?spam=eggs have been in widespread use for a long
time, and typing the latter URL into firefox (3.0.2) confirms that
what's actually sent is "/?spam", whereas urllib2 still sends "?spam".

No test was added with this fix, which makes it unnecessarily hard to
work out what exactly the fix was supposed to fix. For the record, this
is the sequence of redirections before the fix was applied (showing base
URI + redirect URI reference --> redirect URI):

'http://www.wikispaces.com' +
'https://session.wikispaces.com/session/auth?authToken=token' -->
'https://session.wikispaces.com/session/auth?authToken=token'
'https://session.wikispaces.com/session/auth?authToken=token' +
'http://www.wikispaces.com?responseToken=token' -->
'http://www.wikispaces.com?responseToken=token'

and after the fix was applied:

'http://www.wikispaces.com' +
'https://session.wikispaces.com/session/auth?authToken=token' -->
'https://session.wikispaces.com/session/auth?authToken=token'
'https://session.wikispaces.com/session/auth?authToken=token' +
'http://www.wikispaces.com/?responseToken=token' -->
'http://www.wikispaces.com/?responseToken=token'
'http://www.wikispaces.com/?responseToken=token' +
'http://www.wikispaces.com/' --> 'http://www.wikispaces.com/'

jjlee · 2008-12-02T20:47:14Z

I've raised bpo-4493 about the issue I raised in my previous comment.

weijie90 mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Mar 23, 2008

gpshead self-assigned this Mar 29, 2008

gpshead removed their assignment Mar 29, 2008

facundobatista self-assigned this Jul 3, 2008

facundobatista closed this as completed Aug 17, 2008

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

urllib2 can't handle http://www.wikispaces.com #46716

urllib2 can't handle http://www.wikispaces.com #46716

weijie90 mannequin commented Mar 23, 2008

weijie90 mannequin commented Mar 23, 2008

gpshead commented Mar 29, 2008

weijie90 mannequin commented Mar 29, 2008

gpshead commented Mar 29, 2008

gpshead commented Mar 29, 2008

orsenthil commented May 16, 2008

facundobatista commented Aug 16, 2008

gpshead commented Aug 17, 2008

orsenthil commented Aug 17, 2008

orsenthil commented Aug 17, 2008

facundobatista commented Aug 17, 2008

gpshead commented Aug 17, 2008

facundobatista commented Aug 18, 2008

gpshead commented Aug 18, 2008

facundobatista commented Aug 26, 2008

orsenthil commented Aug 27, 2008

jjlee mannequin commented Dec 2, 2008

jjlee mannequin commented Dec 2, 2008