Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

urllib2 can't handle http://www.wikispaces.com #46716

Closed
weijie90 mannequin opened this issue Mar 23, 2008 · 18 comments
Closed

urllib2 can't handle http://www.wikispaces.com #46716

weijie90 mannequin opened this issue Mar 23, 2008 · 18 comments
Assignees
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@weijie90
Copy link
Mannequin

weijie90 mannequin commented Mar 23, 2008

BPO 2464
Nosy @facundobatista, @gpshead, @orsenthil
Files
  • issue2464-facundo.diff
  • issue2464-py26-FINAL.diff
  • issue2464-PATCH1.diff
  • issue2463-py3k.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/facundobatista'
    closed_at = <Date 2008-08-17.03:39:08.602>
    created_at = <Date 2008-03-23.14:41:09.498>
    labels = ['type-bug', 'library']
    title = "urllib2 can't handle http://www.wikispaces.com"
    updated_at = <Date 2008-12-02.20:47:14.241>
    user = 'https://bugs.python.org/weijie90'

    bugs.python.org fields:

    activity = <Date 2008-12-02.20:47:14.241>
    actor = 'jjlee'
    assignee = 'facundobatista'
    closed = True
    closed_date = <Date 2008-08-17.03:39:08.602>
    closer = 'facundobatista'
    components = ['Library (Lib)']
    creation = <Date 2008-03-23.14:41:09.498>
    creator = 'weijie90'
    dependencies = []
    files = ['11127', '11132', '11133', '11134']
    hgrepos = []
    issue_num = 2464
    keywords = ['patch']
    message_count = 18.0
    messages = ['64363', '64676', '64677', '64679', '64680', '66889', '71231', '71255', '71259', '71260', '71261', '71295', '71303', '71304', '71990', '72024', '76736', '76779']
    nosy_count = 5.0
    nosy_names = ['facundobatista', 'gregory.p.smith', 'jjlee', 'orsenthil', 'weijie90']
    pr_nums = []
    priority = 'low'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue2464'
    versions = ['Python 2.6', 'Python 2.5', 'Python 3.0']

    @weijie90
    Copy link
    Mannequin Author

    weijie90 mannequin commented Mar 23, 2008

    Try the following code:

    import urllib2
    gmail = urllib2.urlopen("https://www.gmail.com").read()
    wikispaces = urllib2.urlopen("http://www.wikispaces.com").read()
    Getting the html over HTTPS from gmail.com works, but not over HTTP from
    wikispaces. Here's the traceback:
    >>> wikispaces = urllib2.urlopen("http://www.wikispaces.com").read()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/lib/python2.5/urllib2.py", line 121, in urlopen
        return _opener.open(url, data)
      File "/usr/lib/python2.5/urllib2.py", line 380, in open
        response = meth(req, response)
      File "/usr/lib/python2.5/urllib2.py", line 491, in http_response
        'http', request, response, code, msg, hdrs)
      File "/usr/lib/python2.5/urllib2.py", line 412, in error
        result = self._call_chain(*args)
      File "/usr/lib/python2.5/urllib2.py", line 353, in _call_chain
        result = func(*args)
      File "/usr/lib/python2.5/urllib2.py", line 575, in http_error_302
        return self.parent.open(new)
      File "/usr/lib/python2.5/urllib2.py", line 380, in open
        response = meth(req, response)
      File "/usr/lib/python2.5/urllib2.py", line 491, in http_response
        'http', request, response, code, msg, hdrs)
      File "/usr/lib/python2.5/urllib2.py", line 412, in error
        result = self._call_chain(*args)
      File "/usr/lib/python2.5/urllib2.py", line 353, in _call_chain
        result = func(*args)
      File "/usr/lib/python2.5/urllib2.py", line 575, in http_error_302
        return self.parent.open(new)
      File "/usr/lib/python2.5/urllib2.py", line 374, in open
        response = self._open(req, data)
      File "/usr/lib/python2.5/urllib2.py", line 392, in _open
        '_open', req)
      File "/usr/lib/python2.5/urllib2.py", line 353, in _call_chain
        result = func(*args)
      File "/usr/lib/python2.5/urllib2.py", line 1100, in http_open
        return self.do_open(httplib.HTTPConnection, req)
      File "/usr/lib/python2.5/urllib2.py", line 1075, in do_open
        raise URLError(err)
    urllib2.URLError: <urlopen error (104, 'Connection reset by peer')>

    Note the two 302 redirects.

    I tried accessing wikispaces.com with SSL turned off in Firefox
    2.0.0.12, which didn't work because SSL was required, perhaps in between
    the redirects that wikispaces uses.

    Why doesn't urllib2 handle the "hidden" SSL properly? (Not to be rude,
    but httplib2 works.)

    Thanks!
    WJ

    @weijie90 weijie90 mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Mar 23, 2008
    @gpshead
    Copy link
    Member

    gpshead commented Mar 29, 2008

    The problem does not appear to have anything to do with SSL. The
    problem is that the chain of HTTP requests goes:

    GET -> 302 -> 302 -> 301

    On the final 301 urllib2's internal state is messed up such that by the
    time its in the handle_error_302 method it believes the Location header
    contains the following:

    'http://www.wikispaces.com/\\x00/?responseToken=481aec3249f429316459e01c00b7e522'

    The \x00 and everything after it should not be there and is not there if
    you look at what is sent over the socket itself. The ?responseToken=xxx
    value is leftover from the previous 302 response. No idea where the
    \x00 came from yet. I'm debugging...

    @gpshead gpshead self-assigned this Mar 29, 2008
    @weijie90
    Copy link
    Mannequin Author

    weijie90 mannequin commented Mar 29, 2008

    Please take your time, because this bug isn't critical. Thanks!

    @gpshead
    Copy link
    Member

    gpshead commented Mar 29, 2008

    Instrumenting the code and looking closer at the tcpdump, its true.
    wikispaces.com is returning an invalid Location: header with a null byte
    in the middle of it.

    The "fix" on our end should be to handle such garbage from such broken
    web servers more gracefully. Other clients seem to treat the null as an
    end of string or end of that header.

    @gpshead
    Copy link
    Member

    gpshead commented Mar 29, 2008

    I'm not sure what the best solution for this is. If I truncate the
    header values at a \x00 character it ends in an infinite redirect loop
    (which urllib2 detects and raises on). If I simple remove all \x00
    characters the resulting url is not accepted by wikispaces.com due to
    having an extra / in it.

    Verdict: wikispaces.com is broken.

    urllib2 could do better. wget and firefox deal with it properly. but
    i'll leave deciding which patch to use up to someone who cares about
    handling broken sites.

    patch to implement either behavior of dealing with nulls where they
    shouldn't be:

    Index: Lib/httplib.py
    ===================================================================

    --- Lib/httplib.py      (revision 62033)
    +++ Lib/httplib.py      (working copy)
    @@ -291,9 +291,18 @@
                     break
                 headerseen = self.isheader(line)
                 if headerseen:
    +                # Some bad web servers reply with headers with a \x00 null
    +                # embedded in the value.  Other http clients deal with
    +                # this by treating it as a value terminator, ignoring the
    +                # rest so we will too.  http://bugs.python.org/issue2464.
    +                if '\x00' in line:
    +                    line = line[:line.find('\x00')]
    +                    # if you want to just remove nulls instead use this:
    +                    #line = line.replace('\x00', '')
                     # It's a legal header line, save it.
                     hlist.append(line)
    -                self.addheader(headerseen,
    line[len(headerseen)+1:].strip())
    +                value = line[len(headerseen)+1:].strip()
    +                self.addheader(headerseen, value)
                     continue
                 else:
                     # It's not a header line; throw it back and stop here.

    @gpshead gpshead removed their assignment Mar 29, 2008
    @orsenthil
    Copy link
    Member

    The issue is not just with null character. If you observe now the
    diretion is 302-302-200 and there is no null character.
    However, still urllib2 is unable to handle multiple redirection properly
    (IIRC, there is a portion of code to handle multiple redirection and
    exit on infinite loop)
    >>> url = "http://www.wikispaces.com"
    >>> opened = urllib.urlopen(url)
    >>> print opened.geturl()
    http://www.wikispaces.com?responseToken=344289da354a29c67d48928dbe72042a
    >>> print opened.read()
    <html>
    <head><title>400 Bad Request</title></head>
    <body bgcolor="white">
    <center><h1>400 Bad Request</h1></center>
    <hr><center>nginx/0.6.30</center>
    </body>
    </html>

    Needs a relook, IMO.

    @facundobatista facundobatista self-assigned this Jul 3, 2008
    @facundobatista
    Copy link
    Member

    Senthil:

    Look at that URL that the server returned in the second redirect:

    http://www.wikispaces.com?responseToken=ee3fca88a9b0dc865152d8a9e5b6138d

    See that the "?" appears without a path between the host and it.

    Check the item 3.2.2 in the RFC 2616, it says that a HTTP URL should be:

      http_URL = "http:" "//" host [ ":" port ] [ abs_path [ "?" query ]]

    So, we should fix that URL that the server returned. Guess what: if we
    put a "/" (as obligates the RFC), everything works ok.

    The patch I attach here does that. All tests pass ok.

    What do you think?

    @gpshead
    Copy link
    Member

    gpshead commented Aug 17, 2008

    looks good to me.

    @orsenthil
    Copy link
    Member

    Ah, I that was a simple fix. :) I very much overlooked the problem after
    being so much given the hints at the web-sig.

    I have some comments on the patch, Facundo.

    1. I don't think is a good idea to include that portion in the
      http_error_302 method. That makes the fix "very" specific to "this"
      issue only.
      Another point is, fixing broken url's should not be under urllib2,
      urlparse would be a better place.
      So, I came up with the approach wherein urllib2 does unparse(parse) of
      the url and parse methods will fix the url if it is broken. ( See
      attached bpo-2464-PATCH1.diff)

    But if we handle it in the urlparse methods, then we are much
    susceptible to breaking RFC conformance, breaking a lot of tests, Which
    is not a good idea.

    So,I introduced fix_broken() method in urlparse and called it to solve
    the issue, using the same logic as yours (bpo-2464-py26-FINAL.diff)
    With fix_broken() method in urlparse, we will have better control
    whenever we want to implement a behavior which is RFC non-confirming but
    implemented widely by browsers and clients.

    All tests pass with bpo-2464-py26-FINAL.diff

    Comments,please?

    @orsenthil
    Copy link
    Member

    Patch for py3k, but please test this before applying.

    @facundobatista
    Copy link
    Member

    Senthil: I don't like that.

    Creating a public method called "fix_broken", introducing new behaviours
    now in beta, and actually not fixing the url in any broken possibility
    (just the path if it's not there), it's way too much for this fix.

    I commited the change I proposed. Maybe in the future will have a
    "generic url fixing" function, now is not the moment.

    @gpshead
    Copy link
    Member

    gpshead commented Aug 17, 2008

    i was pondering if it should go in urlparse instead. if it did, i think
    it should be part of urlparse.urlunparse to ensure that there is always
    a trailing slash after the host:port regardless of what the inputs are.

    anyways, agreed, this fixes this specific bug. should it be backported
    to release25-maint?

    @facundobatista
    Copy link
    Member

    Maybe we can put it in urlunparse... do you all agree with this test cases?

    def test_alwayspath(self):
        u = urlparse.urlparse("http://netloc/path;params?query#fragment")
        self.assertEqual(urlparse.urlunparse(u),
    "http://netloc/path;params?query#fragment")
    
        u = urlparse.urlparse("http://netloc?query#fragment")
        self.assertEqual(urlparse.urlunparse(u),
    "http://netloc/?query#fragment")
    
        u = urlparse.urlparse("http://netloc#fragment")
        self.assertEqual(urlparse.urlunparse(u), "http://netloc/#fragment")

    Maybe we could backport this more general fix...

    @gpshead
    Copy link
    Member

    gpshead commented Aug 18, 2008

    That test case looks good to me for 2.6 and 3.0. Also add a note to the
    documentation with a versionchanged 2.6 about urlunparse always ensuring
    there is a / between the netloc and the rest of the url.

    I would not back port the more general urlunparse behavior change to 2.5.

    @facundobatista
    Copy link
    Member

    Gregory... I tried to fill the path in urlunparse, and other functions
    that use this started to fail.

    As we're so close to final releases, I'll leave this as it's right now,
    that actually fixed the bug...

    @orsenthil
    Copy link
    Member

    That was reason in making fix_broken in the urlparse in my patch, Facundo.
    I had thought, it should be handled in urlparse module and if we make
    changes in the urlparse.urlunparse/urlparse.urlparse, then we are
    stepping into area which will break a lot of tests.

    I am kind of +0 with the current fix in urllib2. Should we think/plan
    for something in urlparse, akin to fix_broken?

    @jjlee
    Copy link
    Mannequin

    jjlee mannequin commented Dec 2, 2008

    This fix was applied in the wrong place.

    URI path components, and HTTP URI path components in particular, *can*
    be empty. See RFC 3986. So the comment in the code that was inserted
    with the fix for this bug that says "possibly malformed" is incorrect,
    and should instead just refer to section 3.2.2 of RFC 2616. Also,
    because 3.2.2 says "If the abs_path is not present in the URL, it MUST
    be given as "/" when used as a Request-URI for a resource (section
    5.1.2)", it seems clear that this transformation (add the slash if
    there's no path component) should always happen when retrieving URIs,
    regardless of where the URI came from -- not only for redirects.

    Note that RFC 2616 incorrectly claims to refer to the definition of
    abs_path from RFC 2396. The fact that it's incorrect is made clear in
    2616 itself, in section 3.2.3, when it says that abs_path can be empty.
    In any case, RFC 2396 is obsoleted by RFC 3986, which is clear on this
    issue, and reflects actual usage of URIs. URIs like http://python.org
    and http://python.org?spam=eggs have been in widespread use for a long
    time, and typing the latter URL into firefox (3.0.2) confirms that
    what's actually sent is "/?spam", whereas urllib2 still sends "?spam".

    No test was added with this fix, which makes it unnecessarily hard to
    work out what exactly the fix was supposed to fix. For the record, this
    is the sequence of redirections before the fix was applied (showing base
    URI + redirect URI reference --> redirect URI):

    'http://www.wikispaces.com' +
    'https://session.wikispaces.com/session/auth?authToken=token' -->
    'https://session.wikispaces.com/session/auth?authToken=token'
    'https://session.wikispaces.com/session/auth?authToken=token' +
    'http://www.wikispaces.com?responseToken=token' -->
    'http://www.wikispaces.com?responseToken=token'

    and after the fix was applied:

    'http://www.wikispaces.com' +
    'https://session.wikispaces.com/session/auth?authToken=token' -->
    'https://session.wikispaces.com/session/auth?authToken=token'
    'https://session.wikispaces.com/session/auth?authToken=token' +
    'http://www.wikispaces.com/?responseToken=token' -->
    'http://www.wikispaces.com/?responseToken=token'
    'http://www.wikispaces.com/?responseToken=token' +
    'http://www.wikispaces.com/' --> 'http://www.wikispaces.com/'

    @jjlee
    Copy link
    Mannequin

    jjlee mannequin commented Dec 2, 2008

    I've raised bpo-4493 about the issue I raised in my previous comment.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants