Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP request-line parsing splits on Unicode whitespace #78154

Closed
tipabu mannequin opened this issue Jun 26, 2018 · 6 comments
Closed

HTTP request-line parsing splits on Unicode whitespace #78154

tipabu mannequin opened this issue Jun 26, 2018 · 6 comments
Labels
3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes stdlib Python modules in the Lib dir topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@tipabu
Copy link
Mannequin

tipabu mannequin commented Jun 26, 2018

BPO 33973
Nosy @vstinner, @ezio-melotti, @tipabu
PRs
  • bpo-33973: Only split request lines on b'\x20' #7932
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2018-06-26.18:39:29.435>
    labels = ['type-bug', '3.8', '3.9', '3.7', 'library', 'expert-unicode']
    title = 'HTTP request-line parsing splits on Unicode whitespace'
    updated_at = <Date 2020-01-25.13:18:38.086>
    user = 'https://github.com/tipabu'

    bugs.python.org fields:

    activity = <Date 2020-01-25.13:18:38.086>
    actor = 'cheryl.sabella'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Library (Lib)', 'Unicode']
    creation = <Date 2018-06-26.18:39:29.435>
    creator = 'tburke'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 33973
    keywords = ['patch']
    message_count = 2.0
    messages = ['320507', '320529']
    nosy_count = 3.0
    nosy_names = ['vstinner', 'ezio.melotti', 'tburke']
    pr_nums = ['7932']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue33973'
    versions = ['Python 3.7', 'Python 3.8', 'Python 3.9']

    @tipabu
    Copy link
    Mannequin Author

    tipabu mannequin commented Jun 26, 2018

    This causes (admittedly, buggy) clients that would work with a Python 2 server to stop working when the server upgrades to Python 3. To demonstrate, run python2.7 -m SimpleHTTPServer 8027 in one terminal and curl -v http://127.0.0.1:8027/你好 in another -- curl reports

    *   Trying 127.0.0.1...
    * TCP_NODELAY set
    * Connected to 127.0.0.1 (127.0.0.1) port 8027 (#0)
    > GET /你好 HTTP/1.1
    > Host: 127.0.0.1:8027
    > User-Agent: curl/7.54.0
    > Accept: */*
    >
    * HTTP 1.0, assume close after body
    < HTTP/1.0 404 File not found
    < Server: SimpleHTTP/0.6 [Python/2.7.10](https://github.com/python/cpython/blob/main/Python/2.7.10)
    < Date: Tue, 26 Jun 2018 17:23:25 GMT
    < Content-Type: text/html
    < Connection: close
    <
    <head>
    <title>Error response</title>
    </head>
    <body>
    <h1>Error response</h1>
    <p>Error code 404.
    <p>Message: File not found.
    <p>Error code explanation: 404 = Nothing matches the given URI.
    </body>
    * Closing connection 0
    

    ...while repeating the experiment with python3.6 -m http.server 8036 and curl -v http://127.0.0.1:8036/你好 gives

    *   Trying 127.0.0.1...
    * TCP_NODELAY set
    * Connected to 127.0.0.1 (127.0.0.1) port 8036 (#0)
    > GET /你好 HTTP/1.1
    > Host: 127.0.0.1:8036
    > User-Agent: curl/7.54.0
    > Accept: */*
    >
    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
            "http://www.w3.org/TR/html4/strict.dtd">
    <html>
        <head>
            <meta http-equiv="Content-Type" content="text/html;charset=utf-8">
            <title>Error response</title>
        </head>
        <body>
            <h1>Error response</h1>
            <p>Error code: 400</p>
            <p>Message: Bad request syntax ('GET /ä½\xa0好 HTTP/1.1').</p>
            <p>Error code explanation: HTTPStatus.BAD_REQUEST - Bad request syntax or unsupported method.</p>
        </body>
    </html>
    * Connection #0 to host 127.0.0.1 left intact
    

    Granted, a well-behaved client would have quoted the UTF-8 '你好' as '%E4%BD%A0%E5%A5%BD' (in which case everything would have behaved as expected), but RFC 7230 is pretty clear that the request-line should be SP-delimited. While it notes that "recipients MAY instead parse on whitespace-delimited word boundaries and, aside from the CRLF terminator, treat any form of whitespace as the SP separator", it goes on to say that "such whitespace includes one or more of the following octets: SP, HTAB, VT (%x0B), FF (%x0C), or bare CR" with no mention of characters like the (ISO-8859-1 encoded) non-breaking space that caused the 400 response.

    FWIW, there was a similar unicode-separators-are-not-the-right-separators bug in header parsing a while back: https://bugs.python.org/issue22233

    @tipabu tipabu mannequin added 3.7 (EOL) end of life 3.8 only security fixes stdlib Python modules in the Lib dir topic-unicode type-bug An unexpected behavior, bug, or error labels Jun 26, 2018
    @vstinner
    Copy link
    Member

    isspace() is also true for another non-ASCII character: U+0085 (b'\x85').

    >>> ''.join(chr(i) for i in range(256) if chr(i).isspace())
    '\t\n\x0b\x0c\r\x1c\x1d\x1e\x1f \x85\xa0'

    @csabella csabella added the 3.9 only security fixes label Jan 25, 2020
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @tipabu
    Copy link
    Contributor

    tipabu commented Jan 25, 2024

    Note that this is still an issue; tested anew on 3.10.12, 3.11.7, 3.12.1, and a 3.13 alpha.

    @encukou
    Copy link
    Member

    encukou commented Mar 12, 2024

    I'm sorry for only getting to this PR now.
    IMO, there are three issues to consider:

    • Compatibility with the specification
    • Compatibility with older versions of Python -- 2.x and 3.x

    Since this is about illegal (unquoted) requests, the first point doesn't cone into play much, except that this with this PR, we lose support for alternate ASCII whitespace (HTAB, VT (%x0B), FF (%x0C), or bare CR) which is mentioned in RFC 9112.

    Regarding compatibility with earlier 3.x versions: turning split() into split(' ') means

    • runs of more than one whitespace are no longer collapsed
    • leading/trailing whitespace is no longer ignored

    Both could break software that relies on those non-standard quirks.

    That leaves compatibility with Python 2. However, in 2024, this point has lost its importance. (Again, sorry for getting to this late.)

    I would prefer closing this issue and PR. With Python 2 gone, this would be trading one set of non-standard semantics for another.

    Or perhaps we could heed the RFC's warning about lenient parsing, and -- with a deprecation period -- make the parsing strict: reject all unexpected characters, including all non-ASCII ones, misplaced spaces, etc.

    @encukou
    Copy link
    Member

    encukou commented Apr 15, 2024

    I plan to close the issue and PR in a month if there are no objections.

    @tipabu
    Copy link
    Contributor

    tipabu commented Apr 26, 2024

    We can go ahead and close it. Even if I had the bandwidth to rework the PR to tighten parsing rules and follow through on a deprecate-remove cycle, the software I was working on that caused me to discover the issue still wouldn't be able to make use of the change. At some point, it probably needs to just stop relying on stdlib for HTTP parsing all together.

    At least this bug doesn't have any (obvious) security implications; if you have a chance, I'd appreciate some thoughts around #81274 / #13788

    @encukou encukou closed this as completed May 6, 2024
    @terryjreedy terryjreedy closed this as not planned Won't fix, can't repro, duplicate, stale May 6, 2024
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes stdlib Python modules in the Lib dir topic-unicode type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants