HTTP request-line parsing splits on Unicode whitespace #78154

tipabu · 2018-06-26T18:39:29Z

BPO	33973
Nosy	@vstinner, @ezio-melotti, @tipabu
PRs	bpo-33973: Only split request lines on b'\x20' #7932

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2018-06-26.18:39:29.435>
labels = ['type-bug', '3.8', '3.9', '3.7', 'library', 'expert-unicode']
title = 'HTTP request-line parsing splits on Unicode whitespace'
updated_at = <Date 2020-01-25.13:18:38.086>
user = 'https://github.com/tipabu'

bugs.python.org fields:

activity = <Date 2020-01-25.13:18:38.086>
actor = 'cheryl.sabella'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)', 'Unicode']
creation = <Date 2018-06-26.18:39:29.435>
creator = 'tburke'
dependencies = []
files = []
hgrepos = []
issue_num = 33973
keywords = ['patch']
message_count = 2.0
messages = ['320507', '320529']
nosy_count = 3.0
nosy_names = ['vstinner', 'ezio.melotti', 'tburke']
pr_nums = ['7932']
priority = 'normal'
resolution = None
stage = 'patch review'
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue33973'
versions = ['Python 3.7', 'Python 3.8', 'Python 3.9']

tipabu · 2018-06-26T18:39:29Z

This causes (admittedly, buggy) clients that would work with a Python 2 server to stop working when the server upgrades to Python 3. To demonstrate, run python2.7 -m SimpleHTTPServer 8027 in one terminal and curl -v http://127.0.0.1:8027/你好 in another -- curl reports

*   Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to 127.0.0.1 (127.0.0.1) port 8027 (#0)
> GET /你好 HTTP/1.1
> Host: 127.0.0.1:8027
> User-Agent: curl/7.54.0
> Accept: */*
>
* HTTP 1.0, assume close after body
< HTTP/1.0 404 File not found
< Server: SimpleHTTP/0.6 [Python/2.7.10](https://github.com/python/cpython/blob/main/Python/2.7.10)
< Date: Tue, 26 Jun 2018 17:23:25 GMT
< Content-Type: text/html
< Connection: close
<
<head>
<title>Error response</title>
</head>
<body>
<h1>Error response</h1>
<p>Error code 404.
<p>Message: File not found.
<p>Error code explanation: 404 = Nothing matches the given URI.
</body>
* Closing connection 0

...while repeating the experiment with python3.6 -m http.server 8036 and curl -v http://127.0.0.1:8036/你好 gives

*   Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to 127.0.0.1 (127.0.0.1) port 8036 (#0)
> GET /你好 HTTP/1.1
> Host: 127.0.0.1:8036
> User-Agent: curl/7.54.0
> Accept: */*
>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
        "http://www.w3.org/TR/html4/strict.dtd">
<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html;charset=utf-8">
        <title>Error response</title>
    </head>
    <body>
        <h1>Error response</h1>
        <p>Error code: 400</p>
        <p>Message: Bad request syntax ('GET /ä½\xa0å¥½ HTTP/1.1').</p>
        <p>Error code explanation: HTTPStatus.BAD_REQUEST - Bad request syntax or unsupported method.</p>
    </body>
</html>
* Connection #0 to host 127.0.0.1 left intact

Granted, a well-behaved client would have quoted the UTF-8 '你好' as '%E4%BD%A0%E5%A5%BD' (in which case everything would have behaved as expected), but RFC 7230 is pretty clear that the request-line should be SP-delimited. While it notes that "recipients MAY instead parse on whitespace-delimited word boundaries and, aside from the CRLF terminator, treat any form of whitespace as the SP separator", it goes on to say that "such whitespace includes one or more of the following octets: SP, HTAB, VT (%x0B), FF (%x0C), or bare CR" with no mention of characters like the (ISO-8859-1 encoded) non-breaking space that caused the 400 response.

FWIW, there was a similar unicode-separators-are-not-the-right-separators bug in header parsing a while back: https://bugs.python.org/issue22233

vstinner · 2018-06-27T00:42:06Z

isspace() is also true for another non-ASCII character: U+0085 (b'\x85').

>>> ''.join(chr(i) for i in range(256) if chr(i).isspace())
'\t\n\x0b\x0c\r\x1c\x1d\x1e\x1f \x85\xa0'

tipabu · 2024-01-25T21:13:58Z

Note that this is still an issue; tested anew on 3.10.12, 3.11.7, 3.12.1, and a 3.13 alpha.

encukou · 2024-03-12T14:46:36Z

I'm sorry for only getting to this PR now.
IMO, there are three issues to consider:

Compatibility with the specification
Compatibility with older versions of Python -- 2.x and 3.x

Since this is about illegal (unquoted) requests, the first point doesn't cone into play much, except that this with this PR, we lose support for alternate ASCII whitespace (HTAB, VT (%x0B), FF (%x0C), or bare CR) which is mentioned in RFC 9112.

Regarding compatibility with earlier 3.x versions: turning split() into split(' ') means

runs of more than one whitespace are no longer collapsed
leading/trailing whitespace is no longer ignored

Both could break software that relies on those non-standard quirks.

That leaves compatibility with Python 2. However, in 2024, this point has lost its importance. (Again, sorry for getting to this late.)

I would prefer closing this issue and PR. With Python 2 gone, this would be trading one set of non-standard semantics for another.

Or perhaps we could heed the RFC's warning about lenient parsing, and -- with a deprecation period -- make the parsing strict: reject all unexpected characters, including all non-ASCII ones, misplaced spaces, etc.

encukou · 2024-04-15T09:54:06Z

I plan to close the issue and PR in a month if there are no objections.

tipabu · 2024-04-26T18:48:25Z

We can go ahead and close it. Even if I had the bandwidth to rework the PR to tighten parsing rules and follow through on a deprecate-remove cycle, the software I was working on that caused me to discover the issue still wouldn't be able to make use of the change. At some point, it probably needs to just stop relying on stdlib for HTTP parsing all together.

At least this bug doesn't have any (obvious) security implications; if you have a chance, I'd appreciate some thoughts around #81274 / #13788

tipabu mannequin added 3.7 (EOL) end of life 3.8 only security fixes stdlib Python modules in the Lib dir topic-unicode type-bug An unexpected behavior, bug, or error labels Jun 26, 2018

csabella added the 3.9 only security fixes label Jan 25, 2020

ezio-melotti transferred this issue from another repository Apr 10, 2022

encukou closed this as completed May 6, 2024

terryjreedy closed this as not planned Won't fix, can't repro, duplicate, stale May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTTP request-line parsing splits on Unicode whitespace #78154

HTTP request-line parsing splits on Unicode whitespace #78154

tipabu mannequin commented Jun 26, 2018

tipabu mannequin commented Jun 26, 2018

vstinner commented Jun 27, 2018

tipabu commented Jan 25, 2024

encukou commented Mar 12, 2024

encukou commented Apr 15, 2024

tipabu commented Apr 26, 2024

HTTP request-line parsing splits on Unicode whitespace #78154

HTTP request-line parsing splits on Unicode whitespace #78154

Comments

tipabu mannequin commented Jun 26, 2018

tipabu mannequin commented Jun 26, 2018

vstinner commented Jun 27, 2018

tipabu commented Jan 25, 2024

encukou commented Mar 12, 2024

encukou commented Apr 15, 2024

tipabu commented Apr 26, 2024