New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTTP request-line parsing splits on Unicode whitespace #78154
Comments
This causes (admittedly, buggy) clients that would work with a Python 2 server to stop working when the server upgrades to Python 3. To demonstrate, run
...while repeating the experiment with
Granted, a well-behaved client would have quoted the UTF-8 '你好' as '%E4%BD%A0%E5%A5%BD' (in which case everything would have behaved as expected), but RFC 7230 is pretty clear that the request-line should be SP-delimited. While it notes that "recipients MAY instead parse on whitespace-delimited word boundaries and, aside from the CRLF terminator, treat any form of whitespace as the SP separator", it goes on to say that "such whitespace includes one or more of the following octets: SP, HTAB, VT (%x0B), FF (%x0C), or bare CR" with no mention of characters like the (ISO-8859-1 encoded) non-breaking space that caused the 400 response. FWIW, there was a similar unicode-separators-are-not-the-right-separators bug in header parsing a while back: https://bugs.python.org/issue22233 |
isspace() is also true for another non-ASCII character: U+0085 (b'\x85'). >>> ''.join(chr(i) for i in range(256) if chr(i).isspace())
'\t\n\x0b\x0c\r\x1c\x1d\x1e\x1f \x85\xa0' |
Note that this is still an issue; tested anew on 3.10.12, 3.11.7, 3.12.1, and a 3.13 alpha. |
I'm sorry for only getting to this PR now.
Since this is about illegal (unquoted) requests, the first point doesn't cone into play much, except that this with this PR, we lose support for alternate ASCII whitespace (HTAB, VT (%x0B), FF (%x0C), or bare CR) which is mentioned in RFC 9112. Regarding compatibility with earlier 3.x versions: turning
Both could break software that relies on those non-standard quirks. That leaves compatibility with Python 2. However, in 2024, this point has lost its importance. (Again, sorry for getting to this late.) I would prefer closing this issue and PR. With Python 2 gone, this would be trading one set of non-standard semantics for another. Or perhaps we could heed the RFC's warning about lenient parsing, and -- with a deprecation period -- make the parsing strict: reject all unexpected characters, including all non-ASCII ones, misplaced spaces, etc. |
I plan to close the issue and PR in a month if there are no objections. |
We can go ahead and close it. Even if I had the bandwidth to rework the PR to tighten parsing rules and follow through on a deprecate-remove cycle, the software I was working on that caused me to discover the issue still wouldn't be able to make use of the change. At some point, it probably needs to just stop relying on stdlib for HTTP parsing all together. At least this bug doesn't have any (obvious) security implications; if you have a chance, I'd appreciate some thoughts around #81274 / #13788 |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: