[CVE-2021-23336] urllib.parse.parse_qsl(): Web cache poisoning -
; as a query args separator
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
assignee = 'https://github.com/orsenthil' closed_at = <Date 2021-02-15.19:34:55.754> created_at = <Date 2021-01-19.15:06:49.941> labels = ['type-security', '3.8', '3.9', '3.10', 'release-blocker', '3.7', 'library'] title = '[CVE-2021-23336] urllib.parse.parse_qsl(): Web cache poisoning - `; ` as a query args separator' updated_at = <Date 2021-11-08.16:47:04.726> user = 'https://github.com/AdamGold'
activity = <Date 2021-11-08.16:47:04.726> actor = 'vstinner' assignee = 'orsenthil' closed = True closed_date = <Date 2021-02-15.19:34:55.754> closer = 'orsenthil' components = ['Library (Lib)'] creation = <Date 2021-01-19.15:06:49.941> creator = 'AdamGold' dependencies =  files = ['49839'] hgrepos =  issue_num = 42967 keywords = ['patch'] message_count = 57.0 messages = ['385266', '385332', '385337', '385341', '385342', '385344', '385346', '385352', '385495', '385496', '385497', '385513', '385527', '385544', '385549', '385565', '385566', '385567', '385582', '385585', '385590', '385865', '386003', '386785', '386787', '386788', '386954', '386957', '386960', '386968', '386980', '387027', '387037', '387039', '387040', '387045', '387049', '387069', '387638', '387712', '387735', '387756', '388368', '388433', '388434', '388440', '388447', '388486', '388574', '390782', '390784', '390790', '391231', '405721', '405723', '405725', '405728'] nosy_count = 15.0 nosy_names = ['lemburg', 'gregory.p.smith', 'orsenthil', 'ned.deily', 'mcepl', 'eric.araujo', 'petr.viktorin', 'lukasz.langa', 'serhiy.storchaka', 'pablogsal', 'miss-islington', 'rschiron', 'erlendaasland', 'kj', 'AdamGold'] pr_nums = ['24271', '24297', '24528', '24529', '24531', '24532', '24536', '24818', '25344', '25345'] priority = 'release blocker' resolution = 'fixed' stage = 'resolved' status = 'closed' superseder = None type = 'security' url = 'https://bugs.python.org/issue42967' versions = ['Python 3.6', 'Python 3.7', 'Python 3.8', 'Python 3.9', 'Python 3.10']
The text was updated successfully, but these errors were encountered:
The urlparse module treats semicolon as a separator (https://github.com/python/cpython/blob/master/Lib/urllib/parse.py#L739) - whereas most proxies today only take ampersands as separators. Link to a blog post explaining this vulnerability: https://snyk.io/blog/cache-poisoning-in-popular-open-source-packages/
When the attacker can separate query parameters using a semicolon (;), they can cause a difference in the interpretation of the request between the proxy (running with default configuration) and the server. This can result in malicious requests being cached as completely safe ones, as the proxy would usually not see the semicolon as a separator, and therefore would not include it in a cache key of an unkeyed parameter - such as
urlparse sees 3 parameters here:
A possible solution could be to allow developers to specify a separator, like werkzeug does:
Oops, I missed this issue. I just marked my bpo-42975 issue as a duplicate of this one.
urllib.parse.parse_qsl() uses "&" *and* ";" as separators:
>>> urllib.parse.parse_qsl("a=1&b=2&c=3") [('a', '1'), ('b', '2'), ('c', '3')] >>> urllib.parse.parse_qsl("a=1&b=2;c=3") [('a', '1'), ('b', '2'), ('c', '3')]
But the W3C standards evolved and now suggest against considering semicolon (";") as a separator:
"This form data set encoding is in many ways an aberrant monstrosity, the result of many years of implementation accidents and compromises leading to a set of requirements necessary for interoperability, but in no way representing good design practices. In particular, readers are cautioned to pay close attention to the twisted details involving repeated (and in some cases nested) conversions between character encodings and byte sequences."
"To decode application/x-www-form-urlencoded payloads (...) Let strings be the result of strictly splitting the string payload on U+0026 AMPERSAND characters (&)."
Maybe we should even go further in Python 3.10 and only split at "&" by default, but let the caller to opt-in for ";" separator as well.
On 20.01.2021 12:07, STINNER Victor wrote:
Personally, I've never seen URLs encoded with ";" as query parameter
The use of ";" was recommended in the HTML4 spec, but only in an
and not in the main reference:
Browsers are also pretty relaxed about seeing non-escaped ampersands in
It looks to me, that this is an issue of proxies, not Python. Python implementation obeys contemporary standards, and they are not formally cancelled yet. If we add an option in parse_qsl() or change its default behavior, it should be considered as a new feature which helps to mitigate proxies' issues.
FWIW, a surprising amount of things rely on treating ';' as a valid separator in the standard test suite.
From just a cursory look:
A change in the public API of urlparse will also require a change in cgi.py's FieldStorage, FieldStorage.read_multi, parse and parse_multipart to expose that parameter since those functions forward arguments directly to urllib.parse.parse_qs internally.
If we backport this, it seems that we will *also* need to backport all those changes to cgi's public API. Otherwise, just backporting the security fix part without allowing the user to switch would break existing code.
Just my 2 cents on the issue. I'm not too familiar with security fixes in cpython anyways ;).
Did you upstream fixes for those packages?
Asking because if this is considered a vulnerability in Python, it should be considered a vulnerability for every other tool/library that accept
Again, I feel like we are blaming the wrong piece of the stack, unless proxies are usually ignoring some arguments (e.g. utm_*) as part of the cache key, by default or in a very easy way.
Riccardo - FWIW I agree, the wrong part of the stack was blamed and a CVE was wrongly sought for against CPython on this one.
It's sewage under the bridge at this point. The API change has shipped in several different stable releases and thus is something virtually Python all code must now deal with.
Why was this a bad change to make? Python's parse_qsl obeyed the prevailing HTML 4 standard at the time it was written:
That turns out to have been bad advice in the standard. 15 years later the html5 standard quoted in Adam's snyk blog post links to its text on this which leaves no room for that interpretation.
In that light, the correct thing to do for this issue would be to:
Afterall, the existence of html5 didn't magically fix all of the html and web applications written in the two decades of web that came before it. Ask any browser author...
No, this was not intentional. The separator arg was just coice, for compatibility, if some wanted to use