Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CVE-2021-23336] urllib.parse.parse_qsl(): Web cache poisoning - ; as a query args separator #87133

Closed
AdamGold mannequin opened this issue Jan 19, 2021 · 57 comments
Closed
Assignees
Labels
3.7 3.8 3.9 3.10 release-blocker stdlib type-security

Comments

@AdamGold
Copy link
Mannequin

@AdamGold AdamGold mannequin commented Jan 19, 2021

BPO 42967
Nosy @malemburg, @gpshead, @orsenthil, @ned-deily, @mcepl, @merwok, @encukou, @ambv, @serhiy-storchaka, @pablogsal, @miss-islington, @ret2libc, @erlend-aasland, @Fidget-Spinner, @AdamGold
PRs
  • #24271
  • #24297
  • #24528
  • #24529
  • #24531
  • #24532
  • #24536
  • #24818
  • #25344
  • #25345
  • Files
  • CVE-2021-23336-only-amp-as-query-sep.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/orsenthil'
    closed_at = <Date 2021-02-15.19:34:55.754>
    created_at = <Date 2021-01-19.15:06:49.941>
    labels = ['type-security', '3.8', '3.9', '3.10', 'release-blocker', '3.7', 'library']
    title = '[CVE-2021-23336] urllib.parse.parse_qsl(): Web cache poisoning - `; ` as a query args separator'
    updated_at = <Date 2021-11-08.16:47:04.726>
    user = 'https://github.com/AdamGold'

    bugs.python.org fields:

    activity = <Date 2021-11-08.16:47:04.726>
    actor = 'vstinner'
    assignee = 'orsenthil'
    closed = True
    closed_date = <Date 2021-02-15.19:34:55.754>
    closer = 'orsenthil'
    components = ['Library (Lib)']
    creation = <Date 2021-01-19.15:06:49.941>
    creator = 'AdamGold'
    dependencies = []
    files = ['49839']
    hgrepos = []
    issue_num = 42967
    keywords = ['patch']
    message_count = 57.0
    messages = ['385266', '385332', '385337', '385341', '385342', '385344', '385346', '385352', '385495', '385496', '385497', '385513', '385527', '385544', '385549', '385565', '385566', '385567', '385582', '385585', '385590', '385865', '386003', '386785', '386787', '386788', '386954', '386957', '386960', '386968', '386980', '387027', '387037', '387039', '387040', '387045', '387049', '387069', '387638', '387712', '387735', '387756', '388368', '388433', '388434', '388440', '388447', '388486', '388574', '390782', '390784', '390790', '391231', '405721', '405723', '405725', '405728']
    nosy_count = 15.0
    nosy_names = ['lemburg', 'gregory.p.smith', 'orsenthil', 'ned.deily', 'mcepl', 'eric.araujo', 'petr.viktorin', 'lukasz.langa', 'serhiy.storchaka', 'pablogsal', 'miss-islington', 'rschiron', 'erlendaasland', 'kj', 'AdamGold']
    pr_nums = ['24271', '24297', '24528', '24529', '24531', '24532', '24536', '24818', '25344', '25345']
    priority = 'release blocker'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'security'
    url = 'https://bugs.python.org/issue42967'
    versions = ['Python 3.6', 'Python 3.7', 'Python 3.8', 'Python 3.9', 'Python 3.10']

    @AdamGold
    Copy link
    Mannequin Author

    @AdamGold AdamGold mannequin commented Jan 19, 2021

    The urlparse module treats semicolon as a separator (https://github.com/python/cpython/blob/master/Lib/urllib/parse.py#L739) - whereas most proxies today only take ampersands as separators. Link to a blog post explaining this vulnerability: https://snyk.io/blog/cache-poisoning-in-popular-open-source-packages/

    When the attacker can separate query parameters using a semicolon (;), they can cause a difference in the interpretation of the request between the proxy (running with default configuration) and the server. This can result in malicious requests being cached as completely safe ones, as the proxy would usually not see the semicolon as a separator, and therefore would not include it in a cache key of an unkeyed parameter - such as utm_* parameters, which are usually unkeyed. Let’s take the following example of a malicious request:

    GET /?link=http://google.com&utm_content=1;link='><t>alert(1)</script> HTTP/1.1
    
    Host: somesite.com
    
    Upgrade-Insecure-Requests: 1		
    
    User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36
    
    Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,imag e/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9 Accept-Encoding: gzip, deflate			
    
    Accept-Language: en-US,en;q=0.9 Connection: close			
    

    urlparse sees 3 parameters here: link, utm_content and then link again. On the other hand, the proxy considers this full string: 1;link='><t>alert(1)</script> as the value of utm_content, which is why the cache key would only contain somesite.com/?link=http://google.com.

    A possible solution could be to allow developers to specify a separator, like werkzeug does:

    https://github.com/pallets/werkzeug/blob/6784c44673d25c91613c6bf2e614c84465ad135b/src/werkzeug/urls.py#L833

    @vstinner
    Copy link
    Member

    @vstinner vstinner commented Jan 20, 2021

    Oops, I missed this issue. I just marked my bpo-42975 issue as a duplicate of this one.

    My message:

    urllib.parse.parse_qsl() uses "&" *and* ";" as separators:

    >>> urllib.parse.parse_qsl("a=1&b=2&c=3")
    [('a', '1'), ('b', '2'), ('c', '3')]
    >>> urllib.parse.parse_qsl("a=1&b=2;c=3")
    [('a', '1'), ('b', '2'), ('c', '3')]

    But the W3C standards evolved and now suggest against considering semicolon (";") as a separator:

    https://www.w3.org/TR/2014/REC-html5-20141028/forms.html#url-encoded-form-data

    "This form data set encoding is in many ways an aberrant monstrosity, the result of many years of implementation accidents and compromises leading to a set of requirements necessary for interoperability, but in no way representing good design practices. In particular, readers are cautioned to pay close attention to the twisted details involving repeated (and in some cases nested) conversions between character encodings and byte sequences."

    "To decode application/x-www-form-urlencoded payloads (...) Let strings be the result of strictly splitting the string payload on U+0026 AMPERSAND characters (&)."

    Maybe we should even go further in Python 3.10 and only split at "&" by default, but let the caller to opt-in for ";" separator as well.

    @vstinner vstinner changed the title Web cache poisoning - ; as a query args separator urllib.parse.parse_qsl(): Web cache poisoning - ; as a query args separator Jan 20, 2021
    @vstinner vstinner changed the title Web cache poisoning - ; as a query args separator urllib.parse.parse_qsl(): Web cache poisoning - ; as a query args separator Jan 20, 2021
    @vstinner vstinner changed the title urllib.parse.parse_qsl(): Web cache poisoning - ; as a query args separator [security] urllib.parse.parse_qsl(): Web cache poisoning - ; as a query args separator Jan 20, 2021
    @vstinner vstinner changed the title urllib.parse.parse_qsl(): Web cache poisoning - ; as a query args separator [security] urllib.parse.parse_qsl(): Web cache poisoning - ; as a query args separator Jan 20, 2021
    @malemburg
    Copy link
    Member

    @malemburg malemburg commented Jan 20, 2021

    On 20.01.2021 12:07, STINNER Victor wrote:

    Maybe we should even go further in Python 3.10 and only split at "&" by default, but let the caller to opt-in for ";" separator as well.

    +1.

    Personally, I've never seen URLs encoded with ";" as query parameter
    separator in practice on the server side.

    The use of ";" was recommended in the HTML4 spec, but only in an
    implementation side note:

    https://www.w3.org/TR/1999/REC-html401-19991224/appendix/notes.html#h-B.2.2

    and not in the main reference:

    https://www.w3.org/TR/1999/REC-html401-19991224/interact/forms.html#h-17.13.4.1

    Browsers are also pretty relaxed about seeing non-escaped ampersands in
    link URLs and do the right thing, so the suggested work-around for
    avoiding escaping is not really needed.

    @malemburg malemburg changed the title [security] urllib.parse.parse_qsl(): Web cache poisoning - ; as a query args separator Web cache poisoning - ; as a query args separator Jan 20, 2021
    @malemburg malemburg changed the title [security] urllib.parse.parse_qsl(): Web cache poisoning - ; as a query args separator Web cache poisoning - ; as a query args separator Jan 20, 2021
    @malemburg malemburg changed the title Web cache poisoning - ; as a query args separator [security] Web cache poisoning - ; as a query args separator Jan 20, 2021
    @malemburg malemburg changed the title Web cache poisoning - ; as a query args separator [security] Web cache poisoning - ; as a query args separator Jan 20, 2021
    @vstinner vstinner changed the title [security] Web cache poisoning - ; as a query args separator [security] urllib.parse.parse_qsl(): Web cache poisoning - ; as a query args separator Jan 20, 2021
    @vstinner vstinner changed the title [security] Web cache poisoning - ; as a query args separator [security] urllib.parse.parse_qsl(): Web cache poisoning - ; as a query args separator Jan 20, 2021
    @malemburg
    Copy link
    Member

    @malemburg malemburg commented Jan 20, 2021

    Sorry for the title mess: It seems that when replying to a ticket, RoundUp uses the subject line as the new header regardless of what it was set to before.

    @vstinner
    Copy link
    Member

    @vstinner vstinner commented Jan 20, 2021

    Sorry for the title mess: It seems that when replying to a ticket, RoundUp uses the subject line as the new header regardless of what it was set to before.

    Yeah, it's annoying :-( I like to put a module name in the issue title, to help bug triage.

    @serhiy-storchaka
    Copy link
    Member

    @serhiy-storchaka serhiy-storchaka commented Jan 20, 2021

    It looks to me, that this is an issue of proxies, not Python. Python implementation obeys contemporary standards, and they are not formally cancelled yet. If we add an option in parse_qsl() or change its default behavior, it should be considered as a new feature which helps to mitigate proxies' issues.

    @vstinner
    Copy link
    Member

    @vstinner vstinner commented Jan 20, 2021

    Python implementation obeys contemporary standards

    The contemporary standard is HTML5 and HTML5 asks to only split at "&", no?

    @Fidget-Spinner
    Copy link
    Member

    @Fidget-Spinner Fidget-Spinner commented Jan 20, 2021

    FWIW, a surprising amount of things rely on treating ';' as a valid separator in the standard test suite.

    From just a cursory look:

    test_cgi
    test_urlparse

    A change in the public API of urlparse will also require a change in cgi.py's FieldStorage, FieldStorage.read_multi, parse and parse_multipart to expose that parameter since those functions forward arguments directly to urllib.parse.parse_qs internally.

    If we backport this, it seems that we will *also* need to backport all those changes to cgi's public API. Otherwise, just backporting the security fix part without allowing the user to switch would break existing code.

    Just my 2 cents on the issue. I'm not too familiar with security fixes in cpython anyways ;).

    @mcepl
    Copy link
    Mannequin

    @mcepl mcepl mannequin commented Feb 26, 2021

    Port of the patch to 2.7.18.

    @gpshead
    Copy link
    Member

    @gpshead gpshead commented Feb 27, 2021

    An example code snippet to detect if the API supports the new parameter at runtime for code that wants to use to use something other than the default '&'.

    if 'separator' in inspect.signature(urllib.parse.parse_qs).parameters:
        ... parse_qs(..., separator=';')
    else:
        ... parse_qs(...)
    

    calling it with the arg and catching TypeError if that fails would also work, but might not be preferred as catching things like TypeError is non-specific and could hide other problems, making it a code maintenance headache.

    @ret2libc
    Copy link
    Mannequin

    @ret2libc ret2libc mannequin commented Mar 9, 2021

    This CVE was reported against Python, however it does not seem to be Python's fault for supporting the ; separator, which was a valid separator for older standards.

    @AdamGold for this issue to become a real security problem, it seems that the proxy has to be configured to ignore certain parameters in the query. For NGINX and Varnish proxies mentioned in the article it seems that by default they use the entire request path, host included, and other things as cache key. For NGINX in particular I could find some snippets online to manipulate the query arguments and split them in arguments, so to remove the "utm_*" arguments, however this does not seem a standard(or at least default) behaviour, nor something easily supported.

    I think that if that is the case and a user has to go out of his way to configure the (wrong) splitting of arguments in the proxy, it is not fair to blame python for accepting ; as separator and assigning a CVE against it may cause confusion.

    For distributions this is problematic as they have 2 choices:

    1. "fix" python but with the risk of breaking user's programs/scripts relying on the previous API
    2. keep older version/unpatched python so that user's programs still work, but with a python version "vulnerable" to this CVE.

    None of these options is really ideal, especially if the problem is somewhere else.

    @AdamGold Could you elaborate a bit more on how common it is and how much configuration is required for proxies to make ; a problem in python?

    @encukou
    Copy link
    Member

    @encukou encukou commented Mar 10, 2021

    With the fix, parse_qs[l] doesn't handle bytes separators correctly.
    There is an explicit type check for str/bytes:

        if not separator or (not isinstance(separator, (str, bytes))):
            raise ValueError("Separator must be of type string or bytes.")

    but a bytes separator fails further down:

    >>> import urllib.parse
    >>> urllib.parse.parse_qs('a=1,b=2', separator=b',')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/pviktori/dev/cpython/Lib/urllib/parse.py", line 695, in parse_qs
        pairs = parse_qsl(qs, keep_blank_values, strict_parsing,
      File "/home/pviktori/dev/cpython/Lib/urllib/parse.py", line 748, in parse_qsl
        pairs = [s1 for s1 in qs.split(separator)]
    TypeError: must be str or None, not bytes

    @orsenthil
    Copy link
    Member

    @orsenthil orsenthil commented Mar 10, 2021

    Petr, thank you. Let's treat it as a new issue linked to this.

    @ret2libc
    Copy link
    Mannequin

    @ret2libc ret2libc mannequin commented Mar 10, 2021

    So far, we at openSUSE had to package at least SQLAlchemy, Twisted, yarl and furl. The author of the first one acknowledged use of semicolon as a bug. I don't think it was so bad.

    Did you upstream fixes for those packages?

    Asking because if this is considered a vulnerability in Python, it should be considered a vulnerability for every other tool/library that accept ; as separator. For example, Twisted seems to have a parse_qs method in web/http.py file that splits by both ; and &.

    Again, I feel like we are blaming the wrong piece of the stack, unless proxies are usually ignoring some arguments (e.g. utm_*) as part of the cache key, by default or in a very easy way.

    @gpshead
    Copy link
    Member

    @gpshead gpshead commented Mar 10, 2021

    Riccardo - FWIW I agree, the wrong part of the stack was blamed and a CVE was wrongly sought for against CPython on this one.

    It's sewage under the bridge at this point. The API change has shipped in several different stable releases and thus is something virtually Python all code must now deal with.

    Why was this a bad change to make? Python's parse_qsl obeyed the prevailing HTML 4 standard at the time it was written:

    https://www.w3.org/TR/html401/appendix/notes.html#ampersands-in-uris

    '''
    We recommend that HTTP server implementors, and in particular, CGI implementors support the use of ";" in place of "&"
    '''

    That turns out to have been bad advice in the standard. 15 years later the html5 standard quoted in Adam's snyk blog post links to its text on this which leaves no room for that interpretation.

    In that light, the correct thing to do for this issue would be to:

    • Make the default behavior change in 3.10 match the html5 standard [done].
    • Document that it matches the html4 standard in 3.9 and earlier without changing their default behavior [oops, too late, not done].
    • While adding the ability to allow applications to select the stricter behavior on those older versions. [only sort of done, and somewhat too late now that the strict version has already shipped as stable]

    Afterall, the existence of html5 didn't magically fix all of the html and web applications written in the two decades of web that came before it. Ask any browser author...

    @encukou
    Copy link
    Member

    @encukou encukou commented Mar 11, 2021

    There's another part of the new implementation that looks a bit fishy: the separator argument now allows multi-character strings, so you can parse 'a=1b=2' with separator=''.
    Was this intentional?

    @orsenthil
    Copy link
    Member

    @orsenthil orsenthil commented Mar 13, 2021

    Petr,

    On

    the separator argument now allows multi-character strings, so you can parse 'a=1b=2' with separator=''. Was this intentional?

    No, this was not intentional. The separator arg was just coice, for compatibility, if some wanted to use ; like the some URLs that were shared as use case. We didn't restrict about what was allowed or length of the separator.

    @orsenthil
    Copy link
    Member

    @orsenthil orsenthil commented Apr 11, 2021

    New changeset b38601d by Ken Jin in branch 'master':
    bpo-42967: coerce bytes separator to string in urllib.parse_qs(l) (bpo-24818)
    b38601d

    @miss-islington
    Copy link
    Contributor

    @miss-islington miss-islington commented Apr 11, 2021

    New changeset 6ec2fb4 by Miss Islington (bot) in branch '3.9':
    bpo-42967: coerce bytes separator to string in urllib.parse_qs(l) (GH-24818)
    6ec2fb4

    @mcepl
    Copy link
    Mannequin

    @mcepl mcepl mannequin commented Apr 11, 2021

    Did you upstream fixes for those packages?

    Of course we did. Upstream first!

    @orsenthil
    Copy link
    Member

    @orsenthil orsenthil commented Apr 16, 2021

    New changeset d5b80eb by Miss Islington (bot) in branch '3.8':
    bpo-42967: coerce bytes separator to string in urllib.parse_qs(l) (GH-24818) (bpo-25345)
    d5b80eb

    @merwok
    Copy link
    Member

    @merwok merwok commented Nov 4, 2021

    erlandaasland you’ve been editing closed issues today (got messages from at least 2). maybe submitting old browser tabs with obsolete form data?

    @erlend-aasland
    Copy link
    Contributor

    @erlend-aasland erlend-aasland commented Nov 4, 2021

    Yes, cleaning up ahmedsayeed1982 spam. I did my best to revert the nosy list, component, versions, and assigned to changes. What did I mess up?

    @erlend-aasland
    Copy link
    Contributor

    @erlend-aasland erlend-aasland commented Nov 4, 2021

    See bpo-12168 for a similar cleanup by Eryk Sun. There was approx. 20 spammed issues. Eryk fixed most of them; I did a couple.

    @merwok
    Copy link
    Member

    @merwok merwok commented Nov 4, 2021

    See the changelog entry for 2021-11-04 10:31:24 (and the other ticket where Guido just commented)

    (and thanks for cleaning spam!)

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 3.8 3.9 3.10 release-blocker stdlib type-security
    Projects
    None yet
    Development

    No branches or pull requests