Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[security] CVE-2022-0391: urllib.parse should sanitize urls containing ASCII newline and tabs. #88048

Closed
orsenthil opened this issue Apr 18, 2021 · 47 comments
Assignees
Labels
3.7 3.8 3.9 3.10 3.11 stdlib type-security

Comments

@orsenthil
Copy link
Member

@orsenthil orsenthil commented Apr 18, 2021

BPO 43882
Nosy @gpshead, @orsenthil, @vstinner, @ned-deily, @OddBloke, @ambv, @mgorny, @apollo13, @mlissner, @pablogsal, @miss-islington, @tirkarthi, @felixxm, @sethmlarson
PRs
  • #25595
  • #25725
  • #25726
  • #25727
  • #25728
  • #25853
  • #25921
  • #25923
  • #25924
  • #25936
  • #26267
  • #26268
  • #26275
  • #26276
  • #26277
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/orsenthil'
    closed_at = <Date 2021-06-02.01:26:09.260>
    created_at = <Date 2021-04-18.19:37:00.259>
    labels = ['type-security', '3.8', '3.9', '3.10', '3.11', '3.7', 'library']
    title = '[security] CVE-2022-0391: urllib.parse should sanitize urls containing ASCII newline and tabs.'
    updated_at = <Date 2022-02-09.11:40:17.243>
    user = 'https://github.com/orsenthil'

    bugs.python.org fields:

    activity = <Date 2022-02-09.11:40:17.243>
    actor = 'felixxm'
    assignee = 'orsenthil'
    closed = True
    closed_date = <Date 2021-06-02.01:26:09.260>
    closer = 'gregory.p.smith'
    components = ['Library (Lib)']
    creation = <Date 2021-04-18.19:37:00.259>
    creator = 'orsenthil'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 43882
    keywords = ['patch']
    message_count = 47.0
    messages = ['391343', '391352', '391426', '391859', '392334', '392338', '392611', '392781', '392808', '392835', '392873', '392926', '392944', '392971', '392995', '393009', '393025', '393030', '393033', '393034', '393039', '393049', '393107', '393108', '393136', '393139', '393142', '393144', '393146', '393149', '393150', '393198', '393203', '393205', '393207', '393211', '393997', '394056', '394057', '394058', '394062', '394112', '394113', '396628', '412688', '412705', '412821']
    nosy_count = 14.0
    nosy_names = ['gregory.p.smith', 'orsenthil', 'vstinner', 'ned.deily', 'odd_bloke', 'lukasz.langa', 'mgorny', 'apollo13', 'Mike.Lissner', 'pablogsal', 'miss-islington', 'xtreak', 'felixxm', 'sethmlarson']
    pr_nums = ['25595', '25725', '25726', '25727', '25728', '25853', '25921', '25923', '25924', '25936', '26267', '26268', '26275', '26276', '26277']
    priority = 'high'
    resolution = 'fixed'
    stage = 'commit review'
    status = 'closed'
    superseder = None
    type = 'security'
    url = 'https://bugs.python.org/issue43882'
    versions = ['Python 3.6', 'Python 3.7', 'Python 3.8', 'Python 3.9', 'Python 3.10', 'Python 3.11']

    @orsenthil
    Copy link
    Member Author

    @orsenthil orsenthil commented Apr 18, 2021

    A security issue was reported by Mike Lissner wherein an attacker was able to use \r\n in the url path, the urlparse method didn't sanitize and allowed those characters be present in the request.

    In [9]: from urllib.parse import urlsplit
    In [10]: urlsplit("java\nscript:alert('bad')")
    Out[10]: SplitResult(scheme='', netloc='', path="java\nscript:alert('bad')", query='', fragment='')

    Firefox and other browsers ignore newlines in the scheme. From
    the browser console:

    > new URL("java\nscript:alert(bad)")
    << URL { href: "javascript:alert(bad)", origin: "null", protocol:
    "javascript:", username: "", password: "", host: "", hostname: "", port: "", pathname: "alert(bad)", search: ""

    Mozilla Developers informed about the controlling specification for URLs is in fact defined by the "URL Spec"
    from WHATWG which updates RFC 3986 and specifies that tabs and newlines
    should be stripped from the scheme.

    See: https://url.spec.whatwg.org/#concept-basic-url-parser

    That link defines an automaton for URL parsing. From that link, steps 2 and 3 of scheme parsing read:

    If input contains any ASCII tab or newline, validation error.
    3. Remove all ASCII tab or newline from input.


    urlparse module behavior should be updated, and an ASCII tab or newline should be removed from the url (sanitized) before it is sent to the request, as WHATWG spec.

    @tirkarthi
    Copy link
    Member

    @tirkarthi tirkarthi commented Apr 19, 2021

    See also a related issue to sanitise newline on other helper functions https://bugs.python.org/issue30713

    See also discussion and compatibility on disallowing control characters : https://bugs.python.org/issue30458

    @vstinner vstinner changed the title urllib.parse should sanitize urls containing ASCII newline and tabs. [security] urllib.parse should sanitize urls containing ASCII newline and tabs. Apr 20, 2021
    @vstinner vstinner changed the title urllib.parse should sanitize urls containing ASCII newline and tabs. [security] urllib.parse should sanitize urls containing ASCII newline and tabs. Apr 20, 2021
    @vstinner vstinner added stdlib labels Apr 20, 2021
    @vstinner
    Copy link
    Member

    @vstinner vstinner commented Apr 20, 2021

    See also bpo-43883.

    @orsenthil
    Copy link
    Member Author

    @orsenthil orsenthil commented Apr 25, 2021

    I have added a PR to remove ascii newlines and tabs from URL input. It is as per the WHATWG spec.

    However, I still like to research more and find out if this isn't introducing behavior that will break existing systems. It should also be aligned the decisions we have made with previous related bug reports.

    Please review.

    @orsenthil
    Copy link
    Member Author

    @orsenthil orsenthil commented Apr 29, 2021

    New changeset 76cd81d by Senthil Kumaran in branch 'master':
    bpo-43882 - urllib.parse should sanitize urls containing ASCII newline and tabs. (GH-25595)
    76cd81d

    @orsenthil
    Copy link
    Member Author

    @orsenthil orsenthil commented Apr 29, 2021

    New changeset 491fde0 by Miss Islington (bot) in branch '3.9':
    [3.9] bpo-43882 - urllib.parse should sanitize urls containing ASCII newline and tabs. (GH-25595) (GH-25725)
    491fde0

    @gpshead
    Copy link
    Member

    @gpshead gpshead commented May 1, 2021

    I think there's still a flaw in the fixes implemented in 3.10 and 3.9 so far. We're closer, but probably not quite good enough yet.

    why? We aren't stripping the newlines+tab early enough.

    I think we need to do the stripping *right after* the _coerce_args(url, ...) call at the start of the function.

    Otherwise we
    (1) are storing url variants with the bad characters in _parse_cache [a mere slowdown in the worst case as it'd just overflow the cache sooner]
    (2) are splitting the scheme off the URL prior to stripping. in 3.9+ there is a check for valid scheme characters, which will defer to the default scheme when found. The WHATWG basic url parsing has these characters stripped before any parts are split off though, so 'ht\rtps' - for example - would wind up as 'https' rather than our behavior so far of deferring to the default scheme.

    I noticed this when reviewing the pending 3.8 PR as it made it more obvious due to the structure of the code and would've allowed characters through into query and fragment in some cases. #25726 (review)

    @ambv
    Copy link
    Contributor

    @ambv ambv commented May 3, 2021

    Good catch, Greg. Since it's not merged already, this change will miss 3.8.10 but as a security fix will be included in 3.8.11 later in the year.

    The partial fix already landed in 3.9 will be released in 3.9.5 later today unless it's amended or reverted in a few hours.

    @orsenthil
    Copy link
    Member Author

    @orsenthil orsenthil commented May 3, 2021

    Based on Greg's review comment, I have pushed the fix for 3.9, and 3.8

    There is no need to hold off releases for these alone. If we get it merged before the release cut today, fine, otherwise, they will be in the next security fix.

    @orsenthil
    Copy link
    Member Author

    @orsenthil orsenthil commented May 3, 2021

    New changeset 8a59574 by Senthil Kumaran in branch '3.9':
    [3.9] bpo-43882 Remove the newline, and tab early. From query and fragments. (bpo-25853)
    8a59574

    @mgorny
    Copy link
    Mannequin

    @mgorny mgorny mannequin commented May 4, 2021

    I hate to be the bearer of bad news but I've already found this change to be breaking tests of botocore and django. In both cases, the test failure is apparently because upstream used to reject URLs after finding newlines in the split components, and now they're silently stripped away.

    Filed bugs:
    boto/botocore#2377
    https://code.djangoproject.com/ticket/32713

    Note that I'm not saying the change should be reverted.

    @sethmlarson
    Copy link
    Mannequin

    @sethmlarson sethmlarson mannequin commented May 4, 2021

    Leaving a thought here, I'm highlighting that we're now implementing two different standards, RFC 3986 with hints of WHATWG-URL. There are pitfalls to doing so as now a strict URL parser for RFC 3986 (like the one used by urllib3/requests) will give different results compared to Python and thus opens up the door for SSRF vulnerabilities 1.

    @mlissner
    Copy link
    Mannequin

    @mlissner mlissner mannequin commented May 4, 2021

    I haven't watched that Blackhat presentation yet, but from the slides, it seems like the fix is to get all languages parsing URLs the same as the browsers. That's what @orsenthil has been doing here and plans to do in https://bugs.python.org/issue43883.

    Should we get a bug filed with requests/urllib3 too? Seems like a good idea if it suffers from the same problems.

    @gpshead
    Copy link
    Member

    @gpshead gpshead commented May 5, 2021

    Both Django and Botocore issues appear to be in the category of: "depending on invalid data being passed through our urlsplit API so that they could look for it later" Not much sympathy. We never guaranteed we'd pass invalid data through. They're depending on an implementation detail (Hyrum's law). Invalid data causes other people who don't check for it problems. There is no valid solution on our end within the stdlib that won't frustrate somebody.

    We chose to move towards safer (undoubtedly not perfect) by default.

    Instead of the patches as you see them, we could've raised an exception. I'm sure that would also also have tripped up existing code depending on the undesirable behavior.

    If one wants to reject invalid data as an application/library/framework, they need a validator. The Python stdlib does not provide a URL validation API. I'm not convinced we would even want to (though that could be something bpo-43883 winds up providing) given how perilous that is to get right: Who's version of right? which set of standards? when and why? Conclusion: The web... such a mess.

    @gpshead gpshead reopened this May 6, 2021
    @gpshead gpshead reopened this May 6, 2021
    @mlissner
    Copy link
    Mannequin

    @mlissner mlissner mannequin commented May 6, 2021

    With the fix for this bug, urlsplit silently removes (some of) those characters before we can replace them, modifying the output of our sanitisation code

    I don't have any good solutions for 3.9.5, but going forward, this feels like another example of why we should just do parsing right (the way browsers do). That'd maintain tabs and whatnot in your output, and it'd fix the security issue by putting java\nscript into the scheme attribute instead of the path.

    One solution that presents itself to me: add a strip_insecure_characters: bool = True parameter.

    Doesn't this lose sight of what this tool is supposed to do? It's not supposed to have a good (new, correct) and a bad (old, obsolete) way of parsing. Woe unto whoever has to write the documentation for that parameter.

    Also, I should reiterate that these aren't "insecure" characters so if we did have a parameter for this, it'd be more like do_rfc_3986_parsing or maybe do_naive_parsing. The chars aren't insecure in themselves. They're fine. Python just gets tripped up on them.

    @gpshead
    Copy link
    Member

    @gpshead gpshead commented May 6, 2021

    Of note: If we had chosen to raise a ValueError (or similar) for these characters by default, the cloud-init code would also fail to behave as intended today (based on what I see in canonical/cloud-init@c478d0b)

    Recommendation for cloud-init - Do your hostname transformation early using as simple as possible logic. By virtue of accepting (and encouraging?) invalid characters and transforming them, what you have today that you call urlsplit on is more of a url template, not really a url. something like this:

    if m := re.search(r'^(?P<scheme_colon_slashes>[^/:]+://|)(?P<hostname>[^/]+)(?P<rest>/.*)', url_template):
        start, hostname, end = m.groups()
        for transformation in transformations:
            ... fixup hostname ...
        url = f'{start}{hostname}{end}'
    else:
        ... # doesn't look like a URL template
    

    yes this simplicity would allow your transformations to apply to the :port number. you could simplify further by including the scheme_colon_slashes in the part transformed. as your values are coming from user written config files, do you need to care about invalid characters in those transforming into invalid in the scheme or port number - characters in the resulting url anyways?

    after that, do you even need urlsplit at all in your _apply_hostname_transformations_to_url() function?

    @gpshead
    Copy link
    Member

    @gpshead gpshead commented May 6, 2021

    FWIW, if we were to add a parameter, I'd lean towards a name of "invalid_url_characters = None" defaulting to using what's in our private _UNSAFE_URL_BYTES_TO_REMOVE global when None but otherwise letting the user specify a sequence of characters.

    @ned-deily
    Copy link
    Member

    @ned-deily ned-deily commented May 7, 2021

    Unfortunately you already have to deal with the existence of 3.9.5 having the new behavior but not having a control.

    I have been holding off on 3.7.x and 3.6.x security releases pending resolutions of this and other open security issues. But based on the most recent discussions, my take is that it would be a disservice to anyone still using 3.7 or 3.6 to release with the most recent "partial" fix (GH-25924 / #70112) if it going to cause breakage. So, rather than delay further, unless there is a new resolution or someone has a very persuasive argument against it, I am going to revert those last two PRs from 3.7 and 3.6 for the upcoming releases pending a less intrusive fix.

    @orsenthil
    Copy link
    Member Author

    @orsenthil orsenthil commented May 7, 2021

    Hello All,

    I think, the current striping of ASCII newline and tab is a _reasonable_ solution given it was a security issue.

    It also follows the guidelines of "WHATWG" (Specifically Point 3)

    1. If input contains any ASCII tab or newline, validation error.
    2. Remove all ASCII tab or newline from input.

    And as per WHATWG, "A validation error does not mean that the parser terminates. Termination of a parser is always stated explicitly, e.g., through a return statement."

    I agree that terms used in spec vs representing it with library code may not be 1:1, but we tried to stay close and followed the existing behavior of widely used clients.

    This is a fix, per a security report, and per an evolv{ed,ing} standard recommendation. When dealing with security fixes, there could be simple or more complex migration involvements.

    My reading of the previous message was, even if we raised exception or gave as a parameter, it wont be any better for certain downstream users, as we let the security problem open, and have it only as opt-in fix.

    With respect to control

    The comment in the review - #25595 (review) was to make these characters available in module level parameters, so that if users prefer to override, they could patch it.

    so a revert may not be necessary for the reason of lack of control.

    In short, at this moment, I still feel that is reasonable fix at this moment for the problem report, and intention to move closer to whatwg spec.

    @ned-deily
    Copy link
    Member

    @ned-deily ned-deily commented May 7, 2021

    My reading of the previous message was, even if we raised exception
    or gave as a parameter, it wont be any better for certain downstream
    users, as we let the security problem open, and have it only as opt-in fix.

    Senthil, I am not sure which previous message you are referring to but, with regards to my comment about revert the recent fixes for 3.7 and 3.6 until the reported problems are resolved, I should add that, given the recent input from downstream users about the side effects, the only way we *should* proceed with the current changes is by including more information in a What's New entry and the NEWS blurb about that the implications to users are of these changes.

    @gpshead
    Copy link
    Member

    @gpshead gpshead commented May 7, 2021

    There is no less intrusive fix as far as I can see. I believe we're down to either stick with what we've done, or do nothing. It doesn't have to be the same choice in all release branches, being more conservative with changes the older the stable branch is okay. (ie: removing this from 3.6 and 3.7 seems fine even if more recent ones do otherwise)

    Based on my testing, raising an exception is more intrusive to existing tests (which we can only ever hope is representative of code) than stripping. At least as exposed by running the changes through many tens of thousands of unittest suites at work.

    ie: if we raise an exception, pandas.read_json() starts failing because that winds up using urlsplit in hopes of extracting the scheme and comparing that to known values as their method of deciding if something should be treated as a URL to data rather than data. Pandas would need to be fixed.

    That urlsplit() API use pattern is repeated in various other pieces of code: urlsplit is not expected to raise an exception. The caller then has a conditional or two testing some parts of the urlsplit result to make a guess as to if something should be considered a URL or not. Doing code inspection, pandas included, this code pretty much always then goes on to pass the original url value off to some other library, be it urllib, or requests, or ...).

    Consequences of that code inspection finding? With our existing character stripping change, new data is then allowed to pass through these urlsplit uses and be considered a URL. Which leads to some code sending the url with embedded \r\n\t chars on to other APIs - a concern expressed a couple of times above.

    Even though urlsplit isn't itself a validation API, it gets used as an early step in peoples custom identification and validation attempts. So *any* change we make to it at all in any way breaks someones expectations, even if they probably shouldn't have had those expectations and aren't doing wise validation.

    Treat this analysis as a sign that we should provide an explicit url validator because almost everyone is doing it some form of wrong. (bpo-43883)

    I did wonder if Mike's suggestion of removing the characters during processing, but leaving them in the final result in https://bugs.python.org/issue43882#msg393033 is feasible as remediation for this? My gut feeling is that it isn't. It doesn't solve the problem of preventing the bad data from going where it shouldn't. Even if we happen to parse that example differently, the unwanted characters are still retained in other places they don't belong. Fundamantelly: We require people to make a different series of API call and choices in the end user code to **explicitly not use unvalidated inputs**. Our stdlib API surface can't satisfy that today and use of unvalidated data in wrong places is a broad software security antipattern theme.

    @orsenthil
    Copy link
    Member Author

    @orsenthil orsenthil commented May 7, 2021

    Ned wrote:

    Senthil, I am not sure which previous message you are referring to but.

    I meant, the messages from other developers who raised that change broke certain test cases.

    Ned, but I got little concerned, if we planned to revert the change.

    the only way we *should* proceed with the current changes is by including more information in a What's New entry and the NEWS blurb about that the implications to users are of these changes.

    I agree with completely. I will include an additional blurb for this change for security fix versions.

    Greg wrote:

    There is no less intrusive fix as far as I can see. I believe we're down to either stick with what we've done, or do nothing.

    Exactly my feeling too.

    It doesn't have to be the same choice in all release branches, being more conservative with changes the older the stable branch is okay. (ie: removing this from 3.6 and 3.7 seems fine even if more recent ones do otherwise)

    I hadn't considered that. But it wont save much will be my opinion. The users will have to upgrade to supported versions anyway and it will break then. The problem is only pushed a little.

    So, keeping it consistent seems alright to me. It is a little additional for everyone, but we seem to be doing it.

    @ned-deily
    Copy link
    Member

    @ned-deily ned-deily commented May 20, 2021

    I will include an additional blurb for this change for security fix versions.

    Ping. This issue is still blocking 3.7 and 3.6 security releases.

    @ned-deily
    Copy link
    Member

    @ned-deily ned-deily commented May 20, 2021

    New changeset c723d51 by Senthil Kumaran in branch '3.7':
    [3.7] bpo-43882 - Mention urllib.parse changes in Whats New section for 3.7.11 (GH-26267)
    c723d51

    @ned-deily
    Copy link
    Member

    @ned-deily ned-deily commented May 20, 2021

    New changeset 6f743e7 by Senthil Kumaran in branch '3.6':
    [3.6] bpo-43882 - Mention urllib.parse changes in Whats New section for 3.6.14 (GH-26268)
    6f743e7

    @ned-deily
    Copy link
    Member

    @ned-deily ned-deily commented May 20, 2021

    Thanks, Senthil and Greg! The updates for 3.7 and 3.6 are now merged. Is there anything else that needs to be done for this issue or can it now be closed?

    @gpshead
    Copy link
    Member

    @gpshead gpshead commented May 20, 2021

    Lets get equivalent whatsnew text into the 3.8 and 3.9 and 3.10 branches before closing it.

    @orsenthil
    Copy link
    Member Author

    @orsenthil orsenthil commented May 21, 2021

    New changeset f14015a by Senthil Kumaran in branch '3.10':
    [3.10] bpo-43882 - Mention urllib.parse changes in Whats new section. (GH-26275)
    f14015a

    @orsenthil
    Copy link
    Member Author

    @orsenthil orsenthil commented May 21, 2021

    New changeset 0593ae8 by Senthil Kumaran in branch '3.9':
    [3.9] bpo-43882 - Mention urllib.parse changes in Whats new section. (GH-26276)
    0593ae8

    @gpshead gpshead closed this as completed Jun 2, 2021
    @gpshead gpshead closed this as completed Jun 2, 2021
    @ambv
    Copy link
    Contributor

    @ambv ambv commented Jun 28, 2021

    New changeset 634da2d by Senthil Kumaran in branch '3.8':
    [3.8] bpo-43882 - Mention urllib.parse changes in Whats new section. (bpo-26277)
    634da2d

    @vstinner
    Copy link
    Member

    @vstinner vstinner commented Feb 6, 2022

    CVE-2022-0391 has been assigned to this vulnerability.

    @vstinner vstinner changed the title [security] urllib.parse should sanitize urls containing ASCII newline and tabs. [security] CVE-2022-0391: urllib.parse should sanitize urls containing ASCII newline and tabs. Feb 6, 2022
    @vstinner vstinner changed the title [security] urllib.parse should sanitize urls containing ASCII newline and tabs. [security] CVE-2022-0391: urllib.parse should sanitize urls containing ASCII newline and tabs. Feb 6, 2022
    @mlissner
    Copy link
    Mannequin

    @mlissner mlissner mannequin commented Feb 7, 2022

    Looks like that CVE isn't public yet.

    https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-0391

    Any chance I can get access (I originally reported this vuln.). My email is mike@free.law, if it's possible and my email is needed.

    Thanks!

    @vstinner
    Copy link
    Member

    @vstinner vstinner commented Feb 8, 2022

    Looks like that CVE isn't public yet.
    https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-0391
    Any chance I can get access (I originally reported this vuln.).

    Message from Gaurav Kamathe who requested the CVE:

    "We've sent a request to MITRE to get this published and it'll be available on MITRE shortly."

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 3.8 3.9 3.10 3.11 stdlib type-security
    Projects
    None yet
    Development

    No branches or pull requests

    7 participants