Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tangentially related to CVE-2023-24329] urlparse does not correctly handle schemes that begin with ASCII digits, '+', '-', and '.' characters #99418

Closed
kenballus opened this issue Nov 12, 2022 · 11 comments
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@kenballus
Copy link
Contributor

kenballus commented Nov 12, 2022

Background

RFC 3986 defines a scheme like this:

  • scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )

RFC 2234 defines an ALPHA like this:

  • ALPHA = %x41-5A / %x61-7A

The WHATWG URL spec defines a scheme like this:

The bug

This is the scheme string parsing code from Lib/urllib/parse.py:462-468:

    i = url.find(':')
    if i > 0:
        for c in url[:i]:
            if c not in scheme_chars:
                break
        else:
            scheme, url = url[:i].lower(), url[i+1:]

This is the definition of scheme_chars from Lib/urllib/parse.py:77-80:

scheme_chars = ('abcdefghijklmnopqrstuvwxyz'
                'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
                '0123456789'
                '+-.')

This will erroneously validate schemes that begin with any of ('.', '-', '+', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'). This behavior is in violation of both specifications.

This bug is reproducible with the following snippet:

>>> from urllib.parse import urlparse
>>> urlparse(".://") # Should error, but doesn't
ParseResult(scheme='.', netloc='', path='', params='', query='', fragment='')

My environment

  • CPython versions tested on:
  • Operating system and architecture:
    • Arch Linux x86_64
@kenballus kenballus added the type-bug An unexpected behavior, bug, or error label Nov 12, 2022
@hugovk hugovk added the stdlib Python modules in the Lib dir label Nov 12, 2022
kenballus added a commit to kenballus/cpython that referenced this issue Nov 12, 2022
…that don't begin with an alphabetical ASCII character.
kenballus added a commit to kenballus/cpython that referenced this issue Nov 12, 2022
…that don't begin with an alphabetical ASCII character.
gpshead pushed a commit that referenced this issue Nov 13, 2022
… with an alphabetical ASCII character. (#99421)

Prevent urllib.parse.urlparse from accepting schemes that don't begin with an alphabetical ASCII character.

RFC 3986 defines a scheme like this: `scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )`
RFC 2234 defines an ALPHA like this: `ALPHA = %x41-5A / %x61-7A`

The WHATWG URL spec defines a scheme like this:
`"A URL-scheme string must be one ASCII alpha, followed by zero or more of ASCII alphanumeric, U+002B (+), U+002D (-), and U+002E (.)."`
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Nov 13, 2022
… begin with an alphabetical ASCII character. (pythonGH-99421)

Prevent urllib.parse.urlparse from accepting schemes that don't begin with an alphabetical ASCII character.

RFC 3986 defines a scheme like this: `scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )`
RFC 2234 defines an ALPHA like this: `ALPHA = %x41-5A / %x61-7A`

The WHATWG URL spec defines a scheme like this:
`"A URL-scheme string must be one ASCII alpha, followed by zero or more of ASCII alphanumeric, U+002B (+), U+002D (-), and U+002E (.)."`
(cherry picked from commit 439b9cf)

Co-authored-by: Ben Kallus <49924171+kenballus@users.noreply.github.com>
miss-islington added a commit that referenced this issue Nov 13, 2022
… with an alphabetical ASCII character. (GH-99421)

Prevent urllib.parse.urlparse from accepting schemes that don't begin with an alphabetical ASCII character.

RFC 3986 defines a scheme like this: `scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )`
RFC 2234 defines an ALPHA like this: `ALPHA = %x41-5A / %x61-7A`

The WHATWG URL spec defines a scheme like this:
`"A URL-scheme string must be one ASCII alpha, followed by zero or more of ASCII alphanumeric, U+002B (+), U+002D (-), and U+002E (.)."`
(cherry picked from commit 439b9cf)

Co-authored-by: Ben Kallus <49924171+kenballus@users.noreply.github.com>
@hauntsaninja
Copy link
Contributor

Thanks, looks like this has been fixed

@vstinner
Copy link
Member

vstinner commented Apr 5, 2023

CVE-2023-24329 was assigned to this issue.

@vstinner vstinner changed the title urlparse does not correctly handle schemes that begin with ASCII digits, '+', '-', and '.' characters [CVE-2023-24329] urlparse does not correctly handle schemes that begin with ASCII digits, '+', '-', and '.' characters Apr 5, 2023
@vstinner
Copy link
Member

vstinner commented Apr 5, 2023

Python 3.7, 3.8 and 3.9 are affected by this issue and still get security fixes.

@gpshead: Should this fix be backported to Python 3.7-3.9?

@vstinner
Copy link
Member

vstinner commented Apr 5, 2023

Ah, I don't see a fix for Python 3.10 neither, whereas the issue was reported on Python 3.10.

@vstinner
Copy link
Member

vstinner commented Apr 5, 2023

I created https://python-security.readthedocs.io/vuln/urlparse-scheme.html to track fixes of this issue.

@gpshead
Copy link
Member

gpshead commented Apr 8, 2023

Please see #102153..

@gpshead
Copy link
Member

gpshead commented Apr 8, 2023

(ie: that python-security urlparse-scheme blog text is currently wrong: this is not fixed, and the first report was in July, not November)

@vstinner
Copy link
Member

vstinner commented May 4, 2023

(ie: that python-security urlparse-scheme blog text is currently wrong: this is not fixed, and the first report was in July, not November)

I'm maintaining this page manually and it's quite a lot of work to maintain it. I tried to automate as many things as possible. The source can be found in the YAML file: https://github.com/vstinner/python-security/blob/main/vulnerabilities.yaml#L2134

Free free to propose a PR to fix the entry ;-)

@ngie-eign
Copy link
Contributor

The fix for this CVE should really be backported if applicable.

@gpshead gpshead changed the title [CVE-2023-24329] urlparse does not correctly handle schemes that begin with ASCII digits, '+', '-', and '.' characters [tangentially related to CVE-2023-24329] urlparse does not correctly handle schemes that begin with ASCII digits, '+', '-', and '.' characters May 20, 2023
@gpshead
Copy link
Member

gpshead commented May 20, 2023

The fix for this CVE should really be backported if applicable.

This issue does not contain the fix.

See #102153.

@ngie-eign
Copy link
Contributor

The fix for this CVE should really be backported if applicable.

This issue does not contain the fix.

See #102153.

@gpshead : thank you so very much for the pointer! I'll do some poking around next week to see if some other OS distributions have addressed this and if there aren't any available fixes, try crafting (an) appropriate patch(es) and link it/them to the appropriate issue.

I work on a project that uses 3.8; if it's too much work for 3.7, I'll just look into making the 3.8 patch work.

gentoo-bot pushed a commit to gentoo/cpython that referenced this issue May 21, 2024
… begin with an alphabetical ASCII character. (pythonGH-99421)

Prevent urllib.parse.urlparse from accepting schemes that don't begin with an alphabetical ASCII character.

RFC 3986 defines a scheme like this: `scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )`
RFC 2234 defines an ALPHA like this: `ALPHA = %x41-5A / %x61-7A`

The WHATWG URL spec defines a scheme like this:
`"A URL-scheme string must be one ASCII alpha, followed by zero or more of ASCII alphanumeric, U+002B (+), U+002D (-), and U+002E (.)."`
(cherry picked from commit 439b9cf)

Co-authored-by: Ben Kallus <49924171+kenballus@users.noreply.github.com>
gentoo-bot pushed a commit to gentoo/cpython that referenced this issue May 21, 2024
… begin with an alphabetical ASCII character. (pythonGH-99421)

Prevent urllib.parse.urlparse from accepting schemes that don't begin with an alphabetical ASCII character.

RFC 3986 defines a scheme like this: `scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )`
RFC 2234 defines an ALPHA like this: `ALPHA = %x41-5A / %x61-7A`

The WHATWG URL spec defines a scheme like this:
`"A URL-scheme string must be one ASCII alpha, followed by zero or more of ASCII alphanumeric, U+002B (+), U+002D (-), and U+002E (.)."`
(cherry picked from commit 439b9cf)

Co-authored-by: Ben Kallus <49924171+kenballus@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

6 participants