gh-87389: avoid treating path as URI with netloc #93894

nascheme · 2022-06-16T09:02:43Z

I believe this is a more correct fix for gh-87389 and may fix security bugs in other applications using urlunsplit(). We should not be confusing the path argument as a netloc, IMHO. I fixed http.server by not using urllib.parse on the path. For http.server, that part of the HTTP request is not treated as a full URL but instead a path + optional query + optional fragment. So just parsing it "by hand" and not using urllib.parse avoids the bug.

still needs test for urlunsplit() change

Lib/urllib/parse.py

Lib/http/server.py

bedevere-bot · 2022-06-16T18:30:37Z

When you're done making the requested changes, leave the comment: I have made the requested changes; please review again.

nascheme · 2022-06-16T19:55:07Z

I have made the requested changes; please review again.

I added urllib.parse.pathsplit() as suggested. Not sure about the name but it seems consistent with the module. Hopefully I got the doc markup correct.

bedevere-bot · 2022-06-16T19:55:10Z

Thanks for making the requested changes!

@gpshead: please review the changes made to this pull request.

The pathsplit() function will work correctly on relative paths too so don't say "absolute paths". Improve comment for _get_redirect_url().

gpshead · 2022-06-16T20:40:24Z

Lib/http/server.py

+    # reported in gh-87389, a path starting with a double slash should not be
+    # treated as a relative URI.  Also, a path with a colon in the first
+    # component could also be parsed wrongly.
+    parts = urllib.parse.pathsplit(path)


Unfortunately BaseHTTPRequestHandler self.path isn't guaranteed to be only a pathname by anything.

GET http://netloc/path/to/thing HTTP/1.0

Is a valid HTTP query to many http servers. Our existing http server code ignores the scheme and netloc on that and serves up /path/to/thing properly. This code might break things doing that? ie: if http://netloc winds up in the returned path, the it seems like that'd move through pathsplit and urlunsplit properly only by chance rather than design just given the name and purpose of the pathsplit API vs what was just handed to it as input.

Check out my updated test in #87389.

I have no idea if anything depends on this as it should be unusual for any client not pretending to be talking to an HTTP proxy to make requests that way as that isn't normal per the http/1.0 and 1.1 RFCs but as it worked before and other servers like Apache continue to support it, I don't think we can break that - at least not in a bugfix.

Based on my reading of the RFC, GET http://netloc/path/to/thing HTTP/1.0 should technically be allowed. However, http.server doesn't handle it, before or after this PR. Instead, it tries to use the second word as an absolute file path. I feel like we should not attempt to fix http.server to handle the above. For HTTP 1.0, using only a file path was required. For curl I notice that even with the --http1.1 flag, it passes a path and not the URL with the scheme+netloc.

The bug causing #87389, IMHO, is that http.server was not consistent in how it was treating self.path. The translate_path() method was treating it as a path and not a URL. But, the redirect to add a trailing slash (done only if a folder with that name was found), parsed it as allowing a full URL. It needs to do it one way or the other consistently. The _request_path_split() method I added does this. It doesn't make sense to add this to urllib.parse since what http.server is doing is not modern HTTP.

Lib/urllib/parse.py

Misc/NEWS.d/next/Security/2022-06-16-12-13-55.gh-issue-87389.MS9wAR.rst

Lib/test/test_urlparse.py

Lib/test/test_httpservers.py

This avoids "manual parsing" of the Request-URI part of the request and matches what _get_redirect_url() does.

Possible that someone could override this so a method is nicer.

Since pathsplit() doesn't seem like a generally useful public API, remove it. Instead, add a _request_path_split() method. This ensures that the redirect logic and the translate_path() method use the same path parsing.

ambv · 2022-06-22T08:45:15Z

@gpshead what do you think about landing this change for 3.12 only as an enhancement?

gpshead · 2022-07-03T19:21:24Z

@gpshead what do you think about landing this change for 3.12 only as an enhancement?

Yeah I think this landing in 3.12 as a feature makes sense. this PR branch needs syncing now that the other PR has gone in.

nascheme · 2022-07-04T20:54:51Z

A few thoughts about this. I can rework the PR so that changes to urlib.parse are separated from the httpserver change. The fix made is 3.11 is sufficient for fixing the security bug so the change in this PR isn't strictly required.

The change to urllib.parse.urlunsplit needs careful consideration. Not merging it into 3.11 was the right move. Thinking about it more, I'm having doubts that we can change this. Maybe we need a new API or maybe we just can't change it. It seems almost certain that some code expects the existing (IMHO insecure) behaviour of urlunsplit(). E.g. putting the full URL in the path arg, like this::

>>> urllib.parse.urlunsplit(('', '', 'http://foo.bar/baz', '', ''))
'http://foo.bar/baz'

We could try changing it an alpha and see what breaks. Or, we can try to analyze publicly available code and see how urlunsplit() is used.

Ideally we should analyze all the arguments of urlunsplit() and determine if we want to do sanity checking or cleanup on them. My change adds checking of the path to ensure it doesn't get confused with scheme or netloc. I suspect there are other sanity checks that could be done. E.g. preventing special characters like ? and # in parameters. Maybe we need urlunsplit_safe().

gpshead · 2022-07-18T01:58:02Z

Thinking about it more, I'm having doubts that we can change this. Maybe we need a new API or maybe we just can't change it. It seems almost certain that some code expects the existing (IMHO insecure) behaviour of urlunsplit(). E.g. putting the full URL in the path arg

Agreed, it's a challenge. I'm not personally motivated to try and push for this change to the level of doing code analysis of code-at-large's urlunsplit API calls.

I like your thinking that further analysis of all the safety checks required being good. There are probably useful things to learn from what other libraries (in any language) do for URL construction from their parts libraries. Adding this is always easier as a _safe API or with a bool flag to enable strict/pedantic/safe/bikeshedcolor mode.

I suggest this get tracked as its own Enhancement request Issue. This PR can presumably be tied to that and closed for now given more not yet fully defined work is needed if it is going to be done at all.

vadmium

There seem to be three behaviour changes proposed:

Changing urlunsplit(('', '', '//path', '', '')) to return '/path' instead of '//path'

Another option is to prefix with double slash, representing an empty host e.g. '////path'. This is proposed in issue #78457 and pull request #113563. I think I prefer that four-slash option, because it also fixes some urlsplit → urlunsplit round-trip cases.

Changing urlunsplit(('', '', 'colon:path', '', '')) → './colon:path'

This seems a reasonable change, and it is kind of suggested in RFC 3986. (Another option might be to encode the first colon, and return 'colon%3Apath'.)

SimpleHTTPRequestHandler’s handling of GET https://example.net/dir

This is a legal HTTP 1.0 and 1.1 request, but is mainly for proxy servers, which is not what SimpleHTTPRequestHandler does. In this case the server looks up https:/example.net/dir as a path in its filesystem (which is not in spirit of HTTP), and decides to redirect with a trailing slash.

Currently it looks like the code sends Location: https://example.net/dir/. I don’t think there is anything really wrong with that.

The proposed changes would send Location: ./https://example.net/dir/. This new redirect is a path-relative URL. The base URL is supposed to be the original target https://example.net/dir, so the redirect would resolve to https://example.net/https://example.net/dir/, which is not intended.

If you want to fix anything in the HTTP server, I would make the server ignore the scheme and authority components, and just look up the path component. But I don’t think anyone is complaining about that, so it may not be worth fixing.

If the urllib.parse changes are too disruptive, perhaps a deprecation warning is the best way forward, and either add an opt-in way to get the new behaviour, or change the warning to an exception in the future?

vadmium · 2024-04-28T00:04:17Z

Lib/urllib/parse.py

@@ -491,14 +491,32 @@ def urlunparse(components):
        url = "%s;%s" % (url, params)
    return _coerce_result(urlunsplit((scheme, netloc, url, query, fragment)))

+# Returns true if path can confused with a scheme.  I.e. a relative path
+# without leading dot that includes a colon in the first component.
+_is_scheme_like = re.compile(r'[^/.][^/]*:').match


Why the special allowance for a leading dot? Is there a test case for it? Yes, a scheme cannot start with a dot, but a path-noscheme component of .: is no more legal than https: according to RFC 3986.

vadmium · 2024-04-28T00:05:37Z

Lib/test/test_urlparse.py

+            # expected result is a relative URL without netloc and scheme
+            (('', 'a', '', '', ''), '//a'),
+            # extra leading slashes need to be stripped to avoid confusion
+            # with a relative URL


confusion with a protocol-relative URL? [as opposed to a host-relative URL]

wip: alternative fix for pythongh-87389

4f76c44

still needs test for urlunsplit() change

nascheme added type-security A security issue 3.11 only security fixes labels Jun 16, 2022

nascheme assigned gpshead and nascheme Jun 16, 2022

bedevere-bot added the awaiting core review label Jun 16, 2022

nascheme mentioned this pull request Jun 16, 2022

gh-87389: Fix an open redirection vulnerability in http.server. #93879

Merged

nascheme added 2 commits June 16, 2022 10:52

Don't strip slashes if not building relative URL.

06b3879

Add unit test for urlunsplit relative URL case.

00a3a92

gpshead requested changes Jun 16, 2022

View reviewed changes

Lib/urllib/parse.py Outdated Show resolved Hide resolved

Lib/urllib/parse.py Outdated Show resolved Hide resolved

Lib/http/server.py Outdated Show resolved Hide resolved

bedevere-bot removed the awaiting core review label Jun 16, 2022

bedevere-bot added the awaiting changes label Jun 16, 2022

gpshead added needs backport to 3.7 needs backport to 3.8 only security fixes needs backport to 3.9 only security fixes needs backport to 3.10 only security fixes needs backport to 3.11 only security fixes labels Jun 16, 2022

nascheme added 4 commits June 16, 2022 12:10

sanitize paths that can be confused as scheme

a8e1cc2

Add blurb.

83c8332

Improve name and comment for _get_redirect_url().

f99e80b

Add urllib.parse.pathsplit() function.

e542578

nascheme changed the title ~~gh-87389: alternative fix, WIP~~ gh-87389: avoid treating path as URI with netloc Jun 16, 2022

bedevere-bot added awaiting change review and removed awaiting changes labels Jun 16, 2022

bedevere-bot requested a review from gpshead June 16, 2022 19:55

nascheme added 2 commits June 16, 2022 12:58

Fix blurb formating, add note about pathsplit().

b7b0b15

Improve comments and docs.

6915331

The pathsplit() function will work correctly on relative paths too so don't say "absolute paths". Improve comment for _get_redirect_url().

nascheme added 2 commits June 16, 2022 13:12

Add basic unit test for pathsplit().

915451c

Fix markup in blurb.

952a0f4

gpshead reviewed Jun 16, 2022

View reviewed changes

nascheme added 5 commits June 16, 2022 14:27

Use pathsplit() in translate_path().

899f512

This avoids "manual parsing" of the Request-URI part of the request and matches what _get_redirect_url() does.

Improved unit tests from gps.

a00656c

Add test for Request-URI containing scheme.

8985853

Make _get_redirect_url() into a method.

f1f94ae

Possible that someone could override this so a method is nicer.

Futher cleanups, remove urllib.parse.pathsplit().

8a34cd0

Since pathsplit() doesn't seem like a generally useful public API, remove it. Instead, add a _request_path_split() method. This ensures that the redirect logic and the translate_path() method use the same path parsing.

reword news

23d4b56

gpshead removed their assignment May 19, 2023

gpshead removed the 3.12 bugs and security fixes label May 19, 2023

vadmium reviewed Apr 28, 2024

View reviewed changes

Merge branch 'main' into urlunsplit_relative

7a71381

bedevere-app bot mentioned this pull request May 14, 2024

[security] CVE-2021-28861: http.server: Open Redirection if the URL path starts with // #87389

Closed

Merge branch 'main' into urlunsplit_relative

d18bbd9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-87389: avoid treating path as URI with netloc #93894

gh-87389: avoid treating path as URI with netloc #93894

nascheme commented Jun 16, 2022 •

edited

Loading

bedevere-bot commented Jun 16, 2022

nascheme commented Jun 16, 2022

bedevere-bot commented Jun 16, 2022

gpshead Jun 16, 2022

nascheme Jun 16, 2022

ambv commented Jun 22, 2022

gpshead commented Jul 3, 2022

nascheme commented Jul 4, 2022

gpshead commented Jul 18, 2022

vadmium left a comment •

edited

Loading

vadmium Apr 28, 2024

vadmium Apr 28, 2024

gh-87389: avoid treating path as URI with netloc #93894

Are you sure you want to change the base?

gh-87389: avoid treating path as URI with netloc #93894

Conversation

nascheme commented Jun 16, 2022 • edited Loading

bedevere-bot commented Jun 16, 2022

nascheme commented Jun 16, 2022

bedevere-bot commented Jun 16, 2022

gpshead Jun 16, 2022

Choose a reason for hiding this comment

nascheme Jun 16, 2022

Choose a reason for hiding this comment

ambv commented Jun 22, 2022

gpshead commented Jul 3, 2022

nascheme commented Jul 4, 2022

gpshead commented Jul 18, 2022

vadmium left a comment • edited Loading

Choose a reason for hiding this comment

vadmium Apr 28, 2024

Choose a reason for hiding this comment

vadmium Apr 28, 2024

Choose a reason for hiding this comment

nascheme commented Jun 16, 2022 •

edited

Loading

vadmium left a comment •

edited

Loading