Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-87389: avoid treating path as URI with netloc #93894

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

nascheme
Copy link
Member

@nascheme nascheme commented Jun 16, 2022

I believe this is a more correct fix for gh-87389 and may fix security bugs in other applications using urlunsplit(). We should not be confusing the path argument as a netloc, IMHO. I fixed http.server by not using urllib.parse on the path. For http.server, that part of the HTTP request is not treated as a full URL but instead a path + optional query + optional fragment. So just parsing it "by hand" and not using urllib.parse avoids the bug.

still needs test for urlunsplit() change
Lib/urllib/parse.py Outdated Show resolved Hide resolved
Lib/urllib/parse.py Outdated Show resolved Hide resolved
Lib/http/server.py Outdated Show resolved Hide resolved
@bedevere-bot
Copy link

When you're done making the requested changes, leave the comment: I have made the requested changes; please review again.

@nascheme nascheme changed the title gh-87389: alternative fix, WIP gh-87389: avoid treating path as URI with netloc Jun 16, 2022
@nascheme
Copy link
Member Author

I have made the requested changes; please review again.

I added urllib.parse.pathsplit() as suggested. Not sure about the name but it seems consistent with the module. Hopefully I got the doc markup correct.

@bedevere-bot
Copy link

Thanks for making the requested changes!

@gpshead: please review the changes made to this pull request.

The pathsplit() function will work correctly on relative paths too so
don't say "absolute paths".  Improve comment for _get_redirect_url().
# reported in gh-87389, a path starting with a double slash should not be
# treated as a relative URI. Also, a path with a colon in the first
# component could also be parsed wrongly.
parts = urllib.parse.pathsplit(path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately BaseHTTPRequestHandler self.path isn't guaranteed to be only a pathname by anything.

GET http://netloc/path/to/thing HTTP/1.0

Is a valid HTTP query to many http servers. Our existing http server code ignores the scheme and netloc on that and serves up /path/to/thing properly. This code might break things doing that? ie: if http://netloc winds up in the returned path, the it seems like that'd move through pathsplit and urlunsplit properly only by chance rather than design just given the name and purpose of the pathsplit API vs what was just handed to it as input.

Check out my updated test in #87389.

I have no idea if anything depends on this as it should be unusual for any client not pretending to be talking to an HTTP proxy to make requests that way as that isn't normal per the http/1.0 and 1.1 RFCs but as it worked before and other servers like Apache continue to support it, I don't think we can break that - at least not in a bugfix.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my reading of the RFC, GET http://netloc/path/to/thing HTTP/1.0 should technically be allowed. However, http.server doesn't handle it, before or after this PR. Instead, it tries to use the second word as an absolute file path. I feel like we should not attempt to fix http.server to handle the above. For HTTP 1.0, using only a file path was required. For curl I notice that even with the --http1.1 flag, it passes a path and not the URL with the scheme+netloc.

The bug causing #87389, IMHO, is that http.server was not consistent in how it was treating self.path. The translate_path() method was treating it as a path and not a URL. But, the redirect to add a trailing slash (done only if a folder with that name was found), parsed it as allowing a full URL. It needs to do it one way or the other consistently. The _request_path_split() method I added does this. It doesn't make sense to add this to urllib.parse since what http.server is doing is not modern HTTP.

Lib/urllib/parse.py Outdated Show resolved Hide resolved
Lib/test/test_urlparse.py Outdated Show resolved Hide resolved
Lib/test/test_httpservers.py Show resolved Hide resolved
This avoids "manual parsing" of the Request-URI part of the request and
matches what _get_redirect_url() does.
Possible that someone could override this so a method is nicer.
Since pathsplit() doesn't seem like a generally useful public API,
remove it.  Instead, add a _request_path_split() method.  This ensures
that the redirect logic and the translate_path() method use the same
path parsing.
@ambv
Copy link
Contributor

ambv commented Jun 22, 2022

@gpshead what do you think about landing this change for 3.12 only as an enhancement?

@gpshead gpshead added type-feature A feature request or enhancement stdlib Python modules in the Lib dir 3.12 bugs and security fixes and removed needs backport to 3.7 needs backport to 3.8 only security fixes needs backport to 3.9 only security fixes needs backport to 3.10 only security fixes 3.11 only security fixes needs backport to 3.11 only security fixes labels Jul 3, 2022
@gpshead
Copy link
Member

gpshead commented Jul 3, 2022

@gpshead what do you think about landing this change for 3.12 only as an enhancement?

Yeah I think this landing in 3.12 as a feature makes sense. this PR branch needs syncing now that the other PR has gone in.

@nascheme
Copy link
Member Author

nascheme commented Jul 4, 2022

A few thoughts about this. I can rework the PR so that changes to urlib.parse are separated from the httpserver change. The fix made is 3.11 is sufficient for fixing the security bug so the change in this PR isn't strictly required.

The change to urllib.parse.urlunsplit needs careful consideration. Not merging it into 3.11 was the right move. Thinking about it more, I'm having doubts that we can change this. Maybe we need a new API or maybe we just can't change it. It seems almost certain that some code expects the existing (IMHO insecure) behaviour of urlunsplit(). E.g. putting the full URL in the path arg, like this::

>>> urllib.parse.urlunsplit(('', '', 'http://foo.bar/baz', '', ''))
'http://foo.bar/baz'

We could try changing it an alpha and see what breaks. Or, we can try to analyze publicly available code and see how urlunsplit() is used.

Ideally we should analyze all the arguments of urlunsplit() and determine if we want to do sanity checking or cleanup on them. My change adds checking of the path to ensure it doesn't get confused with scheme or netloc. I suspect there are other sanity checks that could be done. E.g. preventing special characters like ? and # in parameters. Maybe we need urlunsplit_safe().

@gpshead
Copy link
Member

gpshead commented Jul 18, 2022

Thinking about it more, I'm having doubts that we can change this. Maybe we need a new API or maybe we just can't change it. It seems almost certain that some code expects the existing (IMHO insecure) behaviour of urlunsplit(). E.g. putting the full URL in the path arg

Agreed, it's a challenge. I'm not personally motivated to try and push for this change to the level of doing code analysis of code-at-large's urlunsplit API calls.

I like your thinking that further analysis of all the safety checks required being good. There are probably useful things to learn from what other libraries (in any language) do for URL construction from their parts libraries. Adding this is always easier as a _safe API or with a bool flag to enable strict/pedantic/safe/bikeshedcolor mode.

I suggest this get tracked as its own Enhancement request Issue. This PR can presumably be tied to that and closed for now given more not yet fully defined work is needed if it is going to be done at all.

@gpshead gpshead removed their assignment May 19, 2023
@gpshead gpshead removed the 3.12 bugs and security fixes label May 19, 2023
Copy link
Member

@vadmium vadmium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seem to be three behaviour changes proposed:

  1. Changing urlunsplit(('', '', '//path', '', '')) to return '/path' instead of '//path'

Another option is to prefix with double slash, representing an empty host e.g. '////path'. This is proposed in issue #78457 and pull request #113563. I think I prefer that four-slash option, because it also fixes some urlspliturlunsplit round-trip cases.

  1. Changing urlunsplit(('', '', 'colon:path', '', '')) → './colon:path'

This seems a reasonable change, and it is kind of suggested in RFC 3986. (Another option might be to encode the first colon, and return 'colon%3Apath'.)

  1. SimpleHTTPRequestHandler’s handling of GET https://example.net/dir

This is a legal HTTP 1.0 and 1.1 request, but is mainly for proxy servers, which is not what SimpleHTTPRequestHandler does. In this case the server looks up https:/example.net/dir as a path in its filesystem (which is not in spirit of HTTP), and decides to redirect with a trailing slash.

Currently it looks like the code sends Location: https://example.net/dir/. I don’t think there is anything really wrong with that.

The proposed changes would send Location: ./https://example.net/dir/. This new redirect is a path-relative URL. The base URL is supposed to be the original target https://example.net/dir, so the redirect would resolve to https://example.net/https://example.net/dir/, which is not intended.

If you want to fix anything in the HTTP server, I would make the server ignore the scheme and authority components, and just look up the path component. But I don’t think anyone is complaining about that, so it may not be worth fixing.

If the urllib.parse changes are too disruptive, perhaps a deprecation warning is the best way forward, and either add an opt-in way to get the new behaviour, or change the warning to an exception in the future?

@@ -491,14 +491,32 @@ def urlunparse(components):
url = "%s;%s" % (url, params)
return _coerce_result(urlunsplit((scheme, netloc, url, query, fragment)))

# Returns true if path can confused with a scheme. I.e. a relative path
# without leading dot that includes a colon in the first component.
_is_scheme_like = re.compile(r'[^/.][^/]*:').match
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the special allowance for a leading dot? Is there a test case for it? Yes, a scheme cannot start with a dot, but a path-noscheme component of .: is no more legal than https: according to RFC 3986.

# expected result is a relative URL without netloc and scheme
(('', 'a', '', '', ''), '//a'),
# extra leading slashes need to be stripped to avoid confusion
# with a relative URL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confusion with a protocol-relative URL? [as opposed to a host-relative URL]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting change review stdlib Python modules in the Lib dir type-feature A feature request or enhancement type-security A security issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants