Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-87389: avoid treating path as URI with netloc #93894

Open
wants to merge 19 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 11 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions Doc/library/urllib.parse.rst
Original file line number Diff line number Diff line change
Expand Up @@ -339,6 +339,18 @@ or on combining URL components into a URL string.

.. _WHATWG spec: https://url.spec.whatwg.org/#concept-basic-url-parser

.. function:: pathsplit(path)

Parse a path that includes an optional query and fragment. Like
:func:`urlsplit`, this function returns a 5-item :term:`named tuple`::

(addressing scheme, network location, path, query, fragment identifier).

The scheme and network location components will always be empty.

.. versionadded:: 3.11


.. function:: urlunsplit(parts)

Combine the elements of a tuple as returned by :func:`urlsplit` into a
Expand Down
24 changes: 19 additions & 5 deletions Lib/http/server.py
Original file line number Diff line number Diff line change
Expand Up @@ -678,13 +678,10 @@ def send_head(self):
path = self.translate_path(self.path)
f = None
if os.path.isdir(path):
parts = urllib.parse.urlsplit(self.path)
if not parts.path.endswith('/'):
new_url = _get_redirect_url(self.path)
if new_url:
# redirect browser - doing basically what apache does
self.send_response(HTTPStatus.MOVED_PERMANENTLY)
new_parts = (parts[0], parts[1], parts[2] + '/',
parts[3], parts[4])
new_url = urllib.parse.urlunsplit(new_parts)
self.send_header("Location", new_url)
self.send_header("Content-Length", "0")
self.end_headers()
Expand Down Expand Up @@ -881,6 +878,23 @@ def guess_type(self, path):
return 'application/octet-stream'


def _get_redirect_url(path):
"""Returns URL with trailing slash on path, if required. If not required,
returns None.
"""
# Previous versions of this module used urllib.parse.urlsplit() here.
# However, the 'path' is not truly a URI in that it can't have a scheme or
# netloc. We need to avoid parsing it incorrectly. For example, as
# reported in gh-87389, a path starting with a double slash should not be
# treated as a relative URI. Also, a path with a colon in the first
# component could also be parsed wrongly.
parts = urllib.parse.pathsplit(path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately BaseHTTPRequestHandler self.path isn't guaranteed to be only a pathname by anything.

GET http://netloc/path/to/thing HTTP/1.0

Is a valid HTTP query to many http servers. Our existing http server code ignores the scheme and netloc on that and serves up /path/to/thing properly. This code might break things doing that? ie: if http://netloc winds up in the returned path, the it seems like that'd move through pathsplit and urlunsplit properly only by chance rather than design just given the name and purpose of the pathsplit API vs what was just handed to it as input.

Check out my updated test in #87389.

I have no idea if anything depends on this as it should be unusual for any client not pretending to be talking to an HTTP proxy to make requests that way as that isn't normal per the http/1.0 and 1.1 RFCs but as it worked before and other servers like Apache continue to support it, I don't think we can break that - at least not in a bugfix.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my reading of the RFC, GET http://netloc/path/to/thing HTTP/1.0 should technically be allowed. However, http.server doesn't handle it, before or after this PR. Instead, it tries to use the second word as an absolute file path. I feel like we should not attempt to fix http.server to handle the above. For HTTP 1.0, using only a file path was required. For curl I notice that even with the --http1.1 flag, it passes a path and not the URL with the scheme+netloc.

The bug causing #87389, IMHO, is that http.server was not consistent in how it was treating self.path. The translate_path() method was treating it as a path and not a URL. But, the redirect to add a trailing slash (done only if a folder with that name was found), parsed it as allowing a full URL. It needs to do it one way or the other consistently. The _request_path_split() method I added does this. It doesn't make sense to add this to urllib.parse since what http.server is doing is not modern HTTP.

if parts.path.endswith('/'):
return None # already has slash, no redirect needed
return urllib.parse.urlunsplit(('', '', parts.path + '/', parts.query,
parts.fragment))


# Utilities for CGIHTTPRequestHandler

def _url_collapse_path(path):
Expand Down
24 changes: 22 additions & 2 deletions Lib/test/test_httpservers.py
Original file line number Diff line number Diff line change
Expand Up @@ -334,7 +334,7 @@ class request_handler(NoLogRequestHandler, SimpleHTTPRequestHandler):
pass

def setUp(self):
BaseTestCase.setUp(self)
super().setUp()
self.cwd = os.getcwd()
basetempdir = tempfile.gettempdir()
os.chdir(basetempdir)
Expand Down Expand Up @@ -362,7 +362,7 @@ def tearDown(self):
except:
pass
finally:
BaseTestCase.tearDown(self)
super().tearDown()

def check_status_and_reason(self, response, status, data=None):
def close_conn():
Expand Down Expand Up @@ -418,6 +418,26 @@ def test_undecodable_filename(self):
self.check_status_and_reason(response, HTTPStatus.OK,
data=os_helper.TESTFN_UNDECODABLE)

def test_get_dir_redirect_location_domain_injection_bug(self):
nascheme marked this conversation as resolved.
Show resolved Hide resolved
"""Ensure //evil.co/..%2f../../X does not put //evil.co/ in Location.

//domain/ in a Location header is a redirect to a new domain name.
https://github.com/python/cpython/issues/87389

This checks that a path resolving to a directory on our server cannot
resolve into a redirect to another server telling it that the
directory in question exists on the Referrer server.
"""
os.mkdir(os.path.join(self.tempdir, 'existing_directory'))
attack_url = f'//python.org/..%2f..%2f..%2f..%2f..%2f../%0a%0d/../{self.tempdir_name}/existing_directory'
response = self.request(attack_url)
self.check_status_and_reason(response, HTTPStatus.MOVED_PERMANENTLY)
location = response.getheader('Location')
self.assertFalse(location.startswith('//'), msg=location)
self.assertEqual(location, attack_url[1:] + '/',
msg='Expected Location: to start with a single / and '
'end with a / as this is a directory redirect.')

def test_get(self):
#constructs the path relative to the root directory of the HTTPServer
response = self.request(self.base_url + '/test')
Expand Down
29 changes: 29 additions & 0 deletions Lib/test/test_urlparse.py
Original file line number Diff line number Diff line change
Expand Up @@ -1101,6 +1101,35 @@ def test_urlsplit_normalization(self):
with self.assertRaises(ValueError):
urllib.parse.urlsplit(url)

def test_urlunsplit_relative(self):
cases = [
# expected result is a relative URL without netloc and scheme
(('', 'a', '', '', ''), '//a'),
# extra leading slashes need to be stripped to avoid confusion
# with a relative URL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confusion with a protocol-relative URL? [as opposed to a host-relative URL]

(('', '', '//a', '', ''), '/a'),
(('', '', '///a', '', ''), '/a'),
# not relative so extra leading slashes don't need stripping since
# they don't cause confusion
(('http', 'x.y', '//a', '', ''), 'http://x.y//a'),
# avoid confusion with path containing colon
(('', '', 'a:b', '', ''), './a:b'),
]
for parts, result in cases:
self.assertEqual(urllib.parse.urlunsplit(parts), result)
nascheme marked this conversation as resolved.
Show resolved Hide resolved

def test_pathsplit(self):
cases = [
('//a', ('', '', '//a', '', '')),
('a:b', ('', '', 'a:b', '', '')),
('/a/b?x#y', ('', '', '/a/b', 'x', 'y')),
('/a/b#y', ('', '', '/a/b', '', 'y')),
('/a/b?x', ('', '', '/a/b', 'x', '')),
]
for uri, result in cases:
self.assertEqual(urllib.parse.pathsplit(uri), result)


class Utility_Tests(unittest.TestCase):
"""Testcase to test the various utility functions in the urllib."""
# In Python 2 this test class was in test_urllib.
Expand Down
45 changes: 43 additions & 2 deletions Lib/urllib/parse.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@
__all__ = ["urlparse", "urlunparse", "urljoin", "urldefrag",
"urlsplit", "urlunsplit", "urlencode", "parse_qs",
"parse_qsl", "quote", "quote_plus", "quote_from_bytes",
"unquote", "unquote_plus", "unquote_to_bytes",
"unquote", "unquote_plus", "unquote_to_bytes", "pathsplit",
"DefragResult", "ParseResult", "SplitResult",
"DefragResultBytes", "ParseResultBytes", "SplitResultBytes"]

Expand Down Expand Up @@ -480,6 +480,29 @@ def urlsplit(url, scheme='', allow_fragments=True):
v = SplitResult(scheme, netloc, url, query, fragment)
return _coerce_result(v)

# typed=True avoids BytesWarnings being emitted during cache key
# comparison since this API supports both bytes and str input.
@functools.lru_cache(typed=True)
def pathsplit(path):
nascheme marked this conversation as resolved.
Show resolved Hide resolved
"""Parse a path that includes an optional query and fragment.
The full syntax is:

<path>?<query>#<fragment>

The result is a named 5-tuple with fields set corresponding to the above.
It is either a SplitResult or SplitResultBytes object, depending on the
type of the url parameter.

Note that % escapes are not expanded.
"""
path, _coerce_result = _coerce_args(path)
for b in _UNSAFE_URL_BYTES_TO_REMOVE:
path = path.replace(b, "")
path, _, fragment = path.partition('#')
path, _, query = path.partition('?')
v = SplitResult('', '', path, query, fragment)
return _coerce_result(v)

def urlunparse(components):
"""Put a parsed URL back together again. This may result in a
slightly different, but equivalent URL, if the URL that was parsed
Expand All @@ -491,14 +514,32 @@ def urlunparse(components):
url = "%s;%s" % (url, params)
return _coerce_result(urlunsplit((scheme, netloc, url, query, fragment)))

# Returns true if path can confused with a scheme. I.e. a relative path
# without leading dot that includes a colon in the first component.
_is_scheme_like = re.compile(r'[^/.][^/]*:').match
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the special allowance for a leading dot? Is there a test case for it? Yes, a scheme cannot start with a dot, but a path-noscheme component of .: is no more legal than https: according to RFC 3986.


def urlunsplit(components):
"""Combine the elements of a tuple as returned by urlsplit() into a
complete URL as a string. The data argument can be any five-item iterable.
This may result in a slightly different, but equivalent URL, if the URL that
was parsed originally had unnecessary delimiters (for example, a ? with an
empty query; the RFC states that these are equivalent)."""
scheme, netloc, url, query, fragment, _coerce_result = (
scheme, netloc, path, query, fragment, _coerce_result = (
_coerce_args(*components))
if not scheme and not netloc:
# Building a relative URI. Need to be careful that path is not
# confused with scheme or netloc.
if path.startswith('//'):
# gh-87389: don't treat first component of path as netloc
url = '/' + path.lstrip('/')
elif _is_scheme_like(path):
# first component has colon, ensure it will not be parsed as the
# scheme
url = './' + path
else:
url = path
else:
url = path
if netloc or (scheme and scheme in uses_netloc and url[:2] != '//'):
if url and url[:1] != '/': url = '/' + url
url = '//' + (netloc or '') + url
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
:mod:`http.server`: Fix an open redirection vulnerability in the HTTP server
when an URI path starts with ``//``. Vulnerability discovered, and initial
fix proposed, by Hamza Avvan. Change :func:`urllib.parse.urlunsplit` to
sanitize ``path`` argument in order to avoid confusing the first component of
the path as a net location or scheme. Add :func:`urllib.parse.pathsplit`
function.

Co-authored-by: Gregory P. Smith <gps@google.com>
nascheme marked this conversation as resolved.
Show resolved Hide resolved