Excess leading path separators causes ConnectionPool.urlopen to parse URL as host & port #3352

sigmavirus24 · 2024-02-22T12:32:53Z

Subject

Originally reported in Requests (psf/requests#6643) but also applicable here, : is a valid character in a path per RFC3986. However, if a path is used in urllib3 where the path starts with // then urllib3 interprets this as a scheme-less host and attempts to parse the hostname and port but the part after the : may not be an integer.

Environment

Describe your environment.
At least, paste here the output of:

import platform
import ssl
import urllib3

print("OS", platform.platform())
print("Python", platform.python_version())
print(ssl.OPENSSL_VERSION)
print("urllib3", urllib3.__version__)

OS Linux-6.5.10-200.fc38.x86_64-x86_64-with-glibc2.38
Python 3.12.1
OpenSSL 3.1.1 30 May 2023
urllib3 2.2.1

Steps to Reproduce

import urllib3

urllib3.PoolManager().urlopen(method="GET", url="http://127.0.0.1:10000//v:h")

Expected Behavior

If you have a server running on that port, a 200 OK, otherwise a connection error.

Actual Behavior

  File "/home/sigmavirus24/sandbox/requests/.tox/py312/lib/python3.12/site-packages/urllib3/util/url.py", line 425, in parse_url
    host, port = _HOST_PORT_RE.match(host_port).groups()  # type: ignore[union-attr]
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'groups'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/sigmavirus24/sandbox/requests/.tox/py312/lib/python3.12/site-packages/urllib3/poolmanager.py", line 444, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sigmavirus24/sandbox/requests/.tox/py312/lib/python3.12/site-packages/urllib3/connectionpool.py", line 711, in urlopen
    parsed_url = parse_url(url)
                 ^^^^^^^^^^^^^^
  File "/home/sigmavirus24/sandbox/requests/.tox/py312/lib/python3.12/site-packages/urllib3/util/url.py", line 451, in parse_url
    raise LocationParseError(source_url) from e
urllib3.exceptions.LocationParseError: Failed to parse: //v:h

Part of this is that u.request_uri is properly parsed and returned as //v:h but ConnectionPool.urlopen seems to be doing the same thing that PoolManager.urlopen does and the thing that PoolManager.urlopen is trying to delegate to it - i.e., it:

reparses the URI (

urllib3/src/urllib3/connectionpool.py

Line 711 in d4ffa29

parsed_url = parse_url(url)

)
and in reparsing it tries to normalize the host (

urllib3/src/urllib3/util/url.py

Line 425 in d4ffa29

host, port = _HOST_PORT_RE.match(host_port).groups() # type: ignore[union-attr]

)

If, however, we can (as early as possible) normalize the URL then we'll be in a better place. For example, this doesn't happen if you do this on the ConnectionPool level directly:

>>> connpool = urllib3.HTTPConnectionPool(host='localhost', port=10000)
>>> connpool.urlopen(method='GET', url='http://localhost:10000//v:h')
...
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=10000): Max retries exceeded with url: http://localhost:10000//v:h (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fcb11ad7020>: Failed to establish a new connection: [Errno 111] Connection refused'))

And so I think that in PoolManager specifically, we need to trim those leading /. There are a few ways to do this:

s = '//v:h'
print(f'/{s.lstrip("/")}')
# => '/v:h'
print(re.sub('^/+', '/', s))
# => '/v:h'

The text was updated successfully, but these errors were encountered:

sigmavirus24 · 2024-02-22T12:37:43Z

I'm also worried about how this is handled in when handling redirects in the retry logic (e.g., server returns a Location: //v:h or Location: //host:port//v:h but I haven't investigated that

crabhi · 2024-06-26T11:47:05Z

Hi! I investigated a breakage introduced after requests upgrade which led me here (see psf/requests#6711).

I think the root cause of the issue is here in urrlib3 and it can't be effectively fixed in the requests package. If I read RFC3986 correctly, empty path segments are allowed and double slash at the beginning of Path is forbidden only if the authority (host, port) is missing. Also, there are some servers that vary their behaviour depending on the count of leading slashes in the URL, most notably AWS S3.

Therefore, the correct fix IMO is not to strip the URL but rather to parse the double slash into the "Path".

sigmavirus24 · 2024-06-26T14:19:41Z

@crabhi please let's keep discussion to one place for now

crabhi · 2024-06-26T14:24:43Z

Sure. I've posted the comment above before your first reply. I agree to keep the current discussion in the other issue.

This was referenced Feb 22, 2024

Leading slash in uri followed by column fails psf/requests#6643

Closed

Trim excess leading path separators psf/requests#6644

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excess leading path separators causes ConnectionPool.urlopen to parse URL as host & port #3352

Excess leading path separators causes ConnectionPool.urlopen to parse URL as host & port #3352

sigmavirus24 commented Feb 22, 2024

sigmavirus24 commented Feb 22, 2024

crabhi commented Jun 26, 2024

sigmavirus24 commented Jun 26, 2024

crabhi commented Jun 26, 2024

Excess leading path separators causes ConnectionPool.urlopen to parse URL as host & port #3352

Excess leading path separators causes ConnectionPool.urlopen to parse URL as host & port #3352

Comments

sigmavirus24 commented Feb 22, 2024

Subject

Environment

Steps to Reproduce

Expected Behavior

Actual Behavior

sigmavirus24 commented Feb 22, 2024

crabhi commented Jun 26, 2024

sigmavirus24 commented Jun 26, 2024

crabhi commented Jun 26, 2024