Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excess leading path separators causes ConnectionPool.urlopen to parse URL as host & port #3352

Open
sigmavirus24 opened this issue Feb 22, 2024 · 4 comments

Comments

@sigmavirus24
Copy link
Contributor

Subject

Originally reported in Requests (psf/requests#6643) but also applicable here, : is a valid character in a path per RFC3986. However, if a path is used in urllib3 where the path starts with // then urllib3 interprets this as a scheme-less host and attempts to parse the hostname and port but the part after the : may not be an integer.

Environment

Describe your environment.
At least, paste here the output of:

import platform
import ssl
import urllib3

print("OS", platform.platform())
print("Python", platform.python_version())
print(ssl.OPENSSL_VERSION)
print("urllib3", urllib3.__version__)
OS Linux-6.5.10-200.fc38.x86_64-x86_64-with-glibc2.38
Python 3.12.1
OpenSSL 3.1.1 30 May 2023
urllib3 2.2.1

Steps to Reproduce

import urllib3

urllib3.PoolManager().urlopen(method="GET", url="http://127.0.0.1:10000//v:h")

Expected Behavior

If you have a server running on that port, a 200 OK, otherwise a connection error.

Actual Behavior

  File "/home/sigmavirus24/sandbox/requests/.tox/py312/lib/python3.12/site-packages/urllib3/util/url.py", line 425, in parse_url
    host, port = _HOST_PORT_RE.match(host_port).groups()  # type: ignore[union-attr]
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'groups'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/sigmavirus24/sandbox/requests/.tox/py312/lib/python3.12/site-packages/urllib3/poolmanager.py", line 444, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sigmavirus24/sandbox/requests/.tox/py312/lib/python3.12/site-packages/urllib3/connectionpool.py", line 711, in urlopen
    parsed_url = parse_url(url)
                 ^^^^^^^^^^^^^^
  File "/home/sigmavirus24/sandbox/requests/.tox/py312/lib/python3.12/site-packages/urllib3/util/url.py", line 451, in parse_url
    raise LocationParseError(source_url) from e
urllib3.exceptions.LocationParseError: Failed to parse: //v:h

Part of this is that u.request_uri is properly parsed and returned as //v:h but ConnectionPool.urlopen seems to be doing the same thing that PoolManager.urlopen does and the thing that PoolManager.urlopen is trying to delegate to it - i.e., it:

If, however, we can (as early as possible) normalize the URL then we'll be in a better place. For example, this doesn't happen if you do this on the ConnectionPool level directly:

>>> connpool = urllib3.HTTPConnectionPool(host='localhost', port=10000)
>>> connpool.urlopen(method='GET', url='http://localhost:10000//v:h')
...
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=10000): Max retries exceeded with url: http://localhost:10000//v:h (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fcb11ad7020>: Failed to establish a new connection: [Errno 111] Connection refused'))

And so I think that in PoolManager specifically, we need to trim those leading /. There are a few ways to do this:

s = '//v:h'
print(f'/{s.lstrip("/")}')
# => '/v:h'
print(re.sub('^/+', '/', s))
# => '/v:h'
@sigmavirus24
Copy link
Contributor Author

I'm also worried about how this is handled in when handling redirects in the retry logic (e.g., server returns a Location: //v:h or Location: //host:port//v:h but I haven't investigated that

@crabhi
Copy link

crabhi commented Jun 26, 2024

Hi! I investigated a breakage introduced after requests upgrade which led me here (see psf/requests#6711).

I think the root cause of the issue is here in urrlib3 and it can't be effectively fixed in the requests package. If I read RFC3986 correctly, empty path segments are allowed and double slash at the beginning of Path is forbidden only if the authority (host, port) is missing. Also, there are some servers that vary their behaviour depending on the count of leading slashes in the URL, most notably AWS S3.

Therefore, the correct fix IMO is not to strip the URL but rather to parse the double slash into the "Path".

@sigmavirus24
Copy link
Contributor Author

@crabhi please let's keep discussion to one place for now

@crabhi
Copy link

crabhi commented Jun 26, 2024

Sure. I've posted the comment above before your first reply. I agree to keep the current discussion in the other issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants