-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should file URLs have opaque hostnames? #599
Comments
Node.js's url.pathToFileURL function, when run on Windows, calls domainToASCII on the provided NetBIOS machine name, which would return the empty string if the NetBIOS name has a space. That's probably not the correct behavior, though no one seems to have complained so far. Would someone be able to test the Windows UrlCreateFromPath function? I'd be curious to know how it treats This should do.#include <iostream>
#include <shlwapi.h>
#include <winerror.h>
#include <wininet.h>
int main() {
char out[INTERNET_MAX_URL_LENGTH + 1];
DWORD out_len = INTERNET_MAX_URL_LENGTH;
HRESULT res = UrlCreateFromPathA("\\\\some computer\\vol\\file", out, &out_len, NULL);
std::cout << res == S_OK << std::endl;
out[out_len] = '\0';
std::cout << out << std::endl;
} RFC 8089 provides some useful background information on file URLs, but unfortunately doesn't consider the question of weird characters in hostname or IDNs. RFC 8089 uses the grammar in RFC 3986 to describe file hosts though, and the I'd be inclined to allow some sort of opaque hostnames for file URLs as they already have a lot of exceptions in the spec, and hostname in file URLs is basically a legacy feature to support Windows anyway. |
I also found Microsoft's naming conventions for NetBIOS computer names. Unfortunately it doesn't describe exactly whether spaces are allowed (it's not an alphanumeric character which is explicitly allowed, but also doesn't appear in the list of disallowed characters). This ambiguity is also documented on Microsoft's page NetBIOS Name Syntax. However, RFC 1002 has an example for the NetBIOS name |
Interestingly, Chrome actually doesn't treat new URL('file://félicit ations.fr/test.txt').href // ⇒ file://xn--flicit%20ations-bnb.fr/test.txt
new URL('https://exam ple.com').href // ⇒ https://exa%20mple.com/ That's certainly one way around giving file URLs special handling, but no one other than Chrome seems to support this… |
So I finally had time to set up a Windows VM, and I've found that
Unfortunately, any browser or other application which conforms to this standard would fail to parse these URLs. Since NetBIOS hostnames are documented as being case-sensitive, we need to be conservative and preserve the hostname as it was given (i.e. no IDNA). It might be okay to detect and canonicalise IP addresses, but I'm not sure, and in the worst case we can just preserve it in the path and let the system deal with it. So yes, I'm quite sure that file URLs must support opaque hostnames. If we can find a reliable way to decide that it should be interpreted as an IP address or domain, that would be a nice improvement, but they at least need to support percent-encoding and uppercase characters. |
Just as an aside, the standard does ask to parse file hosts as opaque (non-special) in step 3.1. of the file host state. But there are some tests that disagree. |
@alwinb I don't think that is true - whilst it does say "Let host be the result of host parsing buffer with url is not special.", that is the same formulation used in the regular host/hostname state. "url is not special" is a link, or a computed property of When I first tried to implement the spec, it took me far longer to parse that line than I'd like to admit. It's very awkwardly worded. |
Aah now I see! That makes sense, thank you. |
I am doing some work in the Chromium project to try to bring the handling of space characters more in line with the URL parsing standard, and to complete that it would be very helpful if we had an idea of the way this issue is likely to be resolved. As @TimothyGu noted here, Chromium allows spaces in hostnames for both I've been trying to work around this by drafting a Chromium change that only disallows spaces for non- So to decide how we should proceed in bringing Chromium closer to standards compliance, it would be helpful if this issue could be moved towards resolution. Would it be possible for the URL standards experts to comment on which way this should go? |
I want to be clear I am speaking as neither a URL standard editor nor in my capacity as a Chromium engineer. So I don't have any relevant decision power for this. But maybe I am a URL standard expert... I personally find the arguments in this thread compelling that the current standard is not good. I know Chromium engineers (/cc @ricea @hayatoito) have already been reluctant to change file: URL handling (e.g. excluding it from Interop 202X efforts) and I suspect this sort of issue would not help them overcome that reluctance. It's less clear to me what the path forward is. If I understand the issue correctly, we have a few options:
I'm hearing that on balance @karwa believes that treating file: URL hostnames as opaque is the best option, even if that means we don't get canonicalization for IP addresses. Do others agree with that? @annevk, how do you feel about this issue, especially given WebKit's position as a browser that has successfully shipped the URL Standard's parsing? |
When it comes to host canonicalisation in file URLs, it's worth pointing out that this is only relevant on Windows, and while I expect it is could be useful for IPv4, the canonicalisation by the standard isn't even correct on Windows. I can give two examples:
With regards to the latter, So yes, I think it's better to just say these things are opaque from the URL standard's perspective, which allows for this kind of implementation-defined/platform-specific logic. I also think we should document how browsers and other applications are supposed to convert file URLs <-> paths on the major operating systems, since there's lots of divergence there as well. That's filed as whatwg/fetch#1338
This would not allow spaces. (The comment showed that happening in Chrome because Chrome's handling of spaces is yet to be aligned with this standard. It fails on the live viewer: Live viewer) |
I wouldn't mind changing new URL("file://host with spaces/path") from a parsing failure to returning file://host%20with%20spaces/path. |
Percent encoding spaces (and potentially other characters) in file hostnames is something that appeals to me. |
As Alex said above it might be okay to change this. My main worry is that we still have quite a few |
The standards-compliant behavior is to forbid space characters in URLs, but currently Chromium allows them. This causes Chromium to fail several tests included in the Interop2024 'HTTPS URLs for WebSocket' [1] and 'URL' [2] focus areas. A difficulty with removing support for spaces altogether is that they are used in the host part in Windows file: URLs [3]. So in this CL, forbid spaces only for non-file URLs. Since the scheme is not available during host parsing, this requires some plumbing changes, adding a new CanonMode to special-case file: URLs. In order to ensure that behavior for file: URLs is not changed, it was necessary to audit all callers of the URL canonicalization functions that could be passed a file: URL. The results of this are documented in [4]. In most cases the callers do not use file: URLs. In a few cases it was necessary to add a special check to handle file URLs. The URLPattern code is a special case in that there was previously an attempt to implement it with more standards-based URL behavior; a call to ContainsForbiddenHostnameCodePoint [5] prevents URLPatterns being created for hostnames containing space characters, regardless of URL scheme. However this check is not applied consistently, so in some scenarios URLPatterns can succeed in matching against URLs with invalid characters; see the new test cases added to urlpatterntestdata.json in [6]. That issue will be fixed for spaces when this change is enabled, and applying this new behavior even for file: URLs should be a smaller compat risk for URLPattern since URLPattern already attempts to exclude spaces for all URLs in most cases. In this CL, the changes are kept behind a flag. See [6] for the full test fallout from enabling the flag. In order to include some test coverage along with this CL, the flag is enabled in a ScopedFeatureList in url_canon_unittest.cc. [1] https://wpt.fyi/results/websockets?label=master&label=experimental&aligned&view=interop&q=label%3Ainterop-2024-websockets [2] https://wpt.fyi/results/url?label=master&label=experimental&aligned&view=interop&q=label%3Ainterop-2023-url [3] whatwg/url#599 [4] https://docs.google.com/document/d/1VwnOOCa2LZR6hsqz64m-VrJMiWmPLHl_Cbfy1XS3JaA/edit?usp=sharing [5] https://source.chromium.org/chromium/chromium/src/+/main:components/url_pattern/url_pattern_util.cc;l=142;drc=b7979b7a71fa422fd46d72f6c4a6efd6bd121863 [6] https://chromium-review.googlesource.com/c/chromium/src/+/5753305 Bug: 40256677, 40124263, 325979102 Change-Id: I0ccd3a4791b89774ec7027a1d8d5910665955c9c Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/5908422 Reviewed-by: Ted Choc <tedchoc@chromium.org> Reviewed-by: Hayato Ito <hayato@chromium.org> Reviewed-by: Maks Orlovich <morlovich@chromium.org> Commit-Queue: Dan Clark <daniec@microsoft.com> Cr-Commit-Position: refs/heads/main@{#1365710}
I've been trying to port Chromium's file path <-> file URL utilities to a project conforming to the latest standard.
As far as I have been able to tell (it's a large codebase and I'm not at all familiar with it), Chromium turns Windows UNC paths in to file URLs with hostnames, and those hostnames may include percent-encoding (e.g.
\\some computer\foo\bar.txt
becomesfile://some%20computer/foo/bar.txt
).That is not allowed by this standard: the hostnames of file URLs are domains (they are even encoded with IDNA), and may not contain raw spaces or percent-encoding.
I don't think it is expected that a file URL's host must only be a domain or IP address; it might be better for them to have opaque hostnames, so they can contain percent-encoding and other characters as allowed by whatever mechanism resolves them.
The text was updated successfully, but these errors were encountered: