Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should file path <-> file URL handling be standardized? #1338

Open
karwa opened this issue Jul 13, 2021 · 3 comments
Open

Should file path <-> file URL handling be standardized? #1338

karwa opened this issue Jul 13, 2021 · 3 comments
Labels
addition/proposal New features or enhancements needs implementer interest Moving the issue forward requires implementers to express interest

Comments

@karwa
Copy link

karwa commented Jul 13, 2021

I've been trying to work out a good algorithm for this, looking at the 3 major browsers for guidance and compatibility (I would like any file URLs I create to work when given to a browser or other application). As far as I can tell, they all seem to have slightly different behaviour, especially when it comes to quirky edge-cases such as POSIX paths with invalid UTF-8, or Windows paths with invalid UTF-16, and the test-suites for each seem to be quite bare, without coverage for many edge-cases or negative tests.

I should preface this by saying that I'm not terribly familiar with any of these codebases. I've been digging through them in attempt to understand how they might interpret file URLs I create, or how I should interpret file URLs they create, but they are enormous, complex projects (as I'm sure most readers will know all too well); it's certainly possible that I have some of these details wrong, and I would appreciate any corrections from those who are more familiar with them. Also, I'm not trying to call anybody out or criticise any of these projects - paths are just horrible; it's not surprising that there is some divergent behaviour when it comes to handling them.

This seems to be a feature that most modern browsers have to support, and I think there is value in ensuring that they all produce the same URLs for the same file paths, the same file paths for the same URLs, and that other applications can produce and consume file URLs in a way that is compatible with whichever browser the user prefers.

Quick recap of file paths

File paths on POSIX-y systems are semi-arbitrary bytes. The ASCII forward-slash (/, 0x2F) is reserved as a path separator, the ASCII nul (0x00) typically is reserved to mark the end of the byte-string, and components of one or two ASCII periods (., 0x2E) are references to the current or parent directory, respectively (I spent a while sweating over this, because periods are not typically listed as reserved characters, but it seems the Linux kernel interprets them, so I guess that's fine?). Otherwise, there are no restrictions on what a file or directory name may contain; they're just bytes, maybe it's not even correct to interpret them as "text" at all.

Some systems, such as Apple's Darwin-based OSes, guarantee that file and directory names are valid UTF-8 text; the system just won't allow you to create a file whose name can't be represented in UTF-8. The filesystem may perform some Unicode normalization - so if you create a file using a certain sequence of bytes and list the directory, you may not see a file with the same sequence of bytes, but a Unicode-aware comparison function will confirm that a file of the given name exists.

On Windows, file paths are UTF-16. There are more reserved characters, such as backslashes (\, 0x5C), question-marks and colons (?, 0x3F, and :, 0x3A, respectively), but it's better than plain POSIX because you at least have an encoding and the system tells us it's okay to interpret file names as text. That being said, Windows does not actually enforce that file names are well-formed UTF-16, so they can contain things like unpaired surrogates, which are forbidden from being transcoded to UTF-8.

Chromium

The main routines appear to be net/base/FilePathToFileURL and net/base/FileURLToFilePath (implemented here). There are a handful of tests for Windows- and POSIX-style paths.

There are also lots of gaps - weird paths/URLs with repeated slashes don't appear to have much test coverage, IPv6 addresses are not transliterated for Windows UNC paths (UNC paths apparently can't contain IPv6 addresses, so a file URL like file://[2001:db8::]/foo should be turned in to the UNC path \\2001-db8--.ipv6-literal.net\foo - the platform recognizes this domain and resolves it locally).

As for non-UTF8 path components, Chromium's FilePath stores its path in the platform's native format - either an std::string for POSIX-y systems or std::wstring for Windows. It includes a .asUTF8Unsafe() method which, on those POSIX systems where UTF-8 can't be assumed, invokes the standard C multi-byte-to-wide-char conversion routines followed by a wide-char-to-utf8 conversion. This assumes the byte string is some locale-dependent text, rather than arbitrary bytes. FilePathToFileURL calls this method, so the URLs it produces always decode as valid UTF-8 file and directory names, but the transcoding is not reversed by FileURLToFilePath. That said, if the file URL does contain invalid UTF-8, FileURLToFilePath will preserve it in the POSIX path. On Windows, the system's wide-char-to-utf8 conversion is used (rather than the C version), so it's unclear what ill-formed UTF-16, such as unpaired surrogates, will produce. I don't think it's tested?

Additionally, Chromium does some basic normalization, such as collapsing repeated slashes.

WebKit

The main routines appear to be URL::fileURLWithFileSystemPath and URL::fileSystemPath.

Reiterating that I'm not familiar with any of these codebases: while I could find tests for other parts of the WTF API, I couldn't find any tests for this particular functionality. I'm also not able to find a low-level "file path" type in WebKit. There seems to be an assumption that all paths are strings and can be interpreted as text, and WebKit's string type appears to support both Latin-1 and UTF-16-encoded strings. I may have this wrong, so I'd appreciate any corrections.

Nonetheless, these routines clearly reject non-local file URLs, and simply percent-encode a handful of special characters (?, #, and non-ASCII bytes) in to the URL's path component. The percent-encoding routine will also get the contents of the given path as UTF-8 using the default, lenient conversion mode, which encodes unpaired surrogates from ill-formed UTF-16 as UTF-8.

The reverse conversion process will percent-decode the path, then call fileSystemRepresentation on Windows only. This function calls the system WideCharToMultiByte, converting wide characters to the system's active ANSI code-page. This isn't really advisable, since the ANSI code-page in general cannot represent every Unicode character, so the conversion may be lossy. Additionally, as with Chromium, the transcoding to UTF-8 is not reversed, so the resulting bytes would not be equal to the original path.

Firefox (assuming servo/rust-url is used?)

Rust-url has from_file_path and to_file_path methods. These methods work in terms of a native Path type which wraps an OsString - this string differs from Rust's standard string types in that may not contain valid Unicode text. There are a handful of tests in this file, but again, they don't seem to be very comprehensive. Perhaps there are others which I can't find.

On POSIX systems, these bytes are percent-encoded directly in to the URL's path segment, without prior conversion to UTF-8. Similarly, the reverse process decodes file and directory names as raw bytes in to an OsString.

On Windows, if the path/OsString contains invalid Unicode text, creating a file URL will fail. Similarly, the reverse process will fail if the URL contains encoded invalid UTF-8.

Summary

There seem to be quite a few approaches to handling file paths. For POSIX-style paths, both Chrome and WebKit seem to destroy the original path information when it is not UTF-8, whilst rust-url will preserve these paths exactly. Windows paths are supposed to be Unicode by definition, but when invalid Unicode is encountered, browsers run the gamut of behaviour from "whatever the system wants" to allowing or rejecting. Again, my personal opinion is that rust-url's behaviour is the most reasonable, as these kinds of file names should be exceedingly rare (and hopefully one day Microsoft will patch Windows to just disallow them, as Apple's platforms disallow invalid UTF-8).

As a developer of non-browser applications, I would prefer if browsers documented and standardised their behaviour, ideally based on what rust-url does. It might be worth incorporating some basic path normalization, such as collapsing repeated slashes, so that the resulting URLs are easy to work with and display nicely.

There are other outstanding issues, such as preserving localhost, and how to represent Windows \\?\foo paths. Not only is ? not a valid hostname, and can't be percent-encoded, but these paths should apparently be sent to the OS directly without any interpretation at all - i.e. you could actually have a file or directory name called . or .. using these paths! It's unclear how to represent those in a file URL conforming to this standard, since it will always interpret . and .. path components - maybe you'd have to percent-encoding the entire thing, path separators and all, as one giant path component?

Thoughts? Ideas? Corrections?

@domenic
Copy link
Member

domenic commented Jul 13, 2021

Firefox (assuming servo/rust-url is used?)

I don't believe Firefox uses servo/rust-url; there's a very old bug about them switching to it, but from what I understand it isn't running in production, since I never saw the expected big jump in wpt.fyi pass rate.

The bugs are a bit confusing though and imply some Rust-based URL parser has landed somewhere in Firefox, perhaps behind a flag. (But maybe was then reverted?) See https://bugzilla.mozilla.org/show_bug.cgi?id=1318426 and its relatives.


Overall, not speaking as an editor, I think this would be cool. However, it would be a notable expansion of the spec's current scope. And I'd be surprised if it's a high priority for any browser engineers, since file: URLs aren't really part of the web.

I suspect an effort like this might have the best chance of success if it was done by a really interested community, probably server-focused instead of browser-focused. I.e. you all could come up with a spec and test suite that reflects what you think of as the very best behavior that takes into account all of these subtleties you mention, without prioritizing browser interop or compat. Then if that effort is successful it might be worth taking it back to browsers and trying to get it implemented there, or taking it back to the URL Standard and trying to include it as a section here. But you could derive most of the value from such an effort that just tries to standardize across non-browser software.

Note that this is all related to, but not quite the same as, the question of how servers should map HTTP(S) URL path components to file paths when doing static file serving. I.e. it also needs to consider percent-decoding, slash canonicalization, and the like. That question has come up on the issue tracker a few times, with similar responses of "not really in scope for now". It might be cool if whatever you come up with for file: URLs could be applied to give a similar algorithm for HTTP(S) URL paths.

I'm definitely curious what @annevk and other community members think.

@annevk
Copy link
Member

annevk commented Jul 19, 2021

I've always thought that this would be part of Fetch, if we wanted to standardize it at all. As that takes a URL (among other things that together form a request) and turns it into a response.

And yeah, a separate effort that gains traction somehow seems like a good way to approach this, though I would not be opposed to having it integrated if it ends up having wide enough buy-in.

@annevk
Copy link
Member

annevk commented Oct 20, 2021

I'm going to move this issue to Fetch as per my earlier comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
addition/proposal New features or enhancements needs implementer interest Moving the issue forward requires implementers to express interest
Development

No branches or pull requests

3 participants