Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix handling of Windows (WTF-16) and WASI (UTF-8) paths, etc #19005

Merged
merged 6 commits into from
Feb 25, 2024

Conversation

squeek502
Copy link
Collaborator

@squeek502 squeek502 commented Feb 19, 2024

Motivation

On Windows, paths/environment variables/command line arguments are arbitrary sequences of u16 (known as WTF-16), which means that unpaired surrogate codepoints (U+D800 to U+DFFF) are allowed. Unpaired surrogate codepoints cannot be encoded as valid UTF-8/UTF-16, meaning that UTF-8/UTF-16 cannot represent all possible paths/environment variables/command line arguments on Windows.

On other platforms (but not WASI), paths/environment variables/command line arguments are arbitrary sequences of u8 with no particular encoding. Therefore, invalid UTF-8 sequences are allowed, which in turn means that valid UTF-8 cannot represent all possible paths/environment variables/command line arguments.

On WASI, paths/environment variables/command line arguments are specified to be sequences of Unicode scalar values, meaning that they must be encodable as valid UTF-8/UTF-16. This means that WASI cannot handle all paths/environment variables/command line arguments regardless of the host platform.

Because Zig has cross-platform APIs that deal with slices of u8, some normalization/conversion has to be done for certain platforms. Up to this point, the status quo of Zig has been:

  • On Windows, convert WTF-16 to UTF-8 and fail if something can't be encoded as valid UTF-8 (or invoke illegal behavior in some buggy/ill-advised cases)
  • On WASI, Zig would unintentionally hit error.Unexpected if invalid UTF-8 was attempted to be used (the underlying error is ILSEQ or invalid byte sequence)
  • On other platforms, Zig does the right thing and does not assume any particular encoding

Possible solutions

  • Continue with the status quo and have Zig's cross-platform APIs just not be able to handle all paths/environment variables/command line arguments on Windows
    • This doesn't seem aligned to the goals of Zig
  • Scrap the conversions to/from []u8 and force APIs to always deal with WTF-16 directly on Windows
    • This would really complicate writing cross-platform code in Zig when dealing with the filesystem, environment variables, and command line arguments
  • Convert to/from WTF-8 on Windows, which can losslessly encode all possible WTF-16 sequences
    • This is the strategy this PR goes with

What is WTF-8?

WTF-8 is a superset of UTF-8 that allows the codepoints U+D800 to U+DFFF (surrogate codepoints) to be encoded using the normal UTF-8 encoding algorithm. Since U+D800 to U+DFFF are the only WTF-16 code units that are normally unrepresentable in UTF-8, this alone is sufficient to be able to losslessly roundtrip from WTF-8 to WTF-16.

Some notes:

  • WTF-16 to WTF-8 conversion cannot fail and is always lossless
  • WTF-8 to WTF-16 conversion can fail if the WTF-8 is invalid (for example, has a sequence with an invalid start byte, or a sequence that encodes an impossibly large codepoint; in other words, the normal rules around UTF-8 with the exception of surrogate codepoints)
  • WTF-8 -> WTF-16 -> WTF-8 roundtripping relies on the WTF-8 being "well-formed", meaning encoded surrogate codepoints are always unpaired. For example, if the sequence U+D83D U+DCA9 (a high surrogate followed by a low surrogate) was encoded as WTF-8, then when converted to WTF-16 and back to WTF-8 it'd be interpreted as a surrogate pair that enocdes the codepoint U+1F4A9, so the final WTF-8 would have the byte sequence for U+1F4A9 rather than U+D83D U+DCA9. As long as all surrogate codepoints in WTF-8 are unpaired, though, WTF-8 <-> WTF-16 roundtripping is guaranteed.
  • The spec says that users should avoid emitting/transmitting WTF-8 encoded bytes, and instead (lossily) convert to a valid Unicode encoding before emitting/transmitting WTF-8 (more on this later)

The changes

This PR was initially focused solely on handling WTF-16 via WTF-8, but now has a few interconnected changes:

  • std.unicode was refactored a bit and function names were made more consistent (e.g. lowercase le changed to the more common Le)
  • WTF-8 <-> WTF-16 conversion and related functions were added to std.unicode
  • WASI now properly handles ILSEQ errors and returns error.InvalidUtf8 (now a WASI-only error) in that case
  • Windows now does WTF-16 <-> WTF-8 conversion everywhere, and errors with error.InvalidWtf8 (a Windows-only error) if any user-supplied inputs are invalid WTF-8
  • Anything that incorrectly talked about UTF-8 was fixed (e.g. NativeUtf8ComponentIterator was previously incorrectly named [by me])
  • Some error sets were updated/narrowed/made explicit

The std.unicode changes in detail

This same information is in one of the commit messages, but:

std.unicode changes

Renamed functions for consistent Le capitalization and conventions:

  • utf16leToUtf8Alloc -> utf16LeToUtf8Alloc
  • utf16leToUtf8AllocZ -> utf16LeToUtf8AllocZ
  • utf16leToUtf8 -> utf16LeToUtf8
  • utf8ToUtf16LeWithNull -> utf8ToUtf16LeAllocZ
  • fmtUtf16le -> fmtUtf16Le

New UTF related functions:

  • utf16LeToUtf8ArrayList
  • utf8ToUtf16LeArrayList
  • utf8ToUtf16LeAlloc
  • isSurrogateCodepoint

(the ArrayList functions are mostly to allow the Alloc and AllocZ functions to share an implementation)

New WTF related functions/structs:

  • wtf8Encode
  • wtf8Decode
  • wtf8ValidateSlice
  • Wtf8View
  • Wtf8Iterator
  • wtf16LeToWtf8ArrayList
  • wtf16LeToWtf8Alloc
  • wtf16LeToWtf8AllocZ
  • wtf16LeToWtf8
  • wtf8ToWtf16LeArrayList
  • wtf8ToWtf16LeAlloc
  • wtf8ToWtf16LeAllocZ
  • wtf8ToWtf16Le
  • wtf8ToUtf8Lossy
  • wtf8ToUtf8LossyAlloc
  • wtf8ToUtf8LossyAllocZ
  • Wtf16LeIterator

Notes/concerns

  • The WTF-8/WTF-16 functions share a lot of their implementation with the UTF-8/UTF-16 functions. This is nice in some ways (reduces duplicate code), but potentially not so nice in others (changes to the UTF code has to always be mindful of how it affects the WTF code).
  • InvalidUtf8 has gone from a Windows-only error to a WASI-only error in many places. This may lead to bugs at existing callsites since it won't appear as a breaking change.
  • As mentioned before, only well-formed WTF-8 (meaning all surrogates are unpaired) roundtrips properly, but well-formedness is not enforced/validated by the std.unicode implementation. This means it is up to the user to be aware of WTF-8 well-formedness and maintain that property themselves (see the spec section on concatenation for what this means in practice) if they care about the roundtripping property. Note, however, that when converting to WTF-16, paired surrogates in WTF-8 are interpreted as a surrogate pair, so non-well-formed WTF-8 will get interpreted as if it were concatenated according to the spec in the process of being converted to WTF-16.
    • Not sure if I've done a good job explaining this. The idea is basically that since WTF-8 is only really meant to be used as a lossless u8 encoding of WTF-16, well-formedness of the WTF-8 doesn't matter too much since it has to be mapped to WTF-16 before it can be used in syscalls.
  • The spec mentions that "Any WTF-8 data must be converted to a Unicode encoding at the system’s boundary before being emitted. UTF-8 is recommended. WTF-8 must not be used to represent text in a file format or for transmission over the Internet."
    • I'm not fully convinced that emitting WTF-8 is that bad, since it's literally impossible to accurately represent WTF-16 unpaired surrogates as valid UTF-8, so converting invalid sequences to � (U+FFFD) before emission or letting whatever program handle the invalid UTF-8 and do the � replacements themselves doesn't seem that consequential--there's no approach that leads to the output being accurately represented as valid Unicode.
    • However, I have added std.fs.path.fmtAsUtf8Lossy and std.fs.path.fmtWtf16LeAsUtf8Lossy for any use cases where the paths being printed should definitely be represented as valid UTF-8, with unrepresentable sequences replaced by �.

Closes #18694
Closes #1774
Closes #2565

@squeek502 squeek502 changed the title Fix handling of Windows (WTF-16) and WASI (UTF-8) paths Fix handling of Windows (WTF-16) and WASI (UTF-8) paths, etc Feb 19, 2024
@squeek502 squeek502 mentioned this pull request Feb 21, 2024
Renamed functions for consistent `Le` capitalization and conventions:

- utf16leToUtf8Alloc -> utf16LeToUtf8Alloc
- utf16leToUtf8AllocZ -> utf16LeToUtf8AllocZ
- utf16leToUtf8 -> utf16LeToUtf8
- utf8ToUtf16LeWithNull -> utf8ToUtf16LeAllocZ
- fmtUtf16le -> fmtUtf16Le

New UTF related functions:

- utf16LeToUtf8ArrayList
- utf8ToUtf16LeArrayList
- utf8ToUtf16LeAlloc
- isSurrogateCodepoint

(the ArrayList functions are mostly to allow the Alloc and AllocZ to share an implementation)

New WTF related functions/structs:

- wtf8Encode
- wtf8Decode
- wtf8ValidateSlice
- Wtf8View
- Wtf8Iterator
- wtf16LeToWtf8ArrayList
- wtf16LeToWtf8Alloc
- wtf16LeToWtf8AllocZ
- wtf16LeToWtf8
- wtf8ToWtf16LeArrayList
- wtf8ToWtf16LeAlloc
- wtf8ToWtf16LeAllocZ
- wtf8ToWtf16Le
- wtf8ToUtf8Lossy
- wtf8ToUtf8LossyAlloc
- wtf8ToUtf8LossyAllocZ
- Wtf16LeIterator
Ill-formed UTF-8 byte sequences are replaced by the replacement character (U+FFFD) according to "U+FFFD Substitution of Maximal Subparts" from Chapter 3 of the Unicode standard, and as specified by https://encoding.spec.whatwg.org/#utf-8-decoder
Windows paths now use WTF-16 <-> WTF-8 conversion everywhere, which is lossless. Previously, conversion of ill-formed UTF-16 paths would either fail or invoke illegal behavior.

WASI paths must be valid UTF-8, and the relevant function calls have been updated to handle the possibility of failure due to paths not being encoded/encodable as valid UTF-8.

Closes ziglang#18694
Closes ziglang#1774
Closes ziglang#2565
@andrewrk
Copy link
Member

Magnificent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants