Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign uppath: Windows paths may contain non-utf8-representable sequences #12056
Comments
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
SimonSapin
May 12, 2014
Contributor
It turns out that Windows paths may contain unpaired UTF-16 surrogates
@kballard What is this based on? Trying to use CreateFileW directly to create such a file failed for me. Test case:
extern crate libc;
use std::ptr;
use std::io::IoError;
fn main() {
for name in [&['a' as u16, 0xD83D, 0xDCA9, 'b' as u16, 0],
&['a' as u16, 0xD83D, 'b' as u16, 0]].iter() {
let handle = unsafe {
libc::CreateFileW(name.as_ptr(),
libc::FILE_GENERIC_WRITE,
0,
ptr::mut_null(),
libc::CREATE_ALWAYS,
libc::FILE_ATTRIBUTE_NORMAL,
ptr::mut_null())
};
let is_invalid = handle == libc::INVALID_HANDLE_VALUE as libc::HANDLE;
println!("{} {} {}", name, is_invalid, IoError::last_error())
}
}Output:
[97, 55357, 56489, 98, 0] false unknown error (OS Error 0: The operation completed successfully.
)
[97, 55357, 98, 0] true unknown error (OS Error 87: The parameter is incorrect.
)
@kballard What is this based on? Trying to use extern crate libc;
use std::ptr;
use std::io::IoError;
fn main() {
for name in [&['a' as u16, 0xD83D, 0xDCA9, 'b' as u16, 0],
&['a' as u16, 0xD83D, 'b' as u16, 0]].iter() {
let handle = unsafe {
libc::CreateFileW(name.as_ptr(),
libc::FILE_GENERIC_WRITE,
0,
ptr::mut_null(),
libc::CREATE_ALWAYS,
libc::FILE_ATTRIBUTE_NORMAL,
ptr::mut_null())
};
let is_invalid = handle == libc::INVALID_HANDLE_VALUE as libc::HANDLE;
println!("{} {} {}", name, is_invalid, IoError::last_error())
}
}Output:
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
kballard
May 12, 2014
Contributor
@SimonSapin I don't know the precise details, but there exist portions of Windows in which paths are UCS2 rather than UTF-16. I ignored it because I thought it wasn't going to be an issue but at some point someone (and I wish I could remember who) showed me some output that showed that they were actually getting a UCS2 path from some Windows call and Path was unable to parse it.
|
@SimonSapin I don't know the precise details, but there exist portions of Windows in which paths are UCS2 rather than UTF-16. I ignored it because I thought it wasn't going to be an issue but at some point someone (and I wish I could remember who) showed me some output that showed that they were actually getting a UCS2 path from some Windows call and |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
SimonSapin
May 12, 2014
Contributor
Maybe CreateFileW is doing an explicit check that not all parts of the API are doing. sigh
|
Maybe |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
kballard
May 12, 2014
Contributor
My recollection is the bad path came from iterating a temporary directory on this person's filesystem. But I don't remember for certain.
|
My recollection is the bad path came from iterating a temporary directory on this person's filesystem. But I don't remember for certain. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
SimonSapin
May 12, 2014
Contributor
I’d be interested to see a test case, because this seems like it would affect every other language or library that tries to use filenames as Unicode.
|
I’d be interested to see a test case, because this seems like it would affect every other language or library that tries to use filenames as Unicode. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
kballard
May 12, 2014
Contributor
I'd love a test case too. Not being a Windows user is making it hard for me to actually test this stuff.
|
I'd love a test case too. Not being a Windows user is making it hard for me to actually test this stuff. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
SimonSapin
May 12, 2014
Contributor
FWIW I used a virtual machine image from http://www.modern.ie/ and a nightly build to run the test above.
|
FWIW I used a virtual machine image from http://www.modern.ie/ and a nightly build to run the test above. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
klutzy
May 13, 2014
Contributor
@kballard @SimonSapin #13338?
I think it depends on OS version and possibly even on locale. I ran your code on win7/kr and got:
[97, 55357, 56489, 98, 0] false unknown error (OS Error 0 (FormatMessageW() returned error 15100))
[97, 55357, 98, 0] false unknown error (OS Error 0 (FormatMessageW() returned error 15105))
|
@kballard @SimonSapin #13338?
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
SimonSapin
May 13, 2014
Contributor
@klutzy false for is_invalid and OS Error 0 looks like CreateFileW was successful. Do you see the two files created in the current directory?
|
@klutzy |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
SimonSapin
May 13, 2014
Contributor
Oh, wait, "success" here is not good news as it means you managed to create a file with a broken name. Does io::fs::readdir() trigger this failure, then?
|
Oh, wait, "success" here is not good news as it means you managed to create a file with a broken name. Does |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
klutzy
May 13, 2014
Contributor
Do you see the two files created in the current directory?
Does io::fs::readdir() trigger this failure, then?
Yes and yes. :'( I'm curious if it only occurs on recent OSes. (maybe >= 7?) Could you confirm OS name of VM you use?
Yes and yes. :'( I'm curious if it only occurs on recent OSes. (maybe >= 7?) Could you confirm OS name of VM you use? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
Windows 7 Enterprise SP1, 32-bit, en-US locale |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
klutzy
May 13, 2014
Contributor
Ugh. I have no idea now.
Anyway, I'm ok to just ignore non-unicode filenames: I don't think there will be many use cases regarding it.
|
Ugh. I have no idea now. Anyway, I'm ok to just ignore non-unicode filenames: I don't think there will be many use cases regarding it. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
kballard
May 13, 2014
Contributor
Perhaps. If it really is that rare, then maybe it's ok. It certainly simplifies things. And the programmer always has the option of dropping down and doing the system calls directly if they need to handle this.
But it makes me a bit uncomfortable regardless.
However, as I'm not a Windows user, I'll defer to Windows users on this subject.
|
Perhaps. If it really is that rare, then maybe it's ok. It certainly simplifies things. And the programmer always has the option of dropping down and doing the system calls directly if they need to handle this. But it makes me a bit uncomfortable regardless. However, as I'm not a Windows user, I'll defer to Windows users on this subject. |
brson
added
the
A-windows
label
Aug 12, 2014
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
vadimcn
Aug 30, 2014
Contributor
Non-UTF-16 file names are certainly rare, but when you come across one... it's very annoying when you cannot use any standard tools to delete such a file, for example.
Will I be a terrible person if I suggest that we bend rfc3629 a bit, and allow encoding unpaired UTF-16 surrogates "as themselves" in UTF-8 ?
|
Non-UTF-16 file names are certainly rare, but when you come across one... it's very annoying when you cannot use any standard tools to delete such a file, for example. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
SimonSapin
Aug 30, 2014
Contributor
Will I be a terrible person if I suggest that we bend rfc3629 a bit, and allow encoding unpaired UTF-16 surrogates "as themselves" in UTF-8 ?
Yes, this would be terrible. Let’s not do that in the String and str types, it would not be UTF-8. We could have a NotReallyUtf8String type separately, but it would be of limited use.
Yes, this would be terrible. Let’s not do that in the |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
retep998
Aug 30, 2014
Member
We could have a
NotReallyUtf8Stringtype separately
Wouldn't that just be called Path?
Wouldn't that just be called |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
SimonSapin
Aug 30, 2014
Contributor
Maybe it wouldn’t be so terrible if it’s an internal detail of std::path::windows::Path, with the display() method converting to real UTF-8. But then we might as well just use UCS-2 in Vec<u16> internally. (Not sure how that would interact with BytesContainer, though. Would windows::Path::new convert from UTF-8?)
|
Maybe it wouldn’t be so terrible if it’s an internal detail of |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
kballard
Aug 30, 2014
Contributor
Using UCS-2 means none of the *_str() methods would ever work. Encoding UCS-2 as an invalid UTF-8 as an internal implementation detail is something I considered a while back and is probably the best approach.
-Kevin
On Aug 30, 2014, at 2:13 AM, Simon Sapin notifications@github.com wrote:
Maybe it wouldn’t be so terrible if it’s an internal detail of std::path::windows::Path, with the display() method converting to real UTF-8. But then we might as well just use UCS-2 in Vec internally. (Not sure how that would interact with BytesContainer, though. Would windows::Path::new convert from UTF-8?)
—
Reply to this email directly or view it on GitHub.
|
Using UCS-2 means none of the *_str() methods would ever work. Encoding UCS-2 as an invalid UTF-8 as an internal implementation detail is something I considered a while back and is probably the best approach. -Kevin
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
retep998
Sep 6, 2014
Member
For comparison, MSVC C++ when converting an std::path to a utf-8 string will encode UCS-2 as invalid UTF-8, so there is precedent for us to follow this route.
|
For comparison, MSVC C++ when converting an |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
SimonSapin
Sep 16, 2014
Contributor
I suggest calling this WTF-8: an encoding that uses the same algorithm as UTF-8, but has the "value space" of UCS-2. (That is, bigger than Unicode since unpaired surrogates are allowed.) We can decide later what the name an acronym for.
Note: I call UCS-2 here any sequence of 16 bit units. Surrogate pairs have a special meaning, but unpaired surrogates are allowed.
To convert UCS-2 to WTF-8:
- Valid surrogate pairs are interpreted as a non-BMP code point and encoded as a 4-byte UTF-8 sequence
- Any other 16 bit code unit, including lone surrogates, are encoded as sequence of 1 to 3 bytes using UTF-8’s algorithm. This is invalid UTF-8 for lone surrogates, but valid WTF-8.
To convert WTF-8 to UCS-2:
- 4-byte sequences are interpreted as a non-BMP code point and encoded as a surrogate pair of 16 bit units
- 1 to 3 bytes sequences in UTF-8’s algorithm, including surrogates, are encoded as a one 16 bit unit
Consecutive 3-byte sequences for a lead surrogate followed by a trail surrogate are invalid in WTF-8. A 4-byte sequence should be used instead. (This ensures that the WTF-8 encoding of any UCS-2 string is unique.)
WTF-8 has the same "value space" as UCS-2, which is bigger than Unicode (since they include unpaired surrogates.)
Any valid UTF-8 string is also valid WTF-8, with the same byte representation. A WTF-8 string is also valid UTF-8 if and only if its UCS-2 conversion is valid UTF-16.
To convert a WTF-8 to UTF-8, either:
- Strictly: return an error if the string contains a 3-byte sequence for a surrogate code point, otherwise return the string unchanged
- Lossily: replace any 3-byte sequence for a surrogate code point by 0xEF 0xBF 0xBD, the UTF-8 representation of the replacement character U+FFFD.
To concatenate two WTF-8 strings: if the earlier one ends with a lead surrogate and the latter one starts with a trail surrogate, both surrogate need to be removed and replaced with a 4-byte sequence.
Note: WTF-8 is different from CESU-8.
In terms of Rust implementation, WTF-8 data should be kept in a dedicated type that wraps a private Vec<u8> field, with APIs that maintain the encoding invariants, like String does for UTF-8.
|
I suggest calling this WTF-8: an encoding that uses the same algorithm as UTF-8, but has the "value space" of UCS-2. (That is, bigger than Unicode since unpaired surrogates are allowed.) We can decide later what the name an acronym for. Note: I call UCS-2 here any sequence of 16 bit units. Surrogate pairs have a special meaning, but unpaired surrogates are allowed. To convert UCS-2 to WTF-8:
To convert WTF-8 to UCS-2:
Consecutive 3-byte sequences for a lead surrogate followed by a trail surrogate are invalid in WTF-8. A 4-byte sequence should be used instead. (This ensures that the WTF-8 encoding of any UCS-2 string is unique.) WTF-8 has the same "value space" as UCS-2, which is bigger than Unicode (since they include unpaired surrogates.) Any valid UTF-8 string is also valid WTF-8, with the same byte representation. A WTF-8 string is also valid UTF-8 if and only if its UCS-2 conversion is valid UTF-16. To convert a WTF-8 to UTF-8, either:
To concatenate two WTF-8 strings: if the earlier one ends with a lead surrogate and the latter one starts with a trail surrogate, both surrogate need to be removed and replaced with a 4-byte sequence. Note: WTF-8 is different from CESU-8. In terms of Rust implementation, WTF-8 data should be kept in a dedicated type that wraps a private |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
SimonSapin
Sep 26, 2014
Contributor
I’m gonna work of WTF-8 anyway, for Servo to interact with JavaScript. ECMAScript clearly says that strings are sequences of 16-bit integers.
For Windows however, it’s not so clear to me that this is actually needed. MSDN claims that the encoding used in Windows is UTF-16. The documentation for functions converting to and from UTF-16 says:
Starting with Windows Vista, this function fully conforms with the Unicode 4.1 specification for UTF-8 and UTF-16. The function used on earlier operating systems encodes or decodes lone surrogate halves or mismatched surrogate pairs.
I believe (by opposition to the following sentence) that "fully conforms" here means replaces unpaired surrogates with U+FFFD. (I haven’t tested it, though.)
Now, we’ve seen that it’s possible to create a file with an invalid UTF-16 name. But it’s not easy, we’re sometimes prevented from doing it. It may be considered a bug that it was possible. The question is, and I couldn’t find an answer on MSDN, are Windows applications expected to handle such files correctly?
|
I’m gonna work of WTF-8 anyway, for Servo to interact with JavaScript. ECMAScript clearly says that strings are sequences of 16-bit integers. For Windows however, it’s not so clear to me that this is actually needed. MSDN claims that the encoding used in Windows is UTF-16. The documentation for functions converting to and from UTF-16 says:
I believe (by opposition to the following sentence) that "fully conforms" here means replaces unpaired surrogates with U+FFFD. (I haven’t tested it, though.) Now, we’ve seen that it’s possible to create a file with an invalid UTF-16 name. But it’s not easy, we’re sometimes prevented from doing it. It may be considered a bug that it was possible. The question is, and I couldn’t find an answer on MSDN, are Windows applications expected to handle such files correctly? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
retep998
Sep 26, 2014
Member
According to MSDN
There is no need to perform any Unicode normalization on path and file name strings for use by the Windows file I/O API functions because the file system treats path and file names as an opaque sequence of WCHARs.
This isn't exactly UTF-16. So while the rest of WinAPI uses UTF-16, the file API doesn't. I've tested creating filenames with invalid surrogates in their name and I was able to create them and manipulate them easily through both code, and numerous applications including cmd.exe Notepad and Windows explorer. So far Windows 8.1 hasn't tried to stop me from using invalid surrogates for files, so I definitely think Windows applications are expected to handle such files correctly.
|
According to MSDN
This isn't exactly UTF-16. So while the rest of WinAPI uses UTF-16, the file API doesn't. I've tested creating filenames with invalid surrogates in their name and I was able to create them and manipulate them easily through both code, and numerous applications including cmd.exe Notepad and Windows explorer. So far Windows 8.1 hasn't tried to stop me from using invalid surrogates for files, so I definitely think Windows applications are expected to handle such files correctly. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
SimonSapin
Sep 26, 2014
Contributor
"Unicode normalization" refers to something else, but "opaque sequence of WCHARs" does indeed imply that unpaired surrogates can occur. Alright then, WTF-8 for Path it is.
|
"Unicode normalization" refers to something else, but "opaque sequence of WCHARs" does indeed imply that unpaired surrogates can occur. Alright then, WTF-8 for Path it is. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
vadimcn
Sep 27, 2014
Contributor
So why use WTF-8 for internal representation, and not UCS-2 then? If WTFString is a distinct type from String, you might as well save on encoding/decoding them when interacting with Windows APIs.
|
So why use WTF-8 for internal representation, and not UCS-2 then? If WTFString is a distinct type from String, you might as well save on encoding/decoding them when interacting with Windows APIs. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
SimonSapin
Sep 27, 2014
Contributor
As noted by @kballard above, none of the *_str methods would work with UCS-2 internally.
|
As noted by @kballard above, none of the |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
retep998
Sep 27, 2014
Member
Perhaps we could change the API so that all the methods return MaybeOwned instead?
|
Perhaps we could change the API so that all the methods return |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
kballard
Sep 28, 2014
Contributor
MaybeOwned is not intended for this sort of thing. It's intended for functions that, from a usability perspective, should return String, but in the interests of avoiding unnecessary copies, wants to be able to return a &str when possible. Things like string replacement, which want to return the input string if no replacement was actually done.
It's not suitable here because it's mildly annoying to work with.
Also, any clients of Path that know they're working with human-readable strings (which is nearly all of the time) want to be able to use the _str methods without unnecessary allocation. If WindowsPath always has to allocate with a _str method then that kind of sucks.
Overall, I think WTF-8 is the best approach, and it's the one I would have implemented many months ago if I actually cared enough about Windows had the spare time to work on it.
|
It's not suitable here because it's mildly annoying to work with. Also, any clients of Overall, I think WTF-8 is the best approach, and it's the one I would have implemented many months ago if I |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
SimonSapin
Oct 5, 2014
Contributor
So, WTF-8 is a thing now.
- Specification: https://simonsapin.github.io/wtf-8/
- Rust implementation, as a Cargo library: https://github.com/SimonSapin/rust-wtf8
- Library documentation: https://simonsapin.github.io/rust-wtf8/wtf8/index.html
- Same code, in libcollections (
usestatements have to be tweaked): https://github.com/SimonSapin/rust/compare/wtf8
I haven’t made a PR yet for the last one. There is no rush, as long as there’s not branch of std::path::windows using it.
Now, WTF-8 is not Unicode, and I really don’t want it to be used (accidentally?) for random things that should be UTF-8. So I’ve chosen not to expose the underlying bytes: Wtf8String and Wtf8Slice have no .as_bytes() method. (There is no decoding from arbitrary bytes either.)
I’m not sure how that’ll work with std::path::BytesContainer. Maybe I’ll have to compromise.
|
So, WTF-8 is a thing now.
I haven’t made a PR yet for the last one. There is no rush, as long as there’s not branch of Now, WTF-8 is not Unicode, and I really don’t want it to be used (accidentally?) for random things that should be UTF-8. So I’ve chosen not to expose the underlying bytes: I’m not sure how that’ll work with |
kballard commentedFeb 6, 2014
It turns out that Windows paths may contain unpaired UTF-16 surrogates, which is not legally representable in UTF-8. This means that
WindowsPathis not capable of representing such paths, as it uses~strinternally.