New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

path: Windows paths may contain non-utf8-representable sequences #12056

Closed
kballard opened this Issue Feb 6, 2014 · 29 comments

Comments

Projects
None yet
6 participants
@kballard
Contributor

kballard commented Feb 6, 2014

It turns out that Windows paths may contain unpaired UTF-16 surrogates, which is not legally representable in UTF-8. This means that WindowsPath is not capable of representing such paths, as it uses ~str internally.

@SimonSapin

This comment has been minimized.

Show comment
Hide comment
@SimonSapin

SimonSapin May 12, 2014

Contributor

It turns out that Windows paths may contain unpaired UTF-16 surrogates

@kballard What is this based on? Trying to use CreateFileW directly to create such a file failed for me. Test case:

extern crate libc;

use std::ptr;
use std::io::IoError;

fn main() {
    for name in [&['a' as u16, 0xD83D, 0xDCA9, 'b' as u16, 0],
                 &['a' as u16, 0xD83D,         'b' as u16, 0]].iter() {
        let handle = unsafe {
            libc::CreateFileW(name.as_ptr(),
                              libc::FILE_GENERIC_WRITE,
                              0,
                              ptr::mut_null(),
                              libc::CREATE_ALWAYS,
                              libc::FILE_ATTRIBUTE_NORMAL,
                              ptr::mut_null())
        };
        let is_invalid = handle == libc::INVALID_HANDLE_VALUE as libc::HANDLE;
        println!("{} {} {}", name, is_invalid, IoError::last_error())
    }
}

Output:

[97, 55357, 56489, 98, 0] false unknown error (OS Error 0: The operation completed successfully.
)
[97, 55357, 98, 0] true unknown error (OS Error 87: The parameter is incorrect.
)
Contributor

SimonSapin commented May 12, 2014

It turns out that Windows paths may contain unpaired UTF-16 surrogates

@kballard What is this based on? Trying to use CreateFileW directly to create such a file failed for me. Test case:

extern crate libc;

use std::ptr;
use std::io::IoError;

fn main() {
    for name in [&['a' as u16, 0xD83D, 0xDCA9, 'b' as u16, 0],
                 &['a' as u16, 0xD83D,         'b' as u16, 0]].iter() {
        let handle = unsafe {
            libc::CreateFileW(name.as_ptr(),
                              libc::FILE_GENERIC_WRITE,
                              0,
                              ptr::mut_null(),
                              libc::CREATE_ALWAYS,
                              libc::FILE_ATTRIBUTE_NORMAL,
                              ptr::mut_null())
        };
        let is_invalid = handle == libc::INVALID_HANDLE_VALUE as libc::HANDLE;
        println!("{} {} {}", name, is_invalid, IoError::last_error())
    }
}

Output:

[97, 55357, 56489, 98, 0] false unknown error (OS Error 0: The operation completed successfully.
)
[97, 55357, 98, 0] true unknown error (OS Error 87: The parameter is incorrect.
)
@kballard

This comment has been minimized.

Show comment
Hide comment
@kballard

kballard May 12, 2014

Contributor

@SimonSapin I don't know the precise details, but there exist portions of Windows in which paths are UCS2 rather than UTF-16. I ignored it because I thought it wasn't going to be an issue but at some point someone (and I wish I could remember who) showed me some output that showed that they were actually getting a UCS2 path from some Windows call and Path was unable to parse it.

Contributor

kballard commented May 12, 2014

@SimonSapin I don't know the precise details, but there exist portions of Windows in which paths are UCS2 rather than UTF-16. I ignored it because I thought it wasn't going to be an issue but at some point someone (and I wish I could remember who) showed me some output that showed that they were actually getting a UCS2 path from some Windows call and Path was unable to parse it.

@SimonSapin

This comment has been minimized.

Show comment
Hide comment
@SimonSapin

SimonSapin May 12, 2014

Contributor

Maybe CreateFileW is doing an explicit check that not all parts of the API are doing. sigh

Contributor

SimonSapin commented May 12, 2014

Maybe CreateFileW is doing an explicit check that not all parts of the API are doing. sigh

@kballard

This comment has been minimized.

Show comment
Hide comment
@kballard

kballard May 12, 2014

Contributor

My recollection is the bad path came from iterating a temporary directory on this person's filesystem. But I don't remember for certain.

Contributor

kballard commented May 12, 2014

My recollection is the bad path came from iterating a temporary directory on this person's filesystem. But I don't remember for certain.

@SimonSapin

This comment has been minimized.

Show comment
Hide comment
@SimonSapin

SimonSapin May 12, 2014

Contributor

I’d be interested to see a test case, because this seems like it would affect every other language or library that tries to use filenames as Unicode.

Contributor

SimonSapin commented May 12, 2014

I’d be interested to see a test case, because this seems like it would affect every other language or library that tries to use filenames as Unicode.

@kballard

This comment has been minimized.

Show comment
Hide comment
@kballard

kballard May 12, 2014

Contributor

I'd love a test case too. Not being a Windows user is making it hard for me to actually test this stuff.

Contributor

kballard commented May 12, 2014

I'd love a test case too. Not being a Windows user is making it hard for me to actually test this stuff.

@SimonSapin

This comment has been minimized.

Show comment
Hide comment
@SimonSapin

SimonSapin May 12, 2014

Contributor

FWIW I used a virtual machine image from http://www.modern.ie/ and a nightly build to run the test above.

Contributor

SimonSapin commented May 12, 2014

FWIW I used a virtual machine image from http://www.modern.ie/ and a nightly build to run the test above.

@klutzy

This comment has been minimized.

Show comment
Hide comment
@klutzy

klutzy May 13, 2014

Contributor

@kballard @SimonSapin #13338?
I think it depends on OS version and possibly even on locale. I ran your code on win7/kr and got:

[97, 55357, 56489, 98, 0] false unknown error (OS Error 0 (FormatMessageW() returned error 15100))
[97, 55357, 98, 0] false unknown error (OS Error 0 (FormatMessageW() returned error 15105))
Contributor

klutzy commented May 13, 2014

@kballard @SimonSapin #13338?
I think it depends on OS version and possibly even on locale. I ran your code on win7/kr and got:

[97, 55357, 56489, 98, 0] false unknown error (OS Error 0 (FormatMessageW() returned error 15100))
[97, 55357, 98, 0] false unknown error (OS Error 0 (FormatMessageW() returned error 15105))
@SimonSapin

This comment has been minimized.

Show comment
Hide comment
@SimonSapin

SimonSapin May 13, 2014

Contributor

@klutzy false for is_invalid and OS Error 0 looks like CreateFileW was successful. Do you see the two files created in the current directory?

Contributor

SimonSapin commented May 13, 2014

@klutzy false for is_invalid and OS Error 0 looks like CreateFileW was successful. Do you see the two files created in the current directory?

@SimonSapin

This comment has been minimized.

Show comment
Hide comment
@SimonSapin

SimonSapin May 13, 2014

Contributor

Oh, wait, "success" here is not good news as it means you managed to create a file with a broken name. Does io::fs::readdir() trigger this failure, then?

Contributor

SimonSapin commented May 13, 2014

Oh, wait, "success" here is not good news as it means you managed to create a file with a broken name. Does io::fs::readdir() trigger this failure, then?

@klutzy

This comment has been minimized.

Show comment
Hide comment
@klutzy

klutzy May 13, 2014

Contributor

@SimonSapin

Do you see the two files created in the current directory?
Does io::fs::readdir() trigger this failure, then?

Yes and yes. :'( I'm curious if it only occurs on recent OSes. (maybe >= 7?) Could you confirm OS name of VM you use?

Contributor

klutzy commented May 13, 2014

@SimonSapin

Do you see the two files created in the current directory?
Does io::fs::readdir() trigger this failure, then?

Yes and yes. :'( I'm curious if it only occurs on recent OSes. (maybe >= 7?) Could you confirm OS name of VM you use?

@SimonSapin

This comment has been minimized.

Show comment
Hide comment
@SimonSapin

SimonSapin May 13, 2014

Contributor

Windows 7 Enterprise SP1, 32-bit, en-US locale

Contributor

SimonSapin commented May 13, 2014

Windows 7 Enterprise SP1, 32-bit, en-US locale

@klutzy

This comment has been minimized.

Show comment
Hide comment
@klutzy

klutzy May 13, 2014

Contributor

Ugh. I have no idea now.

Anyway, I'm ok to just ignore non-unicode filenames: I don't think there will be many use cases regarding it.

Contributor

klutzy commented May 13, 2014

Ugh. I have no idea now.

Anyway, I'm ok to just ignore non-unicode filenames: I don't think there will be many use cases regarding it.

@kballard

This comment has been minimized.

Show comment
Hide comment
@kballard

kballard May 13, 2014

Contributor

Perhaps. If it really is that rare, then maybe it's ok. It certainly simplifies things. And the programmer always has the option of dropping down and doing the system calls directly if they need to handle this.

But it makes me a bit uncomfortable regardless.

However, as I'm not a Windows user, I'll defer to Windows users on this subject.

Contributor

kballard commented May 13, 2014

Perhaps. If it really is that rare, then maybe it's ok. It certainly simplifies things. And the programmer always has the option of dropping down and doing the system calls directly if they need to handle this.

But it makes me a bit uncomfortable regardless.

However, as I'm not a Windows user, I'll defer to Windows users on this subject.

@vadimcn

This comment has been minimized.

Show comment
Hide comment
@vadimcn

vadimcn Aug 30, 2014

Contributor

Non-UTF-16 file names are certainly rare, but when you come across one... it's very annoying when you cannot use any standard tools to delete such a file, for example.
Will I be a terrible person if I suggest that we bend rfc3629 a bit, and allow encoding unpaired UTF-16 surrogates "as themselves" in UTF-8 ?

Contributor

vadimcn commented Aug 30, 2014

Non-UTF-16 file names are certainly rare, but when you come across one... it's very annoying when you cannot use any standard tools to delete such a file, for example.
Will I be a terrible person if I suggest that we bend rfc3629 a bit, and allow encoding unpaired UTF-16 surrogates "as themselves" in UTF-8 ?

@SimonSapin

This comment has been minimized.

Show comment
Hide comment
@SimonSapin

SimonSapin Aug 30, 2014

Contributor

Will I be a terrible person if I suggest that we bend rfc3629 a bit, and allow encoding unpaired UTF-16 surrogates "as themselves" in UTF-8 ?

Yes, this would be terrible. Let’s not do that in the String and str types, it would not be UTF-8. We could have a NotReallyUtf8String type separately, but it would be of limited use.

Contributor

SimonSapin commented Aug 30, 2014

Will I be a terrible person if I suggest that we bend rfc3629 a bit, and allow encoding unpaired UTF-16 surrogates "as themselves" in UTF-8 ?

Yes, this would be terrible. Let’s not do that in the String and str types, it would not be UTF-8. We could have a NotReallyUtf8String type separately, but it would be of limited use.

@retep998

This comment has been minimized.

Show comment
Hide comment
@retep998

retep998 Aug 30, 2014

Member

We could have a NotReallyUtf8String type separately

Wouldn't that just be called Path?

Member

retep998 commented Aug 30, 2014

We could have a NotReallyUtf8String type separately

Wouldn't that just be called Path?

@SimonSapin

This comment has been minimized.

Show comment
Hide comment
@SimonSapin

SimonSapin Aug 30, 2014

Contributor

Maybe it wouldn’t be so terrible if it’s an internal detail of std::path::windows::Path, with the display() method converting to real UTF-8. But then we might as well just use UCS-2 in Vec<u16> internally. (Not sure how that would interact with BytesContainer, though. Would windows::Path::new convert from UTF-8?)

Contributor

SimonSapin commented Aug 30, 2014

Maybe it wouldn’t be so terrible if it’s an internal detail of std::path::windows::Path, with the display() method converting to real UTF-8. But then we might as well just use UCS-2 in Vec<u16> internally. (Not sure how that would interact with BytesContainer, though. Would windows::Path::new convert from UTF-8?)

@kballard

This comment has been minimized.

Show comment
Hide comment
@kballard

kballard Aug 30, 2014

Contributor

Using UCS-2 means none of the *_str() methods would ever work. Encoding UCS-2 as an invalid UTF-8 as an internal implementation detail is something I considered a while back and is probably the best approach.

-Kevin

On Aug 30, 2014, at 2:13 AM, Simon Sapin notifications@github.com wrote:

Maybe it wouldn’t be so terrible if it’s an internal detail of std::path::windows::Path, with the display() method converting to real UTF-8. But then we might as well just use UCS-2 in Vec internally. (Not sure how that would interact with BytesContainer, though. Would windows::Path::new convert from UTF-8?)


Reply to this email directly or view it on GitHub.

Contributor

kballard commented Aug 30, 2014

Using UCS-2 means none of the *_str() methods would ever work. Encoding UCS-2 as an invalid UTF-8 as an internal implementation detail is something I considered a while back and is probably the best approach.

-Kevin

On Aug 30, 2014, at 2:13 AM, Simon Sapin notifications@github.com wrote:

Maybe it wouldn’t be so terrible if it’s an internal detail of std::path::windows::Path, with the display() method converting to real UTF-8. But then we might as well just use UCS-2 in Vec internally. (Not sure how that would interact with BytesContainer, though. Would windows::Path::new convert from UTF-8?)


Reply to this email directly or view it on GitHub.

@retep998

This comment has been minimized.

Show comment
Hide comment
@retep998

retep998 Sep 6, 2014

Member

For comparison, MSVC C++ when converting an std::path to a utf-8 string will encode UCS-2 as invalid UTF-8, so there is precedent for us to follow this route.

Member

retep998 commented Sep 6, 2014

For comparison, MSVC C++ when converting an std::path to a utf-8 string will encode UCS-2 as invalid UTF-8, so there is precedent for us to follow this route.

@SimonSapin

This comment has been minimized.

Show comment
Hide comment
@SimonSapin

SimonSapin Sep 16, 2014

Contributor

I suggest calling this WTF-8: an encoding that uses the same algorithm as UTF-8, but has the "value space" of UCS-2. (That is, bigger than Unicode since unpaired surrogates are allowed.) We can decide later what the name an acronym for.

Note: I call UCS-2 here any sequence of 16 bit units. Surrogate pairs have a special meaning, but unpaired surrogates are allowed.

To convert UCS-2 to WTF-8:

  • Valid surrogate pairs are interpreted as a non-BMP code point and encoded as a 4-byte UTF-8 sequence
  • Any other 16 bit code unit, including lone surrogates, are encoded as sequence of 1 to 3 bytes using UTF-8’s algorithm. This is invalid UTF-8 for lone surrogates, but valid WTF-8.

To convert WTF-8 to UCS-2:

  • 4-byte sequences are interpreted as a non-BMP code point and encoded as a surrogate pair of 16 bit units
  • 1 to 3 bytes sequences in UTF-8’s algorithm, including surrogates, are encoded as a one 16 bit unit

Consecutive 3-byte sequences for a lead surrogate followed by a trail surrogate are invalid in WTF-8. A 4-byte sequence should be used instead. (This ensures that the WTF-8 encoding of any UCS-2 string is unique.)

WTF-8 has the same "value space" as UCS-2, which is bigger than Unicode (since they include unpaired surrogates.)

Any valid UTF-8 string is also valid WTF-8, with the same byte representation. A WTF-8 string is also valid UTF-8 if and only if its UCS-2 conversion is valid UTF-16.

To convert a WTF-8 to UTF-8, either:

  • Strictly: return an error if the string contains a 3-byte sequence for a surrogate code point, otherwise return the string unchanged
  • Lossily: replace any 3-byte sequence for a surrogate code point by 0xEF 0xBF 0xBD, the UTF-8 representation of the replacement character U+FFFD.

To concatenate two WTF-8 strings: if the earlier one ends with a lead surrogate and the latter one starts with a trail surrogate, both surrogate need to be removed and replaced with a 4-byte sequence.

Note: WTF-8 is different from CESU-8.

In terms of Rust implementation, WTF-8 data should be kept in a dedicated type that wraps a private Vec<u8> field, with APIs that maintain the encoding invariants, like String does for UTF-8.

Contributor

SimonSapin commented Sep 16, 2014

I suggest calling this WTF-8: an encoding that uses the same algorithm as UTF-8, but has the "value space" of UCS-2. (That is, bigger than Unicode since unpaired surrogates are allowed.) We can decide later what the name an acronym for.

Note: I call UCS-2 here any sequence of 16 bit units. Surrogate pairs have a special meaning, but unpaired surrogates are allowed.

To convert UCS-2 to WTF-8:

  • Valid surrogate pairs are interpreted as a non-BMP code point and encoded as a 4-byte UTF-8 sequence
  • Any other 16 bit code unit, including lone surrogates, are encoded as sequence of 1 to 3 bytes using UTF-8’s algorithm. This is invalid UTF-8 for lone surrogates, but valid WTF-8.

To convert WTF-8 to UCS-2:

  • 4-byte sequences are interpreted as a non-BMP code point and encoded as a surrogate pair of 16 bit units
  • 1 to 3 bytes sequences in UTF-8’s algorithm, including surrogates, are encoded as a one 16 bit unit

Consecutive 3-byte sequences for a lead surrogate followed by a trail surrogate are invalid in WTF-8. A 4-byte sequence should be used instead. (This ensures that the WTF-8 encoding of any UCS-2 string is unique.)

WTF-8 has the same "value space" as UCS-2, which is bigger than Unicode (since they include unpaired surrogates.)

Any valid UTF-8 string is also valid WTF-8, with the same byte representation. A WTF-8 string is also valid UTF-8 if and only if its UCS-2 conversion is valid UTF-16.

To convert a WTF-8 to UTF-8, either:

  • Strictly: return an error if the string contains a 3-byte sequence for a surrogate code point, otherwise return the string unchanged
  • Lossily: replace any 3-byte sequence for a surrogate code point by 0xEF 0xBF 0xBD, the UTF-8 representation of the replacement character U+FFFD.

To concatenate two WTF-8 strings: if the earlier one ends with a lead surrogate and the latter one starts with a trail surrogate, both surrogate need to be removed and replaced with a 4-byte sequence.

Note: WTF-8 is different from CESU-8.

In terms of Rust implementation, WTF-8 data should be kept in a dedicated type that wraps a private Vec<u8> field, with APIs that maintain the encoding invariants, like String does for UTF-8.

@SimonSapin

This comment has been minimized.

Show comment
Hide comment
@SimonSapin

SimonSapin Sep 26, 2014

Contributor

I’m gonna work of WTF-8 anyway, for Servo to interact with JavaScript. ECMAScript clearly says that strings are sequences of 16-bit integers.

For Windows however, it’s not so clear to me that this is actually needed. MSDN claims that the encoding used in Windows is UTF-16. The documentation for functions converting to and from UTF-16 says:

Starting with Windows Vista, this function fully conforms with the Unicode 4.1 specification for UTF-8 and UTF-16. The function used on earlier operating systems encodes or decodes lone surrogate halves or mismatched surrogate pairs.

I believe (by opposition to the following sentence) that "fully conforms" here means replaces unpaired surrogates with U+FFFD. (I haven’t tested it, though.)

Now, we’ve seen that it’s possible to create a file with an invalid UTF-16 name. But it’s not easy, we’re sometimes prevented from doing it. It may be considered a bug that it was possible. The question is, and I couldn’t find an answer on MSDN, are Windows applications expected to handle such files correctly?

Contributor

SimonSapin commented Sep 26, 2014

I’m gonna work of WTF-8 anyway, for Servo to interact with JavaScript. ECMAScript clearly says that strings are sequences of 16-bit integers.

For Windows however, it’s not so clear to me that this is actually needed. MSDN claims that the encoding used in Windows is UTF-16. The documentation for functions converting to and from UTF-16 says:

Starting with Windows Vista, this function fully conforms with the Unicode 4.1 specification for UTF-8 and UTF-16. The function used on earlier operating systems encodes or decodes lone surrogate halves or mismatched surrogate pairs.

I believe (by opposition to the following sentence) that "fully conforms" here means replaces unpaired surrogates with U+FFFD. (I haven’t tested it, though.)

Now, we’ve seen that it’s possible to create a file with an invalid UTF-16 name. But it’s not easy, we’re sometimes prevented from doing it. It may be considered a bug that it was possible. The question is, and I couldn’t find an answer on MSDN, are Windows applications expected to handle such files correctly?

@retep998

This comment has been minimized.

Show comment
Hide comment
@retep998

retep998 Sep 26, 2014

Member

According to MSDN

There is no need to perform any Unicode normalization on path and file name strings for use by the Windows file I/O API functions because the file system treats path and file names as an opaque sequence of WCHARs.

This isn't exactly UTF-16. So while the rest of WinAPI uses UTF-16, the file API doesn't. I've tested creating filenames with invalid surrogates in their name and I was able to create them and manipulate them easily through both code, and numerous applications including cmd.exe Notepad and Windows explorer. So far Windows 8.1 hasn't tried to stop me from using invalid surrogates for files, so I definitely think Windows applications are expected to handle such files correctly.

Member

retep998 commented Sep 26, 2014

According to MSDN

There is no need to perform any Unicode normalization on path and file name strings for use by the Windows file I/O API functions because the file system treats path and file names as an opaque sequence of WCHARs.

This isn't exactly UTF-16. So while the rest of WinAPI uses UTF-16, the file API doesn't. I've tested creating filenames with invalid surrogates in their name and I was able to create them and manipulate them easily through both code, and numerous applications including cmd.exe Notepad and Windows explorer. So far Windows 8.1 hasn't tried to stop me from using invalid surrogates for files, so I definitely think Windows applications are expected to handle such files correctly.

@SimonSapin

This comment has been minimized.

Show comment
Hide comment
@SimonSapin

SimonSapin Sep 26, 2014

Contributor

"Unicode normalization" refers to something else, but "opaque sequence of WCHARs" does indeed imply that unpaired surrogates can occur. Alright then, WTF-8 for Path it is.

Contributor

SimonSapin commented Sep 26, 2014

"Unicode normalization" refers to something else, but "opaque sequence of WCHARs" does indeed imply that unpaired surrogates can occur. Alright then, WTF-8 for Path it is.

@vadimcn

This comment has been minimized.

Show comment
Hide comment
@vadimcn

vadimcn Sep 27, 2014

Contributor

So why use WTF-8 for internal representation, and not UCS-2 then? If WTFString is a distinct type from String, you might as well save on encoding/decoding them when interacting with Windows APIs.

Contributor

vadimcn commented Sep 27, 2014

So why use WTF-8 for internal representation, and not UCS-2 then? If WTFString is a distinct type from String, you might as well save on encoding/decoding them when interacting with Windows APIs.

@SimonSapin

This comment has been minimized.

Show comment
Hide comment
@SimonSapin

SimonSapin Sep 27, 2014

Contributor

As noted by @kballard above, none of the *_str methods would work with UCS-2 internally.

Contributor

SimonSapin commented Sep 27, 2014

As noted by @kballard above, none of the *_str methods would work with UCS-2 internally.

@retep998

This comment has been minimized.

Show comment
Hide comment
@retep998

retep998 Sep 27, 2014

Member

Perhaps we could change the API so that all the methods return MaybeOwned instead?

Member

retep998 commented Sep 27, 2014

Perhaps we could change the API so that all the methods return MaybeOwned instead?

@kballard

This comment has been minimized.

Show comment
Hide comment
@kballard

kballard Sep 28, 2014

Contributor

MaybeOwned is not intended for this sort of thing. It's intended for functions that, from a usability perspective, should return String, but in the interests of avoiding unnecessary copies, wants to be able to return a &str when possible. Things like string replacement, which want to return the input string if no replacement was actually done.

It's not suitable here because it's mildly annoying to work with.

Also, any clients of Path that know they're working with human-readable strings (which is nearly all of the time) want to be able to use the _str methods without unnecessary allocation. If WindowsPath always has to allocate with a _str method then that kind of sucks.

Overall, I think WTF-8 is the best approach, and it's the one I would have implemented many months ago if I actually cared enough about Windows had the spare time to work on it.

Contributor

kballard commented Sep 28, 2014

MaybeOwned is not intended for this sort of thing. It's intended for functions that, from a usability perspective, should return String, but in the interests of avoiding unnecessary copies, wants to be able to return a &str when possible. Things like string replacement, which want to return the input string if no replacement was actually done.

It's not suitable here because it's mildly annoying to work with.

Also, any clients of Path that know they're working with human-readable strings (which is nearly all of the time) want to be able to use the _str methods without unnecessary allocation. If WindowsPath always has to allocate with a _str method then that kind of sucks.

Overall, I think WTF-8 is the best approach, and it's the one I would have implemented many months ago if I actually cared enough about Windows had the spare time to work on it.

@SimonSapin

This comment has been minimized.

Show comment
Hide comment
@SimonSapin

SimonSapin Oct 5, 2014

Contributor

So, WTF-8 is a thing now.

I haven’t made a PR yet for the last one. There is no rush, as long as there’s not branch of std::path::windows using it.

Now, WTF-8 is not Unicode, and I really don’t want it to be used (accidentally?) for random things that should be UTF-8. So I’ve chosen not to expose the underlying bytes: Wtf8String and Wtf8Slice have no .as_bytes() method. (There is no decoding from arbitrary bytes either.)

I’m not sure how that’ll work with std::path::BytesContainer. Maybe I’ll have to compromise.

Contributor

SimonSapin commented Oct 5, 2014

So, WTF-8 is a thing now.

I haven’t made a PR yet for the last one. There is no rush, as long as there’s not branch of std::path::windows using it.

Now, WTF-8 is not Unicode, and I really don’t want it to be used (accidentally?) for random things that should be UTF-8. So I’ve chosen not to expose the underlying bytes: Wtf8String and Wtf8Slice have no .as_bytes() method. (There is no decoding from arbitrary bytes either.)

I’m not sure how that’ll work with std::path::BytesContainer. Maybe I’ll have to compromise.

@aturon aturon referenced this issue Nov 6, 2014

Closed

std::path and path-related metabug #18720

0 of 10 tasks complete

bors added a commit that referenced this issue Feb 3, 2015

Auto merge of #21759 - aturon:new-path, r=alexcrichton
This PR implements [path reform](rust-lang/rfcs#474), and motivation and details for the change can be found there.

For convenience, the old path API is being kept as `old_path` for the time being. Updating after this PR is just a matter of changing imports to `old_path` (which is likely not needed, since the prelude entries still export the old path API).

This initial PR does not include additional normalization or platform-specific path extensions. These will be done in follow up commits or PRs.

[breaking-change]

Closes #20034
Closes #12056
Closes #11594
Closes #14028
Closes #14049
Closes #10035

alexcrichton added a commit to alexcrichton/rust that referenced this issue Feb 4, 2015

rollup merge of #21759: aturon/new-path
This PR implements [path reform](rust-lang/rfcs#474), and motivation and details for the change can be found there.

For convenience, the old path API is being kept as `old_path` for the time being. Updating after this PR is just a matter of changing imports to `old_path` (which is likely not needed, since the prelude entries still export the old path API).

This initial PR does not include additional normalization or platform-specific path extensions. These will be done in follow up commits or PRs.

[breaking-change]

Closes #20034
Closes #12056
Closes #11594
Closes #14028
Closes #14049
Closes #10035

@bors bors closed this in #21759 Feb 4, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment