Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upAmend to RFC 517: add subsection on string handling #575
Conversation
aturon
changed the title
RFC 517: add subsection on string handling
Amendment to RFC 517: add subsection on string handling
Jan 13, 2015
aturon
changed the title
Amendment to RFC 517: add subsection on string handling
Amend to RFC 517: add subsection on string handling
Jan 13, 2015
sfackler
reviewed
Jan 13, 2015
| ```rust | ||
| pub mod os_str { | ||
| /// Owned OS strings | ||
| struct OsStrBuf { |
This comment has been minimized.
This comment has been minimized.
sfackler
Jan 13, 2015
Member
Bikeshed: Having OsString and OsStr seems more consistent with the normal String and str types than OsStrBuf.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
alexcrichton
Jan 13, 2015
Member
This would be inconsistent, however, with the upcoming PathBuf and Path types.
This comment has been minimized.
This comment has been minimized.
seanmonstar
Jan 13, 2015
Contributor
It feels more inconsistent with the String types than with Paths.
This comment has been minimized.
This comment has been minimized.
aturon
Jan 13, 2015
Author
Member
The problem is that the String/str convention isn't applicable more broadly to new DST types; it's a special case. With paths we're going with PathBuf and Path, partly to introduce a more general convention.
We could potentially have a split convention here with a special case for string types. That seems unfortunate to me. In any case, it may be worth filing a separate RFC just to cover this conventions issue.
This comment has been minimized.
This comment has been minimized.
alexcrichton
Jan 16, 2015
Member
@sfackler, @nagisa, @seanmonstar, if we were to have OsString instead of OsStrBuf, how would you feel about having PathBuf/Path being the two types in that module. Would that strike you as odd or do you think it seems different enough to not warrant a convention?
This comment has been minimized.
This comment has been minimized.
seanmonstar
Jan 16, 2015
Contributor
That seems like the better alternative to me. The PathBuf name could work, or could be something like PathBuilder, MutPath, etc.
This comment has been minimized.
This comment has been minimized.
nagisa
Jan 16, 2015
Contributor
From the API standpoint I tend to see Paths as an opaque black box with join, split etc methods, possibly not even backed by stringey data. OsString, on the other hand, is just another string type with possibly the same set of operations as String and str have. So having OsString to be consistent with String is much more important than being consistent with Path to me.
aturon
referenced this pull request
Jan 13, 2015
Merged
RFC: io and os reform: initial skeleton #517
nagisa
reviewed
Jan 13, 2015
| #[cfg(unix)] | ||
| mod imp { | ||
| type Buf = Vec<u8>; | ||
| type Slice = [u8; |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
cc @nodakai, I found your previous comments about just using
(this is in reference to your This I found to be an interesting point in general. As is now being brought up elsewhere using WTF-8 means that we can never view a string originating from Windows without making an allocation somewhere along the line. For example an However I'm not sure we can actually reasonably return views into OS-provided data (due to lack of where to anchor the lifetime too), and it's certainly more ergonomic to deal with owned data! I suppose what I'm getting at is that when we bring data from the OS into Rust I think we may always end up allocating (even for unix), so I'm not sure that When handing data from Rust back to the OS, however, there are certainly some cases where we can avoid the allocation (if necessary). The only one I know of, however, is when an owned
I found once I sat down to think about this I found it to not be quite true. If we assume
Whereas, if we represent
Here I'm using
Can you clarify on where information is lost? We've tried to very specifically design the That may be a bit rambly, but in general I think that I've convinced myself that WTF-8 is the right decision for the representation of An alternative would be to choose WTF-8 for now (as I think @aturon may have already implemented most of it), but avoid exposing this details (aka adopting this api). We would then free ourselves up to choose a different representation in the future if necessary, but it does come at the cost of |
This comment has been minimized.
This comment has been minimized.
|
May I suggest we use https://github.com/rust-lang/rfcs/pull/575/files?short_path=0372b19 ("Rich diff") for the "rendering" link? This clearly shows only the changed part. Same for other PRs of this kind. |
Florob
reviewed
Jan 14, 2015
|
|
||
| ```rust | ||
| my_ut8_data.to_wtf_8().to_ucs2().as_u16_slice() == my_utf8_data.to_utf16().as_16_slice() | ||
| ``` |
This comment has been minimized.
This comment has been minimized.
Florob
Jan 14, 2015
At the very least this needs a rewording. UCS-2 inherently can not represent certain USVs, while UTF-8 can represent arbitrary USVs. This means that a real to_ucs2() implementation can fail. This paragraph is clearly assuming that to_ucs2() is implemented as to_utf16() (i.e. creates surrogate pairs). This makes it quite unsurprising that the result after chaining it with a nop (to_wtf_8()) is the same.
This line also has quite a few typos, it should probably be:
my_utf8_data.to_wtf8().to_ucs2().as_u16_slice() == my_utf8_data.to_utf16().as_u16_slice()
This comment has been minimized.
This comment has been minimized.
retep998
Jan 14, 2015
Member
WTF-8 treats valid surrogate pairs in UCS-2 as if they were UTF-16, while invalid surrogates are encoded as themselves in WTF-8. This allows for the UCS-2 to represent all of Unicode and for WTF-8 to represent any possible sequence of WCHAR.
This comment has been minimized.
This comment has been minimized.
nodakai
Jan 14, 2015
Because the notions like surrogate pairs and supplementary planes didn't exist until Unicode 2.0 defined UTF-16 to obsolete UCS-2 (correct me if I'm wrong,) it's better to avoid using the term "UCS-2" here...
This comment has been minimized.
This comment has been minimized.
SimonSapin
Jan 15, 2015
Contributor
What @aturon calls “UCS-2” here is arbitrary [u16] that is interpreted as UTF-16 when surrogates happen to be paired. http://justsolve.archiveteam.org/wiki/UCS-2 (linked from the RFC) has more background and details.
“Potentially ill-formed UTF-16” is IMO the most accurate term for it, but it’s a bit unwieldy.
This comment has been minimized.
This comment has been minimized.
SimonSapin
Jan 15, 2015
Contributor
That said, I agree avoiding “UCS-2” entirely would be better, since it means different things to different people. I’m +0 on “wide string”, proposed below, if it’s properly defined.
Florob
reviewed
Jan 14, 2015
| (Text from @SimonSapin) | ||
|
|
||
| Rather than WTF-8, `OsStr` and `OsStrBuf` on Windows could use | ||
| potentially-ill-formed UTF-16 (a.k.a. "wide" strings), with a |
This comment has been minimized.
This comment has been minimized.
Florob
Jan 14, 2015
What is called «potentially-ill-formed UTF-16 (a.k.a. "wide" strings)» here is referred to as «UCS-2» everywhere else in this document. From a correctness standpoint I'd prefer if the term «wide strings» was introduced upfront and consistently used throughout.
Florob
reviewed
Jan 14, 2015
|
|
||
| Downsides: | ||
| * More expensive conversions between `OsStr` / `OsStrBuf` and `str` / `String`. | ||
| * These conversions have inconsistent performance characteristics between platforms. (Need to allocate on Windows, but not on Unix.) |
This comment has been minimized.
This comment has been minimized.
Florob
Jan 14, 2015
Personally I'd argue that the inconsistent performance characteristics exist either way. The difference is whether they exist on OS calls, or on OsStrBuf creation. It seems a to me that it is not unlikely that the same OsStr would be used to call a function multiple times, or to call multiple different functions. To me this strongly suggests putting the cost at creation time, not call time.
This comment has been minimized.
This comment has been minimized.
nodakai
commented
Jan 14, 2015
What I meant was: when we want to perform any non-trivial manipulation of, say, filenames expressed in
In other words, my claim is that the benefit of implementing |
This comment has been minimized.
This comment has been minimized.
I think for the sake of a thought experiment, it's probably worth it to at least consider this! I've been told by @SimonSapin (correct me if I'm wrong), however, that an arbitrary
So if we were to define With this information, we are faced with the unsrumountable truth that it is impossible to losslessly convert an Let's take a second and look at your example of converting digits to roman numerals for a second. Let's assume that a program is faced with the problem of converting a file which is not valid unicode (and hence cannot in any way be interpreted as All in all, I think that this does largely boil down to the fact that it's unfortunate that Now taking the " All in all I think that the fact that we'll be dealing with two types of strings is just that, a fact. The choice of WTF-8 encoding for By the way thanks for taking the time to talk this through, I know I certainly appreciate thinking about this critically! |
This comment has been minimized.
This comment has been minimized.
nodakai
commented
Jan 15, 2015
No, that isn't a fair comparison. It's obscure how you got "
vs
and we have one more allocation with WTF-8-based UpdateSorry, such an optimization does NOT apply to The comparison should be:
vs either of
or
(I'm not sure |
This comment has been minimized.
This comment has been minimized.
|
It's perhaps a minor point at this stage, but if you are willing to let your final
If you do need to reuse the
Both of these leverage the fact that UTF-8 is valid WTF-8. That could allow |
This comment has been minimized.
This comment has been minimized.
|
Thinking about this more, I'm still not sure if there's any issue with having the null terminator in But that brings me to another question: are the following expecting a correctly terminated string or not, and if so do they check the last character? If they do check, shouldn't they return an #[cfg(unix)]
trait OsStrBufExt {
fn from_vec(Vec<u8>) -> Self;
}
#[cfg(windows)]
trait OsStrBufExt {
fn from_wide_slice(&[u16]) -> Self;
}(You can say that there's a similar problem for interior nulls, but interior nulls are not really a memory safety problem because they don't cause reads past the end of the string. For a missing null terminator, the memory issue technically occurs in foreign code, but it would be Rust that failed to enforce the relevant invariant, so I'd say that this sort of bug should only be occurring in unsafe code.) |
This comment has been minimized.
This comment has been minimized.
|
A lot of WinAPI methods actually have a parameter to explicitly specify the string length rather than relying on terminating nulls, so it would be silly to impose that requirement always. Instead there should internally be a way during conversion from WTF-8 to UCS-2 to simply ensure there are no interior nulls and that there is a null terminator, but for methods where it isn't needed, a simpler conversion that doesn't do any null checks. The programmer won't be exposed to this issue on Windows. |
This comment has been minimized.
This comment has been minimized.
nodakai
commented
Jan 15, 2015
I didn't mean we could do without (Another random idea: we might as well extend WTF-8 even further to encode arbitrary lone bytes. I believe there's still plenty of room in UTF-8 code space, and if not, we can use 5-bytes code abandoned by the Unicode consortium!) That said, I'm still not comfortable with valuing the convenience of out-of-the-box |
This comment has been minimized.
This comment has been minimized.
Oh sorry yes I should have clarified. I was thinking of APIs like getting the environment (like you mentioned) or
Oh good point! I had forgotten about this. I should say though that I don't want to go too far into the weeds about discussion allocations and
I agree, we're already using totally different semantic representations on unix than windows, so I think we're fine to spec this however we like. In general I think that your comments as well as @retep998's make me think that we should not make any promises about a nul terminator for now, allowing us to tweak the representation in the future pending various analyses or benchmarks.
I suspect that like
Whoa! I did not know this! Do you have some examples of APIs? I was thinking we would forbid interior nuls for |
This comment has been minimized.
This comment has been minimized.
|
Oops sorry, hit the comment button too soon:
Indeed! I think we'll start off with inherent methods as necessary on
Could you elaborate a little more on why you're uncomfortable? (totally fine that you are!) So far the pieces that have been talked about are:
I do agree that we shouldn't be doing much that you're not expecting, but so far choosing WTF-8 over |
nrc
assigned
aturon
Jan 15, 2015
This comment has been minimized.
This comment has been minimized.
Look at something simple like |
aturon
added a commit
to aturon/rust
that referenced
this pull request
Jan 21, 2015
aturon
added a commit
to aturon/rust
that referenced
this pull request
Jan 22, 2015
This comment has been minimized.
This comment has been minimized.
|
I've pushed a small update that changes |
This comment has been minimized.
This comment has been minimized.
|
None of the methods here that can have errors report specifics of what the error was. I think that's probably in line with our other decoding APIs but it's a notable limitation. As just a design for adding OS-specific types, and not a design for integrating them into other API's I feel pretty good about adding this as unstable and seeing how it goes. Would be nice if os-strings could have their own feature name. |
This comment has been minimized.
This comment has been minimized.
|
Oh, also, this RFC considers Windows and Unix as 'all platforms'. I can't think of any reason the arguments here are invalidated for other platforms, but we should be careful. |
This comment has been minimized.
This comment has been minimized.
|
The fundamental aspect of this addition to the I/O RFC has had quite broad approval for some time now. It sounds like everyone is in agreement that There are legitimate worries about the ergonomics of having There have also been concerns about the choice of WTF-8 on Windows and performance on Windows. External opinions seem to indicate that this may not be true in all situations, and I suspect that we will quickly learn about the performance implication on Windows. The implementation will be initially In light of this information, I'm going to merge this RFC in order to start making progress on the I/O RFCs and start hammering out some of the more nitty gritty implementation details. I'd like to thank everyone for their comments, it has been super helpful in shaping this aspect! Please feel free to take a look at the current PR for an implementation and leave any comments as well! |
alexcrichton
merged commit 27bd6ad
into
rust-lang:master
Jan 23, 2015
aturon
added a commit
to aturon/rust
that referenced
this pull request
Jan 23, 2015
aturon
added a commit
to aturon/rust
that referenced
this pull request
Jan 23, 2015
aturon
added a commit
to aturon/rust
that referenced
this pull request
Jan 23, 2015
This comment has been minimized.
This comment has been minimized.
I believe this RFC’s design allows adding support for more platforms backward-compatibly if we need/want to. |
aturon
added a commit
to aturon/rust
that referenced
this pull request
Jan 23, 2015
bors
added a commit
to rust-lang/rust
that referenced
this pull request
Jan 23, 2015
aturon
added a commit
to aturon/rust
that referenced
this pull request
Jan 23, 2015
bors
added a commit
to rust-lang/rust
that referenced
this pull request
Jan 23, 2015
aturon
added a commit
to aturon/rust
that referenced
this pull request
Jan 24, 2015
bors
added a commit
to rust-lang/rust
that referenced
this pull request
Jan 24, 2015
aturon
added a commit
to aturon/rust
that referenced
this pull request
Jan 24, 2015
aturon
added a commit
to aturon/rust
that referenced
this pull request
Jan 24, 2015
bors
added a commit
to rust-lang/rust
that referenced
this pull request
Jan 24, 2015
tbu-
reviewed
Jan 26, 2015
| impl OsStr { | ||
| pub fn from_str(value: &str) -> &OsStr; | ||
| pub fn as_str(&self) -> Option<&str>; |
This comment has been minimized.
This comment has been minimized.
tbu-
Jan 26, 2015
Contributor
I think this method should not be made available. All legitimate uses of this function are converted by to_string_lossy and it encourages failing on non-UTF8 strings.
This comment has been minimized.
This comment has been minimized.
quantheory
Jan 27, 2015
Contributor
I disagree with this. It is legitimate for a function to fail or select a fallback behavior if it has good reason to expect Unicode data from a system call, and that's not what it gets. (It may not be a good idea to panic in most cases, but returning an error or special value is legitimate.)
In fact, for any situation that emphasizes correctness over robustness, I would have the opposite worry. Specifically, that to_string_lossy will end up being used when non-Unicode data should be rejected entirely, or when non-Unicode data is actually expected and needs to be handled losslessly. Admittedly, in the latter case, the user should deal with the platform-specific u8/u16 representation (or their own custom type for the expected encoding) instead of converting to str.
This comment has been minimized.
This comment has been minimized.
tbu-
Jan 27, 2015
Contributor
@quantheory I can't think of an example of an application that would need this to_string instead of the to_string_lossy.
This comment has been minimized.
This comment has been minimized.
Florob
Jan 27, 2015
I'm under the impression that this would lock us into using WTF-8 on Windows. Is that intentional?
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
tbu-
Jan 28, 2015
Contributor
@quantheory My feeling about your use case is that you need to drop down to platform-specific calls anyway if you want to support filenames as you suggest ("spelling out the bytes"), because e.g. Windows paths cannot be represented as bytes in a meaningful way.
Otherwise your program would just drop out on handling non-Unicode paths which would be very unfortunate IMO.
This comment has been minimized.
This comment has been minimized.
quantheory
Jan 29, 2015
Contributor
I'm not talking about filenames. I'm talking about information that's going to end up in Rust code, XML documents, and possibly other Unicode file formats that are not always for human consumption. (Purely for debugging purposes you may still want them to be as readable as reasonably possible.) At the moment (though as likely as not this will change) the sending/calling process knows ahead of time whether or not or not a sequence of bytes is guaranteed to be valid Unicode, and is responsible for doing any necessary translation to ensure that the receiving process always gets Unicode, as long as it gets the message at all.
But this is really getting more specific than it needs to be. My real point is that processes send information to co-designed programs or other instances of themselves in ways that incidentally pass through the OS. You can receive a string from a system call that you know actually originates within (another process of) your own application, or a specific other application that uses a known encoding. If what you get is somehow malformed anyway, that's not a normal situation, and the receiving process has no way to continue normally without knowing what went wrong on the other end.
(Also, unless the program just uses unwrap/expect on everything it gets, a None value doesn't necessarily mean that it will "just drop out" if it doesn't get Unicode. We're not looking at a function that just panics here.)
This comment has been minimized.
This comment has been minimized.
tbu-
Jan 30, 2015
Contributor
@quantheory Note that this OsString is not used for file contents or other stuff, but just for file names.
This comment has been minimized.
This comment has been minimized.
quantheory
Jan 30, 2015
Contributor
I'm not talking about file contents, I'm talking about, for instance, the proposed std::env and std::process, as used to communicate short strings between processes. These will apparently use OsStr/OsString.
File names are Path/PathBuf, not OsStr/OsString. (They will have the same internal representation, I think, but they have different interfaces. Path will also implement as_str, though, according to RFC 474.)
This comment has been minimized.
This comment has been minimized.
tbu-
reviewed
Jan 26, 2015
| pub fn from_string(String) -> OsString; | ||
| pub fn from_str(&str) -> OsString; | ||
| pub fn as_slice(&self) -> &OsStr; | ||
| pub fn into_string(Self) -> Result<String, OsString>; |
aturon commentedJan 13, 2015
The IO reform RFC is being split into several semi-independent pieces, posted as PRs like this one.
This RFC amendment adds the string handling section.
Rendered