-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Amend to RFC 517: add subsection on string handling #575
Conversation
```rust | ||
pub mod os_str { | ||
/// Owned OS strings | ||
struct OsStrBuf { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bikeshed: Having OsString
and OsStr
seems more consistent with the normal String
and str
types than OsStrBuf
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be inconsistent, however, with the upcoming PathBuf
and Path
types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels more inconsistent with the String types than with Paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that the String
/str
convention isn't applicable more broadly to new DST types; it's a special case. With paths we're going with PathBuf
and Path
, partly to introduce a more general convention.
We could potentially have a split convention here with a special case for string types. That seems unfortunate to me. In any case, it may be worth filing a separate RFC just to cover this conventions issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sfackler, @nagisa, @seanmonstar, if we were to have OsString
instead of OsStrBuf
, how would you feel about having PathBuf
/Path
being the two types in that module. Would that strike you as odd or do you think it seems different enough to not warrant a convention?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems like the better alternative to me. The PathBuf
name could work, or could be something like PathBuilder
, MutPath
, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the API standpoint I tend to see Path
s as an opaque black box with join
, split
etc methods, possibly not even backed by stringey data. OsString
, on the other hand, is just another string type with possibly the same set of operations as String
and str
have. So having OsString
to be consistent with String
is much more important than being consistent with Path
to me.
#[cfg(unix)] | ||
mod imp { | ||
type Buf = Vec<u8>; | ||
type Slice = [u8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing ]
.
cc @nodakai, I found your previous comments about just using
(this is in reference to your This I found to be an interesting point in general. As is now being brought up elsewhere using WTF-8 means that we can never view a string originating from Windows without making an allocation somewhere along the line. For example an However I'm not sure we can actually reasonably return views into OS-provided data (due to lack of where to anchor the lifetime too), and it's certainly more ergonomic to deal with owned data! I suppose what I'm getting at is that when we bring data from the OS into Rust I think we may always end up allocating (even for unix), so I'm not sure that When handing data from Rust back to the OS, however, there are certainly some cases where we can avoid the allocation (if necessary). The only one I know of, however, is when an owned
I found once I sat down to think about this I found it to not be quite true. If we assume
Whereas, if we represent
Here I'm using
Can you clarify on where information is lost? We've tried to very specifically design the That may be a bit rambly, but in general I think that I've convinced myself that WTF-8 is the right decision for the representation of An alternative would be to choose WTF-8 for now (as I think @aturon may have already implemented most of it), but avoid exposing this details (aka adopting this api). We would then free ourselves up to choose a different representation in the future if necessary, but it does come at the cost of |
May I suggest we use https://github.com/rust-lang/rfcs/pull/575/files?short_path=0372b19 ("Rich diff") for the "rendering" link? This clearly shows only the changed part. Same for other PRs of this kind. |
|
||
```rust | ||
my_ut8_data.to_wtf_8().to_ucs2().as_u16_slice() == my_utf8_data.to_utf16().as_16_slice() | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the very least this needs a rewording. UCS-2 inherently can not represent certain USVs, while UTF-8 can represent arbitrary USVs. This means that a real to_ucs2()
implementation can fail. This paragraph is clearly assuming that to_ucs2()
is implemented as to_utf16()
(i.e. creates surrogate pairs). This makes it quite unsurprising that the result after chaining it with a nop (to_wtf_8()
) is the same.
This line also has quite a few typos, it should probably be:
my_utf8_data.to_wtf8().to_ucs2().as_u16_slice() == my_utf8_data.to_utf16().as_u16_slice()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WTF-8 treats valid surrogate pairs in UCS-2 as if they were UTF-16, while invalid surrogates are encoded as themselves in WTF-8. This allows for the UCS-2 to represent all of Unicode and for WTF-8 to represent any possible sequence of WCHAR
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the notions like surrogate pairs and supplementary planes didn't exist until Unicode 2.0 defined UTF-16 to obsolete UCS-2 (correct me if I'm wrong,) it's better to avoid using the term "UCS-2" here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What @aturon calls “UCS-2” here is arbitrary [u16]
that is interpreted as UTF-16 when surrogates happen to be paired. http://justsolve.archiveteam.org/wiki/UCS-2 (linked from the RFC) has more background and details.
“Potentially ill-formed UTF-16” is IMO the most accurate term for it, but it’s a bit unwieldy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said, I agree avoiding “UCS-2” entirely would be better, since it means different things to different people. I’m +0 on “wide string”, proposed below, if it’s properly defined.
What I meant was: when we want to perform any non-trivial manipulation of, say, filenames expressed in
In other words, my claim is that the benefit of implementing |
I think for the sake of a thought experiment, it's probably worth it to at least consider this! I've been told by @SimonSapin (correct me if I'm wrong), however, that an arbitrary
So if we were to define With this information, we are faced with the unsrumountable truth that it is impossible to losslessly convert an Let's take a second and look at your example of converting digits to roman numerals for a second. Let's assume that a program is faced with the problem of converting a file which is not valid unicode (and hence cannot in any way be interpreted as All in all, I think that this does largely boil down to the fact that it's unfortunate that Now taking the " All in all I think that the fact that we'll be dealing with two types of strings is just that, a fact. The choice of WTF-8 encoding for By the way thanks for taking the time to talk this through, I know I certainly appreciate thinking about this critically! |
No, that isn't a fair comparison. It's obscure how you got "
vs
and we have one more allocation with WTF-8-based UpdateSorry, such an optimization does NOT apply to The comparison should be:
vs either of
or
(I'm not sure |
It's perhaps a minor point at this stage, but if you are willing to let your final
If you do need to reuse the
Both of these leverage the fact that UTF-8 is valid WTF-8. That could allow |
Thinking about this more, I'm still not sure if there's any issue with having the null terminator in But that brings me to another question: are the following expecting a correctly terminated string or not, and if so do they check the last character? If they do check, shouldn't they return an #[cfg(unix)]
trait OsStrBufExt {
fn from_vec(Vec<u8>) -> Self;
}
#[cfg(windows)]
trait OsStrBufExt {
fn from_wide_slice(&[u16]) -> Self;
} (You can say that there's a similar problem for interior nulls, but interior nulls are not really a memory safety problem because they don't cause reads past the end of the string. For a missing null terminator, the memory issue technically occurs in foreign code, but it would be Rust that failed to enforce the relevant invariant, so I'd say that this sort of bug should only be occurring in unsafe code.) |
A lot of WinAPI methods actually have a parameter to explicitly specify the string length rather than relying on terminating nulls, so it would be silly to impose that requirement always. Instead there should internally be a way during conversion from WTF-8 to UCS-2 to simply ensure there are no interior nulls and that there is a null terminator, but for methods where it isn't needed, a simpler conversion that doesn't do any null checks. The programmer won't be exposed to this issue on Windows. |
I didn't mean we could do without (Another random idea: we might as well extend WTF-8 even further to encode arbitrary lone bytes. I believe there's still plenty of room in UTF-8 code space, and if not, we can use 5-bytes code abandoned by the Unicode consortium!) That said, I'm still not comfortable with valuing the convenience of out-of-the-box |
Oh sorry yes I should have clarified. I was thinking of APIs like getting the environment (like you mentioned) or
Oh good point! I had forgotten about this. I should say though that I don't want to go too far into the weeds about discussion allocations and
I agree, we're already using totally different semantic representations on unix than windows, so I think we're fine to spec this however we like. In general I think that your comments as well as @retep998's make me think that we should not make any promises about a nul terminator for now, allowing us to tweak the representation in the future pending various analyses or benchmarks.
I suspect that like
Whoa! I did not know this! Do you have some examples of APIs? I was thinking we would forbid interior nuls for |
Oops sorry, hit the comment button too soon:
Indeed! I think we'll start off with inherent methods as necessary on
Could you elaborate a little more on why you're uncomfortable? (totally fine that you are!) So far the pieces that have been talked about are:
I do agree that we shouldn't be doing much that you're not expecting, but so far choosing WTF-8 over |
Look at something simple like |
I've pushed a small update that changes |
None of the methods here that can have errors report specifics of what the error was. I think that's probably in line with our other decoding APIs but it's a notable limitation. As just a design for adding OS-specific types, and not a design for integrating them into other API's I feel pretty good about adding this as unstable and seeing how it goes. Would be nice if os-strings could have their own feature name. |
Oh, also, this RFC considers Windows and Unix as 'all platforms'. I can't think of any reason the arguments here are invalidated for other platforms, but we should be careful. |
The fundamental aspect of this addition to the I/O RFC has had quite broad approval for some time now. It sounds like everyone is in agreement that There are legitimate worries about the ergonomics of having There have also been concerns about the choice of WTF-8 on Windows and performance on Windows. External opinions seem to indicate that this may not be true in all situations, and I suspect that we will quickly learn about the performance implication on Windows. The implementation will be initially In light of this information, I'm going to merge this RFC in order to start making progress on the I/O RFCs and start hammering out some of the more nitty gritty implementation details. I'd like to thank everyone for their comments, it has been super helpful in shaping this aspect! Please feel free to take a look at the current PR for an implementation and leave any comments as well! |
I believe this RFC’s design allows adding support for more platforms backward-compatibly if we need/want to. |
Per [RFC 517](rust-lang/rfcs#575), this commit introduces platform-native strings. The API is essentially as described in the RFC. The WTF-8 implementation is adapted from @SimonSapin's [implementation](https://github.com/SimonSapin/rust-wtf8). To make this work, some encodign and decoding functionality in `libcore` is now exported in a "raw" fashion reusable for WTF-8. These exports are *not* reexported in `std`, nor are they stable.
Per [RFC 517](rust-lang/rfcs#575), this commit introduces platform-native strings. The API is essentially as described in the RFC. The WTF-8 implementation is adapted from @SimonSapin's [implementation](https://github.com/SimonSapin/rust-wtf8). To make this work, some encodign and decoding functionality in `libcore` is now exported in a "raw" fashion reusable for WTF-8. These exports are *not* reexported in `std`, nor are they stable.
|
||
impl OsStr { | ||
pub fn from_str(value: &str) -> &OsStr; | ||
pub fn as_str(&self) -> Option<&str>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this method should not be made available. All legitimate uses of this function are converted by to_string_lossy
and it encourages failing on non-UTF8 strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree with this. It is legitimate for a function to fail or select a fallback behavior if it has good reason to expect Unicode data from a system call, and that's not what it gets. (It may not be a good idea to panic in most cases, but returning an error or special value is legitimate.)
In fact, for any situation that emphasizes correctness over robustness, I would have the opposite worry. Specifically, that to_string_lossy
will end up being used when non-Unicode data should be rejected entirely, or when non-Unicode data is actually expected and needs to be handled losslessly. Admittedly, in the latter case, the user should deal with the platform-specific u8
/u16
representation (or their own custom type for the expected encoding) instead of converting to str
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@quantheory I can't think of an example of an application that would need this to_string
instead of the to_string_lossy
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm under the impression that this would lock us into using WTF-8 on Windows. Is that intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Florob Why do you think so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@quantheory My feeling about your use case is that you need to drop down to platform-specific calls anyway if you want to support filenames as you suggest ("spelling out the bytes"), because e.g. Windows paths cannot be represented as bytes in a meaningful way.
Otherwise your program would just drop out on handling non-Unicode paths which would be very unfortunate IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not talking about filenames. I'm talking about information that's going to end up in Rust code, XML documents, and possibly other Unicode file formats that are not always for human consumption. (Purely for debugging purposes you may still want them to be as readable as reasonably possible.) At the moment (though as likely as not this will change) the sending/calling process knows ahead of time whether or not or not a sequence of bytes is guaranteed to be valid Unicode, and is responsible for doing any necessary translation to ensure that the receiving process always gets Unicode, as long as it gets the message at all.
But this is really getting more specific than it needs to be. My real point is that processes send information to co-designed programs or other instances of themselves in ways that incidentally pass through the OS. You can receive a string from a system call that you know actually originates within (another process of) your own application, or a specific other application that uses a known encoding. If what you get is somehow malformed anyway, that's not a normal situation, and the receiving process has no way to continue normally without knowing what went wrong on the other end.
(Also, unless the program just uses unwrap
/expect
on everything it gets, a None
value doesn't necessarily mean that it will "just drop out" if it doesn't get Unicode. We're not looking at a function that just panics here.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@quantheory Note that this OsString
is not used for file contents or other stuff, but just for file names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not talking about file contents, I'm talking about, for instance, the proposed std::env
and std::process
, as used to communicate short strings between processes. These will apparently use OsStr
/OsString
.
File names are Path
/PathBuf
, not OsStr
/OsString
. (They will have the same internal representation, I think, but they have different interfaces. Path
will also implement as_str
, though, according to RFC 474.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So maybe this should be called to_string_strict
instead of to_string
?
The IO reform RFC is being split into several semi-independent pieces, posted as PRs like this one.
This RFC amendment adds the string handling section.
Rendered