Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Amend to RFC 517: add subsection on string handling #575

Merged
merged 2 commits into from Jan 23, 2015

Conversation

Projects
None yet
@aturon
Copy link
Member

aturon commented Jan 13, 2015

The IO reform RFC is being split into several semi-independent pieces, posted as PRs like this one.

This RFC amendment adds the string handling section.

Rendered

@aturon aturon changed the title RFC 517: add subsection on string handling Amendment to RFC 517: add subsection on string handling Jan 13, 2015

@aturon aturon changed the title Amendment to RFC 517: add subsection on string handling Amend to RFC 517: add subsection on string handling Jan 13, 2015

```rust
pub mod os_str {
/// Owned OS strings
struct OsStrBuf {

This comment has been minimized.

Copy link
@sfackler

sfackler Jan 13, 2015

Member

Bikeshed: Having OsString and OsStr seems more consistent with the normal String and str types than OsStrBuf.

This comment has been minimized.

Copy link
@nagisa

nagisa Jan 13, 2015

Contributor

Totally

This comment has been minimized.

Copy link
@alexcrichton

alexcrichton Jan 13, 2015

Member

This would be inconsistent, however, with the upcoming PathBuf and Path types.

This comment has been minimized.

Copy link
@seanmonstar

seanmonstar Jan 13, 2015

Contributor

It feels more inconsistent with the String types than with Paths.

This comment has been minimized.

Copy link
@aturon

aturon Jan 13, 2015

Author Member

The problem is that the String/str convention isn't applicable more broadly to new DST types; it's a special case. With paths we're going with PathBuf and Path, partly to introduce a more general convention.

We could potentially have a split convention here with a special case for string types. That seems unfortunate to me. In any case, it may be worth filing a separate RFC just to cover this conventions issue.

This comment has been minimized.

Copy link
@alexcrichton

alexcrichton Jan 16, 2015

Member

@sfackler, @nagisa, @seanmonstar, if we were to have OsString instead of OsStrBuf, how would you feel about having PathBuf/Path being the two types in that module. Would that strike you as odd or do you think it seems different enough to not warrant a convention?

This comment has been minimized.

Copy link
@seanmonstar

seanmonstar Jan 16, 2015

Contributor

That seems like the better alternative to me. The PathBuf name could work, or could be something like PathBuilder, MutPath, etc.

This comment has been minimized.

Copy link
@nagisa

nagisa Jan 16, 2015

Contributor

From the API standpoint I tend to see Paths as an opaque black box with join, split etc methods, possibly not even backed by stringey data. OsString, on the other hand, is just another string type with possibly the same set of operations as String and str have. So having OsString to be consistent with String is much more important than being consistent with Path to me.

#[cfg(unix)]
mod imp {
type Buf = Vec<u8>;
type Slice = [u8;

This comment has been minimized.

Copy link
@nagisa

nagisa Jan 13, 2015

Contributor

Missing ].

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented Jan 13, 2015

cc @nodakai, I found your previous comments about just using Vec<u16> interesting when thinking about whether to use WTF-8 or not. I wanted to clarify a few pieces though:

If we use Vec to implement OsStr, we need only one pair of dynamic memory allocation/deallocation to run this piece of code.

(this is in reference to your set_current_dir(current_dir()) example)

This I found to be an interesting point in general. As is now being brought up elsewhere using WTF-8 means that we can never view a string originating from Windows without making an allocation somewhere along the line. For example an args() iterator (or even perhaps an environment iterator) could in theory yield slices into static memory.

However I'm not sure we can actually reasonably return views into OS-provided data (due to lack of where to anchor the lifetime too), and it's certainly more ergonomic to deal with owned data! I suppose what I'm getting at is that when we bring data from the OS into Rust I think we may always end up allocating (even for unix), so I'm not sure that [u16] being the slice type for OsString would buy us too much.

When handing data from Rust back to the OS, however, there are certainly some cases where we can avoid the allocation (if necessary). The only one I know of, however, is when an owned OsStrBuf is passed in. Almost everything else (even &OsStr) needs to reallocate to get a terminating nul at the end. Not necessarily an argument either way, just an observation!

Then we will end up with comparing Win32 API ⇆ Vec ⇆ String with Win32 API ⇆ Vec ⇆ Wtf8Buf ⇆ String and the former is obviously simpler.

I found once I sat down to think about this I found it to not be quite true. If we assume OsStrBuf is represented via Vec<u16>, then we have these conversions in your s/digit/roman-numeral/ example:

  • *const u16 -> Vec<u16> -> String -> String (replaced) -> Vec<u16> ~> windows

Whereas, if we represent OsStrBuf as Wtf8Buf, then we have these conversions:

  • *const u16 -> Wtf8Buf ~> &str -> String -> Wtf8Buf -> Vec<u16> ~> windows

Here I'm using -> to mean "convert with allocations" and ~> to mean "view as", and an interesting fact I found is that the number of allocations in both these cases is the same! I do agree that the latter is indeed somewhat more involved and probably still involves walking the string a good bit.


Moreover, information can in general be lost here by definition, so the benefit of OsStr is marginal when we talk about ease of interoperability with Rust string API.

Can you clarify on where information is lost? We've tried to very specifically design the OsStrBuf type around not losing any information, so this would be quite bad!


That may be a bit rambly, but in general I think that I've convinced myself that WTF-8 is the right decision for the representation of OsStrBuf. The primary purpose for this (from what I can tell) is buying us .as_str() and perhaps into_string(), which I feel will be worth it (but sadly can't be sure).

An alternative would be to choose WTF-8 for now (as I think @aturon may have already implemented most of it), but avoid exposing this details (aka adopting this api). We would then free ourselves up to choose a different representation in the future if necessary, but it does come at the cost of .as_str as well as the "better signature" for from_str if WTF-8 is eventually chosen.

@kennytm

This comment has been minimized.

Copy link
Member

kennytm commented Jan 13, 2015

May I suggest we use https://github.com/rust-lang/rfcs/pull/575/files?short_path=0372b19 ("Rich diff") for the "rendering" link? This clearly shows only the changed part. Same for other PRs of this kind.


```rust
my_ut8_data.to_wtf_8().to_ucs2().as_u16_slice() == my_utf8_data.to_utf16().as_16_slice()
```

This comment has been minimized.

Copy link
@Florob

Florob Jan 14, 2015

At the very least this needs a rewording. UCS-2 inherently can not represent certain USVs, while UTF-8 can represent arbitrary USVs. This means that a real to_ucs2() implementation can fail. This paragraph is clearly assuming that to_ucs2() is implemented as to_utf16() (i.e. creates surrogate pairs). This makes it quite unsurprising that the result after chaining it with a nop (to_wtf_8()) is the same.

This line also has quite a few typos, it should probably be:

 my_utf8_data.to_wtf8().to_ucs2().as_u16_slice() == my_utf8_data.to_utf16().as_u16_slice()

This comment has been minimized.

Copy link
@retep998

retep998 Jan 14, 2015

Member

WTF-8 treats valid surrogate pairs in UCS-2 as if they were UTF-16, while invalid surrogates are encoded as themselves in WTF-8. This allows for the UCS-2 to represent all of Unicode and for WTF-8 to represent any possible sequence of WCHAR.

This comment has been minimized.

Copy link
@nodakai

nodakai Jan 14, 2015

Because the notions like surrogate pairs and supplementary planes didn't exist until Unicode 2.0 defined UTF-16 to obsolete UCS-2 (correct me if I'm wrong,) it's better to avoid using the term "UCS-2" here...

This comment has been minimized.

Copy link
@SimonSapin

SimonSapin Jan 15, 2015

Contributor

What @aturon calls “UCS-2” here is arbitrary [u16] that is interpreted as UTF-16 when surrogates happen to be paired. http://justsolve.archiveteam.org/wiki/UCS-2 (linked from the RFC) has more background and details.

Potentially ill-formed UTF-16” is IMO the most accurate term for it, but it’s a bit unwieldy.

This comment has been minimized.

Copy link
@SimonSapin

SimonSapin Jan 15, 2015

Contributor

That said, I agree avoiding “UCS-2” entirely would be better, since it means different things to different people. I’m +0 on “wide string”, proposed below, if it’s properly defined.

(Text from @SimonSapin)

Rather than WTF-8, `OsStr` and `OsStrBuf` on Windows could use
potentially-ill-formed UTF-16 (a.k.a. "wide" strings), with a

This comment has been minimized.

Copy link
@Florob

Florob Jan 14, 2015

What is called «potentially-ill-formed UTF-16 (a.k.a. "wide" strings)» here is referred to as «UCS-2» everywhere else in this document. From a correctness standpoint I'd prefer if the term «wide strings» was introduced upfront and consistently used throughout.


Downsides:
* More expensive conversions between `OsStr` / `OsStrBuf` and `str` / `String`.
* These conversions have inconsistent performance characteristics between platforms. (Need to allocate on Windows, but not on Unix.)

This comment has been minimized.

Copy link
@Florob

Florob Jan 14, 2015

Personally I'd argue that the inconsistent performance characteristics exist either way. The difference is whether they exist on OS calls, or on OsStrBuf creation. It seems a to me that it is not unlikely that the same OsStr would be used to call a function multiple times, or to call multiple different functions. To me this strongly suggests putting the cost at creation time, not call time.

@nodakai

This comment has been minimized.

Copy link

nodakai commented Jan 14, 2015

@alexcrichton

Moreover, information can in general be lost here by definition, so the benefit of OsStr is marginal when we talk about ease of interoperability with Rust string API.

Can you clarify on where information is lost? We've tried to very specifically design the OsStrBuf type around not losing any information, so this would be quite bad!

What I meant was: when we want to perform any non-trivial manipulation of, say, filenames expressed in OsStr, we would end up with converting it to String and/or getting a view as &str because most of Rust's string API are designed to work with String / &str. Information can be lost here.

OsStr(Buf) may provide simple string utilities such as iterators over code points or concatenation, but you don't want to build another set of string API on top of them just to convert catch-22.txt to catch-XXII.txt.

In other words, my claim is that the benefit of implementing OsStrBuf on top of WTF-8 is marginal unless String and &str at the core of the Rust language move to WTF-8.

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented Jan 14, 2015

In other words, my claim is that the benefit of implementing OsStrBuf on top of WTF-8 is marginal unless String and &str at the core of the Rust language move to WTF-8.

I think for the sake of a thought experiment, it's probably worth it to at least consider this! I've been told by @SimonSapin (correct me if I'm wrong), however, that an arbitrary u8 sequence cannot be represented in WTF-8. The properties it has that I'm aware of are:

  • Any arbitrary u16 sequence can be losslessly encoded in WTF-8.
  • Any valid UTF-8 sequence is a valid WTF-8 sequence.

So if we were to define str as Wtf8, then we would be losing compatibility with various Unix operating systems (because Unix just gives us a pile of bytes). So continuing this thought, I think we'll reach the conclusion that it's untenable for str to be represented as WTF-8.

With this information, we are faced with the unsrumountable truth that it is impossible to losslessly convert an OsStr to a str. This we've basically been taking as a hard design constraint when designing OS strings and such. Now we can always allow interpretation of an OsStr as a valid str (be it via Cow or transmute), but as you've said this will not be possible in all situations and if code .unwrap()s all the time it's somewhat akin to information loss.

Let's take a second and look at your example of converting digits to roman numerals for a second. Let's assume that a program is faced with the problem of converting a file which is not valid unicode (and hence cannot in any way be interpreted as str or String). Given this, it is up to the program to decide how to interpret the string. To be correct, it must have different behavior on windows than it does on unix because the underlying filename has a completely different representation.

All in all, I think that this does largely boil down to the fact that it's unfortunate that OsStr has to exist at all. We will likely be forced to duplicate some APIs between str and OsStr, but given our design philosophy for IO we are not given the choice of avoiding OsStr altogether. We'll just have to do our best to make using OsStr ergonomic in most circumstances.


Now taking the "str as WTF-8" thought process down another road, I think that simply defining OsStr as the string in the standard library is a whole separate beast altogether. Not only do I think we're far beyond that sort of decision, but I think it likely has a large number of downsides associated with it as well.

All in all I think that the fact that we'll be dealing with two types of strings is just that, a fact. The choice of WTF-8 encoding for OsStr on windows will earn us an as_str() method as opposed to using Vec<u16> to represent it, which while I don't have a hugely strong opinion about does seem quite nice for interoperation with most other Rust code.

By the way thanks for taking the time to talk this through, I know I certainly appreciate thinking about this critically!

@nodakai

This comment has been minimized.

Copy link

nodakai commented Jan 15, 2015

@alexcrichton

If we assume OsStrBuf is represented via Vec<u16>, then we have these conversions in your s/digit/roman-numeral/ example:

  • *const u16 -> Vec<u16> -> String -> String (replaced) -> Vec<u16> ~> windows

Whereas, if we represent OsStrBuf as Wtf8Buf, then we have these conversions:

  • *const u16 -> Wtf8Buf ~> &str -> String -> Wtf8Buf -> Vec<u16> ~> windows

Here I'm using -> to mean "convert with allocations" and ~> to mean "view as", and an interesting fact I found is that the number of allocations in both these cases is the same!

No, that isn't a fair comparison. It's obscure how you got "*const u16" in the beginning. I assume you had only GetEnvironmentStrings() in your mind which simply returns *const u16 without(?) dynamic memory allocation. But I was mainly talking about wider class of Win32 APIs which take a pointer to a buffer and fill it to "return data" such as GetCurrentDirectory(). Then the comparison should be like:

  • windows -> Vec<u16> -> String -> String (replaced) -> Vec<u16> ~> windows

vs

  • windows -> Vec<u16> -> Wtf8Buf ~> &str -> String -> Wtf8Buf -> Vec<u16> ~> windows

and we have one more allocation with WTF-8-based OsStrBuf. Well, we can add an optimization to try to use an on-stack [u16; N] first (we will fall back to Vec<u16> when we find N too small,) but it equally applies to Vec<u16>-based OsStrBuf.

Update

Sorry, such an optimization does NOT apply to Vec<16>-based OsStrBuf. As you pointed out, the number of dynamic memory allocations are the same for the "fast cases" when it's possible to receive data from Win32 API into on-stack [u16; N].

The comparison should be:

  • windows -> Vec<u16> -> String -> String (replaced) -> Vec<u16> ~> windows

vs either of

  • windows -(wcscpy)-> [u16; N] -> Wtf8Buf ~> &str -> String -> Wtf8Buf -> Vec<u16> ~> windows ("fast case")

or

  • windows -> Vec<u16> -> Wtf8Buf ~> &str -> String -> Wtf8Buf -> Vec<u16> ~> windows ("slow case")

(I'm not sure windows -(wcscpy)-> [u16; N] was a proper notation... It's done inside kernel32.dll)

@quantheory

This comment has been minimized.

Copy link
Contributor

quantheory commented Jan 15, 2015

@alexcrichton @nodakai

It's perhaps a minor point at this stage, but if you are willing to let your final String be destroyed, the WTF-8 case could look like this:

  • windows -> Vec<u16> -> Wtf8Buf ~> &str -> String ~> Wtf8Buf -> Vec<u16> ~> windows

If you do need to reuse the String, but don't need an owned OS string, you can use OsStr::from_str:

  • windows -> Vec<u16> -> Wtf8Buf ~> &str -> String ~> &str ~> &OsStr -> Vec<u16> ~> windows

Both of these leverage the fact that UTF-8 is valid WTF-8. That could allow String to implement IntoOsStrBuf with an allocation-free conversion (or you could just use OsStrBuf::from_string?), and also to be viewed as &OsStr. But this also assumes that the terminator is not added until later, during the conversion to Vec<u16> (where we have to allocate anyway).

@quantheory

This comment has been minimized.

Copy link
Contributor

quantheory commented Jan 15, 2015

Thinking about this more, I'm still not sure if there's any issue with having the null terminator in OsStrBuf for Unix, but not Windows. (For that matter, it doesn't explicitly say in this RFC whether data from OsStrBuf will always be null-terminated or not, on either platform!) Having this detail be platform-specific seems mostly harmless to me, as long as the cross-platform interface still yields the same &str/String, and figures out the null termination in system calls one way or the other. It might be a bit confusing for people who are doing lower-level stuff though, especially if they have to write code for both platform-specific interfaces.

But that brings me to another question: are the following expecting a correctly terminated string or not, and if so do they check the last character? If they do check, shouldn't they return an Option (or panic!) if given bad input? If they don't check, should they be unsafe interfaces?

#[cfg(unix)]
trait OsStrBufExt {
    fn from_vec(Vec<u8>) -> Self;
}

#[cfg(windows)]
trait OsStrBufExt {
    fn from_wide_slice(&[u16]) -> Self;
}

(You can say that there's a similar problem for interior nulls, but interior nulls are not really a memory safety problem because they don't cause reads past the end of the string. For a missing null terminator, the memory issue technically occurs in foreign code, but it would be Rust that failed to enforce the relevant invariant, so I'd say that this sort of bug should only be occurring in unsafe code.)

@retep998

This comment has been minimized.

Copy link
Member

retep998 commented Jan 15, 2015

A lot of WinAPI methods actually have a parameter to explicitly specify the string length rather than relying on terminating nulls, so it would be silly to impose that requirement always. Instead there should internally be a way during conversion from WTF-8 to UCS-2 to simply ensure there are no interior nulls and that there is a null terminator, but for methods where it isn't needed, a simpler conversion that doesn't do any null checks. The programmer won't be exposed to this issue on Windows.

@nodakai

This comment has been minimized.

Copy link

nodakai commented Jan 15, 2015

So if we were to define str as Wtf8, then we would be losing compatibility with various Unix operating systems (because Unix just gives us a pile of bytes).

I didn't mean we could do without OsStrBuf *on nix just by making String and &str WTF-8-based. Anyways, for a thought experiment, it would rather define a unifying interface (trait StringLike?) shared by str and OsStr and change all the existing string APIs to work in terms of it. Of course external crates like regex must also be updated. Then it's no more difficult to change <U+D800>-22.txt to <U+D800>-XXII.txt (here <U+D800> denotes a lone first surrogate) than to change catch-22.txt to catch-XXII.txt. Wtf8 already impls PartialEq and a few traits. We can simply do the same for all the other traits impl-ed by str. (This is perhaps in line with what you meant by "We'll just have to do our best to make using OsStr ergonomic in most circumstances.")

(Another random idea: we might as well extend WTF-8 even further to encode arbitrary lone bytes. I believe there's still plenty of room in UTF-8 code space, and if not, we can use 5-bytes code abandoned by the Unicode consortium!)

That said, I'm still not comfortable with valuing the convenience of out-of-the-box OsStr::as_str() over something extra automatically going on under the hood even when I didn't ask to do in such a case as set_current_dir(current_dir()). I believe when to perform translations between encodings is what computer programmers working on system programming must decide by themselves.

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented Jan 15, 2015

@nodakai

No, that isn't a fair comparison. It's obscure how you got "*const u16" in the beginning.

Oh sorry yes I should have clarified. I was thinking of APIs like getting the environment (like you mentioned) or FindNextFile where windows basically hands us a *const u16 somehow (be it allocated or not).

But I was mainly talking about wider class of Win32 APIs which take a pointer to a buffer and fill it to "return data" such as GetCurrentDirectory().

Oh good point! I had forgotten about this.

I should say though that I don't want to go too far into the weeds about discussion allocations and OsStr, you mentioned on the original RFC "That said, when we call a system call, the cost of dynamic memory allocation is almost negligible.", which I definitely agree with! Just wanted to point one one piece which I found interesting, but I now realize I didn't fully consider the "we're handing a buffer to the OS" case, only "the OS handed us a buffer" case.


@quantheory

Thinking about this more, I'm still not sure if there's any issue with having the null terminator in OsStrBuf for Unix, but not Windows.

I agree, we're already using totally different semantic representations on unix than windows, so I think we're fine to spec this however we like.

In general I think that your comments as well as @retep998's make me think that we should not make any promises about a nul terminator for now, allowing us to tweak the representation in the future pending various analyses or benchmarks.

But that brings me to another question: are the following expecting a correctly terminated string or not, and if so do they check the last character?

I suspect that like CString we'll have _unchecked constructors if necessary if they perform any checks which want to be bypassed.


@retep998

A lot of WinAPI methods actually have a parameter to explicitly specify the string length rather than relying on terminating nulls, so it would be silly to impose that requirement always.

Whoa! I did not know this! Do you have some examples of APIs? I was thinking we would forbid interior nuls for OsStr to prevent silent data truncation by accident, but this means we should take a step like that very lightly!

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented Jan 15, 2015

Oops sorry, hit the comment button too soon:

@nodakai

We can simply do the same for all the other traits impl-ed by str. (This is perhaps in line with what you meant by "We'll just have to do our best to make using OsStr ergonomic in most circumstances.")

Indeed! I think we'll start off with inherent methods as necessary on OsStr, and we could add traits to be generic over str or OsStr in the future perhaps (a backwards-compatible addition).

I'm still not comfortable with valuing the convenience of out-of-the-box OsStr::as_str() over something extra automatically going on under the hood even when I didn't ask to do in such a case as set_current_dir(current_dir()).

Could you elaborate a little more on why you're uncomfortable? (totally fine that you are!) So far the pieces that have been talked about are:

  • Perf-wise using WTF-8 may impose some overhead as extra encodings or translations need to be done.
  • Semantically I believe the choices of WTF-8 or Vec<u16> are equivalent, so we're definitely not losing data one way or another.

I do agree that we shouldn't be doing much that you're not expecting, but so far choosing WTF-8 over Vec<u16> I believe only has performance ramifications, not functionality ramifications. Due to the context of OsStr (very frequently paired with syscalls) I would expect the performance worries to be smoothed over a bit, but I could be wrong!

@retep998

This comment has been minimized.

Copy link
Member

retep998 commented Jan 15, 2015

Whoa! I did not know this! Do you have some examples of APIs? I was thinking we would forbid interior nuls for OsStr to prevent silent data truncation by accident, but this means we should take a step like that very lightly!

Look at something simple like WriteConsoleW which takes a buffer of text to write and a length. Or perhaps ExtTextOutW, or GetTabbedTextExtentW. How about LCMapStringEx or even IsNormalizedString? By reading the SAL macros wrapping the parameters to functions you can fairly easily figure out whether to pass a null terminated string or a string + length.

aturon added a commit to aturon/rust that referenced this pull request Jan 21, 2015

Add ffi::OsString and OsStr
Per [RFC 517](rust-lang/rfcs#575), this commit
introduces platform-native strings. The API is essentially as described
in the RFC.

The WTF-8 implementation is adapted from @SimonSapin's
[implementation](https://github.com/SimonSapin/rust-wtf8). To make this
work, some encodign and decoding functionality in `libcore` is now
exported in a "raw" fashion reusable for WTF-8. These exports are *not*
reexported in `std`, nor are they stable.

aturon added a commit to aturon/rust that referenced this pull request Jan 22, 2015

Add ffi::OsString and OsStr
Per [RFC 517](rust-lang/rfcs#575), this commit
introduces platform-native strings. The API is essentially as described
in the RFC.

The WTF-8 implementation is adapted from @SimonSapin's
[implementation](https://github.com/SimonSapin/rust-wtf8). To make this
work, some encodign and decoding functionality in `libcore` is now
exported in a "raw" fashion reusable for WTF-8. These exports are *not*
reexported in `std`, nor are they stable.
@aturon

This comment has been minimized.

Copy link
Member Author

aturon commented Jan 22, 2015

I've pushed a small update that changes OsStrBuf to OsString based on discussion about the Buf convention. Essentially, many feel it better to follow the String/str conventions for DST-ified string types, even if we end up with a separate convention for other types (like PathBuf/Path).

@brson

This comment has been minimized.

Copy link
Contributor

brson commented Jan 23, 2015

None of the methods here that can have errors report specifics of what the error was. I think that's probably in line with our other decoding APIs but it's a notable limitation.

As just a design for adding OS-specific types, and not a design for integrating them into other API's I feel pretty good about adding this as unstable and seeing how it goes. Would be nice if os-strings could have their own feature name.

@brson

This comment has been minimized.

Copy link
Contributor

brson commented Jan 23, 2015

Oh, also, this RFC considers Windows and Unix as 'all platforms'. I can't think of any reason the arguments here are invalidated for other platforms, but we should be careful.

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented Jan 23, 2015

The fundamental aspect of this addition to the I/O RFC has had quite broad approval for some time now. It sounds like everyone is in agreement that OsString is necessary to work with APIs on Windows and Unix systems in a lossless fashion.

There are legitimate worries about the ergonomics of having OsString, essentially a second string type in the standard library. It is planned that the exact details of the final API will be hammered out over time. The current purpose will be to act as a container to quickly facilitate construction of other containers like a Path or interpretation as unicode if possible. As always we're at liberty to grow the API over time to add more convenience methods for working with various aspects of the system.

There have also been concerns about the choice of WTF-8 on Windows and performance on Windows. External opinions seem to indicate that this may not be true in all situations, and I suspect that we will quickly learn about the performance implication on Windows. The implementation will be initially #[unstable] and we will be at liberty for a bit to switch to Vec<u16> if necessary.

In light of this information, I'm going to merge this RFC in order to start making progress on the I/O RFCs and start hammering out some of the more nitty gritty implementation details. I'd like to thank everyone for their comments, it has been super helpful in shaping this aspect! Please feel free to take a look at the current PR for an implementation and leave any comments as well!

@alexcrichton alexcrichton merged commit 27bd6ad into rust-lang:master Jan 23, 2015

aturon added a commit to aturon/rust that referenced this pull request Jan 23, 2015

Add ffi::OsString and OsStr
Per [RFC 517](rust-lang/rfcs#575), this commit
introduces platform-native strings. The API is essentially as described
in the RFC.

The WTF-8 implementation is adapted from @SimonSapin's
[implementation](https://github.com/SimonSapin/rust-wtf8). To make this
work, some encodign and decoding functionality in `libcore` is now
exported in a "raw" fashion reusable for WTF-8. These exports are *not*
reexported in `std`, nor are they stable.

aturon added a commit to aturon/rust that referenced this pull request Jan 23, 2015

Add ffi::OsString and OsStr
Per [RFC 517](rust-lang/rfcs#575), this commit
introduces platform-native strings. The API is essentially as described
in the RFC.

The WTF-8 implementation is adapted from @SimonSapin's
[implementation](https://github.com/SimonSapin/rust-wtf8). To make this
work, some encodign and decoding functionality in `libcore` is now
exported in a "raw" fashion reusable for WTF-8. These exports are *not*
reexported in `std`, nor are they stable.

aturon added a commit to aturon/rust that referenced this pull request Jan 23, 2015

Add ffi::OsString and OsStr
Per [RFC 517](rust-lang/rfcs#575), this commit
introduces platform-native strings. The API is essentially as described
in the RFC.

The WTF-8 implementation is adapted from @SimonSapin's
[implementation](https://github.com/SimonSapin/rust-wtf8). To make this
work, some encodign and decoding functionality in `libcore` is now
exported in a "raw" fashion reusable for WTF-8. These exports are *not*
reexported in `std`, nor are they stable.
@SimonSapin

This comment has been minimized.

Copy link
Contributor

SimonSapin commented Jan 23, 2015

Oh, also, this RFC considers Windows and Unix as 'all platforms'. I can't think of any reason the arguments here are invalidated for other platforms, but we should be careful.

I believe this RFC’s design allows adding support for more platforms backward-compatibly if we need/want to.

aturon added a commit to aturon/rust that referenced this pull request Jan 23, 2015

Add ffi::OsString and OsStr
Per [RFC 517](rust-lang/rfcs#575), this commit
introduces platform-native strings. The API is essentially as described
in the RFC.

The WTF-8 implementation is adapted from @SimonSapin's
[implementation](https://github.com/SimonSapin/rust-wtf8). To make this
work, some encodign and decoding functionality in `libcore` is now
exported in a "raw" fashion reusable for WTF-8. These exports are *not*
reexported in `std`, nor are they stable.

bors added a commit to rust-lang/rust that referenced this pull request Jan 23, 2015

Auto merge of #21488 - aturon:os-str, r=alexcrichton
Per [RFC 517](rust-lang/rfcs#575), this commit introduces platform-native strings. The API is essentially as described in the RFC.

The WTF-8 implementation is adapted from @SimonSapin's [implementation](https://github.com/SimonSapin/rust-wtf8). To make this work, some encodign and decoding functionality in `libcore` is now exported in a "raw" fashion reusable for WTF-8. These exports are *not* reexported in `std`, nor are they stable.

aturon added a commit to aturon/rust that referenced this pull request Jan 23, 2015

Add ffi::OsString and OsStr
Per [RFC 517](rust-lang/rfcs#575), this commit
introduces platform-native strings. The API is essentially as described
in the RFC.

The WTF-8 implementation is adapted from @SimonSapin's
[implementation](https://github.com/SimonSapin/rust-wtf8). To make this
work, some encodign and decoding functionality in `libcore` is now
exported in a "raw" fashion reusable for WTF-8. These exports are *not*
reexported in `std`, nor are they stable.

bors added a commit to rust-lang/rust that referenced this pull request Jan 23, 2015

Auto merge of #21488 - aturon:os-str, r=alexcrichton
Per [RFC 517](rust-lang/rfcs#575), this commit introduces platform-native strings. The API is essentially as described in the RFC.

The WTF-8 implementation is adapted from @SimonSapin's [implementation](https://github.com/SimonSapin/rust-wtf8). To make this work, some encodign and decoding functionality in `libcore` is now exported in a "raw" fashion reusable for WTF-8. These exports are *not* reexported in `std`, nor are they stable.

aturon added a commit to aturon/rust that referenced this pull request Jan 24, 2015

Add ffi::OsString and OsStr
Per [RFC 517](rust-lang/rfcs#575), this commit
introduces platform-native strings. The API is essentially as described
in the RFC.

The WTF-8 implementation is adapted from @SimonSapin's
[implementation](https://github.com/SimonSapin/rust-wtf8). To make this
work, some encodign and decoding functionality in `libcore` is now
exported in a "raw" fashion reusable for WTF-8. These exports are *not*
reexported in `std`, nor are they stable.

bors added a commit to rust-lang/rust that referenced this pull request Jan 24, 2015

Auto merge of #21488 - aturon:os-str, r=alexcrichton
Per [RFC 517](rust-lang/rfcs#575), this commit introduces platform-native strings. The API is essentially as described in the RFC.

The WTF-8 implementation is adapted from @SimonSapin's [implementation](https://github.com/SimonSapin/rust-wtf8). To make this work, some encodign and decoding functionality in `libcore` is now exported in a "raw" fashion reusable for WTF-8. These exports are *not* reexported in `std`, nor are they stable.

aturon added a commit to aturon/rust that referenced this pull request Jan 24, 2015

Add ffi::OsString and OsStr
Per [RFC 517](rust-lang/rfcs#575), this commit
introduces platform-native strings. The API is essentially as described
in the RFC.

The WTF-8 implementation is adapted from @SimonSapin's
[implementation](https://github.com/SimonSapin/rust-wtf8). To make this
work, some encodign and decoding functionality in `libcore` is now
exported in a "raw" fashion reusable for WTF-8. These exports are *not*
reexported in `std`, nor are they stable.

aturon added a commit to aturon/rust that referenced this pull request Jan 24, 2015

Add ffi::OsString and OsStr
Per [RFC 517](rust-lang/rfcs#575), this commit
introduces platform-native strings. The API is essentially as described
in the RFC.

The WTF-8 implementation is adapted from @SimonSapin's
[implementation](https://github.com/SimonSapin/rust-wtf8). To make this
work, some encodign and decoding functionality in `libcore` is now
exported in a "raw" fashion reusable for WTF-8. These exports are *not*
reexported in `std`, nor are they stable.

bors added a commit to rust-lang/rust that referenced this pull request Jan 24, 2015

Auto merge of #21488 - aturon:os-str, r=alexcrichton
Per [RFC 517](rust-lang/rfcs#575), this commit introduces platform-native strings. The API is essentially as described in the RFC.

The WTF-8 implementation is adapted from @SimonSapin's [implementation](https://github.com/SimonSapin/rust-wtf8). To make this work, some encodign and decoding functionality in `libcore` is now exported in a "raw" fashion reusable for WTF-8. These exports are *not* reexported in `std`, nor are they stable.
impl OsStr {
pub fn from_str(value: &str) -> &OsStr;
pub fn as_str(&self) -> Option<&str>;

This comment has been minimized.

Copy link
@tbu-

tbu- Jan 26, 2015

Contributor

I think this method should not be made available. All legitimate uses of this function are converted by to_string_lossy and it encourages failing on non-UTF8 strings.

This comment has been minimized.

Copy link
@quantheory

quantheory Jan 27, 2015

Contributor

I disagree with this. It is legitimate for a function to fail or select a fallback behavior if it has good reason to expect Unicode data from a system call, and that's not what it gets. (It may not be a good idea to panic in most cases, but returning an error or special value is legitimate.)

In fact, for any situation that emphasizes correctness over robustness, I would have the opposite worry. Specifically, that to_string_lossy will end up being used when non-Unicode data should be rejected entirely, or when non-Unicode data is actually expected and needs to be handled losslessly. Admittedly, in the latter case, the user should deal with the platform-specific u8/u16 representation (or their own custom type for the expected encoding) instead of converting to str.

This comment has been minimized.

Copy link
@tbu-

tbu- Jan 27, 2015

Contributor

@quantheory I can't think of an example of an application that would need this to_string instead of the to_string_lossy.

This comment has been minimized.

Copy link
@Florob

Florob Jan 27, 2015

I'm under the impression that this would lock us into using WTF-8 on Windows. Is that intentional?

This comment has been minimized.

Copy link
@nagisa

nagisa Jan 27, 2015

Contributor

@Florob Why do you think so?

This comment has been minimized.

Copy link
@tbu-

tbu- Jan 28, 2015

Contributor

@quantheory My feeling about your use case is that you need to drop down to platform-specific calls anyway if you want to support filenames as you suggest ("spelling out the bytes"), because e.g. Windows paths cannot be represented as bytes in a meaningful way.

Otherwise your program would just drop out on handling non-Unicode paths which would be very unfortunate IMO.

This comment has been minimized.

Copy link
@quantheory

quantheory Jan 29, 2015

Contributor

I'm not talking about filenames. I'm talking about information that's going to end up in Rust code, XML documents, and possibly other Unicode file formats that are not always for human consumption. (Purely for debugging purposes you may still want them to be as readable as reasonably possible.) At the moment (though as likely as not this will change) the sending/calling process knows ahead of time whether or not or not a sequence of bytes is guaranteed to be valid Unicode, and is responsible for doing any necessary translation to ensure that the receiving process always gets Unicode, as long as it gets the message at all.

But this is really getting more specific than it needs to be. My real point is that processes send information to co-designed programs or other instances of themselves in ways that incidentally pass through the OS. You can receive a string from a system call that you know actually originates within (another process of) your own application, or a specific other application that uses a known encoding. If what you get is somehow malformed anyway, that's not a normal situation, and the receiving process has no way to continue normally without knowing what went wrong on the other end.

(Also, unless the program just uses unwrap/expect on everything it gets, a None value doesn't necessarily mean that it will "just drop out" if it doesn't get Unicode. We're not looking at a function that just panics here.)

This comment has been minimized.

Copy link
@tbu-

tbu- Jan 30, 2015

Contributor

@quantheory Note that this OsString is not used for file contents or other stuff, but just for file names.

This comment has been minimized.

Copy link
@quantheory

quantheory Jan 30, 2015

Contributor

@tbu-

I'm not talking about file contents, I'm talking about, for instance, the proposed std::env and std::process, as used to communicate short strings between processes. These will apparently use OsStr/OsString.

File names are Path/PathBuf, not OsStr/OsString. (They will have the same internal representation, I think, but they have different interfaces. Path will also implement as_str, though, according to RFC 474.)

This comment has been minimized.

Copy link
@tbu-

tbu- Jan 30, 2015

Contributor

So maybe this should be called to_string_strict instead of to_string?

pub fn from_string(String) -> OsString;
pub fn from_str(&str) -> OsString;
pub fn as_slice(&self) -> &OsStr;
pub fn into_string(Self) -> Result<String, OsString>;

This comment has been minimized.

Copy link
@tbu-

tbu- Jan 26, 2015

Contributor

Same as OsStr::as_str, shouldn't be made available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.