Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OS string string-like interface #1309

Closed
wants to merge 3 commits into from

Conversation

Projects
None yet
6 participants
@wthrowe
Copy link

wthrowe commented Oct 6, 2015

@wthrowe wthrowe changed the title Add RFC for OS string string-like interface OS string string-like interface Oct 6, 2015

@alexcrichton alexcrichton added the T-libs label Oct 6, 2015

@Stebalien

This comment has been minimized.

Copy link
Contributor

Stebalien commented Oct 6, 2015

See rust-lang/rust#26499 (related discussion)

@wthrowe

This comment has been minimized.

Copy link
Author

wthrowe commented Oct 6, 2015

Thanks. I'd missed that discussion.

starts_with and ends_with (and contains) could certainly be implemented for general OsStr arguments, although I don't think the straight byte comparison proposed there handles WTF-8 unpaired surrogates correctly.

That also suggests the question of whether these functions should take things like &str or things like T: AsRef<str>. I see std interfaces going both ways. Which is preferred for new interfaces?

@alexcrichton alexcrichton self-assigned this Oct 8, 2015

```rust
/// Returns true if the string starts with a valid UTF-8 sequence
/// equal to the given `&str`.
fn starts_with_str(&self, prefix: &str) -> bool;

This comment has been minimized.

@alexcrichton

alexcrichton Oct 8, 2015

Member

When adding string-like APIs to OsStr, it may be best to try to stick to original str APIs as much as possible, while also allowing all kinds of fun functionality for OsStr. Along those lines, perhaps this API could look like:

fn starts_with<S: AsRef<OsStr>>(&self, prefix: S) -> bool;

That should cover this use case as well as something like os_str.starts_with(&other_os_string).

This comment has been minimized.

@wthrowe

wthrowe Oct 8, 2015

Author

Reasonable. I'll make that change.

/// If the string starts with the given `&str`, returns the rest
/// of the string. Otherwise returns `None`.
fn remove_prefix_str(&self, prefix: &str) -> Option<&OsStr>;

This comment has been minimized.

@alexcrichton

alexcrichton Oct 8, 2015

Member

Along the lines of sticking as close to str as possible, I think that this would be best expressed as a splitn if you're working with str. Adding that API to OS strings, however, may be a bit tricky as it may involve dealing with the Pattern trait, so it may be best to perhaps hold off on this just yet? Either that or perhaps adding splitn which only takes P: AsRef<OsStr> for now.

This comment has been minimized.

@SimonSapin

SimonSapin Oct 13, 2015

Contributor

With splitn you have to check that the first chunk is empty, and it does more work than necessary in the negative case.

I’ve wanted str::remove_prefix as well (on "normal" UTF-8 strings). Could it be added on both types?

This comment has been minimized.

@wthrowe

wthrowe Oct 13, 2015

Author

Yeah, I still like this functionality, but you're right that it would be nice on both types of strings. Since modifying str seems out of scope for this RFC, I think this will be probably be bumped to a future RFC.

Hmm, if this took a pattern and also returned the prefix, then it could nicely replace slice_shift_char.

/// and the remainder of the `OsStr`. Returns `None` if the
/// `OsStr` does not start with a character (either because it it
/// empty or because it starts with non-UTF-8 data).
fn slice_shift_char(&self) -> Option<(char, &OsStr)>;

This comment has been minimized.

@alexcrichton

alexcrichton Oct 8, 2015

Member

We may want to hold off on adding this API to OS strings for now as the API on str is still unstable and it may be awhile before it's stabilize (also a little dubious as to how useful it is).

This comment has been minimized.

@wthrowe

wthrowe Oct 8, 2015

Author

My understanding is that the str version of this is potentially non-useful because it is easy to rewrite it as

let first = string.chars().next();
if let Some(first) = first {
    let rest = &string[first.len_utf8()..];
    ...
}

But neither chars() nor indexing make sense on an OsStr. The best I can come up with for OsStr is something like

let mut split = string.splitn(3, "");
split.next().unwrap();
match (split.next(), split.next()) {
    (Some(first), Some(rest)) if first.to_str().map(|s| s.chars().count() == 1).unwrap_or(false) => {
        let first = first.to_str().unwrap().chars().next().unwrap();
        ...
    }
    _ => {}
}

but that's kind of a mouthful and is easy to screw up in subtle ways (and depends on what empty patterns end up doing).

This comment has been minimized.

@alexcrichton

alexcrichton Oct 9, 2015

Member

Hm, very good points! I realize now I should have read the section below a little more closely, thanks for the explanation!

/// If the `OsStr` starts with a UTF-8 section followed by
/// `boundary`, returns the sections before and after the boundary
/// character. Otherwise returns `None`.
fn split_off_str(&self, boundary: char) -> Option<(&str, &OsStr)>;

This comment has been minimized.

@alexcrichton

alexcrichton Oct 8, 2015

Member

Similarly to remove_prefix_str above, perhaps this would be best suited for a splitn in terms of compositions? You'd probably get two OsStr instances out of that, but could call to_str on the first to get the same API as this.

/// # Panics
///
/// Panics if the boundary character is not ASCII.
fn split<'a>(&'a self, boundary: char) -> Split<'a>;

This comment has been minimized.

@alexcrichton

alexcrichton Oct 8, 2015

Member

This is certainly an interesting API! Some thoughts I have here:

  • In the interest of staying aligned with str, this may want to take at least P: AsRef<OsStr> and can perhaps eventually be generalized to P: OsPattern (similar to the split function on str) to also allow char.
  • Having this API panic if a character isn't ASCII though is pretty unfortunate, and I think it'd be best to avoid that (and perhaps using OsStr would alleviate that for now?)

This comment has been minimized.

@wthrowe

wthrowe Oct 8, 2015

Author

As I mentioned in the Alternatives section, it is not possible to split an OsStr on an OsStr in Windows because of the details of the WTF-8 encoding.

Thinking about this a bit more, the restriction to ASCII should be completely unnecessary. We should be able to take an arbitrary &str, and possibly even a full P: Pattern (although I'll have to think about that a bit more), but notably not any generalizations of those to OsStr.

This comment has been minimized.

@alexcrichton

alexcrichton Oct 9, 2015

Member

Aha! Sorry I think I missed that part of the section below. Generalization to a full Pattern would be great, and even &str would be quite nice!

cc @SimonSapin, curious on your thoughts on split + the WTF-8 encoding

This comment has been minimized.

@SimonSapin

SimonSapin Oct 13, 2015

Contributor

I agree the ASCII restriction is not necessary on Windows. I don’t think there is an issue splitting WTF-8 on any char. It’s not clear what splitting on a non-ASCII means on Unix, though. Split on the corresponding UTF-8 sequence? That makes sense (especially if split also accepts &str arguments), but needs to be documented.

As I mentioned in the Alternatives section, it is not possible to split an OsStr on an OsStr in Windows because of the details of the WTF-8 encoding.

Can you expand on this? Concatenation joining surrogate pairs that were previously not in a pair can be unexpected, but I don’t think it makes anything impossible.

This comment has been minimized.

@wthrowe

wthrowe Oct 13, 2015

Author

I'm going to move to a Pattern interface for split in the next version, although that might end up being scaled back if we can't decide what to do with a pattern that matches "".

I'll try to add a better explanation of the splitting problem. For now see my comments about remove_prefix later in these comments.

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented Oct 8, 2015

Thanks for the RFC @wthrowe! It does seem high time that we start expanding the API for OsStr, so thanks for pushing on it!

My thoughts on this in the past have been primarily along the lines of:

  • All extensions should mirror str in one form or another wherever possible
  • Questions about encoding and such tend to work best if we can somehow sidestep the entire question (depends on the question at hand)
  • The encoding question may end up meaning that some str APIs aren't quite appropriate for OS strings, but it's certainly a space to explore!
@wthrowe

This comment has been minimized.

Copy link
Author

wthrowe commented Oct 8, 2015

I'll look into writing up equivalents of some of the more general pattern matching str methods and replacing things those make obsolete. The only difficulty I can think of immediately is figuring out what patterns that match the empty string should do.

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented Oct 9, 2015

Yeah I agree that str.split("") has a pretty good meaning whereas os_str.split("") may not. This may not be able to go the route of a "full Pattern trait" just yet which is totally fine, just food for thought!


This is analogous to the existing `OsStr::to_string_lossy` method, but
transfers ownership. This operation can be done without a copy if the
`OsString` contains UTF-8 data or if the platform is Windows.

This comment has been minimized.

@SimonSapin

SimonSapin Oct 13, 2015

Contributor

Reading between lines I think you already know that, but WTF-8 to UTF-8 conversion can be done in place: https://simonsapin.github.io/wtf-8/#converting-wtf-8-utf-8

This comment has been minimized.

@wthrowe

wthrowe Oct 13, 2015

Author

Yep. In fact, the internal WTF-8 implementation in libstd already has an into_string_lossy that does the right thing. It just isn't exposed at the OsString level.

let prefix: OsString = OsStringExt::from_wide(&[0xD83D]);
let suffix: OsString = OsStringExt::from_wide(&[0xDE3A]);
assert_eq!(string.remove_prefix(&prefix[..]), Some(&suffix[..]));

This comment has been minimized.

@SimonSapin

SimonSapin Oct 13, 2015

Contributor

It is possible to write remove_prefix for WTF-8 such that this holds. (When the prefix ends with a lead surrogate, also consider the corresponding range of four-bytes sequences in self.) But this is such an edge case that it may not be worth the complexity, and it’s not even clear to me that it’s a desirable behavior.

This comment has been minimized.

@wthrowe

wthrowe Oct 13, 2015

Author

It is possible to write a starts_with variant that does this (and one will be included in the next version of this RFC), but for remove_prefix you also have to construct an &OsStr representing the rest of the string, which is a problem if that OsStr should start with a trail surrogate that is combined with something in the in-memory representation. (I suppose it could return a Cow<OsStr>, but I'm not planning on proposing that unless someone else thinks it's a really good idea.)

wthrowe added some commits Oct 12, 2015

Replace str methods with patterns, add _os methods
Also adds more explanation of how OS strings are interpreted.
Switch the proposed and alternate function bounds
This seems to not prevent anything actually useful and avoids
confusion.  Also matches `str` better.
@wthrowe

This comment has been minimized.

Copy link
Author

wthrowe commented Oct 17, 2015

New version, now with more patterns!

The previous proposed functions have mostly been replaced with a new interface matching str as closely as possible, but with a few new OsStr-specific operations.

I did decide to not propose slice_shift_char in the end, as I think I'd prefer to go the route of proposing more general functions for both str and OsStr, probably along the lines of fn split_prefix<P: Pattern<'a>>(&'a self, pat: P) -> Option<(&'a str, &'a Self)>, kind of like the current str::split_at. (slice_shift_char is then split_prefix(|_| true), except for some minor type differences.) (The idea that this would also be nice for str was pointed out by @SimonSapin in some line comments above.) It does make this proposal less useful in isolation, but I think it's a better direction in the long run.

The main uncertainty in the new version is what to do with patterns that match the empty string. The non-iterator functions (like contains) have pretty obvious interpretations, but the iterator ones (like matches) don't. I've listed a few possibilities in the text, but I think some more discussion is needed on this.

@wthrowe

This comment has been minimized.

Copy link
Author

wthrowe commented Oct 17, 2015

One more thought on patterns: With more complicated implementors of Pattern (like regular expressions or something) the concept of "matches the empty string" is not even terribly meaningful. A pattern might match some empty strings but not others, depending on context. The most reasonable idea might be to just declare that patterns are matched against each Unicode section separately. (And maybe at the beginning and end of the string in any case, since it would feel odd for .starts_with("") to ever give false.)


```rust
/// Returns true if `needle` is a substring of `self`.
fn contains_os<S: AsRef<OsStr>>(&self, needle: S) -> bool;

This comment has been minimized.

@alexcrichton

alexcrichton Oct 19, 2015

Member

I think it may be a bit onerous to remember that contains and contains_os are both methods on an OsStr. I think we'd be in a better place (e.g. especially in mirroring str) if we could avoid extra methods like this. It may involve perhaps a new Pattern trait down below, but we may also be able to substitute AsRef<OsStr> for Pattern because str is in theory far more ubiquitous than OsStr

This comment has been minimized.

@wthrowe

wthrowe Oct 20, 2015

Author

Some kind of OsPattern trait implemented for P: Pattern and OsStr could probably work. (Implementing for AsRef<OsStr> as well would be forbidden by coherence, I believe.) There might be some trickiness due to starts_with and contains having different bounds. Have to think about it.

Edit: I think that can be dealt with by bounds on the trait methods.

No reason such a thing couldn't be used for replace as well, except that for some reason str doesn't accept patterns there.

This comment has been minimized.

@alexcrichton

alexcrichton Oct 20, 2015

Member

Yeah I forget the exact reason that replace on strings doesn't take a pattern, but I think there may be a good reason? (cc @Kimundi)

Otherwise yeah having an OsPattern trait seems not-too-bad here perhaps

This comment has been minimized.

@wthrowe

wthrowe Oct 30, 2015

Author

Here's a possible rough design:

Trait OsPattern<'a> is basically identical to Pattern<'a> except that the strs become OsStrs, the Searcher is bound on OsSearcher<'a>, and is_contained_in has an extra bound (more on that later).

Safe trait OsSearcher<'a> only has methods haystack (same as Searcher<'a>) and is_prefix_of.

Safe trait ReverseOsSearcher<'a> requires OsSearcher<'a> and adds is_suffix_of. OsStr::ends_with requires this bound.

Safe trait FullOsSearcher<'a> requires OsSearcher<'a> and adds is_contained_in. OsStr::contains requires this bound.

If we additionally want to support replace with patterns, then we also have:
Unsafe trait IndexedOsSearcher<'a> requires FullOsSearcher<'a> and adds the remainder of the Searcher<'a> methods, except that all the usize returns are changed to OsIndex<'a>s. OsStr::replace requires this bound.

Enum OsIndex<'a> has variants Unicode(&'a str, usize) and NonUnicode(&'a OsStr, NonUnicodeIndex). The first field in each variant is expected to be a substring of the haystack, and the second is an index into that substring. NonUnicodeIndex is a struct with basically no public interface. (OS-specific interfaces may be added later.)

If we decide to add methods like the split_prefix mentioned above we will likely need to add more variants on the OsSearcher trait. Ideally that would be figured out before any of this is stabilized.

This comment has been minimized.

@alexcrichton

alexcrichton Nov 2, 2015

Member

Yeah in the limit I think this may end up duplicating the API surface area of the string pattern traits, but there's also a question as to whether this needs to be so full-blown just yet. In theory OS strings are used much more rarely than regular strings (especially for various text manipulation routines), so we may be able to get by with a much smaller API surface area.

I wonder if perhaps this could expose a relatively straightforward, if not as generic, API today which leaves room to this sort of expansion in the future but doesn't take the leap quite just yet?

This comment has been minimized.

@wthrowe

wthrowe Nov 8, 2015

Author

Reflecting on this, I think at the very least IndexedOsSearcher is silly, because I can't think of any case where one would implement FullOsSearcher but not it, so they can be merged.

The only reasonable simplification of the API that I can think of that can be generalized to something like this is to just bound on the Pattern family of traits. Then there are no new traits needed, and everything works except for starts_with, ends_with, contains, and replace with an OsStr. That's probably not too bad as a restriction, and I believe changing to OsPattern at some later time would be fully backwards compatible.

@aturon aturon assigned Kimundi and unassigned alexcrichton Mar 2, 2016

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented Mar 3, 2016

I've unfortunately not been able to find much time to allocate to this RFC, so @Kimundi (who originally developed the pattern API for strings) is gonna take over this.

@Kimundi

This comment has been minimized.

Copy link
Member

Kimundi commented Mar 15, 2016

Hi, just wanted to say that haven't forgotten about this RFC, but I'm currently investigating how a fully general pattern API (for arbitrary slice types) would look like, and whether there might be incompatibilities or method name clashes with this RFC.

@Diggsey

This comment has been minimized.

Copy link
Contributor

Diggsey commented Apr 6, 2016

@Kimundi Thanks for the update - could really do with this functionality!

@Kimundi

This comment has been minimized.

Copy link
Member

Kimundi commented Apr 24, 2016

Update:

I have a incomplete, undocumented prototype at https://github.com/Kimundi/rust_pattern_api_v2 that provides the same Pattern API + iterators for both & and &mut of str, [T], and OsStr.

I need to find time to do a proper writeup for my findings, but in regard to this RFC the cliff notes are:

  • There are two separate Pattern API's for OsStr:
    • one for searching sub-str slices which is identical to the str API: Patterns str, char, char predicates.
    • one for searching sub-OsStr slices with Patterns OsStr, str, char, char predicates.
      • searching for OsStr patterns has the edge case of needing to reject patterns on windows that start/end with the wrong kind of leading/trailing surrogate, since the Pattern API would not always be able to return slices to the haystack string that contains them. The alternative to a panic/err result would be to make the Pattern impl return Cow<OsStr> values.
  • If both are provided in some capacity, like in this RFC, they should probably be provided consistently and uniformly.
    • Eg. there should for example be symmetrical sets of Pattern-using foo and foo_os methods like contains() and contains_os(), instead of the proposed mix of, eg, contains() that works similar to taking a OsStr Pattern vs starts_with() taking a str Pattern.
  • The "unicode embedded in OsStr" behavior seems to be implementable without going through the split_unicode layer and requiring cloneable Patterns.

@kamalmarhubi kamalmarhubi referenced this pull request May 3, 2016

Closed

Add mkstemp(3) #365

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented Jul 19, 2016

The libs team discussed this RFC recently and the conclusion was that while we'd like to implement something along these lines the RFC will need much of a revamp now and we unfortunately haven't been able to reach the author, so we're going to close. We're of course quite willing to entertain RFCs in this area though!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.