Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upOS string string-like interface #1309
Conversation
wthrowe
changed the title
Add RFC for OS string string-like interface
OS string string-like interface
Oct 6, 2015
alexcrichton
added
the
T-libs
label
Oct 6, 2015
This comment has been minimized.
This comment has been minimized.
|
See rust-lang/rust#26499 (related discussion) |
This comment has been minimized.
This comment has been minimized.
|
Thanks. I'd missed that discussion.
That also suggests the question of whether these functions should take things like |
alexcrichton
self-assigned this
Oct 8, 2015
alexcrichton
reviewed
Oct 8, 2015
| ```rust | ||
| /// Returns true if the string starts with a valid UTF-8 sequence | ||
| /// equal to the given `&str`. | ||
| fn starts_with_str(&self, prefix: &str) -> bool; |
This comment has been minimized.
This comment has been minimized.
alexcrichton
Oct 8, 2015
Member
When adding string-like APIs to OsStr, it may be best to try to stick to original str APIs as much as possible, while also allowing all kinds of fun functionality for OsStr. Along those lines, perhaps this API could look like:
fn starts_with<S: AsRef<OsStr>>(&self, prefix: S) -> bool;That should cover this use case as well as something like os_str.starts_with(&other_os_string).
This comment has been minimized.
This comment has been minimized.
alexcrichton
reviewed
Oct 8, 2015
| /// If the string starts with the given `&str`, returns the rest | ||
| /// of the string. Otherwise returns `None`. | ||
| fn remove_prefix_str(&self, prefix: &str) -> Option<&OsStr>; |
This comment has been minimized.
This comment has been minimized.
alexcrichton
Oct 8, 2015
Member
Along the lines of sticking as close to str as possible, I think that this would be best expressed as a splitn if you're working with str. Adding that API to OS strings, however, may be a bit tricky as it may involve dealing with the Pattern trait, so it may be best to perhaps hold off on this just yet? Either that or perhaps adding splitn which only takes P: AsRef<OsStr> for now.
This comment has been minimized.
This comment has been minimized.
SimonSapin
Oct 13, 2015
Contributor
With splitn you have to check that the first chunk is empty, and it does more work than necessary in the negative case.
I’ve wanted str::remove_prefix as well (on "normal" UTF-8 strings). Could it be added on both types?
This comment has been minimized.
This comment has been minimized.
wthrowe
Oct 13, 2015
Author
Yeah, I still like this functionality, but you're right that it would be nice on both types of strings. Since modifying str seems out of scope for this RFC, I think this will be probably be bumped to a future RFC.
Hmm, if this took a pattern and also returned the prefix, then it could nicely replace slice_shift_char.
alexcrichton
reviewed
Oct 8, 2015
| /// and the remainder of the `OsStr`. Returns `None` if the | ||
| /// `OsStr` does not start with a character (either because it it | ||
| /// empty or because it starts with non-UTF-8 data). | ||
| fn slice_shift_char(&self) -> Option<(char, &OsStr)>; |
This comment has been minimized.
This comment has been minimized.
alexcrichton
Oct 8, 2015
Member
We may want to hold off on adding this API to OS strings for now as the API on str is still unstable and it may be awhile before it's stabilize (also a little dubious as to how useful it is).
This comment has been minimized.
This comment has been minimized.
wthrowe
Oct 8, 2015
Author
My understanding is that the str version of this is potentially non-useful because it is easy to rewrite it as
let first = string.chars().next();
if let Some(first) = first {
let rest = &string[first.len_utf8()..];
...
}But neither chars() nor indexing make sense on an OsStr. The best I can come up with for OsStr is something like
let mut split = string.splitn(3, "");
split.next().unwrap();
match (split.next(), split.next()) {
(Some(first), Some(rest)) if first.to_str().map(|s| s.chars().count() == 1).unwrap_or(false) => {
let first = first.to_str().unwrap().chars().next().unwrap();
...
}
_ => {}
}but that's kind of a mouthful and is easy to screw up in subtle ways (and depends on what empty patterns end up doing).
This comment has been minimized.
This comment has been minimized.
alexcrichton
Oct 9, 2015
Member
Hm, very good points! I realize now I should have read the section below a little more closely, thanks for the explanation!
alexcrichton
reviewed
Oct 8, 2015
| /// If the `OsStr` starts with a UTF-8 section followed by | ||
| /// `boundary`, returns the sections before and after the boundary | ||
| /// character. Otherwise returns `None`. | ||
| fn split_off_str(&self, boundary: char) -> Option<(&str, &OsStr)>; |
This comment has been minimized.
This comment has been minimized.
alexcrichton
Oct 8, 2015
Member
Similarly to remove_prefix_str above, perhaps this would be best suited for a splitn in terms of compositions? You'd probably get two OsStr instances out of that, but could call to_str on the first to get the same API as this.
alexcrichton
reviewed
Oct 8, 2015
| /// # Panics | ||
| /// | ||
| /// Panics if the boundary character is not ASCII. | ||
| fn split<'a>(&'a self, boundary: char) -> Split<'a>; |
This comment has been minimized.
This comment has been minimized.
alexcrichton
Oct 8, 2015
Member
This is certainly an interesting API! Some thoughts I have here:
- In the interest of staying aligned with
str, this may want to take at leastP: AsRef<OsStr>and can perhaps eventually be generalized toP: OsPattern(similar to thesplitfunction onstr) to also allowchar. - Having this API panic if a character isn't ASCII though is pretty unfortunate, and I think it'd be best to avoid that (and perhaps using
OsStrwould alleviate that for now?)
This comment has been minimized.
This comment has been minimized.
wthrowe
Oct 8, 2015
Author
As I mentioned in the Alternatives section, it is not possible to split an OsStr on an OsStr in Windows because of the details of the WTF-8 encoding.
Thinking about this a bit more, the restriction to ASCII should be completely unnecessary. We should be able to take an arbitrary &str, and possibly even a full P: Pattern (although I'll have to think about that a bit more), but notably not any generalizations of those to OsStr.
This comment has been minimized.
This comment has been minimized.
alexcrichton
Oct 9, 2015
Member
Aha! Sorry I think I missed that part of the section below. Generalization to a full Pattern would be great, and even &str would be quite nice!
cc @SimonSapin, curious on your thoughts on split + the WTF-8 encoding
This comment has been minimized.
This comment has been minimized.
SimonSapin
Oct 13, 2015
Contributor
I agree the ASCII restriction is not necessary on Windows. I don’t think there is an issue splitting WTF-8 on any char. It’s not clear what splitting on a non-ASCII means on Unix, though. Split on the corresponding UTF-8 sequence? That makes sense (especially if split also accepts &str arguments), but needs to be documented.
As I mentioned in the Alternatives section, it is not possible to split an OsStr on an OsStr in Windows because of the details of the WTF-8 encoding.
Can you expand on this? Concatenation joining surrogate pairs that were previously not in a pair can be unexpected, but I don’t think it makes anything impossible.
This comment has been minimized.
This comment has been minimized.
wthrowe
Oct 13, 2015
Author
I'm going to move to a Pattern interface for split in the next version, although that might end up being scaled back if we can't decide what to do with a pattern that matches "".
I'll try to add a better explanation of the splitting problem. For now see my comments about remove_prefix later in these comments.
This comment has been minimized.
This comment has been minimized.
|
Thanks for the RFC @wthrowe! It does seem high time that we start expanding the API for My thoughts on this in the past have been primarily along the lines of:
|
This comment has been minimized.
This comment has been minimized.
|
I'll look into writing up equivalents of some of the more general pattern matching |
This comment has been minimized.
This comment has been minimized.
|
Yeah I agree that |
SimonSapin
reviewed
Oct 13, 2015
|
|
||
| This is analogous to the existing `OsStr::to_string_lossy` method, but | ||
| transfers ownership. This operation can be done without a copy if the | ||
| `OsString` contains UTF-8 data or if the platform is Windows. |
This comment has been minimized.
This comment has been minimized.
SimonSapin
Oct 13, 2015
Contributor
Reading between lines I think you already know that, but WTF-8 to UTF-8 conversion can be done in place: https://simonsapin.github.io/wtf-8/#converting-wtf-8-utf-8
This comment has been minimized.
This comment has been minimized.
wthrowe
Oct 13, 2015
Author
Yep. In fact, the internal WTF-8 implementation in libstd already has an into_string_lossy that does the right thing. It just isn't exposed at the OsString level.
SimonSapin
reviewed
Oct 13, 2015
| let prefix: OsString = OsStringExt::from_wide(&[0xD83D]); | ||
| let suffix: OsString = OsStringExt::from_wide(&[0xDE3A]); | ||
| assert_eq!(string.remove_prefix(&prefix[..]), Some(&suffix[..])); |
This comment has been minimized.
This comment has been minimized.
SimonSapin
Oct 13, 2015
Contributor
It is possible to write remove_prefix for WTF-8 such that this holds. (When the prefix ends with a lead surrogate, also consider the corresponding range of four-bytes sequences in self.) But this is such an edge case that it may not be worth the complexity, and it’s not even clear to me that it’s a desirable behavior.
This comment has been minimized.
This comment has been minimized.
wthrowe
Oct 13, 2015
Author
It is possible to write a starts_with variant that does this (and one will be included in the next version of this RFC), but for remove_prefix you also have to construct an &OsStr representing the rest of the string, which is a problem if that OsStr should start with a trail surrogate that is combined with something in the in-memory representation. (I suppose it could return a Cow<OsStr>, but I'm not planning on proposing that unless someone else thinks it's a really good idea.)
wthrowe
added some commits
Oct 12, 2015
This comment has been minimized.
This comment has been minimized.
|
New version, now with more patterns! The previous proposed functions have mostly been replaced with a new interface matching I did decide to not propose The main uncertainty in the new version is what to do with patterns that match the empty string. The non-iterator functions (like |
This comment has been minimized.
This comment has been minimized.
|
One more thought on patterns: With more complicated implementors of |
alexcrichton
reviewed
Oct 19, 2015
|
|
||
| ```rust | ||
| /// Returns true if `needle` is a substring of `self`. | ||
| fn contains_os<S: AsRef<OsStr>>(&self, needle: S) -> bool; |
This comment has been minimized.
This comment has been minimized.
alexcrichton
Oct 19, 2015
Member
I think it may be a bit onerous to remember that contains and contains_os are both methods on an OsStr. I think we'd be in a better place (e.g. especially in mirroring str) if we could avoid extra methods like this. It may involve perhaps a new Pattern trait down below, but we may also be able to substitute AsRef<OsStr> for Pattern because str is in theory far more ubiquitous than OsStr
This comment has been minimized.
This comment has been minimized.
wthrowe
Oct 20, 2015
Author
Some kind of OsPattern trait implemented for P: Pattern and OsStr could probably work. (Implementing for AsRef<OsStr> as well would be forbidden by coherence, I believe.) There might be some trickiness due to starts_with and contains having different bounds. Have to think about it.
Edit: I think that can be dealt with by bounds on the trait methods.
No reason such a thing couldn't be used for replace as well, except that for some reason str doesn't accept patterns there.
This comment has been minimized.
This comment has been minimized.
alexcrichton
Oct 20, 2015
Member
Yeah I forget the exact reason that replace on strings doesn't take a pattern, but I think there may be a good reason? (cc @Kimundi)
Otherwise yeah having an OsPattern trait seems not-too-bad here perhaps
This comment has been minimized.
This comment has been minimized.
wthrowe
Oct 30, 2015
Author
Here's a possible rough design:
Trait OsPattern<'a> is basically identical to Pattern<'a> except that the strs become OsStrs, the Searcher is bound on OsSearcher<'a>, and is_contained_in has an extra bound (more on that later).
Safe trait OsSearcher<'a> only has methods haystack (same as Searcher<'a>) and is_prefix_of.
Safe trait ReverseOsSearcher<'a> requires OsSearcher<'a> and adds is_suffix_of. OsStr::ends_with requires this bound.
Safe trait FullOsSearcher<'a> requires OsSearcher<'a> and adds is_contained_in. OsStr::contains requires this bound.
If we additionally want to support replace with patterns, then we also have:
Unsafe trait IndexedOsSearcher<'a> requires FullOsSearcher<'a> and adds the remainder of the Searcher<'a> methods, except that all the usize returns are changed to OsIndex<'a>s. OsStr::replace requires this bound.
Enum OsIndex<'a> has variants Unicode(&'a str, usize) and NonUnicode(&'a OsStr, NonUnicodeIndex). The first field in each variant is expected to be a substring of the haystack, and the second is an index into that substring. NonUnicodeIndex is a struct with basically no public interface. (OS-specific interfaces may be added later.)
If we decide to add methods like the split_prefix mentioned above we will likely need to add more variants on the OsSearcher trait. Ideally that would be figured out before any of this is stabilized.
This comment has been minimized.
This comment has been minimized.
alexcrichton
Nov 2, 2015
Member
Yeah in the limit I think this may end up duplicating the API surface area of the string pattern traits, but there's also a question as to whether this needs to be so full-blown just yet. In theory OS strings are used much more rarely than regular strings (especially for various text manipulation routines), so we may be able to get by with a much smaller API surface area.
I wonder if perhaps this could expose a relatively straightforward, if not as generic, API today which leaves room to this sort of expansion in the future but doesn't take the leap quite just yet?
This comment has been minimized.
This comment has been minimized.
wthrowe
Nov 8, 2015
Author
Reflecting on this, I think at the very least IndexedOsSearcher is silly, because I can't think of any case where one would implement FullOsSearcher but not it, so they can be merged.
The only reasonable simplification of the API that I can think of that can be generalized to something like this is to just bound on the Pattern family of traits. Then there are no new traits needed, and everything works except for starts_with, ends_with, contains, and replace with an OsStr. That's probably not too bad as a restriction, and I believe changing to OsPattern at some later time would be fully backwards compatible.
aturon
assigned
Kimundi
and unassigned
alexcrichton
Mar 2, 2016
This comment has been minimized.
This comment has been minimized.
|
I've unfortunately not been able to find much time to allocate to this RFC, so @Kimundi (who originally developed the pattern API for strings) is gonna take over this. |
This comment has been minimized.
This comment has been minimized.
|
Hi, just wanted to say that haven't forgotten about this RFC, but I'm currently investigating how a fully general pattern API (for arbitrary slice types) would look like, and whether there might be incompatibilities or method name clashes with this RFC. |
This comment has been minimized.
This comment has been minimized.
|
@Kimundi Thanks for the update - could really do with this functionality! |
This comment has been minimized.
This comment has been minimized.
|
Update: I have a incomplete, undocumented prototype at https://github.com/Kimundi/rust_pattern_api_v2 that provides the same Pattern API + iterators for both I need to find time to do a proper writeup for my findings, but in regard to this RFC the cliff notes are:
|
This comment has been minimized.
This comment has been minimized.
|
The libs team discussed this RFC recently and the conclusion was that while we'd like to implement something along these lines the RFC will need much of a revamp now and we unfortunately haven't been able to reach the author, so we're going to close. We're of course quite willing to entertain RFCs in this area though! |
wthrowe commentedOct 6, 2015
Rendered
Prototype implementation (also covers #1307)
Cc: #900, #1307