Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upRFC: Extend Pattern API to OsStr #2295
Conversation
Centril
added
the
T-libs
label
Jan 16, 2018
This comment has been minimized.
This comment has been minimized.
|
Probably completely out of place but – |
This comment has been minimized.
This comment has been minimized.
|
Thank you thank you thank you! @SimonSapin @Kimundi can one or both of you please provide a review of this RFC? |
aturon
assigned
Kimundi
Feb 7, 2018
This comment has been minimized.
This comment has been minimized.
|
cc @rust-lang/libs this has been a very longstanding issue. Can we consider landing this API as unstable? |
Kimundi
reviewed
Feb 21, 2018
|
This seems like a good (if somewhat hacky) way to solve the surrogate point issue in doing pattern searches in Short summary: This RFC essentially proposes two things:
Both are prerequisites to extending the This simplifies the work needed for an follow-up RFC to actually extend the However, it is not clear to me why the proposed semantic for slicing an
It seems the main advantage of the first scheme is that there is exactly one slice position that indicates the middle of a surrogate pair. But it is not clear to me if this property is actually needed, since the low level implementations of I guess it amounts to the question of whether you can take the index returned by, eg, Besides from that, it is also not clear which slice positions should be legal in general across windows and unix. Again I can image two schemes:
Lastly, I think we should also be sure that changing the equality semantic of |
This comment has been minimized.
This comment has been minimized.
|
@Kimundi Thanks for the review Yes let s = OsStr::new("\u{10000}");
assert_eq!(s.len(), 4);
let index = s.find('\u{dc00}').unwrap();
assert_eq!(index, 1); // if we "Expose the proposed scheme"
let right = &s[index..]; // this will be [90 80 80], fine
let left = &s[..index]; // but this will be [f0] ??????This also relates to the next question "which slice positions should be legal". We could
The second choice is no better than what this RFC proposed. There would also be the question in the meaning of The first choice means if you want the My preference about legal slice positions would be:
Because we support |
This comment has been minimized.
This comment has been minimized.
|
Makes sense! So, would you agree that we could define the slicing semantic as:
We can then either expose this definition as a stable guarantee for users, or just use it as a internal one with the guarantee that you can slice at all indices that the Bad slicing positions should then just panic, just as with |
This comment has been minimized.
This comment has been minimized.
|
@Kimundi Agreed. Since |
This comment has been minimized.
This comment has been minimized.
|
Great! Would you mind adding that as clarification to the RFC text? |
kennytm
force-pushed the
kennytm:os-str-pattern
branch
from
1758742
to
840e821
Feb 25, 2018
kennytm
force-pushed the
kennytm:os-str-pattern
branch
from
840e821
to
8b1171c
Feb 25, 2018
This comment has been minimized.
This comment has been minimized.
|
Done! Added 8b1171c. |
Kimundi
approved these changes
Feb 27, 2018
|
Alright, the RFC looks good now in my opinion. I would add two further advantages of the "use real index" alternative:
But those are minor nits. |
This comment has been minimized.
This comment has been minimized.
|
And since there are no other comments, and we really should start making some progress on this, I'll be moving this to FCP asap. @SimonSapin: I don't think what is proposed here is too objectionable in regard to WTF-8, since the RFC doesn't expose its encoding directly. @rfcbot fcp merge |
This comment has been minimized.
This comment has been minimized.
|
hm... @rfcbot fcp merge |
This comment has been minimized.
This comment has been minimized.
rfcbot
commented
Feb 27, 2018
•
|
Team member @alexcrichton has proposed to merge this. The next step is review by the rest of the tagged teams: No concerns currently listed. Once a majority of reviewers approve (and none object), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up! See this document for info about what commands tagged team members can give me. |
rfcbot
added
the
proposed-final-comment-period
label
Feb 27, 2018
SimonSapin
approved these changes
Mar 12, 2018
|
Strong +1 on the changes proposed. (And some remarks on the RFC itself.) The high-level goal of WTF-8 is to provide an alternative representation (“encoding”) for I was mildly concerned by the escalation of complexity in the encoding, but this is mitigated by being an internal implementation detail whose details end users (hopefully) don’t need to care about. Maybe we would have been better off deciding pre-1.0 to make The term “WTF-8” has been around with a precise specification for a while, so rather than modifying WTF-8 it makes sense to define a new encoding (with a new name) that is a superset. Kudos to @kennytm for coming up with OMG-WTF-8’s design and a well-written and well-illustrated specification. |
| /// let path = OsStr::new("/usr/bin/bash"); | ||
| /// let range = path.rfind_range("/b"); | ||
| /// assert_eq!(range, Some(8..10)); | ||
| /// assert_eq!(path[range.unwrap()], OsStr::new("/bin")); |
This comment has been minimized.
This comment has been minimized.
SimonSapin
Mar 12, 2018
Contributor
This and the similar example in find_range look wrong. Shouldn’t slicing result in OsStr::new("/b")?
This comment has been minimized.
This comment has been minimized.
| It is trivial to apply the pattern API to `OsStr` on platforms where it is just an `[u8]`. The main | ||
| difficulty is on Windows where it is an `[u16]` encoded as WTF-8. This RFC thus focuses on Windows. | ||
|
|
||
| We will generalize the encoding of `OsStr` to specify these two capabilities: |
This comment has been minimized.
This comment has been minimized.
SimonSapin
Mar 12, 2018
Contributor
https://rust-lang.github.io/rfcs/0517-io-os-reform.html#string-handling does mention WTF-8 by name, so this RFC should probably mention the OMG-WTF-8 encoding by name (as a definition of the encoding, not just a sample implementation) even if the encoding remains an implementation detail.
The implementation PR should also make sure to update internal documentation to say that the representation of OsStr on Windows becomes OMG-WTF-8, with a link to the encoding’s full specification.
This comment has been minimized.
This comment has been minimized.
|
|
||
| ## Pattern API | ||
|
|
||
| This RFC assumes a generalized pattern API which supports more than strings. If the pattern API is |
This comment has been minimized.
This comment has been minimized.
SimonSapin
Mar 12, 2018
Contributor
I don’t quite understand what it means for an RFC to assume something. I’d prefer this RFC to propose one concrete API for Pattern and related traits (and maybe list alternatives in the Alternatives section).
This comment has been minimized.
This comment has been minimized.
kennytm
Mar 12, 2018
Author
Member
I've expanded this section and linked to the previous drafts. However, I prefer not to formally propose the pattern API here in this RFC as that is kinda out-of-scope.
kennytm
added some commits
Mar 12, 2018
kennytm
force-pushed the
kennytm:os-str-pattern
branch
from
7c33b8b
to
3d48148
Mar 12, 2018
This comment has been minimized.
This comment has been minimized.
|
Overall, this RFC looks great, but I think there might be an important question here that isn't covered. In particular, how are folks outside of What I do today (on Windows of course, the Unix case is easy) is lossily convert the |
This comment has been minimized.
This comment has been minimized.
|
Since we do not want to expose the WTF-8 buffer, this is probably not easy to
|
This comment has been minimized.
This comment has been minimized.
Right, but that still requires me to assume WTF-8 in the first place. To be clear, I do think that if this use case becomes important for performance, then we should recognize the people can and will depend on internal implementation details in the name of performance. Does that change how we approach this? If enough people do it, then what was an internal implementation detail will become de facto stabilized.
Hmm could you unpack this? (I understand the Unix side of things. For the purposes of this discussion, Unix is a non-issue.)
How does this work if I can't assume WTF-8? |
This comment has been minimized.
This comment has been minimized.
I see what you mean now. I think the situation now is no much different from https://internals.rust-lang.org/t/make-std-os-unix-ffi-osstrext-cross-platform/6277/5. Although this RFC focuses on WTF-8, the new methods do not assume (OMG-)WTF-8, even if it is very hard to find other encoding that satisfies all the requirements:
That does not mean an alternative encoding is impossible though. For instance we could slightly tweak WTF-8 to store the lone surrogate codepoints as Therefore, unless the libs team thinks otherwise, I suppose we will maintain the status quo i.e. there is no safe way to access the underlying buffer. There's no way we can prevent the de facto WTF-8 assumption, but at least we could get away by claiming "THIS IS UNSAFE!!!1"? This also means there won't be a stable non-std
On Windows, an |
This comment has been minimized.
This comment has been minimized.
|
@kennytm Oh hmm, let me pop up a level. I'm less coming at this as "a libs team member who is strenuously objecting" and more as "let's make sure we walk into this with our eyes wide open." Of course, I agree that this isn't making the status quo worse, and more to the point, this new pattern API is a wonderful improvement over the status quo. I think my only real aim here is to get us to carefully consider that what is today an internal implementation detail, may tomorrow be a de facto public API, regardless of how insistent we are about it. I don't have the depth of understanding to know exactly what the ramifications of that are though! More concretely, if the only thing that comes of my bleating is a couple sentences as a documentation somewhere, then I'm happy. :-) |
This comment has been minimized.
This comment has been minimized.
|
Another reason I would be reluctant to have a public API that expose WTF-8 bytes, more than losing the option to change the encoding later, is the risk that people might not realize this is not UTF-8 and send those bytes over the network or write them to a file. Before we know it, WTF-8 might accidentally become a de-facto requirement for interoperably implementing some protocol or format. |
This comment has been minimized.
This comment has been minimized.
vitiral
commented
Mar 27, 2018
•
|
This is a tangential point to this RFC but I just wanted to mention that I developed STFU8 specifically for the use case of serializing/deserializing I'm a huge |
sfackler
reviewed
Mar 30, 2018
| unsafe fn range_to_self(hs: Self, start: Self::StartCursor, end: Self::EndCursor) -> Self; | ||
| // Since a StartCursor and EndCursor may not be comparable, we also need this method | ||
| fn is_range_empty(start: Self::StartCursor, end: Self::EndCursor) -> bool; |
This comment has been minimized.
This comment has been minimized.
sfackler
Mar 30, 2018
Member
We could add something like type StartCursor: Copy + PartialOrd<Self::EndCursor> to handle this possibly?
This comment has been minimized.
This comment has been minimized.
rfcbot
commented
Mar 30, 2018
|
|
rfcbot
added
final-comment-period
and removed
proposed-final-comment-period
labels
Mar 30, 2018
This comment has been minimized.
This comment has been minimized.
rfcbot
commented
Apr 9, 2018
|
The final comment period is now complete. |
Centril
referenced this pull request
Apr 9, 2018
Open
Tracking issue for RFC 2295, "Extend Pattern API to OsStr" #49802
Centril
merged commit 9c1609f
into
rust-lang:master
Apr 9, 2018
This comment has been minimized.
This comment has been minimized.
|
Huzzah! Tracking issue: rust-lang/rust#49802 |
kennytm commentedJan 16, 2018
•
edited by Centril
Supersedes the "Pattern API" part of RFC #1309.
omgwtf8cc #900.
cc @Kimundi (#528 "Pattern API 1.0")
cc @SimonSapin (WTF-8)