Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: introduce `split_at(mid: usize)` on `str` #1123

Merged
merged 2 commits into from Jun 9, 2015

Conversation

Projects
None yet
10 participants
@bluss
Copy link

commented May 17, 2015

Introduce the method split_at(&self, mid: usize) -> (&str, &str) on str,
to divide a slice into two, just like we can with [T].

Rendered version

Adding split_at is a measure to provide a method from [T] in a version that
makes sense for str.

Once used to [T], users might even expect that split_at is present on str.

/// Divide one string slice into two at an index.
///
/// The index `mid` is a byte offset from the start of the string
/// that must be on a character boundary.
///
/// Return slices `&self[..mid]` and `&self[mid..]`.
///
/// # Panics
///
/// Panics if `mid` is beyond the last character of the string,
/// or if it is not on a character boundary.
///
/// # Examples
/// ```
/// let s = "Löwe 老虎 Léopard";
/// let first_space = s.find(' ').unwrap_or(s.len());
/// let (a, b) = s.split_at(first_space);
///
/// assert_eq!(a, "Löwe");
/// assert_eq!(b, " 老虎 Léopard");
/// ```
fn split_at(&self, mid: usize) -> (&str, &str) {
    (&self[..mid], &self[mid..])
}
@bluss

This comment has been minimized.

Copy link
Author

commented May 17, 2015

cc @Manishearth you requested something like this in rust-lang/rust#18063.

@Manishearth

This comment has been minimized.

Copy link
Member

commented May 17, 2015

👍 😄

[Edit: I'm wrong here]
The current implementation could be improved to only have a single pass through the string (this iterates twice because each indexing operation is one loop)

@rkruppe

This comment has been minimized.

Copy link
Member

commented May 17, 2015

@Manishearth It doesn't loop at all, it uses byte indices and therefore each slice operation is constant time (just check that the index is in bounds and the byte is the start of a UTF-8 code point, then adjust the start/length of the slice).

@Manishearth

This comment has been minimized.

Copy link
Member

commented May 17, 2015

[Edit: I'm wrong here]
Strings are utf8, byte indexing would make this unsafe.

String indexing indexes by chars IIRC, and for that you need to loop through because chars have variable length.

@rkruppe

This comment has been minimized.

Copy link
Member

commented May 17, 2015

String indexing is by bytes and checks that the index lies on a codepoint boundary (which is cheap, just check that the byte is less than ox80). This is consistent with almost all other methods, including char_at.

@Manishearth

This comment has been minimized.

Copy link
Member

commented May 17, 2015

Ah, okay.

@bluss

This comment has been minimized.

Copy link
Author

commented May 17, 2015

The method has only a single index that needs to be bounds checked and checked for character boundary (mid), so if the redundant checks are not elided already, split_at can be optimized to fix that. The implementation in the RFC is of course only for illustration.

@alexcrichton

This comment has been minimized.

Copy link
Member

commented May 18, 2015

This sounds like a great idea to me, thanks for writing this up @bluss!

@alexcrichton alexcrichton added the T-libs label May 18, 2015

@retep998

This comment has been minimized.

Copy link
Member

commented May 19, 2015

Since this doesn't return Option, that means this code would panic! if you tried to use an index in the middle of a utf-8 sequence. Make sure you add that to the drawbacks.

@Gankro

This comment has been minimized.

Copy link
Contributor

commented May 20, 2015

I am of the belief that now that the trains are running, this kind of change no longer merits requires any kind of RFC. Am I mistaken?

@seanmonstar

This comment has been minimized.

Copy link
Contributor

commented May 20, 2015

I would expect adding any new public API to require an RFC. Twiddling the internals is just implementation that's hidden, but something that will eventually be marked #[stable] needs an RFC.

@bluss

This comment has been minimized.

Copy link
Author

commented May 20, 2015

@retep998 Done!

@seanmonstar This was what I thought was the rule, and I agree with you.

@pythonesque

This comment has been minimized.

Copy link
Contributor

commented May 20, 2015

Is there any chance split_at for strings could be modified to return an Option instead? It's actually pretty common for me to want a non-panicking version of this for slices and I definitely have a use case for it in parsers. I realize it would be inconsistent with the [T] version, but I've never really understood why that one doesn't return an Option.

@bluss

This comment has been minimized.

Copy link
Author

commented May 20, 2015

The panicking logic is the same as for slicing. You should only ever use it with a byte offset you got from somewhere, so you know it's always correct (for example from find like in the example). So we panic because it's a programmer's error (contract violation) to specify an illegal offset.

I absolutely don't want to introduce an inconsistency. A part of the motivation here is that you expect .split_at to be available as it is on slices. If we go the option route, it needs a new name.

Contract violations @ Error conventions RFC

@pythonesque

This comment has been minimized.

Copy link
Contributor

commented May 20, 2015

You should only ever use it with a byte offset you got from somewhere, so you know it's always correct (for example from find like in the example).

I don't think it's necessarily a contract violation (because strings and vectors can grow and shrink) but yeah, maybe I just want an additional checked version (but I thought our new policy on this was that we shouldn't have both a panicking and non-panicking version of an API?).

@bluss

This comment has been minimized.

Copy link
Author

commented May 20, 2015

@pythonesque

Since we don't have any optional-value slicing, I think we should add .get_slice<R: RangeArgument>(range: R) -> Option<&str> first.

Additionally or alternatively, we should stabilize is_char_boundary so that it is possible to check indices for slicing and split_at up front. I think that's in line with our guidelines. Maybe the function could be renamed to something like is_valid_index ?

bluss pushed a commit to bluss/rust that referenced this pull request May 27, 2015

Ulrik Sverdrup
Add str::split_at
Implement RFC rust-lang/rfcs#1123

Add str method str::split_at(mid: usize) -> (&str, &str).

bluss pushed a commit to bluss/rust that referenced this pull request May 27, 2015

Ulrik Sverdrup
Add str::split_at
Implement RFC rust-lang/rfcs#1123

Add str method str::split_at(mid: usize) -> (&str, &str).
@alexcrichton

This comment has been minimized.

Copy link
Member

commented Jun 2, 2015

This RFC is now entering the final week-long final comment period.

The library subteam may not require an RFC for a change such as this in the future, but we have yet to concretely decided on the guidelines for what needs an RFC (for discussion see this thread). While we develop this policy, however, we're going to go ahead with this RFC.

@Gankro

This comment has been minimized.

Copy link
Contributor

commented Jun 2, 2015

This seems fine to add. No strong feelings, though.

@huonw

This comment has been minimized.

Copy link
Member

commented Jun 3, 2015

My only comment is that I question the unicode hygiene of this function (i.e. naive calls to this probably won't be unicode-correct?). However, this is consistent with [T], no worse than things we already expose, definitely clearer than double-..-slicing and anyway we generally don't want to stop people doing things just because there's the risk of misuse. On balance, I'm happy to see this land.

@rkruppe

This comment has been minimized.

Copy link
Member

commented Jun 3, 2015

What do you mean by unicode-correct? Like all other slicing and indexing, it respects code point boundaries, i.e., it will rather panic than cutting a code point in half. It won't try to make you deal with grapheme clusters or anything, but none of the existing methods do that (except for graphemes and grapheme_indices, obviously).

@huonw

This comment has been minimized.

Copy link
Member

commented Jun 3, 2015

I mean that it is very easy to split a string to give nonsense output, e.g. at the most basic level, splitting the two-codepoint version of ä between a and the combining character, and, it's not immediately obvious to me if performing this operation is useful/can be correct for arbitrary unicode strings. (Don't get me wrong, it can definitely be correct and useful for restricted subsets, e.g. ASCII-only strings, but an arbitrary chunk of text can be very arbitrary.)

@rkruppe

This comment has been minimized.

Copy link
Member

commented Jun 3, 2015

But all other string manipulation1 already has that problem. It's a long-standing policy to not try and prevent that sort of error. In addition, it's not like this API can't be used "properly" --- you just have to select the index at which you split properly. Which, again, is exactly the same situations as with the rest of the string API.

1With the aforementioned exception, which BTW moved out of libstd now.

@huonw

This comment has been minimized.

Copy link
Member

commented Jun 3, 2015

Yes, I covered those aspects in my comment. It's why I'm not against this landing. :)

@Gankro Gankro self-assigned this Jun 5, 2015

bluss pushed a commit to bluss/rust that referenced this pull request Jun 9, 2015

Ulrik Sverdrup
Add str::split_at
Implement RFC rust-lang/rfcs#1123

Add str method str::split_at(mid: usize) -> (&str, &str).
@alexcrichton

This comment has been minimized.

Copy link
Member

commented Jun 9, 2015

This consensus of the libs subteam is that we should merge this RFC, so I'm going to merge it. Thanks again for the RFC @bluss!

@alexcrichton alexcrichton merged commit 8f8f7ea into rust-lang:master Jun 9, 2015

@bluss bluss deleted the bluss:str-split-at branch Jun 10, 2015

bors added a commit to rust-lang/rust that referenced this pull request Jun 10, 2015

Auto merge of #25839 - bluss:str-split-at-impl, r=alexcrichton
Implement RFC rust-lang/rfcs#1123

Add str method str::split_at(mid: usize) -> (&str, &str).

Also a minor cleanup in the collections::str module. Remove redundant slicing of self.

bluss pushed a commit to bluss/rust that referenced this pull request Jun 10, 2015

Ulrik Sverdrup
Add str::split_at
Implement RFC rust-lang/rfcs#1123

Add str method str::split_at(mid: usize) -> (&str, &str).

bors added a commit to rust-lang/rust that referenced this pull request Jun 11, 2015

Auto merge of #25839 - bluss:str-split-at-impl, r=alexcrichton
Implement RFC rust-lang/rfcs#1123

Add str method str::split_at(mid: usize) -> (&str, &str).

Also a minor cleanup in the collections::str module. Remove redundant slicing of self.

thepowersgang added a commit to thepowersgang/rust that referenced this pull request Jul 25, 2015

Add str::split_at
Implement RFC rust-lang/rfcs#1123

Add str method str::split_at(mid: usize) -> (&str, &str).

@chriskrycho chriskrycho referenced this pull request Feb 8, 2017

Closed

Document all features in the reference #38643

0 of 17 tasks complete

@chriskrycho chriskrycho referenced this pull request Mar 11, 2017

Closed

Document all features #9

18 of 48 tasks complete

@Centril Centril added the A-slice label Nov 23, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.