Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

collections: Stabilize String #17438

Merged
merged 3 commits into from Sep 24, 2014

Conversation

@alexcrichton
Copy link
Member

commented Sep 22, 2014

Rationale

When dealing with strings, many functions deal with either a char (unicode
codepoint) or a byte (utf-8 encoding related). There is often an inconsistent
way in which methods are referred to as to whether they contain "byte", "char",
or nothing in their name. There are also issues open to rename all methods to
reflect that they operate on utf8 encodings or bytes (e.g. utf8_len() or
byte_len()).

The current state of String seems to largely be what is desired, so this PR
proposes the following rationale for methods dealing with bytes or characters:

When constructing a string, the input encoding must be mentioned (e.g.
from_utf8). This makes it clear what exactly the input type is expected to be
in terms of encoding.

When a method operates on anything related to an index within the string
such as length, capacity, position, etc, the method implicitly operates on
bytes. It is an understood fact that String is a utf-8 encoded string, and
burdening all methods with "bytes" would be redundant.

When a method operates on the contents of a string, such as push() or pop(),
then "char" is the default type. A String can loosely be thought of as being a
collection of unicode codepoints, but not all collection-related operations
make sense because some can be woefully inefficient.

Method stabilization

The following methods have been marked #[stable]

  • The String type itself
  • String::new
  • String::with_capacity
  • String::from_utf16_lossy
  • String::into_bytes
  • String::as_bytes
  • String::len
  • String::clear
  • String::as_slice

The following methods have been marked #[unstable]

  • String::from_utf8 - The error type in the returned Result may change to
    provide a nicer message when it's unwrap()'d
  • String::from_utf8_lossy - The returned MaybeOwned type still needs
    stabilization
  • String::from_utf16 - The return type may change to become a Result which
    includes more contextual information like where the error
    occurred.
  • String::from_chars - This is equivalent to iter().collect(), but currently not
    as ergonomic.
  • String::from_char - This method is the equivalent of Vec::from_elem, and has
    been marked #[unstable] becuase it can be seen as a
    duplicate of iterator-based functionality as well as
    possibly being renamed.
  • String::push_str - This can be emulated with .extend(foo.chars()), but is
    less efficient because of decoding/encoding. Due to the
    desire to minimize API surface this may be able to be
    removed in the future for something possibly generic with
    no loss in performance.
  • String::grow - This is a duplicate of iterator-based functionality, which may
    become more ergonomic in the future.
  • String::capacity - This function was just added.
  • String::push - This function was just added.
  • String::pop - This function was just added.
  • String::truncate - The failure conventions around String methods and byte
    indices isn't totally clear at this time, so the failure
    semantics and return value of this method are subject to
    change.
  • String::as_mut_vec - the naming of this method may change.
  • string::raw::* - these functions are all waiting on an RFC

The following method have been marked #[experimental]

  • String::from_str - This function only exists as it's more efficient than
    to_string(), but having a less ergonomic function for
    performance reasons isn't the greatest reason to keep it
    around. Like Vec::push_all, this has been marked
    experimental for now.

The following methods have been #[deprecated]

  • String::append - This method has been deprecated to remain consistent with the
    deprecation of Vec::append. While convenient, it is one of
    the only functional-style apis on String, and requires more
    though as to whether it belongs as a first-class method or
    now (and how it relates to other collections).
  • String::from_byte - This is fairly rare functionality and can be emulated with
    str::from_utf8 plus an assert plus a call to to_string().
    Additionally, String::from_char could possibly be used.
  • String::byte_capacity - Renamed to String::capacity due to the rationale
    above.
  • String::push_char - Renamed to String::push due to the rationale above.
  • String::pop_char - Renamed to String::pop due to the rationale above.
  • String::push_bytes - There are a number of unsafe functions on the String
    type which allow bypassing utf-8 checks. These have all
    been deprecated in favor of calling .as_mut_vec() and
    then operating directly on the vector returned. These
    methods were deprecated because naming them with relation
    to other methods was difficult to rationalize and it's
    arguably more composable to call .as_mut_vec().
  • String::as_mut_bytes - See push_bytes
  • String::push_byte - See push_bytes
  • String::pop_byte - See push_bytes
  • String::shift_byte - See push_bytes

Reservation methods

This commit does not yet touch the methods for reserving bytes. The methods on
Vec have also not yet been modified. These methods are discussed in the upcoming
Collections reform RFC

# Rationale

When dealing with strings, many functions deal with either a `char` (unicode
codepoint) or a byte (utf-8 encoding related). There is often an inconsistent
way in which methods are referred to as to whether they contain "byte", "char",
or nothing in their name.  There are also issues open to rename *all* methods to
reflect that they operate on utf8 encodings or bytes (e.g. utf8_len() or
byte_len()).

The current state of String seems to largely be what is desired, so this PR
proposes the following rationale for methods dealing with bytes or characters:

> When constructing a string, the input encoding *must* be mentioned (e.g.
> from_utf8). This makes it clear what exactly the input type is expected to be
> in terms of encoding.
>
> When a method operates on anything related to an *index* within the string
> such as length, capacity, position, etc, the method *implicitly* operates on
> bytes. It is an understood fact that String is a utf-8 encoded string, and
> burdening all methods with "bytes" would be redundant.
>
> When a method operates on the *contents* of a string, such as push() or pop(),
> then "char" is the default type. A String can loosely be thought of as being a
> collection of unicode codepoints, but not all collection-related operations
> make sense because some can be woefully inefficient.

# Method stabilization

The following methods have been marked #[stable]

* The String type itself
* String::new
* String::with_capacity
* String::from_utf16_lossy
* String::into_bytes
* String::as_bytes
* String::len
* String::clear
* String::as_slice

The following methods have been marked #[unstable]

* String::from_utf8 - The error type in the returned `Result` may change to
                      provide a nicer message when it's `unwrap()`'d
* String::from_utf8_lossy - The returned `MaybeOwned` type still needs
                            stabilization
* String::from_utf16 - The return type may change to become a `Result` which
                       includes more contextual information like where the error
                       occurred.
* String::from_chars - This is equivalent to iter().collect(), but currently not
                       as ergonomic.
* String::from_char - This method is the equivalent of Vec::from_elem, and has
                      been marked #[unstable] becuase it can be seen as a
                      duplicate of iterator-based functionality as well as
                      possibly being renamed.
* String::push_str - This *can* be emulated with .extend(foo.chars()), but is
                     less efficient because of decoding/encoding. Due to the
                     desire to minimize API surface this may be able to be
                     removed in the future for something possibly generic with
                     no loss in performance.
* String::grow - This is a duplicate of iterator-based functionality, which may
                 become more ergonomic in the future.
* String::capacity - This function was just added.
* String::push - This function was just added.
* String::pop - This function was just added.
* String::truncate - The failure conventions around String methods and byte
                     indices isn't totally clear at this time, so the failure
                     semantics and return value of this method are subject to
                     change.
* String::as_mut_vec - the naming of this method may change.
* string::raw::* - these functions are all waiting on [an RFC][2]

[2]: rust-lang/rfcs#240

The following method have been marked #[experimental]

* String::from_str - This function only exists as it's more efficient than
                     to_string(), but having a less ergonomic function for
                     performance reasons isn't the greatest reason to keep it
                     around. Like Vec::push_all, this has been marked
                     experimental for now.

The following methods have been #[deprecated]

* String::append - This method has been deprecated to remain consistent with the
                   deprecation of Vec::append. While convenient, it is one of
                   the only functional-style apis on String, and requires more
                   though as to whether it belongs as a first-class method or
                   now (and how it relates to other collections).
* String::from_byte - This is fairly rare functionality and can be emulated with
                      str::from_utf8 plus an assert plus a call to to_string().
                      Additionally, String::from_char could possibly be used.
* String::byte_capacity - Renamed to String::capacity due to the rationale
                          above.
* String::push_char - Renamed to String::push due to the rationale above.
* String::pop_char - Renamed to String::pop due to the rationale above.
* String::push_bytes - There are a number of `unsafe` functions on the `String`
                       type which allow bypassing utf-8 checks. These have all
                       been deprecated in favor of calling `.as_mut_vec()` and
                       then operating directly on the vector returned. These
                       methods were deprecated because naming them with relation
                       to other methods was difficult to rationalize and it's
                       arguably more composable to call .as_mut_vec().
* String::as_mut_bytes - See push_bytes
* String::push_byte - See push_bytes
* String::pop_byte - See push_bytes
* String::shift_byte - See push_bytes

# Reservation methods

This commit does not yet touch the methods for reserving bytes. The methods on
Vec have also not yet been modified. These methods are discussed in the upcoming
[Collections reform RFC][1]

[1]: https://github.com/aturon/rfcs/blob/collections-conventions/active/0000-collections-conventions.md#implicit-growth
This commit deprecates the String::shift_char() function in favor of the
addition of an insert()/remove() pair of functions. This aligns the API with Vec
in that characters can be inserted at arbitrary positions.  Additionaly, there
is no `_char` suffix due to the rationaled laid out in the previous commit.

These functions are both introduced as unstable as their failure semantics,
while in line with slices/vectors, are uncertain about whether they should
remain the same.
@rust-highfive

This comment has been minimized.

Copy link
Collaborator

commented Sep 22, 2014

warning Warning warning

  • These commits modify unsafe code. Please review it carefully!
@alexcrichton alexcrichton force-pushed the alexcrichton:string-stable branch from ecfc954 to f1f302d Sep 22, 2014
@aturon

This comment has been minimized.

Copy link

commented on src/libcollections/string.rs in 79b4ce0 Sep 22, 2014

"call .as_mut_vec().as_mut_slice() instead"

@aturon

This comment has been minimized.

Copy link

commented on src/libcollections/string.rs in 79b4ce0 Sep 22, 2014

"call .as_mut_vec().remove(0)"

@aturon

This comment has been minimized.

Copy link
Member

commented Sep 22, 2014

r=me, modulo a couple of minor nits on the diff.

Also: thanks for the extensive writeup of the rationale, etc! This is a good standard to set for these PRs going forward. I do wonder if the text about the overall design rationale for String should be kept somewhere more permanent -- perhaps in the guidelines? Let me know if you have any thoughts.

@alexcrichton

This comment has been minimized.

Copy link
Member Author

commented Sep 23, 2014

We may not necessarily need to lay out the guidelines for string-based api design in the documentation of String itself, but I would definitely think that it would belong in the guidelines. That said, we should certainly explain very clearly that a string is a utf-8 encoded sequence of bytes, no matter what. The current documentation is a little... sparse. (cc @steveklabnik, maybe a good module/struct to beef up the doc string for?)

@alexcrichton alexcrichton force-pushed the alexcrichton:string-stable branch from f1f302d to 56ea769 Sep 23, 2014
@alexcrichton

This comment has been minimized.

Copy link
Member Author

commented Sep 23, 2014

For now I'm going to hold off on documenting the String struct itself as that may warrant its own PR itself, and I'm going to try to land this in the meantime.

bors added a commit that referenced this pull request Sep 23, 2014
# Rationale

When dealing with strings, many functions deal with either a `char` (unicode
codepoint) or a byte (utf-8 encoding related). There is often an inconsistent
way in which methods are referred to as to whether they contain "byte", "char",
or nothing in their name.  There are also issues open to rename *all* methods to
reflect that they operate on utf8 encodings or bytes (e.g. utf8_len() or
byte_len()).

The current state of String seems to largely be what is desired, so this PR
proposes the following rationale for methods dealing with bytes or characters:

> When constructing a string, the input encoding *must* be mentioned (e.g.
> from_utf8). This makes it clear what exactly the input type is expected to be
> in terms of encoding.
>
> When a method operates on anything related to an *index* within the string
> such as length, capacity, position, etc, the method *implicitly* operates on
> bytes. It is an understood fact that String is a utf-8 encoded string, and
> burdening all methods with "bytes" would be redundant.
>
> When a method operates on the *contents* of a string, such as push() or pop(),
> then "char" is the default type. A String can loosely be thought of as being a
> collection of unicode codepoints, but not all collection-related operations
> make sense because some can be woefully inefficient.

# Method stabilization

The following methods have been marked #[stable]

* The String type itself
* String::new
* String::with_capacity
* String::from_utf16_lossy
* String::into_bytes
* String::as_bytes
* String::len
* String::clear
* String::as_slice

The following methods have been marked #[unstable]

* String::from_utf8 - The error type in the returned `Result` may change to
                      provide a nicer message when it's `unwrap()`'d
* String::from_utf8_lossy - The returned `MaybeOwned` type still needs
                            stabilization
* String::from_utf16 - The return type may change to become a `Result` which
                       includes more contextual information like where the error
                       occurred.
* String::from_chars - This is equivalent to iter().collect(), but currently not
                       as ergonomic.
* String::from_char - This method is the equivalent of Vec::from_elem, and has
                      been marked #[unstable] becuase it can be seen as a
                      duplicate of iterator-based functionality as well as
                      possibly being renamed.
* String::push_str - This *can* be emulated with .extend(foo.chars()), but is
                     less efficient because of decoding/encoding. Due to the
                     desire to minimize API surface this may be able to be
                     removed in the future for something possibly generic with
                     no loss in performance.
* String::grow - This is a duplicate of iterator-based functionality, which may
                 become more ergonomic in the future.
* String::capacity - This function was just added.
* String::push - This function was just added.
* String::pop - This function was just added.
* String::truncate - The failure conventions around String methods and byte
                     indices isn't totally clear at this time, so the failure
                     semantics and return value of this method are subject to
                     change.
* String::as_mut_vec - the naming of this method may change.
* string::raw::* - these functions are all waiting on [an RFC][2]

[2]: rust-lang/rfcs#240

The following method have been marked #[experimental]

* String::from_str - This function only exists as it's more efficient than
                     to_string(), but having a less ergonomic function for
                     performance reasons isn't the greatest reason to keep it
                     around. Like Vec::push_all, this has been marked
                     experimental for now.

The following methods have been #[deprecated]

* String::append - This method has been deprecated to remain consistent with the
                   deprecation of Vec::append. While convenient, it is one of
                   the only functional-style apis on String, and requires more
                   though as to whether it belongs as a first-class method or
                   now (and how it relates to other collections).
* String::from_byte - This is fairly rare functionality and can be emulated with
                      str::from_utf8 plus an assert plus a call to to_string().
                      Additionally, String::from_char could possibly be used.
* String::byte_capacity - Renamed to String::capacity due to the rationale
                          above.
* String::push_char - Renamed to String::push due to the rationale above.
* String::pop_char - Renamed to String::pop due to the rationale above.
* String::push_bytes - There are a number of `unsafe` functions on the `String`
                       type which allow bypassing utf-8 checks. These have all
                       been deprecated in favor of calling `.as_mut_vec()` and
                       then operating directly on the vector returned. These
                       methods were deprecated because naming them with relation
                       to other methods was difficult to rationalize and it's
                       arguably more composable to call .as_mut_vec().
* String::as_mut_bytes - See push_bytes
* String::push_byte - See push_bytes
* String::pop_byte - See push_bytes
* String::shift_byte - See push_bytes

# Reservation methods

This commit does not yet touch the methods for reserving bytes. The methods on
Vec have also not yet been modified. These methods are discussed in the upcoming
[Collections reform RFC][1]

[1]: https://github.com/aturon/rfcs/blob/collections-conventions/active/0000-collections-conventions.md#implicit-growth
@alexcrichton alexcrichton force-pushed the alexcrichton:string-stable branch from 56ea769 to d9b0bc0 Sep 23, 2014
bors added a commit that referenced this pull request Sep 23, 2014
# Rationale

When dealing with strings, many functions deal with either a `char` (unicode
codepoint) or a byte (utf-8 encoding related). There is often an inconsistent
way in which methods are referred to as to whether they contain "byte", "char",
or nothing in their name.  There are also issues open to rename *all* methods to
reflect that they operate on utf8 encodings or bytes (e.g. utf8_len() or
byte_len()).

The current state of String seems to largely be what is desired, so this PR
proposes the following rationale for methods dealing with bytes or characters:

> When constructing a string, the input encoding *must* be mentioned (e.g.
> from_utf8). This makes it clear what exactly the input type is expected to be
> in terms of encoding.
>
> When a method operates on anything related to an *index* within the string
> such as length, capacity, position, etc, the method *implicitly* operates on
> bytes. It is an understood fact that String is a utf-8 encoded string, and
> burdening all methods with "bytes" would be redundant.
>
> When a method operates on the *contents* of a string, such as push() or pop(),
> then "char" is the default type. A String can loosely be thought of as being a
> collection of unicode codepoints, but not all collection-related operations
> make sense because some can be woefully inefficient.

# Method stabilization

The following methods have been marked #[stable]

* The String type itself
* String::new
* String::with_capacity
* String::from_utf16_lossy
* String::into_bytes
* String::as_bytes
* String::len
* String::clear
* String::as_slice

The following methods have been marked #[unstable]

* String::from_utf8 - The error type in the returned `Result` may change to
                      provide a nicer message when it's `unwrap()`'d
* String::from_utf8_lossy - The returned `MaybeOwned` type still needs
                            stabilization
* String::from_utf16 - The return type may change to become a `Result` which
                       includes more contextual information like where the error
                       occurred.
* String::from_chars - This is equivalent to iter().collect(), but currently not
                       as ergonomic.
* String::from_char - This method is the equivalent of Vec::from_elem, and has
                      been marked #[unstable] becuase it can be seen as a
                      duplicate of iterator-based functionality as well as
                      possibly being renamed.
* String::push_str - This *can* be emulated with .extend(foo.chars()), but is
                     less efficient because of decoding/encoding. Due to the
                     desire to minimize API surface this may be able to be
                     removed in the future for something possibly generic with
                     no loss in performance.
* String::grow - This is a duplicate of iterator-based functionality, which may
                 become more ergonomic in the future.
* String::capacity - This function was just added.
* String::push - This function was just added.
* String::pop - This function was just added.
* String::truncate - The failure conventions around String methods and byte
                     indices isn't totally clear at this time, so the failure
                     semantics and return value of this method are subject to
                     change.
* String::as_mut_vec - the naming of this method may change.
* string::raw::* - these functions are all waiting on [an RFC][2]

[2]: rust-lang/rfcs#240

The following method have been marked #[experimental]

* String::from_str - This function only exists as it's more efficient than
                     to_string(), but having a less ergonomic function for
                     performance reasons isn't the greatest reason to keep it
                     around. Like Vec::push_all, this has been marked
                     experimental for now.

The following methods have been #[deprecated]

* String::append - This method has been deprecated to remain consistent with the
                   deprecation of Vec::append. While convenient, it is one of
                   the only functional-style apis on String, and requires more
                   though as to whether it belongs as a first-class method or
                   now (and how it relates to other collections).
* String::from_byte - This is fairly rare functionality and can be emulated with
                      str::from_utf8 plus an assert plus a call to to_string().
                      Additionally, String::from_char could possibly be used.
* String::byte_capacity - Renamed to String::capacity due to the rationale
                          above.
* String::push_char - Renamed to String::push due to the rationale above.
* String::pop_char - Renamed to String::pop due to the rationale above.
* String::push_bytes - There are a number of `unsafe` functions on the `String`
                       type which allow bypassing utf-8 checks. These have all
                       been deprecated in favor of calling `.as_mut_vec()` and
                       then operating directly on the vector returned. These
                       methods were deprecated because naming them with relation
                       to other methods was difficult to rationalize and it's
                       arguably more composable to call .as_mut_vec().
* String::as_mut_bytes - See push_bytes
* String::push_byte - See push_bytes
* String::pop_byte - See push_bytes
* String::shift_byte - See push_bytes

# Reservation methods

This commit does not yet touch the methods for reserving bytes. The methods on
Vec have also not yet been modified. These methods are discussed in the upcoming
[Collections reform RFC][1]

[1]: https://github.com/aturon/rfcs/blob/collections-conventions/active/0000-collections-conventions.md#implicit-growth
@alexcrichton alexcrichton force-pushed the alexcrichton:string-stable branch from d9b0bc0 to 08fe149 Sep 23, 2014
bors added a commit that referenced this pull request Sep 24, 2014
# Rationale

When dealing with strings, many functions deal with either a `char` (unicode
codepoint) or a byte (utf-8 encoding related). There is often an inconsistent
way in which methods are referred to as to whether they contain "byte", "char",
or nothing in their name.  There are also issues open to rename *all* methods to
reflect that they operate on utf8 encodings or bytes (e.g. utf8_len() or
byte_len()).

The current state of String seems to largely be what is desired, so this PR
proposes the following rationale for methods dealing with bytes or characters:

> When constructing a string, the input encoding *must* be mentioned (e.g.
> from_utf8). This makes it clear what exactly the input type is expected to be
> in terms of encoding.
>
> When a method operates on anything related to an *index* within the string
> such as length, capacity, position, etc, the method *implicitly* operates on
> bytes. It is an understood fact that String is a utf-8 encoded string, and
> burdening all methods with "bytes" would be redundant.
>
> When a method operates on the *contents* of a string, such as push() or pop(),
> then "char" is the default type. A String can loosely be thought of as being a
> collection of unicode codepoints, but not all collection-related operations
> make sense because some can be woefully inefficient.

# Method stabilization

The following methods have been marked #[stable]

* The String type itself
* String::new
* String::with_capacity
* String::from_utf16_lossy
* String::into_bytes
* String::as_bytes
* String::len
* String::clear
* String::as_slice

The following methods have been marked #[unstable]

* String::from_utf8 - The error type in the returned `Result` may change to
                      provide a nicer message when it's `unwrap()`'d
* String::from_utf8_lossy - The returned `MaybeOwned` type still needs
                            stabilization
* String::from_utf16 - The return type may change to become a `Result` which
                       includes more contextual information like where the error
                       occurred.
* String::from_chars - This is equivalent to iter().collect(), but currently not
                       as ergonomic.
* String::from_char - This method is the equivalent of Vec::from_elem, and has
                      been marked #[unstable] becuase it can be seen as a
                      duplicate of iterator-based functionality as well as
                      possibly being renamed.
* String::push_str - This *can* be emulated with .extend(foo.chars()), but is
                     less efficient because of decoding/encoding. Due to the
                     desire to minimize API surface this may be able to be
                     removed in the future for something possibly generic with
                     no loss in performance.
* String::grow - This is a duplicate of iterator-based functionality, which may
                 become more ergonomic in the future.
* String::capacity - This function was just added.
* String::push - This function was just added.
* String::pop - This function was just added.
* String::truncate - The failure conventions around String methods and byte
                     indices isn't totally clear at this time, so the failure
                     semantics and return value of this method are subject to
                     change.
* String::as_mut_vec - the naming of this method may change.
* string::raw::* - these functions are all waiting on [an RFC][2]

[2]: rust-lang/rfcs#240

The following method have been marked #[experimental]

* String::from_str - This function only exists as it's more efficient than
                     to_string(), but having a less ergonomic function for
                     performance reasons isn't the greatest reason to keep it
                     around. Like Vec::push_all, this has been marked
                     experimental for now.

The following methods have been #[deprecated]

* String::append - This method has been deprecated to remain consistent with the
                   deprecation of Vec::append. While convenient, it is one of
                   the only functional-style apis on String, and requires more
                   though as to whether it belongs as a first-class method or
                   now (and how it relates to other collections).
* String::from_byte - This is fairly rare functionality and can be emulated with
                      str::from_utf8 plus an assert plus a call to to_string().
                      Additionally, String::from_char could possibly be used.
* String::byte_capacity - Renamed to String::capacity due to the rationale
                          above.
* String::push_char - Renamed to String::push due to the rationale above.
* String::pop_char - Renamed to String::pop due to the rationale above.
* String::push_bytes - There are a number of `unsafe` functions on the `String`
                       type which allow bypassing utf-8 checks. These have all
                       been deprecated in favor of calling `.as_mut_vec()` and
                       then operating directly on the vector returned. These
                       methods were deprecated because naming them with relation
                       to other methods was difficult to rationalize and it's
                       arguably more composable to call .as_mut_vec().
* String::as_mut_bytes - See push_bytes
* String::push_byte - See push_bytes
* String::pop_byte - See push_bytes
* String::shift_byte - See push_bytes

# Reservation methods

This commit does not yet touch the methods for reserving bytes. The methods on
Vec have also not yet been modified. These methods are discussed in the upcoming
[Collections reform RFC][1]

[1]: https://github.com/aturon/rfcs/blob/collections-conventions/active/0000-collections-conventions.md#implicit-growth
@alexcrichton alexcrichton force-pushed the alexcrichton:string-stable branch from 08fe149 to 5037513 Sep 24, 2014
@alexcrichton

This comment has been minimized.

Copy link
Owner Author

commented on 5037513 Sep 24, 2014

r=aturon

This comment has been minimized.

Copy link
Owner Author

replied Sep 24, 2014

@bors: retry

@bors

This comment has been minimized.

Copy link
Contributor

commented on 5037513 Sep 24, 2014

saw approval from aturon
at alexcrichton@5037513

This comment has been minimized.

Copy link
Contributor

replied Sep 24, 2014

merging alexcrichton/rust/string-stable = 5037513 into auto

This comment has been minimized.

Copy link
Contributor

replied Sep 24, 2014

alexcrichton/rust/string-stable = 5037513 merged ok, testing candidate = 3a21b31

This comment has been minimized.

Copy link
Contributor

replied Sep 24, 2014

saw approval from aturon
at alexcrichton@5037513

This comment has been minimized.

Copy link
Contributor

replied Sep 24, 2014

merging alexcrichton/rust/string-stable = 5037513 into auto

This comment has been minimized.

Copy link
Contributor

replied Sep 24, 2014

alexcrichton/rust/string-stable = 5037513 merged ok, testing candidate = 5366cfe

This comment has been minimized.

Copy link
Contributor

replied Sep 24, 2014

fast-forwarding master to auto = 5366cfe

bors added a commit that referenced this pull request Sep 24, 2014
# Rationale

When dealing with strings, many functions deal with either a `char` (unicode
codepoint) or a byte (utf-8 encoding related). There is often an inconsistent
way in which methods are referred to as to whether they contain "byte", "char",
or nothing in their name.  There are also issues open to rename *all* methods to
reflect that they operate on utf8 encodings or bytes (e.g. utf8_len() or
byte_len()).

The current state of String seems to largely be what is desired, so this PR
proposes the following rationale for methods dealing with bytes or characters:

> When constructing a string, the input encoding *must* be mentioned (e.g.
> from_utf8). This makes it clear what exactly the input type is expected to be
> in terms of encoding.
>
> When a method operates on anything related to an *index* within the string
> such as length, capacity, position, etc, the method *implicitly* operates on
> bytes. It is an understood fact that String is a utf-8 encoded string, and
> burdening all methods with "bytes" would be redundant.
>
> When a method operates on the *contents* of a string, such as push() or pop(),
> then "char" is the default type. A String can loosely be thought of as being a
> collection of unicode codepoints, but not all collection-related operations
> make sense because some can be woefully inefficient.

# Method stabilization

The following methods have been marked #[stable]

* The String type itself
* String::new
* String::with_capacity
* String::from_utf16_lossy
* String::into_bytes
* String::as_bytes
* String::len
* String::clear
* String::as_slice

The following methods have been marked #[unstable]

* String::from_utf8 - The error type in the returned `Result` may change to
                      provide a nicer message when it's `unwrap()`'d
* String::from_utf8_lossy - The returned `MaybeOwned` type still needs
                            stabilization
* String::from_utf16 - The return type may change to become a `Result` which
                       includes more contextual information like where the error
                       occurred.
* String::from_chars - This is equivalent to iter().collect(), but currently not
                       as ergonomic.
* String::from_char - This method is the equivalent of Vec::from_elem, and has
                      been marked #[unstable] becuase it can be seen as a
                      duplicate of iterator-based functionality as well as
                      possibly being renamed.
* String::push_str - This *can* be emulated with .extend(foo.chars()), but is
                     less efficient because of decoding/encoding. Due to the
                     desire to minimize API surface this may be able to be
                     removed in the future for something possibly generic with
                     no loss in performance.
* String::grow - This is a duplicate of iterator-based functionality, which may
                 become more ergonomic in the future.
* String::capacity - This function was just added.
* String::push - This function was just added.
* String::pop - This function was just added.
* String::truncate - The failure conventions around String methods and byte
                     indices isn't totally clear at this time, so the failure
                     semantics and return value of this method are subject to
                     change.
* String::as_mut_vec - the naming of this method may change.
* string::raw::* - these functions are all waiting on [an RFC][2]

[2]: rust-lang/rfcs#240

The following method have been marked #[experimental]

* String::from_str - This function only exists as it's more efficient than
                     to_string(), but having a less ergonomic function for
                     performance reasons isn't the greatest reason to keep it
                     around. Like Vec::push_all, this has been marked
                     experimental for now.

The following methods have been #[deprecated]

* String::append - This method has been deprecated to remain consistent with the
                   deprecation of Vec::append. While convenient, it is one of
                   the only functional-style apis on String, and requires more
                   though as to whether it belongs as a first-class method or
                   now (and how it relates to other collections).
* String::from_byte - This is fairly rare functionality and can be emulated with
                      str::from_utf8 plus an assert plus a call to to_string().
                      Additionally, String::from_char could possibly be used.
* String::byte_capacity - Renamed to String::capacity due to the rationale
                          above.
* String::push_char - Renamed to String::push due to the rationale above.
* String::pop_char - Renamed to String::pop due to the rationale above.
* String::push_bytes - There are a number of `unsafe` functions on the `String`
                       type which allow bypassing utf-8 checks. These have all
                       been deprecated in favor of calling `.as_mut_vec()` and
                       then operating directly on the vector returned. These
                       methods were deprecated because naming them with relation
                       to other methods was difficult to rationalize and it's
                       arguably more composable to call .as_mut_vec().
* String::as_mut_bytes - See push_bytes
* String::push_byte - See push_bytes
* String::pop_byte - See push_bytes
* String::shift_byte - See push_bytes

# Reservation methods

This commit does not yet touch the methods for reserving bytes. The methods on
Vec have also not yet been modified. These methods are discussed in the upcoming
[Collections reform RFC][1]

[1]: https://github.com/aturon/rfcs/blob/collections-conventions/active/0000-collections-conventions.md#implicit-growth
bors added a commit that referenced this pull request Sep 24, 2014
# Rationale

When dealing with strings, many functions deal with either a `char` (unicode
codepoint) or a byte (utf-8 encoding related). There is often an inconsistent
way in which methods are referred to as to whether they contain "byte", "char",
or nothing in their name.  There are also issues open to rename *all* methods to
reflect that they operate on utf8 encodings or bytes (e.g. utf8_len() or
byte_len()).

The current state of String seems to largely be what is desired, so this PR
proposes the following rationale for methods dealing with bytes or characters:

> When constructing a string, the input encoding *must* be mentioned (e.g.
> from_utf8). This makes it clear what exactly the input type is expected to be
> in terms of encoding.
>
> When a method operates on anything related to an *index* within the string
> such as length, capacity, position, etc, the method *implicitly* operates on
> bytes. It is an understood fact that String is a utf-8 encoded string, and
> burdening all methods with "bytes" would be redundant.
>
> When a method operates on the *contents* of a string, such as push() or pop(),
> then "char" is the default type. A String can loosely be thought of as being a
> collection of unicode codepoints, but not all collection-related operations
> make sense because some can be woefully inefficient.

# Method stabilization

The following methods have been marked #[stable]

* The String type itself
* String::new
* String::with_capacity
* String::from_utf16_lossy
* String::into_bytes
* String::as_bytes
* String::len
* String::clear
* String::as_slice

The following methods have been marked #[unstable]

* String::from_utf8 - The error type in the returned `Result` may change to
                      provide a nicer message when it's `unwrap()`'d
* String::from_utf8_lossy - The returned `MaybeOwned` type still needs
                            stabilization
* String::from_utf16 - The return type may change to become a `Result` which
                       includes more contextual information like where the error
                       occurred.
* String::from_chars - This is equivalent to iter().collect(), but currently not
                       as ergonomic.
* String::from_char - This method is the equivalent of Vec::from_elem, and has
                      been marked #[unstable] becuase it can be seen as a
                      duplicate of iterator-based functionality as well as
                      possibly being renamed.
* String::push_str - This *can* be emulated with .extend(foo.chars()), but is
                     less efficient because of decoding/encoding. Due to the
                     desire to minimize API surface this may be able to be
                     removed in the future for something possibly generic with
                     no loss in performance.
* String::grow - This is a duplicate of iterator-based functionality, which may
                 become more ergonomic in the future.
* String::capacity - This function was just added.
* String::push - This function was just added.
* String::pop - This function was just added.
* String::truncate - The failure conventions around String methods and byte
                     indices isn't totally clear at this time, so the failure
                     semantics and return value of this method are subject to
                     change.
* String::as_mut_vec - the naming of this method may change.
* string::raw::* - these functions are all waiting on [an RFC][2]

[2]: rust-lang/rfcs#240

The following method have been marked #[experimental]

* String::from_str - This function only exists as it's more efficient than
                     to_string(), but having a less ergonomic function for
                     performance reasons isn't the greatest reason to keep it
                     around. Like Vec::push_all, this has been marked
                     experimental for now.

The following methods have been #[deprecated]

* String::append - This method has been deprecated to remain consistent with the
                   deprecation of Vec::append. While convenient, it is one of
                   the only functional-style apis on String, and requires more
                   though as to whether it belongs as a first-class method or
                   now (and how it relates to other collections).
* String::from_byte - This is fairly rare functionality and can be emulated with
                      str::from_utf8 plus an assert plus a call to to_string().
                      Additionally, String::from_char could possibly be used.
* String::byte_capacity - Renamed to String::capacity due to the rationale
                          above.
* String::push_char - Renamed to String::push due to the rationale above.
* String::pop_char - Renamed to String::pop due to the rationale above.
* String::push_bytes - There are a number of `unsafe` functions on the `String`
                       type which allow bypassing utf-8 checks. These have all
                       been deprecated in favor of calling `.as_mut_vec()` and
                       then operating directly on the vector returned. These
                       methods were deprecated because naming them with relation
                       to other methods was difficult to rationalize and it's
                       arguably more composable to call .as_mut_vec().
* String::as_mut_bytes - See push_bytes
* String::push_byte - See push_bytes
* String::pop_byte - See push_bytes
* String::shift_byte - See push_bytes

# Reservation methods

This commit does not yet touch the methods for reserving bytes. The methods on
Vec have also not yet been modified. These methods are discussed in the upcoming
[Collections reform RFC][1]

[1]: https://github.com/aturon/rfcs/blob/collections-conventions/active/0000-collections-conventions.md#implicit-growth
@bors bors closed this Sep 24, 2014
@bors bors merged commit 5037513 into rust-lang:master Sep 24, 2014
1 of 2 checks passed
1 of 2 checks passed
continuous-integration/travis-ci The Travis CI build failed
Details
default all tests passed
@alexcrichton alexcrichton deleted the alexcrichton:string-stable branch Oct 11, 2014
@aturon aturon referenced this pull request Dec 16, 2014
140 of 142 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.