Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upImproved docs for CStr, CString, OsStr, OsString #44855
Conversation
federicomenaquintero
added some commits
Sep 22, 2017
rust-highfive
assigned
aturon
Sep 26, 2017
This comment has been minimized.
This comment has been minimized.
|
Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @aturon (or someone else) soon. If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes. Please see the contribution instructions for more information. |
This comment has been minimized.
This comment has been minimized.
|
|
This comment has been minimized.
This comment has been minimized.
|
Oops, long lines... will fix. I'll also clarify that CStr/CString are bags of zero-terminated bytes, and UTF-8 only happens when making a string out of them. |
arielb1
added
the
S-waiting-on-author
label
Sep 26, 2017
clarfon
reviewed
Sep 27, 2017
| /// This type serves the primary purpose of being able to safely generate a | ||
| /// C-compatible string from a Rust byte slice or vector. An instance of this | ||
| /// This type serves the purpose of being able to safely generate a | ||
| /// C-compatible UTF-8 string from a Rust byte slice or vector. An instance of this |
This comment has been minimized.
This comment has been minimized.
clarfon
reviewed
Sep 27, 2017
| @@ -8,7 +8,145 @@ | |||
| // option. This file may not be copied, modified, or distributed | |||
| // except according to those terms. | |||
|
|
|||
| //! Utilities related to FFI bindings. | |||
| //! This module provides utilities to handle C-like strings. It is | |||
This comment has been minimized.
This comment has been minimized.
clarfon
Sep 27, 2017
Contributor
I think that this is a bit misleading because OsString isn't a C string on Windows.
This comment has been minimized.
This comment has been minimized.
clarfon
Sep 27, 2017
Contributor
A better way to describe it might be "to handle data across non-Rust interfaces, like other programming languages and the underlying operating system"
clarfon
reviewed
Sep 27, 2017
| //! borrowed slices of strings with the [`str`] primitive. Both are | ||
| //! always in UTF-8 encoding, and may contain nul bytes in the middle, | ||
| //! i.e. if you look at the bytes that make up the string, there may | ||
| //! be a `0` among them. Both `String` and `str` know their length; |
This comment has been minimized.
This comment has been minimized.
clarfon
Sep 27, 2017
Contributor
nit: the '0' here makes it look like you're referring to a zero digit, not a literal zero. Perhaps use '\0' instead?
This comment has been minimized.
This comment has been minimized.
clarfon
Sep 27, 2017
Contributor
another nit: I'd word "know their length" as "store their length explicitly" because technically we "know" the length of a C-string, but it's not computed in O(1) time.
clarfon
reviewed
Sep 27, 2017
| //! | ||
| //! C strings are different from Rust strings: | ||
| //! | ||
| //! * **Encodings** - C strings may have different encodings. If |
This comment has been minimized.
This comment has been minimized.
clarfon
Sep 27, 2017
Contributor
I think that "encoding" here is a bit inaccessible to people who are unfamiliar with how string encoding works. I'd say introduce it with "Rust strings are UTF-8, but C strings may use other encodings. If you're using a string from C, you may have to check its encoding explicitly, rather than just assuming that it's UTF-8 like you can in Rust."
clarfon
reviewed
Sep 27, 2017
| //! you are bringing in strings from C APIs, you should check what | ||
| //! encoding you are getting. Rust strings are always UTF-8. | ||
| //! | ||
| //! * **Character width** - C strings may use "normal" or "wide" |
This comment has been minimized.
This comment has been minimized.
clarfon
Sep 27, 2017
Contributor
"Width" here may be what C uses, but it's again misleading because Unicode has its own specific definition of width. I'd say "size" instead. Instead of using "normal" and "wide," I'd just say directly that C uses two types, char (clarifying that this is different from Rust's type) and wchar_t, which are different sizes.
This comment has been minimized.
This comment has been minimized.
clarfon
Sep 27, 2017
Contributor
You can also clarify that wchar_t is referred to by "wide character" but that this doesn't actually reflect the Unicode width, but the size of the character in bytes.
clarfon
reviewed
Sep 27, 2017
| //! '[Unicode code point]'. | ||
| //! | ||
| //! * **Nul terminators and implicit string lengths** - Often, C | ||
| //! strings are nul-terminated, i.e. they have a `0` character at the |
This comment has been minimized.
This comment has been minimized.
clarfon
reviewed
Sep 27, 2017
| //! | ||
| //! * **Nul terminators and implicit string lengths** - Often, C | ||
| //! strings are nul-terminated, i.e. they have a `0` character at the | ||
| //! end. The length of a string buffer is not known *a priori*; |
This comment has been minimized.
This comment has been minimized.
clarfon
Sep 27, 2017
Contributor
No need to use Latin; just say that it isn't stored, but has to be calculated. IMHO we should keep language simple if possible to be more accessible to non-native speakers.
clarfon
reviewed
Sep 27, 2017
| //! `wcslen()` for `wchar_t`-based ones. Those functions return the | ||
| //! number of characters in the string excluding the nul terminator, | ||
| //! so the buffer length is really `len+1` characters. Rust strings | ||
| //! don't have a nul terminator, and they always know their length. |
This comment has been minimized.
This comment has been minimized.
clarfon
Sep 27, 2017
Contributor
I'd also note in here somewhere that Rust's way of doing it means that you can easily access a string's length, whereas there's an implicit cost to it in C. This also may carry over to CStr if its implementation changes.
clarfon
reviewed
Sep 27, 2017
| //! so the buffer length is really `len+1` characters. Rust strings | ||
| //! don't have a nul terminator, and they always know their length. | ||
| //! | ||
| //! * **No nul characters in the middle of the string** - When C |
This comment has been minimized.
This comment has been minimized.
clarfon
reviewed
Sep 27, 2017
| //! strings have a nul terminator character, this usually means that | ||
| //! they cannot have nul characters in the middle — a nul character | ||
| //! would essentially truncate the string. Rust strings *can* have | ||
| //! nul characters in the middle, since they don't use nul |
This comment has been minimized.
This comment has been minimized.
clarfon
Sep 27, 2017
Contributor
Rather than "don't use nul terminators," it's clearer to say "because NUL doesn't have to mark the end of the string in Rust"
clarfon
reviewed
Sep 27, 2017
| //! # Representations of non-Rust strings | ||
| //! | ||
| //! [`CString`] and [`CStr`] are useful when you need to transfer | ||
| //! UTF-8 strings to and from C, respectively: |
This comment has been minimized.
This comment has been minimized.
clarfon
Sep 27, 2017
Contributor
I'd expand this to languages with a C ABI like Python, etc. People should know that a CStr might be necessary when interacting with other languages too.
clarfon
reviewed
Sep 27, 2017
| //! UTF-8 strings to and from C, respectively: | ||
| //! | ||
| //! * **From Rust to C:** [`CString`] represents an owned, C-friendly | ||
| //! UTF-8 string: it is valid UTF-8, it is nul-terminated, and has no |
This comment has been minimized.
This comment has been minimized.
clarfon
reviewed
Sep 27, 2017
| //! | ||
| //! * **From C to Rust:** [`CStr`] represents a borrowed C string; it | ||
| //! is what you would use to wrap a raw `*const u8` that you got from | ||
| //! a C function. A `CStr` is just guaranteed to be a nul-terminated |
This comment has been minimized.
This comment has been minimized.
clarfon
reviewed
Sep 27, 2017
| //! * **From C to Rust:** [`CStr`] represents a borrowed C string; it | ||
| //! is what you would use to wrap a raw `*const u8` that you got from | ||
| //! a C function. A `CStr` is just guaranteed to be a nul-terminated | ||
| //! array of bytes; the UTF-8 validation step only happens when you |
This comment has been minimized.
This comment has been minimized.
clarfon
Sep 27, 2017
Contributor
"the UTF-8 validation step" is only just mentioned here so I'd just make a separate sentence describing how that works instead, along the lines of "once you have a CStr, you can convert it to a Rust str if it's valid UTF-8, or lossily convert it by adding replacement characters"
clarfon
reviewed
Sep 27, 2017
| //! request to convert it to a `&str`. | ||
| //! | ||
| //! [`OsString`] and [`OsStr`] are useful when you need to transfer | ||
| //! strings to and from operating system calls. If you need Rust |
This comment has been minimized.
This comment has been minimized.
clarfon
Sep 27, 2017
•
Contributor
A lot of programmers may not know what system calls are; I'd probably word this as "the operating system itself."
It may also make sense to include examples where this happens, like in opening files and running external commands.
This comment has been minimized.
This comment has been minimized.
clarfon
Sep 27, 2017
Contributor
I feel like the "If you need Rust strings out of them [...]" section is kind of redundant and wordy. I'd probably just say that conversions between OsStr and str work very similarly to CStr and leave it at that.
This comment has been minimized.
This comment has been minimized.
|
Great work! I've interacted a lot with |
federicomenaquintero
added some commits
Oct 2, 2017
This comment has been minimized.
This comment has been minimized.
|
I've integrated the changes per your comments. How's it look now? :) |
This comment has been minimized.
This comment has been minimized.
|
Looks good to me! Again, great work! :) |
This comment has been minimized.
This comment has been minimized.
|
Thank you! |
shepmaster
added
S-waiting-on-review
and removed
S-waiting-on-author
labels
Oct 6, 2017
This comment has been minimized.
This comment has been minimized.
|
Poke @aturon — this is now ready for your masterful reviewing skills! |
This comment has been minimized.
This comment has been minimized.
|
Actually, @aturon wasn't available last week and is on PTO this week, so let's try.... |
rust-highfive
assigned
steveklabnik
and unassigned
aturon
Oct 9, 2017
steveklabnik
requested changes
Oct 11, 2017
|
This is fantastic, thank you so much! I have a few little formatting nits, but after that, let's get this merged! |
| @@ -149,8 +209,13 @@ pub struct CStr { | |||
| } | |||
|
|
|||
| /// An error returned from [`CString::new`] to indicate that a nul byte was found | |||
| /// in the vector provided. | |||
| /// in the vector provided. While Rust strings may contain nul bytes in the middle, | |||
| /// C strings can't, as that byte would effectively truncate the string. | |||
This comment has been minimized.
This comment has been minimized.
steveklabnik
Oct 11, 2017
Member
Could we change this up a bit? We try to have a summary sentence first, then the rest of it. This one has a long summary, and repeats itself since you added the information below. How about:
/// An error indicating that an interior nul byte was found.
///
/// While Rust strings may contain nul bytes in the middle, C strings can't, as that byte would effectively
/// truncate the string.
///
/// This `struct`....
with the correct wrapping, I just guessed here. What do you think?
| /// that a nul byte was found too early in the slice provided, or one | ||
| /// wasn't found at all for the nul terminator. The slice used to | ||
| /// create a `CStr` must have one and only one nul byte at the end of | ||
| /// the slice. |
This comment has been minimized.
This comment has been minimized.
steveklabnik
Oct 11, 2017
Member
Same thing here; don't repeat where it came from, make sure to have a short summary, some space, and then a longer description.
| /// UTF-8 error was encountered during the conversion. `CString` is | ||
| /// just a wrapper over a buffer of bytes with a nul terminator; | ||
| /// [`into_string`][`CString::into_string`] performs UTF-8 validation | ||
| /// and may return this error. |
This comment has been minimized.
This comment has been minimized.
| /// underlying bytes to construct a new string, ensuring that | ||
| /// there is a trailing 0 byte. This trailing 0 byte will be | ||
| /// appended by this method; the provided data should *not* | ||
| /// contain any 0 bytes in it. |
This comment has been minimized.
This comment has been minimized.
| @@ -8,7 +8,156 @@ | |||
| // option. This file may not be copied, modified, or distributed | |||
| // except according to those terms. | |||
|
|
|||
| //! Utilities related to FFI bindings. | |||
| //! This module provides utilities to handle data across non-Rust | |||
This comment has been minimized.
This comment has been minimized.
steveklabnik
Oct 11, 2017
Member
I'd keep this short summary, but with a newline between it, so you get the summary. That is:
///! Utilities related to FFI bindings.
//!
//! This module provides utilities....
| //! C strings are different from Rust strings: | ||
| //! | ||
| //! * **Encodings** - Rust strings are UTF-8, but C strings may use | ||
| //! other encodings. If you are using a string from C, you should |
This comment has been minimized.
This comment has been minimized.
| //! characters; please **note** that C's `char` is different from Rust's. | ||
| //! The C standard leaves the actual sizes of those types open to | ||
| //! interpretation, but defines different APIs for strings made up of | ||
| //! each character type. Rust strings are always UTF-8, so different |
This comment has been minimized.
This comment has been minimized.
federicomenaquintero
added some commits
Oct 11, 2017
steveklabnik
approved these changes
Oct 12, 2017
This comment has been minimized.
This comment has been minimized.
|
Thanks! @bors: r+ rollup |
This comment has been minimized.
This comment has been minimized.
|
|
federicomenaquintero commentedSep 26, 2017
This expands the documentation for those structs and their corresponding traits, per #29354