Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: CStr, the dereferenced complement to CString #592

Merged
merged 28 commits into from
Feb 18, 2015

Conversation

mzabaluev
Copy link
Contributor

Change the Deref target of CString to a token type CStr.
Add helper functions and macros to produce CStr references out of static string data.

Rendered

@mzabaluev
Copy link
Contributor Author

Cc @alexcrichton

@dgrunwald
Copy link
Contributor

👍 to the general idea: some kind of CStr is necessary to allow writing efficient safe bindings around C libraries.

I think this could be added as a motivating usecase:

// C library:
const char* f1(); // return value has static lifetime
void f2(const char*);
/// Safe rust bindings:
fn f1() -> &'static CStr { ... }
fn f2(s: &CStr) { ... }

// Caller using only safe code:
f2(f1())

If CString was used instead of CStr, f1 would have to copy the string into the rust heap.
If [u8] was used instead of CStr, f2 would have to unnecessarily validate that the string is null terminated.

I also like the 'irrelevantly-sized type': my approach using a DST CStr would involve an unnecessary strlen call in the binding for f1.


However, this proposal seems to be missing some details around the type:

  • static_c_str_from_bytes and static_c_str_from_str: I don't think there's any reason these need to be restricted to 'static lifetime, is there?
    You should also mention that these functions would check the input slice and panic if it's not 0-terminated or contains non-terminating zeroes (there should probably be an unsafe unchecked version of these functions, too).
  • There should be a way to convert from *const libc::c_char to &CStr without going through &[u8].
  • Borrow/ToOwned implementations should be provided to connect CStr with CString.
  • Should there be functions like as_bytes, as_bytes_with_nul as on CString? (these would have to call strlen)
  • In general, try to support all operations that are valid on CString and also make sense for non-owned strings.

@mzabaluev
Copy link
Contributor Author

@dgrunwald, thanks for your comments.

static_c_str_from_bytes and static_c_str_from_str: I don't think there's any reason these need to be restricted to 'static lifetime, is there?

These address a special case: they are non-copying ways to get &CStr out of static values, which have to be null-terminated (unlike dynamic slices, which must not have NUL anywhere including the tail position). These functions can skip the O(N) check for interior NULs, because in most cases their parameter will be a literal where any interior '\0' characters/bytes would be evident. I will add this explanation to the RFC.

There should be a way to convert from *const libc::c_char to &CStr without going through &[u8].

In this proposal, I view CStr only as a way to pass data from Rust to C. For Rust code, the only useful thing to do with a returned C string pointer is to parse it into a slice, which can already be done with c_str_to_bytes/c_str_to_bytes_with_nul (Besides, these functions need some
lifetime love). I do share your concern about being able to feed a non-owned C string obtained from FFI to another FFI-wrapping interface without unnecessary copying, but there is uncertainty as to whether this will be needed often.
Perhaps there should be a raw builder for &CStr, like in my crate:

unsafe fn c_str_from_raw_ptr<'a, T: ?Sized>(ptr: *const libc::c_char,
                                            life_anchor: &'a T)
                                            -> &'a CStr

BorrowFrom/ToOwned implementations should be provided to connect CStr with CString.

Yes, and trivially. This, however, is not essential to the RFC and can be added later.

Should there be functions like as_bytes, as_bytes_with_nul as on CString? (these would have to call strlen)

I'm hesitant about these, and I don't quite see the purpose of as_bytes on CString to begin with. My expectation is that CString is constructed at the use site rather than being passed around in interfaces, so its user should have access to the original slice. In any case, as_bytes is not a good method name for what it takes to get a slice from a C string pointer, because methods with this name have negligible cost elsewhere in the core libraries. In my crate, similar functions are named parse_as_bytes to remind about the scanning cost. These methods will be also invokable on CString via auto-deref, so their peculiarity should be apparent.

An alternative is to make CStr a DST newtype of [libc::c_char]. This will preclude any thin wrappers that do not call strlen up front, like the RAII helper OwnedCString in my crate.

@Diggsey
Copy link
Contributor

Diggsey commented Jan 20, 2015

How do you prevent instances of CStr being created on the stack without making it unsized?

@mzabaluev
Copy link
Contributor Author

@Diggsey, you can't safely move a borrowed non-Copy value, and CStr is only ever obtainable borrowed.

@Diggsey
Copy link
Contributor

Diggsey commented Jan 20, 2015

It still seems like there's potential for problems: for example, if DerefMut is ever implemented for CString (not unlikely, as it doesn't enforce UTF8 afaict) then you'll have to special-case "swap" to prevent it being used with CStr. You're allowing the programmer to bypass any "Sized" constraints on type parameters: for example, "sizeof" would give incorrect results when applied to CStr.

What's really needed is an unsized type without fat references: maybe there should be an "Unsized" marker class, or a way to negatively implement the Sized trait for a type, which doesn't affect its in-memory representation.

@mzabaluev
Copy link
Contributor Author

@Diggsey:

if DerefMut is ever implemented for CString (not unlikely, as it doesn't enforce UTF8 afaict)

FWIW CString enforces the no-NULs-except-last invariant. To maintain it with mutability, CString and CStr would have to grow some interesting infrastructure. I don't think, however, that facilitating mutation of C strings in Rust will be worthwhile. You'd likely need size changes, and for that going through Vec or String is a better option, I think. There is actually a case for into_vec method on CString, which I have on its workalike CStrBuf in my c_string.

You're allowing the programmer to bypass any "Sized" constraints on type parameters: for example, "sizeof" would give incorrect results when applied to CStr.

Are there any examples of safe code demonstrating how this could be unintentionally misused? My understanding is that, as long as you can't safely initialize, mutate, copy, or move a value, its size should not be important.

I also use this technique when given a public view into an object that has a private part allocated alongside:

#[repr(C)]
pub struct Object {
    raw: ffi::RawObject,  // we know it's not all there is
    no_copy: std::marker::NoCopy
}

impl Object {
    // an example with a static ref, just for the sake of brevity
    pub fn get_static() -> &'static Object {
        unsafe {
            let ptr = ffi::raw_object_get_static();
            std::mem::copy_lifetime("", &*(ptr as *const Object))
        }
    }
}

Being able to opt out of Sized without implying DST would be nice, indeed. @nikomatsakis ?

@Kimundi
Copy link
Member

Kimundi commented Jan 20, 2015

Example with safe code:

let a: &'a mut CStr = foo.as_c_str();
let b: &'b mut CStr = bar.as_c_str();

mem::swap(a, b);

This swaps the actual CStr values, which will incidentally result in swapping of the first bytes of each C string, which will lead to memory unsafety if one C string was empty and had its terminating null byte at first position.

@quantheory
Copy link
Contributor

If we don't allow any method to create a mutable CStr (which I think is actually for the best), the mem::swap issue goes away. Additionally, I think some of the unsized weirdness could be solved by instead having a representation like:

#[repr(C)]
pub struct CStringRef<'a> {
    raw: core::nonzero::NonZero<*const libc::c_char>,
    marker: std::marker::ContravariantLifetime<'a>,
}

That is, instead of trying to use an &'a CStr that's a normal reference, the CStringRef<'a> itself becomes a sized "reference-like" type into un-owned immutable data. The only issue is that CStringRef looks different from the various slice types, e.g. &str, &[u8], and &Path. Personally, I think that this lack of syntactic sugar is OK, but otherwise it does seem like CStr would have to be made unsized.

If CStr was also primitive, the above CStringRef could be stripped of its lifetime parameter and used in the Repr impl for CStr, which would be a close analogy to how slices and str work, but making a new primitive type is probably going too far.

@Diggsey
Copy link
Contributor

Diggsey commented Jan 21, 2015

CStr is not the only reason to want unsized types with thin references: the PIMPL idiom in C++ requires a similar ability. You can get away with just not having a dereferenced type (like quantheory described), but it makes for some weird semantics: CStr and CString are both reference types, and yet CString dereferences to a CStr? Either that or you have to duplicate CStr functionality on a CString.

@mzabaluev
Copy link
Contributor Author

instead of trying to use an &'a CStr that's a normal reference, the CStringRef<'a> itself becomes a sized "reference-like" type into un-owned immutable data. The only issue is that CStringRef looks different from the various slice types, e.g. &str, &[u8], and &Path.

That's not great for Deref, where the dereferenced value has to be returned by reference.

otherwise it does seem like CStr would have to be made unsized.

Aside from the size_of issue which by itself is not very important, can any problems occur due to CStr having a bogus size? The same would apply for public views into larger private objects, as described above. When the ABI stabilizes and dynamic crates become the norm, I expect such implementation-hiding techniques to become more common between Rust crates as well, not only in interfacing with C.

@quantheory
Copy link
Contributor

CStr and CString are both reference types, and yet CString dereferences to a CStr?

I don't think it's that strange, any more than CString deref-ing to &CStr.

That will also work poorly with Deref, where the dereferenced value has to be returned by reference.

However, I think that this pretty much kills the idea. I don't know why I didn't think of that problem, since I've had the exact same issue with the Index trait. Not only can Deref not return a type with the right lifetime, it can't even be redefined to do so without higher-kinded types.

@mzabaluev
Copy link
Contributor Author

@Ericson2314
Copy link
Contributor

Unsized types definitely the way to go for this. Though I think dynamic sized [lib::char] wrapper enforcing C invariant could be useful too.

Your deallocation ideas should be subsumed by an allocator API, whenever we get it.

This would need the allocator API to get resolved, as
pointed out by John Ericson:
rust-lang#592 (comment)
@aturon
Copy link
Member

aturon commented Feb 11, 2015

@mzabaluev

I'm late to this party, but this is a great RFC! I love the idea of introducing a slice type for null-terminated data. I think the various motivations are well-covered in the RFC and this thread, and I buy them. This will really tighten up the CString API.

That said, I agree with the original stance that @mzabaluev took that CStr really "should" be unsized, and at some point I'd like to pursue the "truly unsized types" RFC or something like it to get there. There are a number of places in the libraries and language where unsized types get somewhat special treatment, and it'd be good to be able to plug into that. (I can elaborate if people are curious what I mean.)

So that said, I'm interested in some means of exposing this type as unsized from the get-go. (Whereas the main proposal here exposes that CStr is sized.) Personally, I think making this a newtype wrapper for [libc::c_char] is the way to go: it gives us the correct semantics at a temporary runtime cost. Eventually, if/when we add "truly" unsized types or some other support, we can transition the representation from fat to thin pointer without breaking code. Alternatively, if we decide not to land such support, we should be able to transition to the main proposal here, again without breakage.

@mzabaluev
Copy link
Contributor Author

@aturon, if CStr is temporarily made a DST, could there be any usages of &CStr that will break when that becomes a thin reference?
Should CStr be left unstable until the question of unsized types gets resolved?

@aturon
Copy link
Member

aturon commented Feb 11, 2015

@mzabaluev

if CStr is temporarily made a DST, could there be any usages of &CStr that will break when that becomes a thin reference?

I think only for unsafe code that depended on the exact representation. I will try to get some second opinions on this though, it's an important question.

Should CStr be left unstable until the question of unsized types gets resolved?

I don't think so -- we want to stabilize CString soon, and I would really like to use the overall approach you're suggesting.

@alexcrichton
Copy link
Member

I think in general using Rust types in FFI is pretty hazardous and I don't think either way that we should start defining FFI functions like:

extern {
    fn foo() -> &'static CStr;
}

I would be ok changing the runtime representation of CStr at a future date to become a thin pointer instead of a fat pointer and mandating that the only stable method to get the raw representation is .as_ptr()

@mzabaluev
Copy link
Contributor Author

I think only for unsafe code that depended on the exact representation.

That's not an issue for me; I don't think any code that would e.g. transmute the reference into a slice could expect any guarantees on backwards compatibility.

What's the general stance on binary compatibility in Rust 1.0 timeframe? Does anyone care?

@aturon
Copy link
Member

aturon commented Feb 11, 2015

@mzabaluev

What's the general stance on binary compatibility in Rust 1.0 timeframe? Does anyone care?

Completely off the table. ABI stability is going to be a very long road.

@mzabaluev
Copy link
Contributor Author

@alexcrichton I don't suggest &CStr to be used in the FFI signatures themselves. It's only meant for safe wrapper functions.

.as_ptr() could be provided by trait AsRaw from the unsized types RFC (#813), if that part of the RFC would land in Rust 1.0 in anticipation of the eventual support for !Sized.

@alexcrichton
Copy link
Member

@mzabaluev would you be ok with the [libc::c_char] internal representation that @aturon mentioned? We'd keep the same API that's already described as it would allow us to move to a thin pointer in the future.

@mzabaluev
Copy link
Contributor Author

@alexcrichton Yes, I'm convinced. I will update the RFC soon.

@mzabaluev
Copy link
Contributor Author

I've got another opportunistic change to suggest. Having the conversion functions panic on inner NULs is a little... severe. How about CString::from_slice return Result<CString, NulError>, and CString::from_vec return a Result with a move-friendly error around NulError, similar to how the conversions from UTF-8 fail with Utf8Error and IntoUtf8Error? I have prototyped this in cargo crate c_string.

@aturon
Copy link
Member

aturon commented Feb 11, 2015

@mzabaluev I would be in favor of such a change. We'll probably still panic in common cases when using CString within std, but this way other users of CString can make their own choice.

@alexcrichton
Copy link
Member

I personally see the panic on inner nul as a contract of the CString API. I vaguely remember in the past that I tried to move to a Result but it just got so painful so quickly that I abandoned the efforts. I would vote for continuing to panic instead of returning a Result, but it's also not the end of the world because I can always just write my own function :)

@mzabaluev
Copy link
Contributor Author

@alexcrichton A potential problem is, it is very rare that a real-life string has NUL characters, and at the same time it is perfectly valid in the Rust world and in Unicode text in general. So developers may be blissfully forgetful of the panic until it hits their software in the field. DoS possibilities galore.

@alexcrichton
Copy link
Member

I personally see the fact that it's so rare as a justification for panicking as it's otherwise that much more difficult to deal with for everyone using CString. I do agree though that almost no one will consider the fact that &str can have interior nul bytes (I certainly don't consciously think about it).

@aturon
Copy link
Member

aturon commented Feb 12, 2015

I posted a discuss thread on the topic of panicking here.

@mzabaluev
Copy link
Contributor Author

I have updated the proposal to @aturon's suggestions, adding a disclaimer that the inner slice is not guaranteed to have any particular size. This way we can still avoid the strlen() on construction and discourage transmute-happy code.

mzabaluev added a commit to mzabaluev/rust-c-str that referenced this pull request Feb 13, 2015
As decided in process of developing the CStr RFC [1], this is the way to go
with CStr as an unsized type.

The length of the internal slice is bogus; it only provides safe access to
the first byte for the implementation of .is_empty().

[1] rust-lang/rfcs#592
@aturon
Copy link
Member

aturon commented Feb 18, 2015

@mzabaluev Thanks again for this RFC! After discussion on this thread, on the discuss forum, and amongst the team, it seems that almost everyone is in favor of this change. We have decided to move forward, and try to land these revisions ahead of alpha2 if possible.

Tracking issue

mzabaluev added a commit to mzabaluev/rust-c-str that referenced this pull request Feb 21, 2015
Harmonization with the changes in std::ffi landed as the result of
rust-lang/rfcs#592
@Centril Centril added A-ffi FFI related proposals. A-dst Proposals re. DSTs labels Nov 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-dst Proposals re. DSTs A-ffi FFI related proposals.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants