Remove the trailing null from strings #7235

erickt · 2013-06-19T18:15:31Z

It is unsafe to convert every &[u8] into a &str, because &strs need to have a valid byte after the end of the characters. This means that this pattern:

let v = &['a' as u8, 'b' as u8, 'c' as u8];
let s = str::from_bytes_slice(v);
str::as_c_str(s, |ptr| call_c_function(ptr));

is unsafe because str::as_c_str dereferences the next byte past the end of the slice. This is safe when working with a ~str as a &str, because it has the trailing null. To do this safely, you need to copy the bytes:

let v = &['a' as u8, 'b' as u8, 'c' as u8];
let s = str::from_bytes(v);
str::as_c_str(s, |ptr| call_c_function(ptr));

While we could work around this by marking from_bytes_slice as unsafe and documenting this unsafe behavior, it would be simplest to just remove the trailing null from strings and allow the end user to add the trailing null if they want it. .as_c_str() can still take the fast path if the string is null terminated, or allocate a temporary string otherwise.

Some comments from the IRC discussion:

@graydon suggests renaming .as_c_str() to .as_null_terminated_c_str()

@cmr suggested .as_null_terminated_str()

@bstrie suggested .null_terminate()

@Kimundi mentioned a .as_c_str() that takes a &mut ~str, adds a NULL, calls the closure, then removes the NULL before exiting.

cc: #6869

The text was updated successfully, but these errors were encountered:

graydon · 2013-06-19T18:17:14Z

nominating for backwards compatibility

lilyball · 2013-06-19T18:18:20Z

CStrings, by definition, are null-terminated, so I don't see any reason to change the name away from .as_c_str().

emberian · 2013-06-19T18:19:45Z

@kballard because rust strings, as valid utf-8, can contain NUL. Thus, not all rust strings from as_c_str are the C string you would expect. Naming it away from as_c_str emphasizes this.

thestinger · 2013-06-19T18:20:09Z

I don't think we should have a shortcut for temporarily copying to a nul-terminated string at all. Anything using \0 as a string terminator doesn't support UTF-8, because they are valid characters in a Unicode string.

The most common usage of nul-termination is likely to be paths, but those can't be represented with our string type because they aren't always valid Unicode so it's not something the string type can deal with.

In special cases where you really do want to abandon all hope of correctness, you can just push \0 to the end yourself or make a copy and push it. This is potentially a security issue, because string comparisons or other operations stopping at the \0 will be incorrect. Is there actually a use case for handling only a subset of possible strings correctly?

lilyball · 2013-06-19T18:22:18Z

@cmr: Naming it something like .as_null_terminated_c_str() doesn't change the fact that it could contain NUL.

Kimundi · 2013-06-19T19:08:37Z

I wrote this under issue #6869, but seeing how there is more discussion here I'm moving it here:

Summary of the current situation

Every string allocation in rust is null terminated (~str, &'static str etc).
Just like a &[u8], a &str is a tuple (*u8, len) that points at the start of a memory range of len bytes.
However, while a &str points at len bytes, only len-1 bytes of that are considered part of the utf-8 data.
The last valid address a &str points at is the 'hidden' byte.
If you take a slice (start, end) of a string, you actually get a &[u8] slice of (start, end+1). This is safe because there is always a byte after the end for a string allocation.
For example, if you take the slice of a whole allocated string, the hidden byte points at the null terminator.
And if you take a subslice of a that string, the hidden byte points at whatever byte of the string allocation comes next after the end you specified.

All this is done just for one reason: So that as_c_str() can take a &str, safely look at its hidden byte, and only if it's not NULL make a copy of the string for increased safety (The c side will find a null terminator in valid memory). This design has a few issues though:

Converting between &str and &[u8] is a pain. While &str -> &[u8] is perfectly fine, A &[u8] of size N can only be turned into a &str of length N-1, making a lot of low level conversions a pita/impossible without some sort of copy.
The difference is hard to explain and library devs stumble over it all the time.
Because as_c_str() only does a copy 'sometimes', it's hard to gauge its performance.
Valid utf8 can contain interior nulls, so even with a guaranteed null terminator you might have to do additional transformations on your string to get the behavior you want of as_c_str()

Making `&str` the same as `&[u8]`

Because of this issues, &strs one-off-ness really introduces more problem than it solves, and should be changed. This means changing all functions in the library and compiler that handle a &str to not tread it as (*u8, len) with a len-1 long utf8 part and a hidden byte, but instead just as a len memory slice that points at valid utf8.

Changes to the C interface

The current as_c_str() should be replaced by a few specialized functions depending on your use case:

(&str).as_terminated_c_str() should assert that the last codepoint in it is '\x00', and only then provide a *u8 and len-1 pair to the string data. The docs also need to mention that the data might contain interior nulls, so the C side might see a shorter string than intended.
(&str).as_unterminated_c_str() should provide a *u8 and len pair to the string data. The docs need to be explicit that there is NO null termination guaranteed, but that the string might contain interior nulls, including at its end.
(&mut ~str).null_terminate() can be a convenience method to push a '\x00' at the end if it's not there already.
(&str).is_null_terminated() returns true if the last byte is null.
(&str).is_c_null_term_compatible() returns true if the last byte is null and all other aren't.
(&str).as_terminated_c_str_auto_copy() should take a &str, see if it ends with a '\x00', and if not make a copy of it, push a null, and then call as_terminated_c_str() on it (Like today's as_c_str()). For short copies it should include a smallvector buffer or similar anti-heap-allocation strategies.
(~str).into_terminated_c_str() -> ~str should take ownership of a ~str, push a null, call as_terminated_c_str on it, pop the null and return the ~str again, leveraging overallocation to prevent a copy.

thestinger · 2013-06-19T19:13:20Z

I really don't think it deserves even one method because it's completely incorrect to treat UTF-8 this way. We aren't using modified UTF-8, so we have to accept that it's not usable with C string APIs.

Can someone point out a situation or use case where this would actually be correct/useful?

Kimundi · 2013-06-19T19:18:32Z

@thestinger good point, maybe the whole api should be based around &[Ascii] or &[u8].
However, most of the issues I listed above still apply at the boundaries where you convert to/from this type.

lilyball · 2013-06-19T19:19:12Z

While UTF-8 allows NUL, most UTF-8 strings won't have interior NULs. And the fact of the matter is, it's awfully convenient to be able to construct a NUL-terminated C String when working with FFI functions. It would be nice not to have to copy the string to a ~[u8] every time you want to work with an FFI function.

For reference, in Cocoa, NSString is a fully unicode-aware string class, which provides a convenience method -UTF8String which returns a NUL-terminated char*, and there's been no issues with this.

lilyball · 2013-06-19T19:24:14Z

Perhaps we should just implement .as_c_str() on ~str alone, and require ~str to be NUL-terminated, but not provide this functionality at all on &str. Hopefully people will be using as_buff() anyway (which should be turned into a method).

thestinger · 2013-06-19T19:30:40Z

@kballard: so what's an example FFI API, where handling only a subset of strings correctly is still going to be okay? This kind of flaw can easily become a security vulnerability and I don't see why every string should be a byte longer just to make it easier to write broken code.

Encryption, compression, I/O and file paths are all byte based so the string representation having a \0 at the end isn't helpful/relevant because the bindings will use [u8].

Nothing stops you from pushing a \0 to the end of a ~str if you're determined to write code that only supports a subset of Unicode.

lilyball · 2013-06-19T19:41:09Z

My general feeling here is that, once you'e dealing with FFI, you're already in unsafe land, so instead of trying to be opinionated about what is safe and what is not, we should just provide the tools to do what's useful.

That said, the argument that making every string a byte longer when most of them are never going to touch FFI code (not that this is precisely what you said) is a good one. Although I wonder how much of an impact this will have in practice.

thestinger · 2013-06-19T19:42:26Z

I don't think it's useful though. I can't think of any case where it would be okay/useful to use strings for stuff that's not actually Unicode. I was also against having any ASCII-only methods (like the now removed to_lower and to_upper) on strings for the same reason - it should be separate.

If we ever make a small string class, the impact will be that strings under 10 bytes are on the stack instead of strings under 11 bytes on 32-bit, and 22 instead of 23 on 64-bit. It can also push them into the next allocator size class.

graydon · 2013-06-19T20:42:40Z

@thestinger suggests that we look at haskell's Foreign.C.String :: CString type (http://www.haskell.org/ghc/docs/latest/html/libraries/base/Foreign-C-String.html) and structure the library such that &str can be converted to such a CString which will ensure that it matches current C-string assumptions:

Encoded by current locale encoding
No interior nulls
One trailing null

IMO this relates closely to having to figure out APIs for handling non-utf8 encodings. See #6164, #4837, etc. (there should be a metabug for overhauling unicode / absorbing libICU / something; but there isn't yet)

…=flip1995 Fix another manual_unwrap_or deref FP changelog: none (since this just piggybacks on rust-lang#7233) Fixes rust-lang#6960

Rollup of 3 pull requests Successful merges: - rust-lang#7235 (Fix another manual_unwrap_or deref FP) - rust-lang#7237 (Add the command to add upstream remote) - rust-lang#7239 (CI: update rustup before installing the toolchain on windows) Failed merges: r? `@ghost` `@rustbot` modify labels: rollup changelog: rollup

Kimundi mentioned this issue Jun 19, 2013

Reconsider as_c_str function for working with strings over FFI #6869

Closed

graydon mentioned this issue Jun 20, 2013

Make str::as_buf less error prone. #3371

Closed

ghost assigned erickt Jun 25, 2013

erickt mentioned this issue Jun 25, 2013

Str from bytes API cleanup, attempt 2 #7172

Closed

graydon mentioned this issue Jul 23, 2013

Cleanup strs #7996

Closed

erickt mentioned this issue Aug 5, 2013

Remove str trailing nulls #8296

Closed

bors closed this as completed in 60f5011 Aug 10, 2013

erickt removed their assignment Jun 16, 2014

bluss mentioned this issue Nov 19, 2015

CString::from_vec_unchecked may not need to be unsafe #29918

Closed

steffahn mentioned this issue Jun 21, 2021

Add owner/name prefix to PR number in “Auto merge of …”/“Rollup merge of …” message rust-lang/homu#145

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove the trailing null from strings #7235

Remove the trailing null from strings #7235

erickt commented Jun 19, 2013

graydon commented Jun 19, 2013

lilyball commented Jun 19, 2013

emberian commented Jun 19, 2013

thestinger commented Jun 19, 2013

lilyball commented Jun 19, 2013

Kimundi commented Jun 19, 2013

thestinger commented Jun 19, 2013

Kimundi commented Jun 19, 2013

lilyball commented Jun 19, 2013

lilyball commented Jun 19, 2013

thestinger commented Jun 19, 2013

lilyball commented Jun 19, 2013

thestinger commented Jun 19, 2013

graydon commented Jun 19, 2013

Remove the trailing null from strings #7235

Remove the trailing null from strings #7235

Comments

erickt commented Jun 19, 2013

graydon commented Jun 19, 2013

lilyball commented Jun 19, 2013

emberian commented Jun 19, 2013

thestinger commented Jun 19, 2013

lilyball commented Jun 19, 2013

Kimundi commented Jun 19, 2013

Summary of the current situation

Making &str the same as &[u8]

Changes to the C interface

thestinger commented Jun 19, 2013

Kimundi commented Jun 19, 2013

lilyball commented Jun 19, 2013

lilyball commented Jun 19, 2013

thestinger commented Jun 19, 2013

lilyball commented Jun 19, 2013

thestinger commented Jun 19, 2013

graydon commented Jun 19, 2013

Making `&str` the same as `&[u8]`