Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validity of strings #78

Open
RalfJung opened this issue Jan 10, 2019 · 9 comments
Open

Validity of strings #78

RalfJung opened this issue Jan 10, 2019 · 9 comments
Labels
A-validity Topic: Related to validity invariants S-writeup-needed Status: Ready for a writeup and no one is assigned

Comments

@RalfJung
Copy link
Member

Discussing the validity invariant of strings: Do they just have to be like byte slices, or do we require in the validity invariant that they be valid UTF-8?

Certainly, their safety invariant requires UTF-8, but that is not what we are discussing here.

@RalfJung RalfJung added active discussion topic A-validity Topic: Related to validity invariants labels Jan 10, 2019
@nikomatsakis
Copy link
Contributor

nikomatsakis commented Jan 31, 2019

I think they should just be byte-slices, analogously with how char's validity invariant is independent from unicode.

EDIT: That statement was incorrect

@RalfJung
Copy link
Member Author

Note that the Nomicon calls non-UTF-8 str a form of insta-UB, making it a validity invariant: https://doc.rust-lang.org/nightly/nomicon/what-unsafe-does.html.

@Lokathor
Copy link
Contributor

I think that in this case the Nomicon simply isn't quite correct. The Validity Invariant of a type, if any, is known to the compiler. In the case of str, the utf8 invariant is known only to actual Rust code. As far as I know, LLVM has absolutely no concept that there's a byte encoding involved.

@hanna-kruppe
Copy link

I don't think the current rustc-LLVM compilation pipeline exploits knowledge of str being UTF-8 anywhere, but since str is a built-in type, the compiler absolutely could know about this invariant if we defined it so, and of course rustc is under no obligation to actually exploit that knowledge. So what the Nomicon says is perfectly compatible with the current state of the implementation.

That said, I don't think there's anything useful the compiler can do with strs being well-formed UTF-8, since you usually won't store or copy them by value. Even if reference validity includes validity of the pointee, that still will be hard to make use of -- niches behind references are useless for layout optimizations, and the invariant is rather complicated (since the format is variable-length) so I doubt it could be used to automatically optimize code that processes UTF-8 on a byte by byte basis. For these reasons, I don't object to downgrading this to the safety invariant.

@RalfJung
Copy link
Member Author

I opened a PR against the reference to decide that strings do not have UTF-8 in their validity invariant: rust-lang/reference#792.

@scottmcm
Copy link
Member

I saw this comment above:

I think they should just be byte-slices, analogously with how char's validity invariant is independent from unicode.

Is that true? It seems in opposition to the following statement from rust-lang/rust#71033 (comment)

And anyway Rust also limits char to U+10FFFF max.

Trying out matching on char demonstrates that, at least as implemented today, even in unsafe code you can't look for things beyond the unicode-defined limit: https://rust.godbolt.org/z/gLBW08

So I don't think that "analogously [...] independent from unicode" is good justification here.

(Which isn't a statement that removing the invariant would bad, just that I'd like to strike that particular piece of evidence.)

@nikomatsakis
Copy link
Contributor

I agree I was incorrect.

@mbrubeck
Copy link

mbrubeck commented Jul 15, 2020

The lang team and libs team decided in rust-lang/rust#71033 to change this from a validity invariant to a safety invariant. The FCP is complete, and the reference and nomicon have been updated. I believe this issue can be closed now.

@RalfJung RalfJung added the S-writeup-needed Status: Ready for a writeup and no one is assigned label Jul 16, 2020
@RalfJung
Copy link
Member Author

Ah good point, discussion here is done. Moving to state "writeup needed".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-validity Topic: Related to validity invariants S-writeup-needed Status: Ready for a writeup and no one is assigned
Projects
None yet
Development

No branches or pull requests

6 participants