Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request: Lossy way to move Vec<u8> into a String #64727

Open
Lokathor opened this issue Sep 24, 2019 · 11 comments
Open

Request: Lossy way to move Vec<u8> into a String #64727

Lokathor opened this issue Sep 24, 2019 · 11 comments
Labels
C-feature-request Category: A feature request, i.e: not implemented / a PR. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@Lokathor
Copy link
Contributor

If you have a Vec<u8> of utf8 data there's three ways to get a String:

  • from_utf8(vec: Vec<u8>) -> Result<String, FromUtf8Error>
  • from_utf8_lossy(v: &'a [u8]) -> Cow<'a, str>
  • unsafe fn from_utf8_unchecked(bytes: Vec<u8>) -> String

There doesn't appear to be any way to hand over the existing Vec<u8> and have any necessary character changes into be done in place. The std::string::FromUtf8Error type doesn't, for example, have a way to just do a lossy in-place conversion.

@mtnmts
Copy link

mtnmts commented Sep 24, 2019

@Lokathor - I would like to implement this (I have no contributions yet in the project).

May i please claim the request?

@Lokathor
Copy link
Contributor Author

I think that normally the Libs team is given a chance to comment on the issue (it's only been 3 hours), but the worst that can happen is that they reject your PR.

All I can say for sure is that I will not be implementing this particular patch myself.

@mtnmts
Copy link

mtnmts commented Sep 24, 2019

Thanks for letting me know, I will wait until someone from the lib team will respond.

@jonas-schievink jonas-schievink added C-feature-request Category: A feature request, i.e: not implemented / a PR. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Sep 24, 2019
@Mark-Simulacrum
Copy link
Member

Seems like the right API here is actually fn from_utf_8_lossy_in_place(bytes: &mut [u8]) -> &mut str, since I believe there is never a need to allocate more bytes. Unfortunately I guess we'd need two APIs for this to be perfect, one on String as well, since there's no safe way to get from running this function to a String of the bytes, right? The only safe way would be to do String::from_utf8 which is guaranteed to succeed but is costly...

@Lokathor
Copy link
Contributor Author

  1. I don't want slice to slice. I definitely want owned to owned. I'm not sure why you'd think that slice to slice is somehow more correct. My goal is that whenever possible no additional allocations will immediately happen during the conversion (obviously this isn't always possible).
  2. You definitely can have byte sequences where the lossy conversion into utf8 makes the sequence longer and thus could hit the capacity of the allocation and thus could trigger a reallocation. Trivial example: 0xFF 0x00 becomes 0xEF 0xBF 0xBD 0x00, which is longer.

@Mark-Simulacrum
Copy link
Member

Hm, I had thought that the lossy conversion always replaced a single byte with a single byte, if that's not the case then the slice case is indeed not all that helpful probably. If this was true then the slice case would be good to have as a more general form, though.

@Mark-Simulacrum
Copy link
Member

So we want this signature: fn from_utf8_lossy_owned(v: Vec<u8>) -> String, right? That seems like it can be added via a PR relatively easily.

@Lokathor
Copy link
Contributor Author

That is one option.

Another would be to add a method to FromUtf8 so that you can consume that error into the lossy String.

The value here is not having to start over on the utf8 parsing.

@dylni
Copy link
Contributor

dylni commented Oct 5, 2020

There was a post with possible designs on internals:
https://internals.rust-lang.org/t/too-many-words-on-a-from-utf8-lossy-variant-that-takes-a-vec-u8/13005

@dishmaker
Copy link

I hope from_utf8_vec_lossy gets added soon :)

@ryanavella
Copy link

Another would be to add a method to FromUtf8 so that you can consume that error into the lossy String.

I prefer this approach for a couple of reasons.

It signals whether or not the conversion was lossless. Whereas if we went with the signature fn(Vec<u8>) -> String, some pairs of inputs collide, like the pair b"" and b"\xff", or the pair b"\xef\xbf\xbd" and b"\xff\xff\xff".

Also, it will deter people from checking for the presence of '�' to tell if something went wrong. This is a footgun I've seen in the wild.

And even for users who "just want a string" and don't care about the error, you can still reasonably get that in single readable line:

String::from_utf8(v).unwrap_or_else(|e| e.into_string_lossy())

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-feature-request Category: A feature request, i.e: not implemented / a PR. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

7 participants