Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: std::string::String could provide options about UTF BOM #2428

Open
hanyuwei70 opened this issue May 3, 2018 · 6 comments
Open

RFC: std::string::String could provide options about UTF BOM #2428

hanyuwei70 opened this issue May 3, 2018 · 6 comments
Labels
T-libs-api Relevant to the library API team, which will review and decide on the RFC.

Comments

@hanyuwei70
Copy link

hanyuwei70 commented May 3, 2018

std::string::String should :

  • correctly accept byte stream with or without UTF BOM,
  • when converts to &str or [u8], should be with or without UTF BOM by caller.

p.s. I am not native English speaker, so may what I describe may differ from my original meaning.

@Centril Centril added the T-libs-api Relevant to the library API team, which will review and decide on the RFC. label May 3, 2018
@Centril
Copy link
Contributor

Centril commented May 3, 2018

Hey there. Thank you for your interest in designing Rust!

I think you can find more interest in the issue if you post it over at: https://internals.rust-lang.org/ :)

@H2CO3
Copy link

H2CO3 commented May 3, 2018

String and &str are explicitly designed to store and refer to UTF-8 data, so parsing/storing/emitting a BOM doesn't quite make sense. Its use in UTF-8 text is even explicitly discouraged by the Unicode standard itself, as it's essentially useless and it also breaks ASCII-compatibility.

@SimonSapin
Copy link
Contributor

SimonSapin commented May 3, 2018

Copy of my comment on rust-lang/rust#50386 (comment), which was about "stripping" the BOM.

It’s not clear what is being proposed. Which APIs exactly should strip?

And more importantly, why? I tend to think of these standard library API as low-level primitives, and feel that BOM removal would tend to belong more in a higher library that might for example also support multiple encodings and detect the presence of a BOM to help pick one. And even then, maybe not always. https://docs.rs/encoding_rs/0.7.2/encoding_rs/struct.Decoder.html#impl has different methods for different use cases, only some of them remove a BOM.

@hanyuwei70
Copy link
Author

@H2CO3 @SimonSapin
UTF8 with BOM should be accepted. What I want is just to let BOM go, not disturbing other components.
And it's rather difficult to determine BOM for it does not show up in println!.

@joshtriplett
Copy link
Member

String shouldn't have magic treatment for any characters. It makes sense for encoding/decoding methods to have options for handling the BOM, such as those @SimonSapin mentioned, but none of that should happen internally to String.

@WiSaGaN
Copy link
Contributor

WiSaGaN commented May 25, 2018

We may consider adding a method like

pub fn from_utf8_with_bom(vec: Vec<u8>) -> Result<String, FromUtf8Error>

or even

pub fn from_utf8_with_optional_bom(vec: Vec<u8>) -> Result<String, FromUtf8Error>

But like comments below, this shouldn't affect String internal. And this may also just live in an external crate rather than std.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-libs-api Relevant to the library API team, which will review and decide on the RFC.
Projects
None yet
Development

No branches or pull requests

6 participants