Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Condition for handling malformed UTF-8; also an interface to iconv #4837

Closed
lifthrasiir opened this issue Feb 8, 2013 · 3 comments
Closed
Labels
A-unicode Area: Unicode C-enhancement Category: An issue proposing an enhancement or a PR with one. E-easy Call for participation: Easy difficulty. Experience needed to fix: Not much. Good first issue.

Comments

@lifthrasiir
Copy link
Contributor

Currently even this simple cat program:

use io::ReaderUtil;
fn main() {
    for io::stdin().each_line |line| { io::println(line); }
}

...fails on the broken or invalid UTF-8 strings (or possibly in other character encodings, as this example illustrates):

$ echo 깨진 글자 | iconv -f utf-8 -t cp949 | ./test
rust: task failed at 'Assertion is_utf8(vv) failed', [...]/rust/src/libcore/str.rs:50
rust: domain main @0x7fcf32815e10 root task failed

...due to the byte sequence is assumed to be in UTF-8 (which is not). But there is currently no standard way to fix broken UTF-8 strings by replacing offending substrings by some other valid UTF-8, so it is hard to fix this kind of bugs.

This issue is ultimately linked to the general character encoding handling (libiconv binding, perhaps?) and a strict distinction between byte sequence and Unicode (UTF-8) string. I found Python's approach reasonable (bytes and str are separated, converted to each other via encode and decode methods, normal file open reads bytes, codecs.open with an encoding converts them to str), but I'm really not sure about the actual interface.

@graydon
Copy link
Contributor

graydon commented Feb 8, 2013

Rust does already make a strict distinction between bytes and strings. A string (types named str) are UTF-8; an array of bytes (types named [u8]) make no restriction on their contents. A cat program generally does not deal with text (in any encoding), but plain bytes. So for that generally one should use [u8] rather than str.

But I generally agree that:

  • Rust's string-based IO routines should provide a condition for hooking in different behavior when presented with malformed UTF-8. The current behavior predates conditions, and should be upgraded.
  • Rust's standard library should contain an interface to iconv, so that we can accept input in multiple character sets from the outside of a rust program. Currently we only really support UTF-8 and UTF-16 (for windows compatibility).

I'll rename this bug to match these concerns.

@lifthrasiir
Copy link
Contributor Author

Agreed on the blurred subject(s). I know [u8] and str are separate, but practically speaking [u8] is not much used compared to str even when it should be used due to the lack of useful methods on [u8]. Personally I hit this issue many times during porting some program in C to Rust.

In the short term, I'm thinking of adding two functions to core::str (and possibly impl &str):

pure fn fix_utf8_range(s: &str, begin: uint, end: uint, handler: &fn(&[u8]) -> ~str) -> ~str;
pure fn fix_utf8(s: &str, handler: &fn(&[u8]) -> ~str) -> ~str;

...so that one can write str::fix_utf8(line_just_read, |_| ~"\ufffd"). Maybe that is for another issue though.

@huonw
Copy link
Member

huonw commented Sep 10, 2013

Triage: There's currently a not_utf8 condition (added by #5399), but it's under discussion and likely to change (#8968). In any case, I think there's been a pile of rumblings/discussion recently about how to handle encodings properly, and there's a bug specifically about this #6164.

I'm closing this in favour of the two bugs above, reopen if you disagree.

@huonw huonw closed this as completed Sep 10, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-unicode Area: Unicode C-enhancement Category: An issue proposing an enhancement or a PR with one. E-easy Call for participation: Easy difficulty. Experience needed to fix: Not much. Good first issue.
Projects
None yet
Development

No branches or pull requests

3 participants