Condition for handling malformed UTF-8; also an interface to iconv #4837

lifthrasiir · 2013-02-08T02:00:05Z

Currently even this simple cat program:

use io::ReaderUtil;
fn main() {
    for io::stdin().each_line |line| { io::println(line); }
}

...fails on the broken or invalid UTF-8 strings (or possibly in other character encodings, as this example illustrates):

$ echo 깨진 글자 | iconv -f utf-8 -t cp949 | ./test
rust: task failed at 'Assertion is_utf8(vv) failed', [...]/rust/src/libcore/str.rs:50
rust: domain main @0x7fcf32815e10 root task failed

...due to the byte sequence is assumed to be in UTF-8 (which is not). But there is currently no standard way to fix broken UTF-8 strings by replacing offending substrings by some other valid UTF-8, so it is hard to fix this kind of bugs.

This issue is ultimately linked to the general character encoding handling (libiconv binding, perhaps?) and a strict distinction between byte sequence and Unicode (UTF-8) string. I found Python's approach reasonable (bytes and str are separated, converted to each other via encode and decode methods, normal file open reads bytes, codecs.open with an encoding converts them to str), but I'm really not sure about the actual interface.

The text was updated successfully, but these errors were encountered:

graydon · 2013-02-08T02:14:46Z

Rust does already make a strict distinction between bytes and strings. A string (types named str) are UTF-8; an array of bytes (types named [u8]) make no restriction on their contents. A cat program generally does not deal with text (in any encoding), but plain bytes. So for that generally one should use [u8] rather than str.

But I generally agree that:

Rust's string-based IO routines should provide a condition for hooking in different behavior when presented with malformed UTF-8. The current behavior predates conditions, and should be upgraded.
Rust's standard library should contain an interface to iconv, so that we can accept input in multiple character sets from the outside of a rust program. Currently we only really support UTF-8 and UTF-16 (for windows compatibility).

I'll rename this bug to match these concerns.

lifthrasiir · 2013-02-08T02:36:46Z

Agreed on the blurred subject(s). I know [u8] and str are separate, but practically speaking [u8] is not much used compared to str even when it should be used due to the lack of useful methods on [u8]. Personally I hit this issue many times during porting some program in C to Rust.

In the short term, I'm thinking of adding two functions to core::str (and possibly impl &str):

pure fn fix_utf8_range(s: &str, begin: uint, end: uint, handler: &fn(&[u8]) -> ~str) -> ~str;
pure fn fix_utf8(s: &str, handler: &fn(&[u8]) -> ~str) -> ~str;

...so that one can write str::fix_utf8(line_just_read, |_| ~"\ufffd"). Maybe that is for another issue though.

huonw · 2013-09-10T14:10:03Z

Triage: There's currently a not_utf8 condition (added by #5399), but it's under discussion and likely to change (#8968). In any case, I think there's been a pile of rumblings/discussion recently about how to handle encodings properly, and there's a bug specifically about this #6164.

I'm closing this in favour of the two bugs above, reopen if you disagree.

sonwow mentioned this issue Mar 15, 2013

Provide a condition in string-based IO routines #5399

Closed

graydon mentioned this issue Jun 19, 2013

Remove the trailing null from strings #7235

Closed

pnkfelix mentioned this issue Sep 10, 2013

str::not_utf8 condition is a sign that the from_bytes API needs massaging #8968

Closed

huonw closed this as completed Sep 10, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Condition for handling malformed UTF-8; also an interface to iconv #4837

Condition for handling malformed UTF-8; also an interface to iconv #4837

lifthrasiir commented Feb 8, 2013

graydon commented Feb 8, 2013

lifthrasiir commented Feb 8, 2013

huonw commented Sep 10, 2013

Condition for handling malformed UTF-8; also an interface to iconv #4837

Condition for handling malformed UTF-8; also an interface to iconv #4837

Comments

lifthrasiir commented Feb 8, 2013

graydon commented Feb 8, 2013

lifthrasiir commented Feb 8, 2013

huonw commented Sep 10, 2013