Deserializer doesn't support unicode identifiers #321

sephiron99 · 2021-10-21T16:48:41Z

use ron;
use serde::{Deserialize, Serialize};

#[derive(Debug, Serialize, Deserialize, PartialEq)]
struct A {
    한글: String,
}

#[derive(Debug, Serialize, Deserialize, PartialEq)]
struct B {
    eng: String,
}
fn main() {
    let a = A {
        한글: "스트링".to_string(),
    };
    let b = B {
        eng: "string".to_string(),
    };
    let a_ser = ron::ser::to_string(&a).unwrap();
    let b_ser = ron::ser::to_string(&b).unwrap();
    let a_de = ron::de::from_str::<A>(&a_ser);
    let b_de = ron::de::from_str::<B>(&b_ser);
    println!("a_ser:\n{:#?}", a_ser );
    println!("a_de:\n{:#?}", a_de );
    println!("b_ser:\n{:#?}", b_ser );
    println!("b_de:\n{:#?}", b_de );
}

a_ser:
"(한글:\"스트링\")"
a_de:
Err(
    Error {
        code: ExpectedIdentifier,
        position: Position {
            line: 1,
            col: 2,
        },
    },
)
b_ser:
"(eng:\"string\")"
b_de:
Ok(
    B {
        eng: "string",
    },
)

Is this a bug of ron?

torkleyy · 2021-10-21T18:15:08Z

The bug is not that it cannot deserialize unicode identifiers, but that it doesn't use raw identifiers when serializing identifiers.

EDIT: So far we don't support unicode in identifiers. I'll change this issue to be specific to unicode support; I've created one for incorrect serialization of raw idents: #322

ModProg · 2023-03-22T20:48:48Z

All parsing operates on u8, for doing xid_start/xid_continue checks, we would need to work with char.

I see 3 possible implementations:

make Bytes::new validate that the input is valid utf8. This would allow us to use a bit of unsafe to still work with bytes but be able to also get the next unicode char like so:

fn check_ident_other_char(&self, index: usize) -> bool {
         // Safe because `index` always points at a valid char boundry
         unsafe { from_utf8_unchecked(&self.bytes[index..]) }
        .chars()
        .next()
        .map_or(false, is_xid_continue)
}

create our own method (or pull in a crate that is able to do it) to only validate the next character in the byte array something like

get_next_char(&self.bytes[index..]).map_or(false, is_xid_continue)

we could also just use from_utf8 but that would mean that for every parsed ident the whole trailing input would be validated as well.

juntyr · 2023-03-23T07:57:52Z

@ModProg Thank you for sharing your thoughts! I've been thinking about the UTF-8 support for a while too but haven't had the time to draft an implementation so far.

When we serialise ron to a string, we already assert that the document must be valid UTF-8. While we can currently parse just about any bytes, I'm not fully convinced that sticking to that really is necessary - arbitrary byte strings as suggested in #438 can always use escapes. So I think option (1) is a very valid approach.

Option 2 has the appeal of still giving us the flexibility of non-UTF-8 documents but at the cost of extra maintenance on our side. I think I'd still prefer option (1) here unless there is a very simple implementation or small and well-supported crate that we could depend on.

I have also thought about option (3) but the potential N^2 complexity blowup does make me wary about that option.

ModProg · 2023-03-23T10:15:00Z

While we can currently parse just about any bytes

Well we actually cannot, every byte we parse, must be either inside a string and therefore valid utf-8, or outside a string and therefore ascii (subset of utf-8).

Introducing the restriction of utf-8 for the whole file therefore does not change anything currently, it would only lock us out of the possibility to introduce some raw byte data later. But in case we really want to do that, we can still change our approach then.

ModProg · 2023-03-25T12:53:27Z

Do we have any benchmarks? To measure performance degradation?

ModProg · 2023-03-25T12:54:19Z

Do we have any benchmarks? To measure performance degradation?

Oh ignore me, found them immediately after commenting: https://github.com/ron-rs/ron-bench

torkleyy added the bug label Oct 21, 2021

torkleyy changed the title ~~Failure to deserialize unicode identifiers(field)~~ Serializer doesn't escape identifiers and thus produces output that can't be deserialized Oct 21, 2021

torkleyy changed the title ~~Serializer doesn't escape identifiers and thus produces output that can't be deserialized~~ Serializer doesn't escape identifiers Oct 21, 2021

torkleyy changed the title ~~Serializer doesn't escape identifiers~~ Deserializer doesn't support unicode identifiers Oct 21, 2021

torkleyy added enhancement and removed bug labels Oct 24, 2021

sephiron99 closed this as completed Oct 24, 2021

sephiron99 reopened this Oct 24, 2021

juntyr mentioned this issue Aug 14, 2022

Release 0.8 #395

Closed

7 tasks

ModProg mentioned this issue Mar 25, 2023

Support unicode idents (matching rust) #444

Closed

1 task

juntyr mentioned this issue Sep 2, 2023

Add full UTF-8 support in RON incl. unicode identifiers #488

Merged

1 task

juntyr closed this as completed in #488 Sep 3, 2023

juntyr mentioned this issue Sep 3, 2023

Value meta-issue #397

Open

17 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deserializer doesn't support unicode identifiers #321

Deserializer doesn't support unicode identifiers #321

sephiron99 commented Oct 21, 2021

torkleyy commented Oct 21, 2021 •

edited

ModProg commented Mar 22, 2023

juntyr commented Mar 23, 2023

ModProg commented Mar 23, 2023

ModProg commented Mar 25, 2023

ModProg commented Mar 25, 2023

Deserializer doesn't support unicode identifiers #321

Deserializer doesn't support unicode identifiers #321

Comments

sephiron99 commented Oct 21, 2021

torkleyy commented Oct 21, 2021 • edited

ModProg commented Mar 22, 2023

juntyr commented Mar 23, 2023

ModProg commented Mar 23, 2023

ModProg commented Mar 25, 2023

ModProg commented Mar 25, 2023

torkleyy commented Oct 21, 2021 •

edited