Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deserializer doesn't support unicode identifiers #321

Closed
Tracked by #397
sephiron99 opened this issue Oct 21, 2021 · 6 comments · Fixed by #488
Closed
Tracked by #397

Deserializer doesn't support unicode identifiers #321

sephiron99 opened this issue Oct 21, 2021 · 6 comments · Fixed by #488

Comments

@sephiron99
Copy link

use ron;
use serde::{Deserialize, Serialize};

#[derive(Debug, Serialize, Deserialize, PartialEq)]
struct A {
    한글: String,
}

#[derive(Debug, Serialize, Deserialize, PartialEq)]
struct B {
    eng: String,
}
fn main() {
    let a = A {
        한글: "스트링".to_string(),
    };
    let b = B {
        eng: "string".to_string(),
    };
    let a_ser = ron::ser::to_string(&a).unwrap();
    let b_ser = ron::ser::to_string(&b).unwrap();
    let a_de = ron::de::from_str::<A>(&a_ser);
    let b_de = ron::de::from_str::<B>(&b_ser);
    println!("a_ser:\n{:#?}", a_ser );
    println!("a_de:\n{:#?}", a_de );
    println!("b_ser:\n{:#?}", b_ser );
    println!("b_de:\n{:#?}", b_de );
}
a_ser:
"(한글:\"스트링\")"
a_de:
Err(
    Error {
        code: ExpectedIdentifier,
        position: Position {
            line: 1,
            col: 2,
        },
    },
)
b_ser:
"(eng:\"string\")"
b_de:
Ok(
    B {
        eng: "string",
    },
)

Is this a bug of ron?

@torkleyy
Copy link
Contributor

torkleyy commented Oct 21, 2021

The bug is not that it cannot deserialize unicode identifiers, but that it doesn't use raw identifiers when serializing identifiers.

EDIT: So far we don't support unicode in identifiers. I'll change this issue to be specific to unicode support; I've created one for incorrect serialization of raw idents: #322

@torkleyy torkleyy added the bug label Oct 21, 2021
@torkleyy torkleyy changed the title Failure to deserialize unicode identifiers(field) Serializer doesn't escape identifiers and thus produces output that can't be deserialized Oct 21, 2021
@torkleyy torkleyy changed the title Serializer doesn't escape identifiers and thus produces output that can't be deserialized Serializer doesn't escape identifiers Oct 21, 2021
@torkleyy torkleyy changed the title Serializer doesn't escape identifiers Deserializer doesn't support unicode identifiers Oct 21, 2021
@torkleyy torkleyy added enhancement and removed bug labels Oct 24, 2021
@sephiron99 sephiron99 reopened this Oct 24, 2021
@juntyr juntyr mentioned this issue Aug 14, 2022
7 tasks
@ModProg
Copy link
Contributor

ModProg commented Mar 22, 2023

All parsing operates on u8, for doing xid_start/xid_continue checks, we would need to work with char.

I see 3 possible implementations:

  1. make Bytes::new validate that the input is valid utf8. This would allow us to use a bit of unsafe to still work with bytes but be able to also get the next unicode char like so:
fn check_ident_other_char(&self, index: usize) -> bool {
         // Safe because `index` always points at a valid char boundry
         unsafe { from_utf8_unchecked(&self.bytes[index..]) }
        .chars()
        .next()
        .map_or(false, is_xid_continue)
}
  1. create our own method (or pull in a crate that is able to do it) to only validate the next character in the byte array something like
get_next_char(&self.bytes[index..]).map_or(false, is_xid_continue)
  1. we could also just use from_utf8 but that would mean that for every parsed ident the whole trailing input would be validated as well.

@juntyr
Copy link
Member

juntyr commented Mar 23, 2023

@ModProg Thank you for sharing your thoughts! I've been thinking about the UTF-8 support for a while too but haven't had the time to draft an implementation so far.

When we serialise ron to a string, we already assert that the document must be valid UTF-8. While we can currently parse just about any bytes, I'm not fully convinced that sticking to that really is necessary - arbitrary byte strings as suggested in #438 can always use escapes. So I think option (1) is a very valid approach.

Option 2 has the appeal of still giving us the flexibility of non-UTF-8 documents but at the cost of extra maintenance on our side. I think I'd still prefer option (1) here unless there is a very simple implementation or small and well-supported crate that we could depend on.

I have also thought about option (3) but the potential N^2 complexity blowup does make me wary about that option.

@ModProg
Copy link
Contributor

ModProg commented Mar 23, 2023

While we can currently parse just about any bytes

Well we actually cannot, every byte we parse, must be either inside a string and therefore valid utf-8, or outside a string and therefore ascii (subset of utf-8).

Introducing the restriction of utf-8 for the whole file therefore does not change anything currently, it would only lock us out of the possibility to introduce some raw byte data later. But in case we really want to do that, we can still change our approach then.

@ModProg
Copy link
Contributor

ModProg commented Mar 25, 2023

Do we have any benchmarks? To measure performance degradation?

@ModProg
Copy link
Contributor

ModProg commented Mar 25, 2023

Do we have any benchmarks? To measure performance degradation?

Oh ignore me, found them immediately after commenting: https://github.com/ron-rs/ron-bench

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants