Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deserialising unicode escape gives non-UTF8 String #228

Closed
5225225 opened this issue Jun 9, 2022 · 5 comments · Fixed by #229
Closed

Deserialising unicode escape gives non-UTF8 String #228

5225225 opened this issue Jun 9, 2022 · 5 comments · Fixed by #229

Comments

@5225225
Copy link
Contributor

5225225 commented Jun 9, 2022

fn main() {
    let mut data = *br#""\uDE71""#; // [34, 92, 117, 68, 69, 55, 49, 34];

    let x: String = simd_json::from_slice(&mut data).unwrap();
    String::from_utf8(x.into_bytes()).unwrap();
}

Here, the second unwrap here fails, which means the library gave me a String with invalid UTF-8.

@Licenser
Copy link
Member

Licenser commented Jun 9, 2022

Hi,

good catch! let me investigate.

@Licenser
Copy link
Member

Licenser commented Jun 9, 2022

@hkratz I think this tracks back to simdutf8,

I added the following test case

    test_invalid_after_prefixes(br#""\uDE71""#, 0, None, Some(5));

and it doesn't throw an error.

@hkratz
Copy link
Contributor

hkratz commented Jun 28, 2022

Simdutf8 usage in simd-json only confirms that the input byte sequence without any escape processing is valid UTF-8. Here the input byte sequence [34, 92, 117, 68, 69, 55, 49, 34] is valid UTF-8. AFAICS unicode escape processing is done by simd-json itself here:

/// handle a unicode codepoint
/// write appropriate values into dest
/// src will advance 6 bytes or 12 bytes
/// dest will advance a variable amount (return via pointer)
/// return true if the unicode codepoint was valid
/// We work in little-endian then swap at write time
#[cfg_attr(not(feature = "no-inline"), inline(always))]
pub(crate) fn handle_unicode_codepoint(
mut src_ptr: &[u8],
dst_ptr: &mut [u8],
) -> Result<(usize, usize), ErrorType> {
// hex_to_u32_nocheck fills high 16 bits of the return value with 1s if the
// conversion isn't valid; we defer the check for this to inside the
// multilingual plane check
let mut code_point: u32 = hex_to_u32_nocheck(unsafe { src_ptr.get_unchecked(2..) });
src_ptr = unsafe { src_ptr.get_unchecked(6..) };
let mut src_offset = 6;
// check for low surrogate for characters outside the Basic
// Multilingual Plane.
if (0xd800..0xdc00).contains(&code_point) {
if (unsafe { *src_ptr.get_unchecked(0) } != b'\\')
|| unsafe { *src_ptr.get_unchecked(1) } != b'u'
{
return Ok((0, src_offset));
}
let code_point_2: u32 = hex_to_u32_nocheck(unsafe { src_ptr.get_unchecked(2..) });
// if the first code point is invalid we will get here, as we will go past
// the check for being outside the Basic Multilingual plane. If we don't
// find a \u immediately afterwards we fail out anyhow, but if we do,
// this check catches both the case of the first code point being invalid
// or the second code point being invalid.
if ((code_point | code_point_2) >> 16) != 0 {
return Ok((0, src_offset));
}
let c1 = if let Some(c) = code_point.checked_sub(0xd800) {
c
} else {
return Err(ErrorType::InvalidUtf8);
};
let c2 = if let Some(c) = code_point_2.checked_sub(0xdc00) {
c
} else {
return Err(ErrorType::InvalidUtf8);
};
code_point = ((c1 << 10) | c2) + 0x10000;
src_offset += 6;
}
let offset: usize = codepoint_to_utf8(code_point, dst_ptr);
Ok((offset, src_offset))
}

This does not seem to handle a lone low surrogates such as \uDE71 at all (which should result in an error).

@Licenser
Copy link
Member

Heya, sorry about that, too many distractions these days, you're completely right, I missed that there is a \u in there that is not escaped during the test itself 🤦

@Licenser
Copy link
Member

Heya @5225225 sorry this took so long, I was a bit too all over the place and totally missed the real issue here ... on the bright side, it should be fixed now and released as 0.5.1 :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants