Deserialising unicode escape gives non-UTF8 String #228

5225225 · 2022-06-09T12:38:56Z

fn main() {
    let mut data = *br#""\uDE71""#; // [34, 92, 117, 68, 69, 55, 49, 34];

    let x: String = simd_json::from_slice(&mut data).unwrap();
    String::from_utf8(x.into_bytes()).unwrap();
}

Here, the second unwrap here fails, which means the library gave me a String with invalid UTF-8.

Licenser · 2022-06-09T12:59:09Z

Hi,

good catch! let me investigate.

Licenser · 2022-06-09T13:03:19Z

@hkratz I think this tracks back to simdutf8,

I added the following test case

    test_invalid_after_prefixes(br#""\uDE71""#, 0, None, Some(5));

and it doesn't throw an error.

hkratz · 2022-06-28T14:50:05Z

Simdutf8 usage in simd-json only confirms that the input byte sequence without any escape processing is valid UTF-8. Here the input byte sequence [34, 92, 117, 68, 69, 55, 49, 34] is valid UTF-8. AFAICS unicode escape processing is done by simd-json itself here:

simd-json/src/stringparse.rs

Lines 22 to 73 in 6af8b2a

    
           /// handle a unicode codepoint 
        
           /// write appropriate values into dest 
        
           /// src will advance 6 bytes or 12 bytes 
        
           /// dest will advance a variable amount (return via pointer) 
        
           /// return true if the unicode codepoint was valid 
        
           /// We work in little-endian then swap at write time 
        
           #[cfg_attr(not(feature = "no-inline"), inline(always))] 
        
           pub(crate) fn handle_unicode_codepoint( 
        
               mut src_ptr: &[u8], 
        
               dst_ptr: &mut [u8], 
        
           ) -> Result<(usize, usize), ErrorType> { 
        
               // hex_to_u32_nocheck fills high 16 bits of the return value with 1s if the 
        
               // conversion isn't valid; we defer the check for this to inside the 
        
               // multilingual plane check 
        
               let mut code_point: u32 = hex_to_u32_nocheck(unsafe { src_ptr.get_unchecked(2..) }); 
        
               src_ptr = unsafe { src_ptr.get_unchecked(6..) }; 
        
               let mut src_offset = 6; 
        
               // check for low surrogate for characters outside the Basic 
        
               // Multilingual Plane. 
        
               if (0xd800..0xdc00).contains(&code_point) { 
        
                   if (unsafe { *src_ptr.get_unchecked(0) } != b'\\') 
        
                       || unsafe { *src_ptr.get_unchecked(1) } != b'u' 
        
                   { 
        
                       return Ok((0, src_offset)); 
        
                   } 
        
                   let code_point_2: u32 = hex_to_u32_nocheck(unsafe { src_ptr.get_unchecked(2..) }); 
        
                   // if the first code point is invalid we will get here, as we will go past 
        
                   // the check for being outside the Basic Multilingual plane. If we don't 
        
                   // find a \u immediately afterwards we fail out anyhow, but if we do, 
        
                   // this check catches both the case of the first code point being invalid 
        
                   // or the second code point being invalid. 
        
                   if ((code_point | code_point_2) >> 16) != 0 { 
        
                       return Ok((0, src_offset)); 
        
                   } 
        
                   let c1 = if let Some(c) = code_point.checked_sub(0xd800) { 
        
                       c 
        
                   } else { 
        
                       return Err(ErrorType::InvalidUtf8); 
        
                   }; 
        
                   let c2 = if let Some(c) = code_point_2.checked_sub(0xdc00) { 
        
                       c 
        
                   } else { 
        
                       return Err(ErrorType::InvalidUtf8); 
        
                   }; 
        
                   code_point = ((c1 << 10) | c2) + 0x10000; 
        
                   src_offset += 6; 
        
               } 
        
               let offset: usize = codepoint_to_utf8(code_point, dst_ptr); 
        
               Ok((offset, src_offset)) 
        
           }

This does not seem to handle a lone low surrogates such as \uDE71 at all (which should result in an error).

Licenser · 2022-06-29T08:17:35Z

Heya, sorry about that, too many distractions these days, you're completely right, I missed that there is a \u in there that is not escaped during the test itself 🤦

Licenser · 2022-06-30T16:05:50Z

Heya @5225225 sorry this took so long, I was a bit too all over the place and totally missed the real issue here ... on the bright side, it should be fixed now and released as 0.5.1 :)

Licenser mentioned this issue Jun 28, 2022

Deserialising unicode escape gives non-UTF8 String rusticstuff/simdutf8#67

Closed

Licenser mentioned this issue Jun 30, 2022

Fix bug that allowed invalid UTF8 to be paresed #229

Merged

Licenser closed this as completed in #229 Jun 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deserialising unicode escape gives non-UTF8 String #228

Deserialising unicode escape gives non-UTF8 String #228

5225225 commented Jun 9, 2022

Licenser commented Jun 9, 2022

Licenser commented Jun 9, 2022

hkratz commented Jun 28, 2022

Licenser commented Jun 29, 2022

Licenser commented Jun 30, 2022

Deserialising unicode escape gives non-UTF8 String #228

Deserialising unicode escape gives non-UTF8 String #228

Comments

5225225 commented Jun 9, 2022

Licenser commented Jun 9, 2022

Licenser commented Jun 9, 2022

hkratz commented Jun 28, 2022

Licenser commented Jun 29, 2022

Licenser commented Jun 30, 2022