Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sxd-document cannot parse document containing a UTF-8 BOM #39

Open
therealprof opened this issue Feb 17, 2017 · 4 comments
Open

sxd-document cannot parse document containing a UTF-8 BOM #39

therealprof opened this issue Feb 17, 2017 · 4 comments
Labels

Comments

@therealprof
Copy link

Disclaimer: I've been working with XML and UTF-8 for a long time and this is the first time I ran into such a problem so I had to do a bit of research to figure out what's going on...

So what I'm trying to do is sort of naive approach to writing an application reading an XML document. The problem is also reproducible using sxd-xpath/evaluate so I'll use that for the sake of easier access and to demonstrate the problem I'll use https://www.broadband-forum.org/cwmp/tr-069-biblio.xml.

This file uses a UTF-8 BOM which read_to_string ()gladly integrates into the resulting String which fails the parser because it expects the beginning of the document to literally be <xml:

# cargo run -- --xpath / tr-069-biblio.xml
    Finished dev [unoptimized + debuginfo] target(s) in 0.0 secs
     Running `target/debug/evaluate --xpath / tr-069-biblio.xml`
Unable to parse input XML
 -> Expected("<?xml")
 -> ExpectedElement
 -> ExpectedWhitespace
 -> ExpectedComment
 -> ExpectedProcessingInstruction
thread 'main' panicked at 'At:
<?xml version=', src/main.rs:52

I'm not sure what the expected behaviour is supposed to be and do see a couple of approaches to address this particular problem:

  1. Have std automatically strip irrelevant magic from file content turned into Strings
  2. Have std provide a normalising read function
  3. Have each application (including sxd-document) specifically deal with this variant
@shepmaster
Copy link
Owner

Hmm. The UTF-8 BOM should never have existed, and now it's causing problems.

fn main() {
    let b = [0xEF, 0xBB, 0xBF, 104, 101, 108, 108, 111];
    let s = std::str::from_utf8(&b).unwrap();
    println!("->{}<-", s);
    println!("{}", s.len());
    for c in s.chars() {
        println!("c: [{}]", c);
    }
}
->hello<-
8
c: []
c: [h]
c: [e]
c: [l]
c: [l]
c: [o]

I'd guess that the most likely solution would be to normalize the text in some fashion. Would you be able to give a crate like unicode-normalization a quick shot to see if it does anything with the BOM?

@shepmaster shepmaster added the bug label Feb 17, 2017
@therealprof
Copy link
Author

I'd guess that the most likely solution would be to normalize the text in some fashion. Would you be able to give a crate like unicode-normalization a quick shot to see if it does anything with the BOM?

So I tried that and it doesn't do a thing to the String happily keeping the BOM.

This little hack works though:

/* If String starts with a BOM, strip it */
if data.as_bytes()[0] == 239 {
    data.remove(0);
}

@amckinlay
Copy link

A BOM should not occur in a string representation of Unicode text in any programming language according to the Unicode spec. The spec says that a BOM is not part of the "Unicode text", and hence should not be present in a programming language implementation of a Unicode string. This makes sense, because the byte order (and encoding) of a string is known implicitly within the programming language (there is no need for in-band signaling).

The standard states a BOM is only valid within the context of a "Unicode encoding scheme," which defines the physical bit representation of a "Unicode encoding form." A BOM is not meant to have meaning within the context of "Unicode text". When a "BOM" is encountered at the abstraction level of Unicode text, it is interpreted as a zero width non-breaking space, not a BOM, no matter where it is in the text.

Having a BOM remain in a string changes the meaning of the Unicode text, because now you technically have a zero width non-breaking space at the beginning of your string that wasn't present in the original encoded form.

What Rust is doing, I don't know. But this has security consequences for string operations like concatenation, and for any text processing libraries that do not expect to encounter a BOM (and they shouldn't have to).

But then again, that's just, like my opinion man. The Unicode standard section on conformance is enlightening. Sorry I got side-tracked, this really isn't an sxd-Document issue. This is a defect in Rust if Unicode strings really include BOM. Even if Rust strings are considered to take on a Unicode encoding form, like UTF-8, they should not carry a BOM. The use of a Unicode encoding scheme is really meant to be reserved for encoding Unicode text in a file or within a network protocol, when its defined as bits without any higher level abstraction.

If this is incoherent just read the conformance clauses in the Unicode spec, they are much clearer. BTW, UTF-8 names both an encoding form and an encoding scheme, the former being a sequence of code unit values, and the later being the byte encoding of those values along with a possible BOM (the encoding form of UTF-8 is trivially equivalent to the encoding scheme without the BOM, since UTF-8 is obviously insensitive to byte order). I'm gonna puke.

@shepmaster
Copy link
Owner

@amckinlay thanks for the illuminating response! :-)

It rather sounds like you should open an issue on the Rust repo. I don't know if such a change would be acceptable or not, given Rust's stability guarantees, but it seems like it's worth a shot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants