-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sxd-document cannot parse document containing a UTF-8 BOM #39
Comments
Hmm. The UTF-8 BOM should never have existed, and now it's causing problems. fn main() {
let b = [0xEF, 0xBB, 0xBF, 104, 101, 108, 108, 111];
let s = std::str::from_utf8(&b).unwrap();
println!("->{}<-", s);
println!("{}", s.len());
for c in s.chars() {
println!("c: [{}]", c);
}
}
I'd guess that the most likely solution would be to normalize the text in some fashion. Would you be able to give a crate like unicode-normalization a quick shot to see if it does anything with the BOM? |
So I tried that and it doesn't do a thing to the String happily keeping the BOM. This little hack works though:
|
A BOM should not occur in a string representation of Unicode text in any programming language according to the Unicode spec. The spec says that a BOM is not part of the "Unicode text", and hence should not be present in a programming language implementation of a Unicode string. This makes sense, because the byte order (and encoding) of a string is known implicitly within the programming language (there is no need for in-band signaling). The standard states a BOM is only valid within the context of a "Unicode encoding scheme," which defines the physical bit representation of a "Unicode encoding form." A BOM is not meant to have meaning within the context of "Unicode text". When a "BOM" is encountered at the abstraction level of Unicode text, it is interpreted as a Having a BOM remain in a string changes the meaning of the Unicode text, because now you technically have a What Rust is doing, I don't know. But this has security consequences for string operations like concatenation, and for any text processing libraries that do not expect to encounter a BOM (and they shouldn't have to). But then again, that's just, like my opinion man. The Unicode standard section on conformance is enlightening. Sorry I got side-tracked, this really isn't an sxd-Document issue. This is a defect in Rust if Unicode strings really include BOM. Even if Rust strings are considered to take on a Unicode encoding form, like UTF-8, they should not carry a BOM. The use of a Unicode encoding scheme is really meant to be reserved for encoding Unicode text in a file or within a network protocol, when its defined as bits without any higher level abstraction. If this is incoherent just read the conformance clauses in the Unicode spec, they are much clearer. BTW, UTF-8 names both an encoding form and an encoding scheme, the former being a sequence of code unit values, and the later being the byte encoding of those values along with a possible BOM (the encoding form of UTF-8 is trivially equivalent to the encoding scheme without the BOM, since UTF-8 is obviously insensitive to byte order). I'm gonna puke. |
@amckinlay thanks for the illuminating response! :-) It rather sounds like you should open an issue on the Rust repo. I don't know if such a change would be acceptable or not, given Rust's stability guarantees, but it seems like it's worth a shot! |
Disclaimer: I've been working with XML and UTF-8 for a long time and this is the first time I ran into such a problem so I had to do a bit of research to figure out what's going on...
So what I'm trying to do is sort of naive approach to writing an application reading an XML document. The problem is also reproducible using sxd-xpath/evaluate so I'll use that for the sake of easier access and to demonstrate the problem I'll use https://www.broadband-forum.org/cwmp/tr-069-biblio.xml.
This file uses a UTF-8 BOM which
read_to_string ()
gladly integrates into the resulting String which fails the parser because it expects the beginning of the document to literally be<xml
:I'm not sure what the expected behaviour is supposed to be and do see a couple of approaches to address this particular problem:
std
automatically strip irrelevant magic from file content turned into Stringsstd
provide a normalising read functionThe text was updated successfully, but these errors were encountered: