Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Declared entity is not recognized #258

Open
pchampin opened this issue Feb 5, 2021 · 10 comments
Open

Declared entity is not recognized #258

pchampin opened this issue Feb 5, 2021 · 10 comments
Labels
enhancement help wanted serde Issues related to mapping from Rust types to XML

Comments

@pchampin
Copy link
Contributor

pchampin commented Feb 5, 2021

When a file defines new entities into its DOCTYPE, quick-xml does not recognize these entities in attribute values or text nodes. This causes an EscapeError(UnrecognizedSymbol(...)) error.

I built a minimal example demonstrating this problem...

use quick_xml::Reader;
use quick_xml::events::Event;

const DATA: &str = r#"

    <?xml version="1.0"?>
    <!DOCTYPE test [
    <!ENTITY msg "hello world" >
    ]>
    <test label="&msg;">&msg;</test>

"#;

fn main() {
    
    let mut reader = Reader::from_str(DATA);
    reader.trim_text(true);
    
    let mut buf = Vec::new();
    
    loop {
        match reader.read_event(&mut buf) {
            Ok(Event::Start(ref e)) => {
                match e.name() {
                    b"test" => println!("attributes values: {:?}",
                                        e.attributes().map(|a| a.unwrap().unescape_and_decode_value(&reader).unwrap())
                                        .collect::<Vec<_>>()),
                    _ => (),
                }
            },
            Ok(Event::Text(ref e)) => {
                println!("text value: {}", e.unescape_and_decode(&reader).unwrap());
            },
            Ok(Event::Eof) => break,
            Err(e) => panic!("Error at position {}: {:?}", reader.buffer_position(), e),
            _ => (),
        }
    }
}

I wasn't sure if that was a dublicate of #124 because that issue is about external entities...

@pchampin
Copy link
Contributor Author

pchampin commented Feb 5, 2021

If I understand correctly, quick-xml does not intend to parse the content of the DOCTYPE, which is fair enough.

I would be happy with a (hopefully simpler) feature to add new entities to the parser. That way, I could:

  • subscribe to the DOCTYPE event
  • make my custom parsing of its content to extract only local entities (that's really just what I need)
  • inform the Reader about those entities so that it does not choke on their use in the rest of the file.

@tafia
Copy link
Owner

tafia commented Feb 8, 2021

Indeed quick-xml doesn't support parsing DOCTYPE for the moment. I'd be happy to review a PR but I believe this is not a trivial change.

One way you could work around it is to:

  • get the DocType text event and parse the entities into a hashmap
  • when getting the attributes, don't try to unescape values, instead search in your map first then only unescape (using Attribute::value field and the unsecape function))
    Alternatively if you feel like you could improve quick-xml I'd accept any change, eventually behing a feature flag if needed.

@pchampin
Copy link
Contributor Author

pchampin commented Feb 8, 2021

Great minds think alike: during the weekend, I came up with a similar solution (see #261).
The main difference is that my PR extends the API with unescape_..._with_custom_entities methods, where the user can pass the map built from parsing the DOCTYPE.

@qwandor
Copy link

qwandor commented Feb 10, 2022

I'm running into the same issue, is there any plan to fix this? Or a workaround that works when using serde deserialisation rather than the event API directly?

@pchampin
Copy link
Contributor Author

@qwandor my PR above was merged, so the workaround described above can be used.

For Serde, however, I don't know...

@Mingun Mingun added enhancement help wanted serde Issues related to mapping from Rust types to XML labels May 21, 2022
@dralley
Copy link
Collaborator

dralley commented Sep 26, 2022

For some complex situations _with_custom_entities won't be enough.

Example: as per the specificiation, entity replacement in text contents can impact XML parsing:

If the DTD contains the declaration

<!ENTITY example "<p>An ampersand (&#38;) may be escaped numerically (&#38;#38;) or with a general entity (&amp;).</p>" >

then the XML processor will recognize the character references when it parses the entity declaration, and resolve them before storing the following string as the value of the entity " example ":

<p>An ampersand (&) may be escaped numerically (&#38;) or with a general entity (&amp;).</p>

A reference in the document to " &example; " will cause the text to be reparsed, at which time the start- and end-tags of the p element will be recognized and the three references will be recognized and expanded, resulting in a p element with the following content (all data, no delimiters or markup):

An ampersand (&) may be escaped numerically (&) or with a general entity (&).

Architecturally that would be rather difficult to implement as things currently stand.

However: this same "feature" is responsible for a lot of denial of service attacks and security issues, so maybe we don't want to ever implement that.

https://en.wikipedia.org/wiki/Billion_laughs_attack
https://knowledge-base.secureflag.com/vulnerabilities/xml_injection/xml_entity_expansion_vulnerability.html

@pchampin
Copy link
Contributor Author

@dralley _with_custom_entities is not meant as a general solution to the problem, but as a workaround for some common practices in the RDF/XML (see pchampin/sophia_rs#98).

so maybe we don't want to ever implement that.

Indeed, I believe that it is one of the reasons why quick-xml does not parse the DOCTYPE in the first place.

@Mingun
Copy link
Collaborator

Mingun commented Jun 11, 2023

Probably, we want to do more robust parsing of DTD, because currently we fail to parse this (because we simply count < and > inside DOCTYPE definition to know where DTD is ended):

<!DOCTYPE e [
    <!ELEMENT e ANY>
    <!ATTLIST a>
    <!ENTITY ent '>'>
    <!NOTATION n SYSTEM '>'>
    <?pi >?>
    <!-->-->
]>
<e/>

According to the https://www.truugo.com/xml_validator/, this is valid XML document. I also created a pegjs grammar by adapting rules from https://www.w3.org/TR/xml11, and it accepts this DTD.

PEGjs DTD grammar
// DTD parser from https://www.w3.org/TR/xml11
// rule S renamed to _, S? replaced by __
doctype = '<!DOCTYPE' _ Name (_ @ExternalID)? __ ('[' @intSubset ']' __)? '>';
ExternalID
  = 'SYSTEM' _ SystemLiteral
  / 'PUBLIC' _ PubidLiteral _ SystemLiteral;
SystemLiteral
  = '"' @$[^"]* '"'
  / "'" @$[^']* "'"
  ;
PubidLiteral
  = '"' @$PubidChar* '"'
  / "'" @$(!"'" PubidChar)* "'"
PubidChar = [- \r\na-zA-Z0-9'()+,./:=?;!*#@$_%];

intSubset = (markupdecl / DeclSep)*;
markupdecl
  = elementdecl
  / AttlistDecl
  / EntityDecl
  / NotationDecl
  / PI
  / Comment
  ;
elementdecl = '<!ELEMENT' _ Name _ contentspec __ '>';
contentspec = 'EMPTY' / 'ANY' / Mixed / children;
Mixed
  = '(' __ '#PCDATA' (__ '|' __ @Name)* __ ')*'
  / '(' __ '#PCDATA' __ ')'
  ;
children = (choice / seq) ('?' / '*' / '+')?;
choice = '(' __ cp (__ '|' __ @cp)+ __ ')';
seq    = '(' __ cp (__ ',' __ @cp)* __ ')';
cp = (Name / choice / seq) ('?' / '*' / '+')?;

AttlistDecl = '<!ATTLIST' _ Name AttDef* __ '>';
AttDef = _ Name _ AttType _ DefaultDecl;
AttType = StringType / TokenizedType / EnumeratedType;
StringType = 'CDATA'
TokenizedType
  = 'IDREFS'
  / 'IDREF'
  / 'ID'
  / 'ENTITY'
  / 'ENTITIES'
  / 'NMTOKENS'
  / 'NMTOKEN'
  ;
EnumeratedType = NotationType / Enumeration;
NotationType = 'NOTATION' _ '(' __ Name (__ '|' __ @Name)* __ ')';
Enumeration = '(' __ Nmtoken (__ '|' __ @Nmtoken)* __ ')';
DefaultDecl
  = '#REQUIRED'
  / '#IMPLIED'
  / ('#FIXED' _)? AttValue
  ;
AttValue
  = '"' @([^<&"] / Reference)* '"'
  / "'" @([^<&'] / Reference)* "'"
  ;
Reference = EntityRef / CharRef;
EntityRef = '&' Name ';'
CharRef
  = '&#' $[0-9]+ ';'
  / '&#x' $[0-9a-fA-F]+ ';'
  ;

EntityDecl = GEDecl / PEDecl;
GEDecl = '<!ENTITY' _ Name _ EntityDef __ '>';
PEDecl = '<!ENTITY' _ '%' _ Name _ PEDef __ '>';
EntityDef = EntityValue / (ExternalID NDataDecl?);
PEDef = EntityValue / ExternalID;
EntityValue
  = '"' @([^%&"] / PEReference / Reference)* '"'
  / "'" @([^%&'] / PEReference / Reference)* "'"
  ;
NDataDecl = _ 'NDATA' _ Name;

NotationDecl = '<!NOTATION' _ Name _ (ExternalID / PublicID) __ '>';
PublicID = 'PUBLIC' _ PubidLiteral;

PI = '<?' PITarget (_ @$(!'?>' Char)*)? '?>';
PITarget = !'xml'i @Name;

Comment = '<!--' $((!'-' Char) / ('-' (!'-' Char)))* '-->';

DeclSep = PEReference / _;
PEReference = '%' Name ';';

_ = $[ \t\r\n]+;
__ = $[ \t\r\n]*;

Name = $(NameStartChar NameChar*);
Nmtoken = $NameChar+;

Char = [\u0001-\uD7FF\uE000-\uFFFD] // / [\u10000-\u10FFFF];
NameStartChar = [:A-Za-z_\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u02FF\u0370-\u037D\u037F-\u1FFF\u200C-\u200D\u2070-\u218F\u2C00-\u2FEF\u3001-\uD7FF\uF900-\uFDCF\uFDF0-\uFFFD] // / [\u10000-\uEFFFF]
NameChar = NameStartChar / [-.0-9] / '\xB7' / [\u0300-\u036F] / [\u203F-\u2040];

@melissaboiko
Copy link

Related to dealing with files that break parsing because of this issue:

Is there a way to decode a BytesText into an str without unescaping at all?

@Mingun
Copy link
Collaborator

Mingun commented May 3, 2024

Yes, BytesText implements Deref<Target = [u8]>, so it can be used in contexts where &[u8] is expected, specifically you can decode this text. Alternatively, you can extract inner Cow<[u8]> using into_inner method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement help wanted serde Issues related to mapping from Rust types to XML
Projects
None yet
Development

No branches or pull requests

6 participants