Declared entity is not recognized #258

pchampin · 2021-02-05T09:28:17Z

When a file defines new entities into its DOCTYPE, quick-xml does not recognize these entities in attribute values or text nodes. This causes an EscapeError(UnrecognizedSymbol(...)) error.

I built a minimal example demonstrating this problem...

use quick_xml::Reader;
use quick_xml::events::Event;

const DATA: &str = r#"

    <?xml version="1.0"?>
    <!DOCTYPE test [
    <!ENTITY msg "hello world" >
    ]>
    <test label="&msg;">&msg;</test>

"#;

fn main() {
    
    let mut reader = Reader::from_str(DATA);
    reader.trim_text(true);
    
    let mut buf = Vec::new();
    
    loop {
        match reader.read_event(&mut buf) {
            Ok(Event::Start(ref e)) => {
                match e.name() {
                    b"test" => println!("attributes values: {:?}",
                                        e.attributes().map(|a| a.unwrap().unescape_and_decode_value(&reader).unwrap())
                                        .collect::<Vec<_>>()),
                    _ => (),
                }
            },
            Ok(Event::Text(ref e)) => {
                println!("text value: {}", e.unescape_and_decode(&reader).unwrap());
            },
            Ok(Event::Eof) => break,
            Err(e) => panic!("Error at position {}: {:?}", reader.buffer_position(), e),
            _ => (),
        }
    }
}

I wasn't sure if that was a dublicate of #124 because that issue is about external entities...

The text was updated successfully, but these errors were encountered:

pchampin · 2021-02-05T14:44:21Z

If I understand correctly, quick-xml does not intend to parse the content of the DOCTYPE, which is fair enough.

I would be happy with a (hopefully simpler) feature to add new entities to the parser. That way, I could:

subscribe to the DOCTYPE event
make my custom parsing of its content to extract only local entities (that's really just what I need)
inform the Reader about those entities so that it does not choke on their use in the rest of the file.

tafia · 2021-02-08T04:55:12Z

Indeed quick-xml doesn't support parsing DOCTYPE for the moment. I'd be happy to review a PR but I believe this is not a trivial change.

One way you could work around it is to:

get the DocType text event and parse the entities into a hashmap
when getting the attributes, don't try to unescape values, instead search in your map first then only unescape (using Attribute::value field and the unsecape function))
Alternatively if you feel like you could improve quick-xml I'd accept any change, eventually behing a feature flag if needed.

pchampin · 2021-02-08T09:17:03Z

Great minds think alike: during the weekend, I came up with a similar solution (see #261).
The main difference is that my PR extends the API with unescape_..._with_custom_entities methods, where the user can pass the map built from parsing the DOCTYPE.

qwandor · 2022-02-10T22:42:06Z

I'm running into the same issue, is there any plan to fix this? Or a workaround that works when using serde deserialisation rather than the event API directly?

pchampin · 2022-02-11T13:48:19Z

@qwandor my PR above was merged, so the workaround described above can be used.

For Serde, however, I don't know...

dralley · 2022-09-26T02:42:08Z

For some complex situations _with_custom_entities won't be enough.

Example: as per the specificiation, entity replacement in text contents can impact XML parsing:

If the DTD contains the declaration

<!ENTITY example "<p>An ampersand (&) may be escaped numerically (&#38;) or with a general entity (&).</p>" >

then the XML processor will recognize the character references when it parses the entity declaration, and resolve them before storing the following string as the value of the entity " example ":

<p>An ampersand (&) may be escaped numerically (&) or with a general entity (&).</p>

A reference in the document to " &example; " will cause the text to be reparsed, at which time the start- and end-tags of the p element will be recognized and the three references will be recognized and expanded, resulting in a p element with the following content (all data, no delimiters or markup):

An ampersand (&) may be escaped numerically (&) or with a general entity (&).

Architecturally that would be rather difficult to implement as things currently stand.

However: this same "feature" is responsible for a lot of denial of service attacks and security issues, so maybe we don't want to ever implement that.

https://en.wikipedia.org/wiki/Billion_laughs_attack
https://knowledge-base.secureflag.com/vulnerabilities/xml_injection/xml_entity_expansion_vulnerability.html

pchampin · 2022-09-26T12:33:26Z

@dralley _with_custom_entities is not meant as a general solution to the problem, but as a workaround for some common practices in the RDF/XML (see pchampin/sophia_rs#98).

so maybe we don't want to ever implement that.

Indeed, I believe that it is one of the reasons why quick-xml does not parse the DOCTYPE in the first place.

Mingun · 2023-06-11T20:08:35Z

Probably, we want to do more robust parsing of DTD, because currently we fail to parse this (because we simply count < and > inside DOCTYPE definition to know where DTD is ended):

<!DOCTYPE e [
    <!ELEMENT e ANY>
    <!ATTLIST a>
    <!ENTITY ent '>'>
    <!NOTATION n SYSTEM '>'>
    <?pi >?>
    <!-->-->
]>
<e/>

According to the https://www.truugo.com/xml_validator/, this is valid XML document. I also created a pegjs grammar by adapting rules from https://www.w3.org/TR/xml11, and it accepts this DTD.

PEGjs DTD grammar

// DTD parser from https://www.w3.org/TR/xml11
// rule S renamed to _, S? replaced by __
doctype = '<!DOCTYPE' _ Name (_ @ExternalID)? __ ('[' @intSubset ']' __)? '>';
ExternalID
  = 'SYSTEM' _ SystemLiteral
  / 'PUBLIC' _ PubidLiteral _ SystemLiteral;
SystemLiteral
  = '"' @$[^"]* '"'
  / "'" @$[^']* "'"
  ;
PubidLiteral
  = '"' @$PubidChar* '"'
  / "'" @$(!"'" PubidChar)* "'"
PubidChar = [- \r\na-zA-Z0-9'()+,./:=?;!*#@$_%];

intSubset = (markupdecl / DeclSep)*;
markupdecl
  = elementdecl
  / AttlistDecl
  / EntityDecl
  / NotationDecl
  / PI
  / Comment
  ;
elementdecl = '<!ELEMENT' _ Name _ contentspec __ '>';
contentspec = 'EMPTY' / 'ANY' / Mixed / children;
Mixed
  = '(' __ '#PCDATA' (__ '|' __ @Name)* __ ')*'
  / '(' __ '#PCDATA' __ ')'
  ;
children = (choice / seq) ('?' / '*' / '+')?;
choice = '(' __ cp (__ '|' __ @cp)+ __ ')';
seq    = '(' __ cp (__ ',' __ @cp)* __ ')';
cp = (Name / choice / seq) ('?' / '*' / '+')?;

AttlistDecl = '<!ATTLIST' _ Name AttDef* __ '>';
AttDef = _ Name _ AttType _ DefaultDecl;
AttType = StringType / TokenizedType / EnumeratedType;
StringType = 'CDATA'
TokenizedType
  = 'IDREFS'
  / 'IDREF'
  / 'ID'
  / 'ENTITY'
  / 'ENTITIES'
  / 'NMTOKENS'
  / 'NMTOKEN'
  ;
EnumeratedType = NotationType / Enumeration;
NotationType = 'NOTATION' _ '(' __ Name (__ '|' __ @Name)* __ ')';
Enumeration = '(' __ Nmtoken (__ '|' __ @Nmtoken)* __ ')';
DefaultDecl
  = '#REQUIRED'
  / '#IMPLIED'
  / ('#FIXED' _)? AttValue
  ;
AttValue
  = '"' @([^<&"] / Reference)* '"'
  / "'" @([^<&'] / Reference)* "'"
  ;
Reference = EntityRef / CharRef;
EntityRef = '&' Name ';'
CharRef
  = '&#' $[0-9]+ ';'
  / '&#x' $[0-9a-fA-F]+ ';'
  ;

EntityDecl = GEDecl / PEDecl;
GEDecl = '<!ENTITY' _ Name _ EntityDef __ '>';
PEDecl = '<!ENTITY' _ '%' _ Name _ PEDef __ '>';
EntityDef = EntityValue / (ExternalID NDataDecl?);
PEDef = EntityValue / ExternalID;
EntityValue
  = '"' @([^%&"] / PEReference / Reference)* '"'
  / "'" @([^%&'] / PEReference / Reference)* "'"
  ;
NDataDecl = _ 'NDATA' _ Name;

NotationDecl = '<!NOTATION' _ Name _ (ExternalID / PublicID) __ '>';
PublicID = 'PUBLIC' _ PubidLiteral;

PI = '<?' PITarget (_ @$(!'?>' Char)*)? '?>';
PITarget = !'xml'i @Name;

Comment = '<!--' $((!'-' Char) / ('-' (!'-' Char)))* '-->';

DeclSep = PEReference / _;
PEReference = '%' Name ';';

_ = $[ \t\r\n]+;
__ = $[ \t\r\n]*;

Name = $(NameStartChar NameChar*);
Nmtoken = $NameChar+;

Char = [\u0001-\uD7FF\uE000-\uFFFD] // / [\u10000-\u10FFFF];
NameStartChar = [:A-Za-z_\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u02FF\u0370-\u037D\u037F-\u1FFF\u200C-\u200D\u2070-\u218F\u2C00-\u2FEF\u3001-\uD7FF\uF900-\uFDCF\uFDF0-\uFFFD] // / [\u10000-\uEFFFF]
NameChar = NameStartChar / [-.0-9] / '\xB7' / [\u0300-\u036F] / [\u203F-\u2040];

melissaboiko · 2024-05-02T16:30:12Z

Related to dealing with files that break parsing because of this issue:

Is there a way to decode a BytesText into an str without unescaping at all?

Mingun · 2024-05-03T14:10:59Z

Yes, BytesText implements Deref<Target = [u8]>, so it can be used in contexts where &[u8] is expected, specifically you can decode this text. Alternatively, you can extract inner Cow<[u8]> using into_inner method.

pchampin mentioned this issue Feb 5, 2021

RDFXML parser fails on xsd entities pchampin/sophia_rs#98

Closed

Mingun added enhancement help wanted serde Issues related to mapping from Rust types to XML labels May 21, 2022

Mingun mentioned this issue Jun 26, 2023

Fails on empty comments without 4 -" #618

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Declared entity is not recognized #258

Declared entity is not recognized #258

pchampin commented Feb 5, 2021

pchampin commented Feb 5, 2021 •

edited

Loading

tafia commented Feb 8, 2021

pchampin commented Feb 8, 2021

qwandor commented Feb 10, 2022

pchampin commented Feb 11, 2022

dralley commented Sep 26, 2022

pchampin commented Sep 26, 2022

Mingun commented Jun 11, 2023 •

edited

Loading

melissaboiko commented May 2, 2024

Mingun commented May 3, 2024

Declared entity is not recognized #258

Declared entity is not recognized #258

Comments

pchampin commented Feb 5, 2021

pchampin commented Feb 5, 2021 • edited Loading

tafia commented Feb 8, 2021

pchampin commented Feb 8, 2021

qwandor commented Feb 10, 2022

pchampin commented Feb 11, 2022

dralley commented Sep 26, 2022

pchampin commented Sep 26, 2022

Mingun commented Jun 11, 2023 • edited Loading

melissaboiko commented May 2, 2024

Mingun commented May 3, 2024

pchampin commented Feb 5, 2021 •

edited

Loading

Mingun commented Jun 11, 2023 •

edited

Loading