-
Notifications
You must be signed in to change notification settings - Fork 228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Declared entity is not recognized #258
Comments
If I understand correctly, quick-xml does not intend to parse the content of the DOCTYPE, which is fair enough. I would be happy with a (hopefully simpler) feature to add new entities to the parser. That way, I could:
|
Indeed quick-xml doesn't support parsing DOCTYPE for the moment. I'd be happy to review a PR but I believe this is not a trivial change. One way you could work around it is to:
|
Great minds think alike: during the weekend, I came up with a similar solution (see #261). |
I'm running into the same issue, is there any plan to fix this? Or a workaround that works when using serde deserialisation rather than the event API directly? |
For some complex situations Example: as per the specificiation, entity replacement in text contents can impact XML parsing:
Architecturally that would be rather difficult to implement as things currently stand. However: this same "feature" is responsible for a lot of denial of service attacks and security issues, so maybe we don't want to ever implement that. https://en.wikipedia.org/wiki/Billion_laughs_attack |
@dralley
Indeed, I believe that it is one of the reasons why quick-xml does not parse the DOCTYPE in the first place. |
Probably, we want to do more robust parsing of DTD, because currently we fail to parse this (because we simply count <!DOCTYPE e [
<!ELEMENT e ANY>
<!ATTLIST a>
<!ENTITY ent '>'>
<!NOTATION n SYSTEM '>'>
<?pi >?>
<!-->-->
]>
<e/> According to the https://www.truugo.com/xml_validator/, this is valid XML document. I also created a pegjs grammar by adapting rules from https://www.w3.org/TR/xml11, and it accepts this DTD. PEGjs DTD grammar// DTD parser from https://www.w3.org/TR/xml11
// rule S renamed to _, S? replaced by __
doctype = '<!DOCTYPE' _ Name (_ @ExternalID)? __ ('[' @intSubset ']' __)? '>';
ExternalID
= 'SYSTEM' _ SystemLiteral
/ 'PUBLIC' _ PubidLiteral _ SystemLiteral;
SystemLiteral
= '"' @$[^"]* '"'
/ "'" @$[^']* "'"
;
PubidLiteral
= '"' @$PubidChar* '"'
/ "'" @$(!"'" PubidChar)* "'"
PubidChar = [- \r\na-zA-Z0-9'()+,./:=?;!*#@$_%];
intSubset = (markupdecl / DeclSep)*;
markupdecl
= elementdecl
/ AttlistDecl
/ EntityDecl
/ NotationDecl
/ PI
/ Comment
;
elementdecl = '<!ELEMENT' _ Name _ contentspec __ '>';
contentspec = 'EMPTY' / 'ANY' / Mixed / children;
Mixed
= '(' __ '#PCDATA' (__ '|' __ @Name)* __ ')*'
/ '(' __ '#PCDATA' __ ')'
;
children = (choice / seq) ('?' / '*' / '+')?;
choice = '(' __ cp (__ '|' __ @cp)+ __ ')';
seq = '(' __ cp (__ ',' __ @cp)* __ ')';
cp = (Name / choice / seq) ('?' / '*' / '+')?;
AttlistDecl = '<!ATTLIST' _ Name AttDef* __ '>';
AttDef = _ Name _ AttType _ DefaultDecl;
AttType = StringType / TokenizedType / EnumeratedType;
StringType = 'CDATA'
TokenizedType
= 'IDREFS'
/ 'IDREF'
/ 'ID'
/ 'ENTITY'
/ 'ENTITIES'
/ 'NMTOKENS'
/ 'NMTOKEN'
;
EnumeratedType = NotationType / Enumeration;
NotationType = 'NOTATION' _ '(' __ Name (__ '|' __ @Name)* __ ')';
Enumeration = '(' __ Nmtoken (__ '|' __ @Nmtoken)* __ ')';
DefaultDecl
= '#REQUIRED'
/ '#IMPLIED'
/ ('#FIXED' _)? AttValue
;
AttValue
= '"' @([^<&"] / Reference)* '"'
/ "'" @([^<&'] / Reference)* "'"
;
Reference = EntityRef / CharRef;
EntityRef = '&' Name ';'
CharRef
= '&#' $[0-9]+ ';'
/ '&#x' $[0-9a-fA-F]+ ';'
;
EntityDecl = GEDecl / PEDecl;
GEDecl = '<!ENTITY' _ Name _ EntityDef __ '>';
PEDecl = '<!ENTITY' _ '%' _ Name _ PEDef __ '>';
EntityDef = EntityValue / (ExternalID NDataDecl?);
PEDef = EntityValue / ExternalID;
EntityValue
= '"' @([^%&"] / PEReference / Reference)* '"'
/ "'" @([^%&'] / PEReference / Reference)* "'"
;
NDataDecl = _ 'NDATA' _ Name;
NotationDecl = '<!NOTATION' _ Name _ (ExternalID / PublicID) __ '>';
PublicID = 'PUBLIC' _ PubidLiteral;
PI = '<?' PITarget (_ @$(!'?>' Char)*)? '?>';
PITarget = !'xml'i @Name;
Comment = '<!--' $((!'-' Char) / ('-' (!'-' Char)))* '-->';
DeclSep = PEReference / _;
PEReference = '%' Name ';';
_ = $[ \t\r\n]+;
__ = $[ \t\r\n]*;
Name = $(NameStartChar NameChar*);
Nmtoken = $NameChar+;
Char = [\u0001-\uD7FF\uE000-\uFFFD] // / [\u10000-\u10FFFF];
NameStartChar = [:A-Za-z_\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u02FF\u0370-\u037D\u037F-\u1FFF\u200C-\u200D\u2070-\u218F\u2C00-\u2FEF\u3001-\uD7FF\uF900-\uFDCF\uFDF0-\uFFFD] // / [\u10000-\uEFFFF]
NameChar = NameStartChar / [-.0-9] / '\xB7' / [\u0300-\u036F] / [\u203F-\u2040]; |
Related to dealing with files that break parsing because of this issue: Is there a way to decode a BytesText into an str without unescaping at all? |
Yes, |
When a file defines new entities into its DOCTYPE, quick-xml does not recognize these entities in attribute values or text nodes. This causes an
EscapeError(UnrecognizedSymbol(...))
error.I built a minimal example demonstrating this problem...
I wasn't sure if that was a dublicate of #124 because that issue is about external entities...
The text was updated successfully, but these errors were encountered: