Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse internal entity declarations in internal DTD #367

Closed
wants to merge 9 commits into from
24 changes: 21 additions & 3 deletions lib/dom-parser.js
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,12 @@ DOMParser.prototype.parseFromString = function(source,mimeType){
var locator = options.locator;
var defaultNSMap = options.xmlns||{};
var isHTML = /\/x?html?$/.test(mimeType);//mimeType.toLowerCase().indexOf('html') > -1;
var entityMap = isHTML ? entities.HTML_ENTITIES : entities.XML_ENTITIES;
var entityMap = isHTML ? entities.HTML_ENTITIES : entities.XML_ENTITIES;
// Using prototypical inheritance to avoid copying nedlessly
function EntityMap() {}
EntityMap.prototype = entityMap;
entityMap = new EntityMap();
domBuilder.entityMap = entityMap;
if(locator){
domBuilder.setDocumentLocator(locator)
}
Expand Down Expand Up @@ -243,6 +248,20 @@ DOMHandler.prototype = {
this.doc.doctype = dt;
}
},

/**
* If the same entity is declared more than once, the first declaration
* encountered is binding, and a warning is issued.
* @link https://www.w3.org/TR/2006/REC-xml11-20060816/#sec-entity-decl
*/
internalEntityDecl:function(name, value) {
if (name in this.entityMap) {
this.warning("entity '"+ name +"' has already been declared");
return;
}
this.entityMap[name] = value;
Copy link
Member

@karfau karfau Jan 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just thinking out loud without checking any spec:

  • Is it reasonable to allow overriding existing declarations?

The Name identifies the entity in an entity reference or, in the case of an unparsed entity, in the value of an ENTITY or ENTITIES attribute. If the same entity is declared more than once, the first declaration encountered is binding; at user option, an XML processor may issue a warning if entities are declared multiple times.

https://www.w3.org/TR/2006/REC-xml11-20060816/#sec-entity-decl

  • What about an entity with the name prototype?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question and spec reference! Since the default entityMaps defined in lib/entities.js are frozen (using Object.freeze from lib/conventions.js), this kind of works right now; but only the predefined entities, not if repeating a declaration.

So I added an explicit check and redeclaration warning in 311ffc3.

The prototype is defined on the type, not in the prototype, so an undefined 'prototype' key returns undefined (and a &prototype; entity can be defined in an entity declaration).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the prototypical inheritance is an old manoeuvre which ought to be efficient. But at least moving it out to a utility could be wise. I did consider using Object.assign here, but AFAIK that is an ES6 feature. (If a future overhaul decides to use e.g. hasOwnProperty tests, or replace the entityMap with an ES6 Map here, this would of course have to be reworked.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the default entityMaps defined in lib/entities.js are frozen (using Object.freeze from lib/conventions.js)

Please be aware it will not be frozen in a runtime where the Object.freeze method is not available. So thank you for adding that check.

I think it would be good to add a test for that, and also the prototype entity, just to be sure it doesn't break in the future.

Copy link
Member

@karfau karfau Jan 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the prototypical inheritance is an old manoeuvre which ought to be efficient. But at least moving it out to a utility could be wise. I did consider using Object.assign here, but AFAIK that is an ES6 feature. (If a future overhaul decides to use e.g. hasOwnProperty tests, or replace the entityMap with an ES6 Map here, this would of course have to be reworked.)

I just played with this approach a bit and after checking how the entityMap is currently used to replace entities filed the bug #370.
So once that bug is fixed your current solution no longer works.

I assume we will not get around copying the entities. (I wouldn't want to touch the sax parser more then absolutely required right now, since we have plans to get rid of it: #55.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mentioned bug was now fixed and the fix was released as 0.8.1.

Please update your branch (I can also do that if you prefer), after which I assume some tests would fail. The solution should be to copy the entitymap.
Since we can not just rely on Object.assign I will provide something similar as part of lib/conventions.js in the next days (in a separate PR).

Copy link
Member

@karfau karfau Feb 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has been done as part of #379

},

/**
* @see org.xml.sax.ErrorHandler
* @link http://www.saxproject.org/apidoc/org/xml/sax/ErrorHandler.html
Expand Down Expand Up @@ -293,7 +312,6 @@ function _toString(chars,start,length){
* #attributeDecl(eName, aName, type, mode, value)
* #elementDecl(name, model)
* #externalEntityDecl(name, publicId, systemId)
* #internalEntityDecl(name, value)
* @link http://www.saxproject.org/apidoc/org/xml/sax/ext/EntityResolver2.html
* IGNORED method of org.xml.sax.EntityResolver2
* #resolveEntity(String name,String publicId,String baseURI,String systemId)
Expand All @@ -304,7 +322,7 @@ function _toString(chars,start,length){
* #notationDecl(name, publicId, systemId) {};
* #unparsedEntityDecl(name, publicId, systemId, notationName) {};
*/
"endDTD,startEntity,endEntity,attributeDecl,elementDecl,externalEntityDecl,internalEntityDecl,resolveEntity,getExternalSubset,notationDecl,unparsedEntityDecl".replace(/\w+/g,function(key){
"endDTD,startEntity,endEntity,attributeDecl,elementDecl,externalEntityDecl,resolveEntity,getExternalSubset,notationDecl,unparsedEntityDecl".replace(/\w+/g,function(key){
DOMHandler.prototype[key] = function(){return null}
})

Expand Down
50 changes: 46 additions & 4 deletions lib/sax.js
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ function parse(source,defaultNSMapCopy,entityMap,domBuilder,errorHandler){
if(end>start){
var xt = source.substring(start,end).replace(/&#?\w+;/g,entityReplacer);
locator&&position(start);
domBuilder.characters(xt,0,end-start);
domBuilder.characters(xt,0,xt.length);
start = end
}
}
Expand Down Expand Up @@ -554,6 +554,38 @@ function parseDCC(source,start,domBuilder,errorHandler){//sure start with '<!'
domBuilder.endCDATA()
return end+3;
}
if(source.substr(start+2,6) == 'ENTITY'){
var end = source.indexOf('>', start+8);
var chunk = source.substring(start+8, end);
var match = chunk.match(/^\s+(?:(%)\s+)?(\S+)\s+(["'])/);
if (!match) {
// NOTE: Ignoring unhandled forms of entity declarations
return -1;
}
var declStart = start+8+match[0].length;
var name = match[2];
var delim = match[3];
var delimEnd = source.indexOf(delim, declStart);
if (delimEnd > end) {
end = source.indexOf('>', delimEnd);
}
if (match[1] === '%') {
// NOTE: Ignoring the PEDef form of entity declaration after forwarding
// to the declaratino end.
return end;
}
Comment on lines +560 to +576
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For landing this PR there are two things I would like to have added regarding this:

  • add errorHandler.warn for the unsupported cases
  • unit tests for "all" the supported and unsupported "type of cases" we can think of.
    I assume we should at least cover all the different cases that are present in node_modules/xmltest/xmltest.zip, I can support you in collecting/unifying them if required.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm short on time at the moment alas, so I'd appreciate some pointers for collecting those cases, and/or suggestions for where to add new tests (even if that is just a link to some contribution guideline I may have missed).

I also agree that warnings are prudent for the range of unsupported entity forms. I just want to know how to balance that with the current "lenient/silent on unsupported" approach in other parts of this code. "Leave things in a better state than when entered" is perfectly viable here, I mostly want to know where to set the bar. (There is the risk of needing to handle the nested DOCTYPE state just to emit warnings. I think we can dodge that, to avoid increasing the SAX parser complexity if it is slated for a bigger overhaul or replacement.)

Copy link
Member

@karfau karfau Feb 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started looking into the (number of) cases we need to test vs support.
Even though I'm not done yet, I start to have a feeling of what we are talking about.
I will dig more into "expanding" the top level of intSubset, it's so crazy what is allowed in there...

Note:

Regarding the !DOCTYPE tag:

/// already supported
<!DOCTYPE `Name``S?`>
<!DOCTYPE `Name` SYSTEM "`[^"]*`"`S?`>
<!DOCTYPE `Name` SYSTEM '`[^']*`'`S?`>
<!DOCTYPE `Name` PUBLIC "`PubidChar*`" "`[^"]*`"`S?`>
<!DOCTYPE `Name` PUBLIC '`(PubidChar - "'")*`' "`[^"]*`"`S?`>
<!DOCTYPE `Name` PUBLIC "`PubidChar*`" '`[^']*`'`S?`>
<!DOCTYPE `Name` PUBLIC '`(PubidChar - "'")*`' '`[^']*`'`S?`>
/// newly supported
<!DOCTYPE `Name``S?`[`intSubset`]`S?`>
<!DOCTYPE `Name` SYSTEM "`[^"]*`"`S?`[`intSubset`]`S?`>
<!DOCTYPE `Name` SYSTEM '`[^']*`'`S?`[`intSubset`]`S?`>
<!DOCTYPE `Name` PUBLIC "`PubidChar*`" "`[^"]*`"`S?`[`intSubset`]`S?`>
<!DOCTYPE `Name` PUBLIC '`(PubidChar - "'")*`' "`[^"]*`"`S?`[`intSubset`]`S?`>
<!DOCTYPE `Name` PUBLIC "`PubidChar*`" '`[^']*`'`S?`[`intSubset`]`S?`>
<!DOCTYPE `Name` PUBLIC '`(PubidChar - "'")*`' '`[^']*`'`S?`[`intSubset`]`S?`>

regarding the !ENTITY declarations:

/// supported 
<!ENTITY `Name` "`([^%&"] | %`Name`; | &`Name`; | &#`[0-9]+`; | &#x`[0-9a-fA-F]+`;)*`"`S?`>
<!ENTITY `Name` '`([^%&'] | %`Name`; | &`Name`; | &#`[0-9]+`; | &#x`[0-9a-fA-F]+`;)*`'`S?`>
  // with every option just once:
  <!ENTITY `Name` ""`S?`> || <!ENTITY `Name` ''`S?`>
  <!ENTITY `Name` "`[^%&"]*`"`S?`> || <!ENTITY `Name` '`[^%&"]*`'`S?`>
  <!ENTITY `Name` "%`Name`;"`S?`> || <!ENTITY `Name` '%`Name`;'`S?`>
  <!ENTITY `Name` "&`Name`;"`S?`> || <!ENTITY `Name` '&`Name`;'`S?`>
  <!ENTITY `Name` "&#`[0-9]+`;"`S?`> || <!ENTITY `Name` '&#`[0-9]+`;'`S?`>
  <!ENTITY `Name` "&#x`[0-9a-fA-F]+`;"`S?`> || <!ENTITY `Name` '&#x`[0-9a-fA-F]+`;'`S?`>

/// unsupported
<!ENTITY `Name` SYSTEM "`[^"]*`"`S?`>
<!ENTITY `Name` SYSTEM ''`[^']*`'`S?`>
<!ENTITY `Name` PUBLIC "`PubidChar*`" SYSTEM "`[^"]*`"`S?`>
<!ENTITY `Name` PUBLIC '`(PubidChar - "'")*`' SYSTEM "`[^"]*`"`S?`>
<!ENTITY `Name` PUBLIC "`PubidChar*`" SYSTEM "`[^']*`'`S?`>
<!ENTITY `Name` PUBLIC '`(PubidChar - "'")*`' SYSTEM "`[^']*`'`S?`>
<!ENTITY `Name` SYSTEM "`[^"]*`" NDATA `Name``S?`>
<!ENTITY `Name` SYSTEM "`[^']*`' NDATA `Name``S?`>
<!ENTITY `Name` PUBLIC "`PubidChar*`" SYSTEM "`[^"]*`" NDATA `Name``S?`>
<!ENTITY `Name` PUBLIC '`(PubidChar - "'")*`' SYSTEM "`[^"]*`" NDATA `Name``S?`>
<!ENTITY `Name` PUBLIC "`PubidChar*`" SYSTEM "`[^']*`' NDATA `Name``S?`>
<!ENTITY `Name` PUBLIC '`(PubidChar - "'")*`' SYSTEM "`[^']*`' NDATA `Name``S?`>
/// aka Parameter entities
<!ENTITY % `Name` "`([^%&"] | %`Name`; | &`Name`; | &#`[0-9]+`; | &#x`[0-9a-fA-F]+`;)*`"`S?`>
<!ENTITY % `Name` '`([^%&'] | %`Name`; | &`Name`; | &#`[0-9]+`; | &#x`[0-9a-fA-F]+`;)*`'`S?`>
  // with every option just once:
  <!ENTITY % `Name` ""`S?`> || <!ENTITY % `Name` ''`S?`>
  <!ENTITY % `Name` "`[^%&"]*`"`S?`> || <!ENTITY % `Name` '`[^%&"]*`'`S?`>
  <!ENTITY % `Name` "%`Name`;"`S?`> || <!ENTITY % `Name` '%`Name`;'`S?`>
  <!ENTITY % `Name` "&`Name`;"`S?`> || <!ENTITY % `Name` '&`Name`;'`S?`>
  <!ENTITY % `Name` "&#`[0-9]+`;"`S?`> || <!ENTITY % `Name` '&#`[0-9]+`;'`S?`>
  <!ENTITY % `Name` "&#x`[0-9a-fA-F]+`;"`S?`> || <!ENTITY % `Name` '&#x`[0-9a-fA-F]+`;'`S?`>
<!ENTITY % `Name` SYSTEM "`[^"]*`"`S?`>
<!ENTITY % `Name` SYSTEM '`[^']*`'`S?`>
<!ENTITY % `Name` PUBLIC "`PubidChar*`" SYSTEM "`[^"]*`"`S?`>
<!ENTITY % `Name` PUBLIC '`(PubidChar - "'")*`' SYSTEM "`[^"]*`"`S?`>
<!ENTITY % `Name` PUBLIC "`PubidChar*`" SYSTEM "`[^']*`'`S?`>
<!ENTITY % `Name` PUBLIC '`(PubidChar - "'")*`' SYSTEM "`[^']*`'`S?`>

Copy link
Member

@karfau karfau Feb 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I think that https://github.com/xmldom/xmldom/blob/master/test/parse/doctype.test.js is a perfect place for those additional tests.

I assume that I will soon be able to provide some samples of what I have in mind.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just invested at least another hour into doing the same for intSubset, this is sooo crazy I don't want to go any deeper. I'm now quite confident that we have to pick the following strategy for anything unsupported (like the whole [intSubset] was before: xmldom should continue to fail hard when anything unsupported / unexpected appears.

I also think that the current implementation doesn't really support everything I marked as supported in the ENTITY part above, but rather the following is supported:

<!ENTITY `Name` "`([^%&"] | &#`[0-9]+`; | &#x`[0-9a-fA-F]+`;)*`"`S?`>
<!ENTITY `Name` '`([^%&'] | &#`[0-9]+`; | &#x`[0-9a-fA-F]+`;)*`'`S?`>
  // with every option just once:
  <!ENTITY `Name` ""`S?`> || <!ENTITY `Name` ''`S?`>
  <!ENTITY `Name` "`[^%&"]*`"`S?`> || <!ENTITY `Name` '`[^%&"]*`'`S?`>
  <!ENTITY `Name` "&#`[0-9]+`;"`S?`> || <!ENTITY `Name` '&#`[0-9]+`;'`S?`>
  <!ENTITY `Name` "&#x`[0-9a-fA-F]+`;"`S?`> || <!ENTITY `Name` '&#x`[0-9a-fA-F]+`;'`S?`>

or maybe even just

<!ENTITY `Name` "`([^%&"])*`"`S?`>
<!ENTITY `Name` '`([^%&'])*`'`S?`>
  // with every option just once:
  <!ENTITY `Name` ""`S?`> || <!ENTITY `Name` ''`S?`>
  <!ENTITY `Name` "`[^%&"]*`"`S?`> || <!ENTITY `Name` '`[^%&"]*`'`S?`>

and everything else fails as before.

And if we don't find a reliable way to know when the intSubset/ DOCTYPE ends when doing that, I would argue this could even be a fatalError / ParseError which throws by default.

What do you think?

var value = source.substring(declStart, delimEnd);
// NOTE: This value is not further processed, but treated as PCDATA.
// If recursive processing is implemented, ENSURE to avoid an XML Bomb
// attack vector!
domBuilder.internalEntityDecl(name, value);
var nextTagStart = source.indexOf('<', end);
var dtdEnd = source.indexOf(']>', end);
if (dtdEnd < nextTagStart) {
domBuilder.endDTD();
}
Comment on lines +577 to +586
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just found the following section of the spec:
https://www.w3.org/TR/2006/REC-xml11-20060816/#sec-predefined-ent

All XML processors must recognize these entities whether they are declared or not. For interoperability, valid XML documents should declare these entities, like any others, before using them. If the entities lt or amp are declared, they must be declared as internal entities whose replacement text is a character reference to the respective character (less-than sign or ampersand) being escaped; the double escaping is required for these entities so that references to them produce a well-formed result. If the entities gt, apos, or quot are declared, they must be declared as internal entities whose replacement text is the single character being escaped (or a character reference to that character; the double escaping here is optional but harmless). For example:

<!ENTITY lt     "&#38;#60;">
<!ENTITY gt     "&#62;">
<!ENTITY amp    "&#38;#38;">
<!ENTITY apos   "&#39;">
<!ENTITY quot   "&#34;">

Does the current implementation support this? (We should add a test if it's not already there.)
I didn't look into the tests yet, but I wonder if it could make sense to at least do a single replacement run as part of domBuilder.internalEntityDecl since it has access to the existing entityMap. Or would that already open up the "can of worms"?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, no, this specific recommended re-declaration isn't technically supported (and there is apparently no test for it). Of course these are predefined so they are already available (and I cannot recall seeing this re-declaration in practise, at least not internally in documents).

As I read this, it seems that while re-declarations are allowed to issue warnings, these specific characters (being the XML_ENTITIES) are allowed to be re-declared by these exact values. That could be hard-coded, I guess, now or later on, to avoid the then improper warning if this is done.

And yes, I believe doing a replacement run on the entity values could open up that can. The Billion laughs attack doesn't rely upon recursion, but upon exponential growth, which this would enable. It should then be combined with e.g. monitoring of value growth, such as by carefully counting expansions, as libxml2 does with its entity reference loop detection (and corresponding --huge option to xmllint).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for elaborating on this topic, very much appreciated.
After reading the Wikipedia article it became obvious.

I fully agree that there is no need for implementing this "exception" of the "first definition is binding" rule for XML entities, since the only allowed option is to redefine them to the same value, so in the worst case there is a warning about this not being supported, but the values still being correct.

I just started to think about (X)HTML and entities.

  • XHTML both get's the html entities and is otherwise working like XML, so that shouldn't be a problem.
  • Since the living HTML spec clearly defines what the doctype can contain:
    https://html.spec.whatwg.org/multipage/syntax.html#the-doctype
    we should check what the browser does if it is provided a doctype with entities and the html mime type, and act accordingly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my confusion, currently xmldom basically only knows about XHTML and treats HTML in the "same way".
So this will be part of #338

return end+2;
}
//<!DOCTYPE
//startDTD(java.lang.String name, java.lang.String publicId, java.lang.String systemId)
var matchs = split(source,start);
Expand All @@ -570,11 +602,21 @@ function parseDCC(source,start,domBuilder,errorHandler){//sure start with '<!'
sysid = matchs[3][0];
}
}
var lastMatch = matchs[len-1]
domBuilder.startDTD(name, pubid, sysid);
domBuilder.endDTD();

var hasInternalDTD = matchs[2][0] === '[';
// NOTE: Currently only handles entity declarations and not the full
// [internal DTD subset](https://www.w3.org/TR/xml/#NT-doctypedecl).
if (hasInternalDTD) {
// NOTE: endDTD must be called by the last internal subset item.
return matchs[2].index;
} else {
domBuilder.endDTD();
}

return lastMatch.index+lastMatch[0].length
var lastMatch = matchs[len-1];
var end = lastMatch.index+lastMatch[0].length;
return end;
}
}
return -1;
Expand Down
8 changes: 2 additions & 6 deletions test/xmltest/__snapshots__/not-wf.test.js.snap
Original file line number Diff line number Diff line change
Expand Up @@ -439,7 +439,7 @@ Object {

exports[`xmltest/not-wellformed standalone should match 055.xml with snapshot 1`] = `
Object {
"actual": "<!DOCTYPE doc>",
"actual": "<!DOCTYPE doc><doc/>",
}
`;

Expand Down Expand Up @@ -1213,11 +1213,7 @@ Object {

exports[`xmltest/not-wellformed standalone should match 159.xml with snapshot 1`] = `
Object {
"actual": "<!DOCTYPE doc><doc>&amp;e;</doc>",
"error": Array [
"[xmldom error] entity not found:&e;
@#[line:5,col:1]",
],
"actual": "<!DOCTYPE doc><doc>&lt;![CDATA[Tim &amp; Michael]]&gt;</doc>",
}
`;

Expand Down