-
-
Notifications
You must be signed in to change notification settings - Fork 9.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DomCrawler] Support classes from the new DOM extension #54383
base: 7.2
Are you sure you want to change the base?
Conversation
a326cbc
to
414de14
Compare
*/ | ||
interface CrawlerInterface extends \Countable, \IteratorAggregate | ||
{ | ||
public function add(\DOMNodeList|\DOMNode|array|string|null $node): void; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this an issue for compat with the new extension ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand where the problem is. But I'm not sure to know how to solve this. We still want to have this interface to avoid self-deprecations notices, right? Not sure how to deal with it in a better way 🤔
throw new \InvalidArgumentException('You cannot use both external and native parsers at the same time.'); | ||
} | ||
|
||
if ($this->options & self::CRAWLER_ENABLE_HTML5_PARSING && !($this->options & self::CRAWLER_USE_EXTERNAL_PARSER) && !($this->options & self::CRAWLER_USE_NATIVE_PARSER)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the parsing behavior for HTML different between the native parser of PHP 8.4 and masterminds/html5 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the new native parser follows specifications strictly.
This is why there is a "separate test suite" for it, because many dataset of the current tests are not "valid strict HTML": unclosed tags, unneeded slashes to close some tags... many cases where the new parser throws at parsing. This is why I opted for an opt-in update. Devs will have to fully test their app with this new parser before using it extensively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The masterminds/html5 parser is also supposed to follow the spec.
And currently, the opt-in we use to switch between the libxml parser and the HTML5 parser is the HTML5 doctype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a snippet comparing ext-dom and masterminds/html5:
require __DIR__.'/vendor/autoload.php';
$html = '<!DOCTYPE html><html><base href="https://base.com/"><form method="post" action="link/"><button type="submit" name="submit" /></form></html>';
$parser = new \Masterminds\HTML5();
$oldDom = $parser->loadHTML($html);
dump($oldDom->childNodes->item(1)->childNodes->item(1)->tagName);
$newDom = \DOM\HTMLDocument::createFromString($html);
dump($newDom->childNodes->item(1)->childNodes->item(1)->tagName);
The output is:
"form"
PHP Warning: DOM\HTMLDocument::createFromString(): tree error unexpected-element-in-open-elements-stack in Entity, line: 1, column: 128-131 in /Users/alex/PhpstormProjects/symfony/test.php on line 11
Warning: DOM\HTMLDocument::createFromString(): tree error unexpected-element-in-open-elements-stack in Entity, line: 1, column: 128-131 in /Users/alex/PhpstormProjects/symfony/test.php on line 11
"BODY"
Note that ext-dom also throws a warning because of the button
tag being closed with />
instead of </button>
. Because of these differences, I think it is safer to have the new parser opt-in. Things are likely to break when upgrading to ext-dom in existing code bases.
src/Symfony/Component/DomCrawler/Test/Constraint/CrawlerSelectorExists.php
Outdated
Show resolved
Hide resolved
36b8982
to
cc6945b
Compare
(Commented on the GBK test case too, but I know that GH notifications can be weird so just pointing it out so the comment isn't lost). The intention with <root>
<child1 xmlns:a="urn:a">
<a:nestedchild1/>
</child1>
<child1 xmlns:a="urn:not_a">
<a:nestedchild2/>
</child1>
</root> That's why the default Anyway, to answer the question of "why doesn't it work?", I wrote about this in the following two threads: veewee/xml#74 (comment) and nielsdos/php-src#93 (comment) |
cc6945b
to
c7408a9
Compare
Thanks @nielsdos for this complete explanation! 🚀 I think waiting for nielsdos/php-src#93 would help a lot in what's missing now in this PR. In the meantime, I'll have a look at what's possible with the current API. |
c7408a9
to
54b3814
Compare
54b3814
to
8a5f8b7
Compare
https://wiki.php.net/rfc/dom_additions_84 is under discussion which contains the features you need to finish this, as well as some extras such as css selector support. I'd like to point out one thing that may bite later about this PR: you're using the NO_DEFAULT_NS compatibility flag. This flag exists to ease migration and won't put elements in their proper namespace, it will put them in no namespace to be exact. Presumably you do this because your CSS selector translation does not take into account namespaces? Note that not having the proper namespace set on elements has quite a few subtle consequences. Depending on if something is inside the html namespace the behaviour of some methods alter (as defined by spec). For example automatic lowercasing of attribute names will not happen with the compatibility flag on because the element holding the attribute is no longer in the html namespace. (See https://dom.spec.whatwg.org/#concept-element-attributes-get-by-name step 1 for example) |
Amazing, thank you for keeping me updated! I'll have a look at the |
8a5f8b7
to
f27f754
Compare
f27f754
to
ae36c49
Compare
A few minutes ago I merged the final changes to the DOM extension. This includes the methods to get the in scope and descendant namespaces. Unfortunately though this barely missed alpha 1. |
A little bit of explanation on some choices that I made during development (feedback welcome!):
DomCrawler
constructor. Indeed, the new parser strictly follows the specification. This comes with behavior changes and I think using the new parser should be opt-in.FormTest::testGetValues
andFormTest::testGetFiles
: thetemplate
tag insideturbo-stream
messes with the new parser and fields are not found. They are when thetemplate
tag is removed. I can't figure out why right now, help appreciated 🙂\Symfony\Component\DomCrawler\Crawler::discoverNamespace
work with the new XPath, making the three tests about namespace fail. Here also, help much appreciated! 🙏I also added a few comments in the code to explain some choices I made.