Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DomCrawler] Support classes from the new DOM extension #54383

Open
wants to merge 1 commit into
base: 7.2
Choose a base branch
from

Conversation

alexandre-daubois
Copy link
Contributor

@alexandre-daubois alexandre-daubois commented Mar 23, 2024

Q A
Branch? 7.2
Bug fix? no
New feature? yes
Deprecations? yes
Issues Fix #53666
License MIT

A little bit of explanation on some choices that I made during development (feedback welcome!):

  • I went for the flag solution in DomCrawler constructor. Indeed, the new parser strictly follows the specification. This comes with behavior changes and I think using the new parser should be opt-in.
  • FormTest::testGetValues and FormTest::testGetFiles: the template tag inside turbo-stream messes with the new parser and fields are not found. They are when the template tag is removed. I can't figure out why right now, help appreciated 🙂
  • I was not able to make \Symfony\Component\DomCrawler\Crawler::discoverNamespace work with the new XPath, making the three tests about namespace fail. Here also, help much appreciated! 🙏

I also added a few comments in the code to explain some choices I made.

@alexandre-daubois alexandre-daubois force-pushed the dom-crawler-84 branch 3 times, most recently from a326cbc to 414de14 Compare March 29, 2024 09:41
src/Symfony/Component/DomCrawler/AbstractUriElement.php Outdated Show resolved Hide resolved
*/
interface CrawlerInterface extends \Countable, \IteratorAggregate
{
public function add(\DOMNodeList|\DOMNode|array|string|null $node): void;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this an issue for compat with the new extension ?

Copy link
Contributor Author

@alexandre-daubois alexandre-daubois Apr 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand where the problem is. But I'm not sure to know how to solve this. We still want to have this interface to avoid self-deprecations notices, right? Not sure how to deal with it in a better way 🤔

src/Symfony/Component/DomCrawler/DomCrawler.php Outdated Show resolved Hide resolved
throw new \InvalidArgumentException('You cannot use both external and native parsers at the same time.');
}

if ($this->options & self::CRAWLER_ENABLE_HTML5_PARSING && !($this->options & self::CRAWLER_USE_EXTERNAL_PARSER) && !($this->options & self::CRAWLER_USE_NATIVE_PARSER)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the parsing behavior for HTML different between the native parser of PHP 8.4 and masterminds/html5 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the new native parser follows specifications strictly.

This is why there is a "separate test suite" for it, because many dataset of the current tests are not "valid strict HTML": unclosed tags, unneeded slashes to close some tags... many cases where the new parser throws at parsing. This is why I opted for an opt-in update. Devs will have to fully test their app with this new parser before using it extensively.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The masterminds/html5 parser is also supposed to follow the spec.
And currently, the opt-in we use to switch between the libxml parser and the HTML5 parser is the HTML5 doctype.

Copy link
Contributor Author

@alexandre-daubois alexandre-daubois Apr 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a snippet comparing ext-dom and masterminds/html5:

require __DIR__.'/vendor/autoload.php';

$html = '<!DOCTYPE html><html><base href="https://base.com/"><form method="post" action="link/"><button type="submit" name="submit" /></form></html>';

$parser = new \Masterminds\HTML5();
$oldDom = $parser->loadHTML($html);
dump($oldDom->childNodes->item(1)->childNodes->item(1)->tagName);

$newDom = \DOM\HTMLDocument::createFromString($html);
dump($newDom->childNodes->item(1)->childNodes->item(1)->tagName);

The output is:

"form"
PHP Warning:  DOM\HTMLDocument::createFromString(): tree error unexpected-element-in-open-elements-stack in Entity, line: 1, column: 128-131 in /Users/alex/PhpstormProjects/symfony/test.php on line 11

Warning: DOM\HTMLDocument::createFromString(): tree error unexpected-element-in-open-elements-stack in Entity, line: 1, column: 128-131 in /Users/alex/PhpstormProjects/symfony/test.php on line 11
"BODY"

Note that ext-dom also throws a warning because of the button tag being closed with /> instead of </button>. Because of these differences, I think it is safer to have the new parser opt-in. Things are likely to break when upgrading to ext-dom in existing code bases.

src/Symfony/Component/DomCrawler/DomCrawler.php Outdated Show resolved Hide resolved
src/Symfony/Component/DomCrawler/Crawler.php Show resolved Hide resolved
src/Symfony/Component/DomCrawler/Tests/ImageTest.php Outdated Show resolved Hide resolved
@javiereguiluz javiereguiluz added the ❄️ Feature Freeze Important Pull Requests to finish before the next Symfony "feature freeze" label Apr 2, 2024
@alexandre-daubois alexandre-daubois force-pushed the dom-crawler-84 branch 6 times, most recently from 36b8982 to cc6945b Compare April 4, 2024 08:13
@nielsdos
Copy link

nielsdos commented Apr 4, 2024

(Commented on the GBK test case too, but I know that GH notifications can be weird so just pointing it out so the comment isn't lost).

The intention with discoverNamespace is to check what namespace URI corresponds to which namespace prefix, correct?
In that case, the current code (even for the "old DOM") doesn't make much sense to me to be honest.
Namespaces are scoped, but you seem to want to list all namespaces in the document, and only keep the last one of that list. So that means on a document like this, only nestedchild2 can be returned because the last namespace uri of prefix a is urn:not_a:

<root>
  <child1 xmlns:a="urn:a">
    <a:nestedchild1/>
  </child1>
  <child1 xmlns:a="urn:not_a">
    <a:nestedchild2/>
  </child1>
</root>

That's why the default registerNodeNamespaces property of (DOM\|DOM)XPath is scoped to the context node.
If it were scoped to a context node, you could use the builtin lookupNamespaceURI function of DOM\Element to get the uri from a prefix.

Anyway, to answer the question of "why doesn't it work?", I wrote about this in the following two threads: veewee/xml#74 (comment) and nielsdos/php-src#93 (comment)
The TL;DR is that it's not defined by spec what should be returned for the namespace axis query. Modern implementations such as browsers just return the empty list, so I did too. Reintroducing a return value of DomNameSpaceNode is wrong imo because it's not a node, and has caused many problems with confusion for users and confusion for static analysis before. There are potential replacement functions on the way in a final (yet-to-be-written) DOM RFC for 8.4.

@alexandre-daubois
Copy link
Contributor Author

Thanks @nielsdos for this complete explanation! 🚀

I think waiting for nielsdos/php-src#93 would help a lot in what's missing now in this PR. In the meantime, I'll have a look at what's possible with the current API.

@fabpot fabpot removed the ❄️ Feature Freeze Important Pull Requests to finish before the next Symfony "feature freeze" label May 2, 2024
@fabpot fabpot modified the milestones: 7.1, 7.2 May 2, 2024
@nielsdos
Copy link

nielsdos commented Jun 4, 2024

https://wiki.php.net/rfc/dom_additions_84 is under discussion which contains the features you need to finish this, as well as some extras such as css selector support.

I'd like to point out one thing that may bite later about this PR: you're using the NO_DEFAULT_NS compatibility flag. This flag exists to ease migration and won't put elements in their proper namespace, it will put them in no namespace to be exact. Presumably you do this because your CSS selector translation does not take into account namespaces? Note that not having the proper namespace set on elements has quite a few subtle consequences. Depending on if something is inside the html namespace the behaviour of some methods alter (as defined by spec). For example automatic lowercasing of attribute names will not happen with the compatibility flag on because the element holding the attribute is no longer in the html namespace. (See https://dom.spec.whatwg.org/#concept-element-attributes-get-by-name step 1 for example)

@alexandre-daubois
Copy link
Contributor Author

Amazing, thank you for keeping me updated! I'll have a look at the NO_DEFAULT_NS thing. I remember adding it to be consistent with the old behavior, but as the new parser would be opt-in, maybe this type of break could be introduced. I'll have a closer look at it. Good luck with your RFC, no doubt it will go smoothly!

@nielsdos
Copy link

nielsdos commented Jul 4, 2024

A few minutes ago I merged the final changes to the DOM extension. This includes the methods to get the in scope and descendant namespaces. Unfortunately though this barely missed alpha 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Leverage PHP 8.4's native parsing of HTML5
6 participants