Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DomCrawler] Optionally use html5-php to parse HTML #29306

Merged
merged 1 commit into from Apr 3, 2019

Conversation

@tgalopin
Copy link
Member

commented Nov 24, 2018

Q A
Branch? master
Bug fix? no
New feature? yes
BC breaks? no
Deprecations? no
Tests pass? WIP
Fixed tickets #29280, #28596
License MIT
Doc PR symfony/symfony-docs#10700

This PR introduces the possibility to parse HTML content in the Crawler using the html5-php library (https://github.com/Masterminds/html5-php). This allows for better support of HTML5 and fix many unexpected behaviors and inconsistencies of the native DOM extension.

Show resolved Hide resolved src/Symfony/Component/DomCrawler/Crawler.php Outdated
Show resolved Hide resolved src/Symfony/Component/DomCrawler/Crawler.php Outdated
Show resolved Hide resolved src/Symfony/Component/DomCrawler/Crawler.php Outdated

@nicolas-grekas nicolas-grekas added this to the next milestone Nov 24, 2018

@stof

This comment has been minimized.

Copy link
Member

commented Nov 24, 2018

As the native implementation uses validateOnParse, I think your alternative implementation needs to check $html5->hasErrors() and throw based on $html5->getErrors() too. Otherwise, parse errors might go unnoticed.

@tgalopin tgalopin force-pushed the tgalopin:html5-parser branch from 5e439d7 to d0420c3 Dec 8, 2018

@fabpot

This comment has been minimized.

Copy link
Member

commented Feb 21, 2019

@tgalopin What's the status of this PR?

@tgalopin

This comment has been minimized.

Copy link
Member Author

commented Feb 21, 2019

Waiting for Masterminds/html5-php#163 to be merged to pass tests here.

@fabpot

This comment has been minimized.

Copy link
Member

commented Mar 4, 2019

@tgalopin Upstream PR merged :)

@stof

This comment has been minimized.

Copy link
Member

commented Mar 28, 2019

Due to Masterminds/html5-php#139, shouldn't we use the saveHTML of the HTML5 library instead of the native one when use use the HTML5 parser to parse the DOM (meaning we need to also remember whether the DOM was created by the HTML5 parser)

@fabpot

This comment has been minimized.

Copy link
Member

commented Mar 31, 2019

@tgalopin friendly ping

@tgalopin tgalopin force-pushed the tgalopin:html5-parser branch 3 times, most recently from e21e17a to 14a454d Mar 31, 2019

@tgalopin tgalopin changed the title [DomCrawler][WIP] Optionally use html5-php to parse HTML [DomCrawler] Optionally use html5-php to parse HTML Mar 31, 2019

@tgalopin

This comment has been minimized.

Copy link
Member Author

commented Apr 3, 2019

Tests are failing for an unrelated reason. I think this is ready to review.

Show resolved Hide resolved src/Symfony/Component/DomCrawler/CHANGELOG.md Outdated
Show resolved Hide resolved src/Symfony/Component/DomCrawler/Crawler.php Outdated
Show resolved Hide resolved src/Symfony/Component/DomCrawler/Crawler.php Outdated
Show resolved Hide resolved src/Symfony/Component/DomCrawler/Crawler.php Outdated
Show resolved Hide resolved src/Symfony/Component/DomCrawler/Crawler.php Outdated
Show resolved Hide resolved src/Symfony/Component/DomCrawler/Crawler.php Outdated
Show resolved Hide resolved src/Symfony/Component/DomCrawler/Crawler.php Outdated
Show resolved Hide resolved src/Symfony/Component/DomCrawler/composer.json
Show resolved Hide resolved composer.json Outdated
Show resolved Hide resolved src/Symfony/Component/DomCrawler/Crawler.php Outdated
Show resolved Hide resolved src/Symfony/Component/DomCrawler/Crawler.php Outdated
Show resolved Hide resolved src/Symfony/Component/DomCrawler/Crawler.php Outdated
@tgalopin

This comment has been minimized.

Copy link
Member Author

commented Apr 3, 2019

Updated

@tgalopin tgalopin force-pushed the tgalopin:html5-parser branch 2 times, most recently from 3e61e24 to e0ca69a Apr 3, 2019

}
/**
* Convert charset to HTML-entities to ensure valid parsing.

This comment has been minimized.

Copy link
@fabpot

fabpot Apr 3, 2019

Member

Converts

@fabpot

fabpot approved these changes Apr 3, 2019

@fabpot fabpot force-pushed the tgalopin:html5-parser branch from e0ca69a to 4050ec4 Apr 3, 2019

@fabpot

This comment has been minimized.

Copy link
Member

commented Apr 3, 2019

Thank you @tgalopin.

@fabpot fabpot merged commit 4050ec4 into symfony:master Apr 3, 2019

1 of 3 checks passed

continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
fabbot.io Your code looks good.
Details

fabpot added a commit that referenced this pull request Apr 3, 2019

feature #29306 [DomCrawler] Optionally use html5-php to parse HTML (t…
…galopin)

This PR was squashed before being merged into the 4.3-dev branch (closes #29306).

Discussion
----------

[DomCrawler] Optionally use html5-php to parse HTML

| Q             | A
| ------------- | ---
| Branch?       | master
| Bug fix?      | no
| New feature?  | yes
| BC breaks?    | no
| Deprecations? | no
| Tests pass?   | WIP
| Fixed tickets | #29280, #28596
| License       | MIT
| Doc PR        | symfony/symfony-docs#10700

This PR introduces the possibility to parse HTML content in the Crawler using the html5-php library (https://github.com/Masterminds/html5-php). This allows for better support of HTML5 and fix many unexpected behaviors and inconsistencies of the native DOM extension.

Commits
-------

4050ec4 [DomCrawler] Optionally use html5-php to parse HTML

@tgalopin tgalopin deleted the tgalopin:html5-parser branch Apr 3, 2019

"masterminds/html5": "^2.6"
},
"conflict": {
"masterminds/html5": "<2.6"

This comment has been minimized.

Copy link
@stof

stof Apr 3, 2019

Member

We should also conflict with > 3 then

@@ -608,6 +601,15 @@ public function html(/* $default = null */)
throw new \InvalidArgumentException('The current node list is empty.');
}
if (null !== $this->html5Parser) {

This comment has been minimized.

Copy link
@stof

stof Apr 3, 2019

Member

There is an issue here. You instantiate the HTML5 parser in the constructor even when the content added is not HTML5 but XML or existing DOM elements (coming from elsewhere than a parent crawler using HTML5). This means you might be saving with the HTML5 parser when it was not used for parsing.

This comment has been minimized.

Copy link
@tgalopin

tgalopin Apr 3, 2019

Author Member

How do you propose to improve this?

This comment has been minimized.

Copy link
@stof

stof Apr 3, 2019

Member

well, we need to distinguish 3 cases:

  • we are parsing some HTML5
  • we are parsing some older HTML
  • we are not parsing HTML at all

The boolean argument in the constructor allows us to decide between the first 2 cases at the time we instantiate. But knowing whether this is HTML or no is not something the controller knows (as it can be done later).

The solution might be to store the boolean property. Then, based on that, we would decide which parsing strategy to use if we load HTML and instantiate the HTML5 parser if needed.
Then, here, we can keep saying "if I used an HTML5 parser, I also use it for saving".

And for subcrawlers, we copy the content of the private property.

}
if ($useHtml5Parser ?? class_exists(HTML5::class)) {
$this->html5Parser = new HTML5(['disable_html_ns' => true]);

This comment has been minimized.

Copy link
@stof

stof Apr 3, 2019

Member

When creating a child crawler, you should not rely on guessing but pass the existing value used for the parsing (or even better, assign the actual parser instead of instantiating a new one).

This comment has been minimized.

Copy link
@tgalopin

tgalopin Apr 3, 2019

Author Member

You mean in the createSubCrawler method?

This comment has been minimized.

Copy link
@stof

stof Apr 3, 2019

Member

yes

@stof

This comment has been minimized.

Copy link
Member

commented Apr 3, 2019

Using a constructor argument has a big drawback (but the previous implementation using a setter that must be called before loading the content has the same drawback): most people in Symfony don't instantiate a Crawler themselves. They use BrowserKit which manages this instantiation. This means they don't have direct access to anything happening before adding content.

javiereguiluz added a commit to symfony/symfony-docs that referenced this pull request Apr 5, 2019

minor #10700 [DomCrawler][WIP] Add note about the HTML5 parser librar…
…y (tgalopin)

This PR was merged into the master branch.

Discussion
----------

[DomCrawler][WIP] Add note about the HTML5 parser library

Documentation for the PR symfony/symfony#29306.

Commits
-------

6e2f04a [DomCrawler] Add note about the HTML5 parser library

fabpot added a commit that referenced this pull request Apr 6, 2019

feature #30892 [DomCrawler] Improve Crawler HTML5 parser need detecti…
…on (tgalopin)

This PR was merged into the 4.3-dev branch.

Discussion
----------

[DomCrawler] Improve Crawler HTML5 parser need detection

| Q             | A
| ------------- | ---
| Branch?       | master
| Bug fix?      | kind of
| New feature?  | no
| BC breaks?    | no
| Deprecations? | no>
| Tests pass?   | yes
| Fixed tickets | -
| License       | MIT
| Doc PR        | -

Live from #eu-fossa

Follow up of #29306

This PR introduces a better detection mechanism to choose when to parse using the HTML5 parser or not, and fix a subcrawler parsing issue as well.

@stof I'd be super interested by your review :) !

Commits
-------

9bbdab6 [DomCrawler] Improve Crawler HTML5 parser need detection
"masterminds/html5": "^2.6"
},
"conflict": {
"masterminds/html5": "<2.6"
},
"suggest": {
"symfony/css-selector": ""

This comment has been minimized.

Copy link
@apfelbox

apfelbox Apr 8, 2019

Contributor

Shouldn't there be an entry here, that describes that you can load masterminds/html5?

This comment has been minimized.

Copy link
@stof

stof Apr 8, 2019

Member

indeed, that would make sense.

This comment has been minimized.

Copy link
@fabpot

fabpot Apr 8, 2019

Member

I'm against adding things under suggest, nobody reads them anyway. I would even go as far as removing the existing entries :)

This comment has been minimized.

Copy link
@phoenixgao

phoenixgao May 1, 2019

I always read them!

@nicolas-grekas nicolas-grekas modified the milestones: next, 4.3 Apr 30, 2019

@fabpot fabpot referenced this pull request May 9, 2019

Merged

Release v4.3.0-BETA1 #31435

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.