New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optionally use html5-php to parse HTML in DomCrawler #29280

Open
stof opened this Issue Nov 22, 2018 · 10 comments

Comments

Projects
None yet
6 participants
@stof
Member

stof commented Nov 22, 2018

We got lots of complains in the past about the fact that the Crawler does not properly parse HTML5 documents, when the HTML5 shortcut features are being used.
The root cause is that DomDocument::loadHtml does not rely on an HTML5 parser, and so does not understand all these specificities of the spec.

A solution could be to add an optional dependency on https://github.com/Masterminds/html5-php, which is a userland implementation of an HTML5 parser, which still returns a DomDocument object after parsing (and so everything else would work fine after that).

I'm not sure whether the usage of html5-php should be turned on just by the presence of that library, or should require an explicit opt-in:

  • the explicit opt-in might be painful for users of the BrowserKit component, because BrowserKit handles the creation of the Crawler (and so it might require to have an opt-in in BrowserKit as well to control the DomCrawler opt-in)
  • automatically using html5-php might be bad for people not needing the HTML5 features for the page they load, as a userland implementation is probably slower than DomDocument::loadHtml which is implemented in C (that's only a guess; I haven't done any benchmarking)

@stof stof added this to the next milestone Nov 22, 2018

@fabpot

This comment has been minimized.

Member

fabpot commented Nov 22, 2018

That's a great idea. I would enable it automatically if the library is available (performance is quite good and was improved a lot thanks to @tgalopin's recent work).

@tgalopin

This comment has been minimized.

Contributor

tgalopin commented Nov 22, 2018

I would be happy to do this PR. The library is already used by Drupal, so I think it's safe to say it well tested.

@goetas

This comment has been minimized.

Contributor

goetas commented Nov 22, 2018

a userland implementation is probably slower than DomDocument::loadHtml which is implemented in C (that's only a guess; I haven't done any benchmarking)

Masterminds/html5-php v2.4.0 is 8x slower than the C extension (files used for the benchmark https://github.com/Masterminds/html5-php/tree/cadcfaaa13153e0e8eb92c49a53e140cf1a85dea/test/benchmark ~400kb html file used)

@nicolas-grekas

This comment has been minimized.

Member

nicolas-grekas commented Nov 22, 2018

Would a process like: try {native} catch(something) {html5-parser} work?

@tgalopin

This comment has been minimized.

Contributor

tgalopin commented Nov 22, 2018

IIRC, the native DOM extension does emit warnings when things are not parsed correctly. However, detecting when this is due to HTML5 and not to a real problem may be tricky. I like this try/catch idea though, it's just that errors may come from different sources for the same code.

@stof

This comment has been minimized.

Member

stof commented Nov 22, 2018

@tgalopin or @goetas do you have a benchmarch comparing html5-php to DomDocument::loadHtml on an input compatible with DomDocument::loadHtml ? I'm wondering what the ratio is.
Edit: should have reloaded the page before commenting... the answer is above.

But anyway, I'm also in favor of automatically enabling it as it makes things much simpler.

@Pierstoval

This comment has been minimized.

Contributor

Pierstoval commented Nov 22, 2018

Won't there be a slight BC break if the current parsing implementation changes when the new lib is enabled?

Of course the DomCrawler doesn't really implement HTML5 properly, but what about parsed strings that are not HTML5 but might still be expected by the user?

Shouldn't it be opt-in instead?

@tgalopin

This comment has been minimized.

Contributor

tgalopin commented Nov 22, 2018

Not being able to parss HTML5 tags is more a bug than something to rely on IMO. Fixing a bug is not a BC break, if people relied on the bug in their code, I think that's their responsibility.

@stof

This comment has been minimized.

Member

stof commented Nov 22, 2018

@Pierstoval a solution might be to opt-in for html5-php only if we have the HTML5 Doctype in the HTML, and keep using the native implementation otherwise. But is it worth it ?

@stof

This comment has been minimized.

Member

stof commented Nov 24, 2018

Note that the current state of performance (8x slower than the C loading) can still be improved for the HTML5 library (for instance, I just opened a PR with a 9% improvement). So I think we're good on using the userland parsing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment