Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Incorrect HTML parsing, omitted closing tags #482

Closed
jankuca opened this Issue Aug 16, 2012 · 10 comments

Comments

Projects
None yet
5 participants

jankuca commented Aug 16, 2012

The parser should take some HTML principles into account.

There are several situations in which you can omit closing tags of some elements.

<p>First paragraph.
<p>Second paragraph.

This is valid HTML5 code and should be parsed as

<p>First paragraph.</p>
<p>Second paragraph.</p>

The parser instead nests the second paragraph into the former one as

<p>First paragraph.
  <p>Second paragraph.</p>
</p>

This is a major bug in my opinion.

jankuca commented Aug 16, 2012

I see that this is actually rather a problem of a dependency of this project.

I fixed the bug and requested a pull to the dependency (node-htmlparser): tautologistics/node-htmlparser#54

Collaborator

domenic commented Oct 10, 2012

I've let @tautologistics know I'm happy to help maintain node-htmlparser; hopefully we can get this pulled in.

@tautologistics seems to have gone offline, wouldn't it be possible to set up your own up-to-date fork and change the package dependency to that, at least for the time being?

Collaborator

domenic commented Dec 26, 2012

We're also considering moving to a different html5 parser, either the one from dom.js per #354, or @aredridel's html5. You can currently run the tests with a different parser, so if anyone gets one working that passes at least as many tests as current, plus fixes a bug or two, I'm ready to switch.

Contributor

aredridel commented Dec 27, 2012

Yeah, the html5 parser should handle this better. node-htmlparser is raw speed but not careful at all.

Is it possible to use the HTML5 parser with jsdom.env? I'm making calls with {config:{parser:require('html5')}}, but I'm not seeing any difference from the normal parser (specifically, <p> tags aren't closing the previous unclosed <p> tag).

Collaborator

domenic commented Dec 28, 2012

@stuartpb try this? Not sure why it'd be necessary, but this is what the tests use:

var browser = require("../lib/jsdom/browser/index");
browser.setDefaultParser(require('html5'));

@godmar may know more about why simply using config doesn't work.

Contributor

aredridel commented Dec 28, 2012

I'd love to see jsdom's setup routines simplified to fix things like this ;-)

Collaborator

domenic commented Dec 28, 2012

Pull requests welcome from people who know more about what they want and how jsdom setup works than I do :)

Contributor

godmar commented Dec 28, 2012

FWIW, it took some trickery to get the tests to work with html5. From what I recall, one problematic issue is that lib/jsdom.js passes an export from jsdom/browser/index onto its own exports:

https://github.com/tmpvar/jsdom-oscon/blob/master/node_modules/jsdom/lib/jsdom.js#L10-11

If jsdom/browser/index is only partially initialized when lib/jsdom.js is sourced, this will pick up undefineds.

Note that my patch that added setDefaultParser is mainly intended to run the tests; providing html5 as a parser for individual calls has so far worked for me.

@domenic domenic closed this in e4dbc59 Apr 15, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment