Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect HTML parsing, omitted closing tags #482

Closed
jankuca opened this issue Aug 16, 2012 · 10 comments
Closed

Incorrect HTML parsing, omitted closing tags #482

jankuca opened this issue Aug 16, 2012 · 10 comments
Labels

Comments

@jankuca
Copy link

jankuca commented Aug 16, 2012

The parser should take some HTML principles into account.

There are several situations in which you can omit closing tags of some elements.

<p>First paragraph.
<p>Second paragraph.

This is valid HTML5 code and should be parsed as

<p>First paragraph.</p>
<p>Second paragraph.</p>

The parser instead nests the second paragraph into the former one as

<p>First paragraph.
  <p>Second paragraph.</p>
</p>

This is a major bug in my opinion.

@jankuca
Copy link
Author

jankuca commented Aug 16, 2012

I see that this is actually rather a problem of a dependency of this project.

I fixed the bug and requested a pull to the dependency (node-htmlparser): tautologistics/node-htmlparser#54

@domenic
Copy link
Member

domenic commented Oct 10, 2012

I've let @tautologistics know I'm happy to help maintain node-htmlparser; hopefully we can get this pulled in.

@stuartpb
Copy link

@tautologistics seems to have gone offline, wouldn't it be possible to set up your own up-to-date fork and change the package dependency to that, at least for the time being?

@domenic
Copy link
Member

domenic commented Dec 26, 2012

We're also considering moving to a different html5 parser, either the one from dom.js per #354, or @aredridel's html5. You can currently run the tests with a different parser, so if anyone gets one working that passes at least as many tests as current, plus fixes a bug or two, I'm ready to switch.

@aredridel
Copy link
Contributor

Yeah, the html5 parser should handle this better. node-htmlparser is raw speed but not careful at all.

@stuartpb
Copy link

Is it possible to use the HTML5 parser with jsdom.env? I'm making calls with {config:{parser:require('html5')}}, but I'm not seeing any difference from the normal parser (specifically, <p> tags aren't closing the previous unclosed <p> tag).

@domenic
Copy link
Member

domenic commented Dec 28, 2012

@stuartpb try this? Not sure why it'd be necessary, but this is what the tests use:

var browser = require("../lib/jsdom/browser/index");
browser.setDefaultParser(require('html5'));

@godmar may know more about why simply using config doesn't work.

@aredridel
Copy link
Contributor

I'd love to see jsdom's setup routines simplified to fix things like this ;-)

@domenic
Copy link
Member

domenic commented Dec 28, 2012

Pull requests welcome from people who know more about what they want and how jsdom setup works than I do :)

@godmar
Copy link
Contributor

godmar commented Dec 28, 2012

FWIW, it took some trickery to get the tests to work with html5. From what I recall, one problematic issue is that lib/jsdom.js passes an export from jsdom/browser/index onto its own exports:

https://github.com/tmpvar/jsdom-oscon/blob/master/node_modules/jsdom/lib/jsdom.js#L10-11

If jsdom/browser/index is only partially initialized when lib/jsdom.js is sourced, this will pick up undefineds.

Note that my patch that added setDefaultParser is mainly intended to run the tests; providing html5 as a parser for individual calls has so far worked for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants