Restructured, cleaned code; improved speed #31

Open
wants to merge 552 commits into
from

Conversation

Projects
None yet
@fb55

fb55 commented Oct 12, 2011

The code was a mess. I spent some time restructuring it, so that it could be understood easily. I dropped support for browsers, they already have HTML parsers.

@fb55

This comment has been minimized.

Show comment Hide comment
@fb55

fb55 Nov 12, 2011

To clarify: I don't expect this to be merged, there are probably too many changes. But anyone who's interested may use my code.

fb55 commented Nov 12, 2011

To clarify: I don't expect this to be merged, there are probably too many changes. But anyone who's interested may use my code.

@matthewmueller

This comment has been minimized.

Show comment Hide comment
@matthewmueller

matthewmueller Nov 20, 2011

This is awesome - I'm getting 2x speed on this patch with no modifications to my existing code other than changing the requires to "htmlparser2"

Also the directory structure definitely needed a facelift. Thanks!

Is there any way you can open up issues on your forked page, or does everything go back to tautologistics repo?

This is awesome - I'm getting 2x speed on this patch with no modifications to my existing code other than changing the requires to "htmlparser2"

Also the directory structure definitely needed a facelift. Thanks!

Is there any way you can open up issues on your forked page, or does everything go back to tautologistics repo?

@fb55

This comment has been minimized.

Show comment Hide comment
@fb55

fb55 Nov 20, 2011

The issues tab is opened now. Thank you for the feedback :D

fb55 commented Nov 20, 2011

The issues tab is opened now. Thank you for the feedback :D

@deanmao

This comment has been minimized.

Show comment Hide comment
@deanmao

deanmao May 2, 2012

Contributor

hmm, it's too bad this won't run on the browser -- that's the only reason I use htmlparser

browsers may already have html parsers, but there are other reasons why this package is used there.

Contributor

deanmao commented May 2, 2012

hmm, it's too bad this won't run on the browser -- that's the only reason I use htmlparser

browsers may already have html parsers, but there are other reasons why this package is used there.

@matthewmueller

This comment has been minimized.

Show comment Hide comment
@matthewmueller

matthewmueller May 2, 2012

@deanmao I'm curious, what's your use case? The only instance I can think of is running server-side tests in the browser instead the command line.

Also, this has become a hilarious pull request:

+ 3,536 additions 
- 3,273 deletions

...maybe tautologistics will merge it ;-)

@deanmao I'm curious, what's your use case? The only instance I can think of is running server-side tests in the browser instead the command line.

Also, this has become a hilarious pull request:

+ 3,536 additions 
- 3,273 deletions

...maybe tautologistics will merge it ;-)

@deanmao

This comment has been minimized.

Show comment Hide comment
@deanmao

deanmao May 2, 2012

Contributor

There's lots of use cases for in browser... for example, you're manipulating html in the browser, but you'd like to do it the same way as on the server side so that the same set of js can be in both places.

For me, I have a pseudo html that doesn't really map to proper html tags. The browser is too smart for it's own good and would convert unknown tags to something that suits itself. For example, if I created the <x> tag, it might convert it into a <div> for me if I start appending children to it.

Contributor

deanmao commented May 2, 2012

There's lots of use cases for in browser... for example, you're manipulating html in the browser, but you'd like to do it the same way as on the server side so that the same set of js can be in both places.

For me, I have a pseudo html that doesn't really map to proper html tags. The browser is too smart for it's own good and would convert unknown tags to something that suits itself. For example, if I created the <x> tag, it might convert it into a <div> for me if I start appending children to it.

@deanmao

This comment has been minimized.

Show comment Hide comment
@deanmao

deanmao May 2, 2012

Contributor

Also, why bother making it a pull request? Why not just make this one a different npm module that was derived from htmlparser? I'm sure others would still find it useful even though it no longer resembles the original node-htmlparser.

You could just attribute the original npm module in package.json and give it a separate name. I'm sure there will be users who will appreciate this module for what it is.

EDIT: I see that it is already under htmlparser2... makes sense. I also see that it can be run in the browser as well, it doesn't make use of node specific apis.

Contributor

deanmao commented May 2, 2012

Also, why bother making it a pull request? Why not just make this one a different npm module that was derived from htmlparser? I'm sure others would still find it useful even though it no longer resembles the original node-htmlparser.

You could just attribute the original npm module in package.json and give it a separate name. I'm sure there will be users who will appreciate this module for what it is.

EDIT: I see that it is already under htmlparser2... makes sense. I also see that it can be run in the browser as well, it doesn't make use of node specific apis.

@matthewmueller

This comment has been minimized.

Show comment Hide comment
@matthewmueller

matthewmueller May 2, 2012

I'd use JSDOM for something like that... it's slow but it's a closer representation to the original DOM.

As far as the pull request, check the original pull request date - it was 7 months ago. You can install this version with npm install htmlparser2. @fb55 has done a great job improving this library. I'm using his fork in my library, cheerio (http://github.com/MatthewMueller/cheerio).

I'd use JSDOM for something like that... it's slow but it's a closer representation to the original DOM.

As far as the pull request, check the original pull request date - it was 7 months ago. You can install this version with npm install htmlparser2. @fb55 has done a great job improving this library. I'm using his fork in my library, cheerio (http://github.com/MatthewMueller/cheerio).

@deanmao

This comment has been minimized.

Show comment Hide comment
@deanmao

deanmao May 2, 2012

Contributor

JSDOM doesn't actually include a parser if you've read their source. JSDOM actually just uses htmlparser 1.x behind the scenes, so it's not really that much gain over from using htmlparser directly. You'll probably notice it fails on the same sets of bad html.

Contributor

deanmao commented May 2, 2012

JSDOM doesn't actually include a parser if you've read their source. JSDOM actually just uses htmlparser 1.x behind the scenes, so it's not really that much gain over from using htmlparser directly. You'll probably notice it fails on the same sets of bad html.

@matthewmueller

This comment has been minimized.

Show comment Hide comment
@matthewmueller

matthewmueller May 2, 2012

Ohhh thats right.. so you're looking for something that exactly matches the browser's parser? (ie. <x> => <div>)

Ohhh thats right.. so you're looking for something that exactly matches the browser's parser? (ie. <x> => <div>)

@deanmao

This comment has been minimized.

Show comment Hide comment
@deanmao

deanmao May 2, 2012

Contributor

well, not really... i'm specifically looking for something that doesn't match the browser's parser at all :-)

Contributor

deanmao commented May 2, 2012

well, not really... i'm specifically looking for something that doesn't match the browser's parser at all :-)

myndzi referenced this pull request in myndzi/node-htmlparser Feb 16, 2013

Merge pull request #31 from jugglinmike/text-after-cdata
Recognize closing CDATA tags as end of "special"

fb55 and others added some commits Mar 31, 2013

added a WritableStream interface again
this time, it's implementing stream.Writable
3.0.0 (finally!)
the 3.x releases before were crappy, and I will deny to have published
them
burl
[tokenizer] fix for script tags causing following nodes to be interpr…
…eted as TEXT

* this._special reverted to 0 after "closetag" event
[02-template.json] added <p>...</p> around script tag to ensure that closing </p> is seen as a tag rather than text node
added CollectingHandler
collects all events and passes them through to another handler

can simulate a reset for the underlying handler using the `restart()`
method
[readme] updated benchmarks
also use the more readable unit ms/el

fb55 and others added some commits Mar 15, 2014

tokenizer: use entity maps of `entities`
I don't want to maintain them a second time, plus, the overhead isn't
too much.
Merge pull request #81 from jugglinmike/domutils-1.5
Update to latest version of domutils
use flat icons
& only display test status of master
Handle partial comment endings as normal content
If a comment ending is interrupted by an unexpected character, treat the
parsed characters as comment content and revert to the "in comment"
state.
Merge pull request #87 from jugglinmike/partial-comment-end
Handle partial comment endings as normal content
Merge pull request #98 from devongovett/patch-1
Ignore readable-stream dependency in browserify
Merge pull request #99 from chbrown/master
h1, h2, ... h6 should auto-close p elements
add support for onparserinit event callback
So that handler can obtain reference to the Parser
(and then access Parser.startIndex)
Refs fb55/domhandler#7
Merge pull request #106 from cvrebert/onparserinit
add support for onparserinit event callback
Merge pull request #107 from teambition/atom-content
use entry:content if entry:summary is not exist
Merge pull request #110 from cvrebert/master
bump domhandler dependency to 2.3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment