Restructured, cleaned code; improved speed #31

wants to merge 552 commits into

The code was a mess. I spent some time restructuring it, so that it could be understood easily. I dropped support for browsers, they already have HTML parsers.


To clarify: I don't expect this to be merged, there are probably too many changes. But anyone who's interested may use my code.


This is awesome - I'm getting 2x speed on this patch with no modifications to my existing code other than changing the requires to "htmlparser2"

Also the directory structure definitely needed a facelift. Thanks!

Is there any way you can open up issues on your forked page, or does everything go back to tautologistics repo?


The issues tab is opened now. Thank you for the feedback :D


hmm, it's too bad this won't run on the browser -- that's the only reason I use htmlparser

browsers may already have html parsers, but there are other reasons why this package is used there.


@deanmao I'm curious, what's your use case? The only instance I can think of is running server-side tests in the browser instead the command line.

Also, this has become a hilarious pull request:

+ 3,536 additions 
- 3,273 deletions

...maybe tautologistics will merge it ;-)


There's lots of use cases for in browser... for example, you're manipulating html in the browser, but you'd like to do it the same way as on the server side so that the same set of js can be in both places.

For me, I have a pseudo html that doesn't really map to proper html tags. The browser is too smart for it's own good and would convert unknown tags to something that suits itself. For example, if I created the <x> tag, it might convert it into a <div> for me if I start appending children to it.


Also, why bother making it a pull request? Why not just make this one a different npm module that was derived from htmlparser? I'm sure others would still find it useful even though it no longer resembles the original node-htmlparser.

You could just attribute the original npm module in package.json and give it a separate name. I'm sure there will be users who will appreciate this module for what it is.

EDIT: I see that it is already under htmlparser2... makes sense. I also see that it can be run in the browser as well, it doesn't make use of node specific apis.


I'd use JSDOM for something like that... it's slow but it's a closer representation to the original DOM.

As far as the pull request, check the original pull request date - it was 7 months ago. You can install this version with npm install htmlparser2. @FB55 has done a great job improving this library. I'm using his fork in my library, cheerio (


JSDOM doesn't actually include a parser if you've read their source. JSDOM actually just uses htmlparser 1.x behind the scenes, so it's not really that much gain over from using htmlparser directly. You'll probably notice it fails on the same sets of bad html.


Ohhh thats right.. so you're looking for something that exactly matches the browser's parser? (ie. <x> => <div>)


well, not really... i'm specifically looking for something that doesn't match the browser's parser at all :-)

fb55 and others added some commits Mar 31, 2013
@fb55 fb55 [tests/events] concat text events 09b8833
@fb55 fb55 [tests/events] fixed order of attribute/opentag events, merged text e…
@fb55 fb55 [tokenizer] use strings instead of buffers
has a huge impact on speed
@fb55 fb55 [parser] don't implement stream.Writable, use new tokenizer interface b837b95
@fb55 fb55 [tests/stream] fixed order of events db95f00
@fb55 fb55 [tokenizer] simplified logic e4982e1
@fb55 fb55 [parser] fixed handling of implied closing and empty tags 1905dd3
@fb55 fb55 [tests/events] accidentally removed part of the document 70c6865
@fb55 fb55 added a WritableStream interface again
this time, it's implementing stream.Writable
@fb55 fb55 3.0.0 (finally!)
the 3.x releases before were crappy, and I will deny to have published
@fb55 fb55 [tokenizer] changed internal name to `Tokenizer` 1db8148
burl [tokenizer] fix for script tags causing following nodes to be interpr…
…eted as TEXT

* this._special reverted to 0 after "closetag" event
[02-template.json] added <p>...</p> around script tag to ensure that closing </p> is seen as a tag rather than text node
@fb55 fb55 [proxyhandler] don't use getters/setters 9898b9a
@fb55 fb55 added CollectingHandler
collects all events and passes them through to another handler

can simulate a reset for the underlying handler using the `restart()`
@fb55 fb55 [tests] use the new CollectingHandler 01d8adf
@fb55 fb55 [tests] removed unused `f` var f2542db
@fb55 fb55 3.0.1 fcb35f0
@fb55 fb55 Merge pull request #38 from burl/master
fix for script tags
@fb55 fb55 3.0.2 779e608
@fb55 fb55 [bench] use setImmediate instead of process.nextTick c848d69
@fb55 fb55 [bench] try to test all available modules 1384620
@fb55 fb55 [bench] removed unused functions, improved output 9f465ca
@fb55 fb55 [readme] updated benchmarks
also use the more readable unit ms/el
@fb55 fb55 [doc] call `end`, use single quotes bc00862
@fb55 fb55 [doc] updated section about node-htmlparser 6935c0d
@fb55 fb55 renamed repository, 3.0.3 8a91aac
@fb55 fb55 use DomUtils.getText in fetch, split getElements e7ad785
@fb55 fb55 [tokenizer] name states consistently 6b995ab
@fb55 fb55 [feedhandler] recursively walk the tree 0b88170
@fb55 fb55 [readme] small updates
tests pass now thanks to updates to the domhandler module
@fb55 fb55 [tokenizer] don't emit an "onopentagend" event for self-closing tags e6f0199
@fb55 fb55 [parser] fixed handling of self-closing tags a3a9954
@fb55 fb55 [tests] stream tests are run again 9d478ea
@fb55 fb55 [tests/feeds] run rdf test again e612238
@fb55 fb55 [tests/stream] enabled xmlMode for RSS test 3b821dc
@fb55 fb55 [tests/stream] create a new handler for the second run 1bb92f7
@fb55 fb55 [tests/stream] added tests for the files in tests/Documents ae58e56
@fb55 fb55 3.0.4 83c75dc
@fb55 fb55 [parser] lowercase instruction names if lowerCaseTags option is set
for backwards compat
@fb55 fb55 3.0.5 61c5a80
@fb55 fb55 [tests/events] added test case for jsdom#368 d79b1b3
@fb55 fb55 changed behavior for non-xml mode
• lowercase tag and attribute names by default
• CDATA is now emitted as a comment (fixes tmpvar/jsdom#618)
@fb55 fb55 [tests/events] updated tests to reflect latest changes 357a825
@fb55 fb55 3.1.0 96c41b1
@papandreou papandreou Added missing void elements. 75fb1cf
@fb55 fb55 Merge pull request #46 from One-com/missing_void_elements
Added missing void elements.
@AndreasMadsen AndreasMadsen [tokenizer] text in special tags there looks like a tag ending 7ca6d22
@fb55 fb55 Merge pull request #48 from AndreasMadsen/script-in-script
[tokenizer] text in special tags there looks like a tag ending
@fb55 fb55 [tokenizer] consume token again
after switching from BEFORE_CLOSING_TAG_NAME to TEXT state (inside a special tag)
@fb55 fb55 [parser] still recognize other options in non-xml-mode
using the easiest solution (applying DeMorgan).
@fb55 fb55 3.1.1 231a746
@AndreasMadsen AndreasMadsen [tokenizer] don't reset comment state in case of long endings 7ef5de8
@fb55 fb55 Merge pull request #49 from AndreasMadsen/long-comment
[tokenizer] don't reset comment state in case of long endings
@AndreasMadsen AndreasMadsen [Tokenizer] don't reset CDATA state in case of long endings e8dc84a
@fb55 fb55 Merge pull request #50 from AndreasMadsen/long-cdata-ending
[Tokenizer] don't reset CDATA state in case of long endings
@fb55 fb55 readme: added version badge a768e88
@fb55 fb55 [readme] added yet another badge (dependency versions) 40a2339
@fb55 fb55 [bench] added the hubbub & html-parser modules
todo: update readme
@fb55 fb55 3.1.2 dda8df2
@AndreasMadsen AndreasMadsen [Parser] open tags before close if never opened 7fd58aa
@AndreasMadsen AndreasMadsen [Parser] implicit open only p and br tags 694dea7
@abarre abarre Fix perf regression in the Tokenizer : avoid a concatenation
Version 2.3.1 :
-> % node bench2.js
htmlparser2:  01.86 ms/el

Version 3.1.2 without the fix :
-> % node tests/bench.js 
htmlparser2:  04.50 ms/el

Version 3.1.2 with the fix :
-> % node tests/bench.js
htmlparser2:  01.75 ms/el
@fb55 fb55 Merge pull request #54 from abarre/master
[tokenizer] fix perf regression
@fb55 fb55 Merge pull request #52 from AndreasMadsen/implicit-open
[parser] implicit open only p and br tags
@fb55 fb55 3.1.3 eade820
@fb55 fb55 [parser] renamed emptyTags to voidElements, sorted them 0ca2c1e
@fb55 fb55 [parser] improved consistency & simplified 26117ef
@fb55 fb55 [tokenizer] simplified `end` logic 7932367
@fb55 fb55 [tokenizer] removed noop blocks in AFTER_{COMMENT,CDATA}_2 45d9067
@fb55 fb55 [tokenizer] use `continue` instead of decreasing the index 87c6f2b
@fb55 fb55 [bench] removed unnecessary noop functions 7608c11
@fb55 fb55 [tokenizer] improved handling of remaining data d00b391
@fb55 fb55 [readme] it~~'~~s 863183a
@ForbesLindesay ForbesLindesay Add parseDOM and parseFeed helper methods 77bf0ae
@fb55 fb55 Merge pull request #55 from ForbesLindesay/patch-1
Add parseDOM and parseFeed helper methods
@ForbesLindesay ForbesLindesay Add link to live demo 16aef00
@fb55 fb55 Merge pull request #56 from ForbesLindesay/patch-1
Add link to live demo
@fb55 fb55 [parser] default options & cbs to empty objects
fixes #57
@fb55 fb55 3.1.4 529f727
Zach Smith [tokenizer] fix case where `<` followed by whitespace doesn't parse c…
@fb55 fb55 Merge pull request #58 from xcoderzach/master
[tokenizer] fix case where `<` followed by whitespace doesn't parse
@fb55 fb55 3.1.5 830c157
@fb55 fb55 [parser] don't overwrite attribute values on second occurence
as described in #42
@fb55 fb55 [readme] behavior of example changed due to #58 4d56157
@ForbesLindesay ForbesLindesay Add .gitignore ca311d4
@ForbesLindesay ForbesLindesay Add .gitattributes so tests still work on windows 909a3f1
@ForbesLindesay ForbesLindesay Normalize line endings f6f93ef
@fb55 fb55 [tokenizer] recognize the form field (U+0C), drop the carriage return…
… (U+0D)

to be inline with the HTML5 spec

(recognized in cheeriojs/cheerio#242)
fb55 and others added some commits Aug 26, 2013
@fb55 fb55 Delete .DS_Store 41ad914
@fb55 fb55 [tokenizer] emit `onattribdata` in `_handleTrailingData`
fixes #66
@fb55 fb55 [tests] simplifications 336af9b
@fb55 fb55 3.2.4 fc0918c
@fb55 fb55 [readme] updated performance characteristics 7b1e4c9
@fb55 fb55 [tokenizer] handle `<<` correctly 76643d3
@fb55 fb55 3.2.5 2f24491
@fb55 fb55 [tests] added test case for cheeriojs/cheerio#247 834d6d2
@fb55 fb55 update to DomHandler@2.1, updated FeedHandler accordingly, bump 994cfda
@fb55 fb55 [tests] write only single characters for testing chunked data
failed previously (only for FeedHandler tests), fixed now due to
DomHandler upgrade (which removed the `ignoreWhitespace` option)
@fb55 fb55 [package] require domutils@1.2
as requested in fb55/css-select#11
@fb55 fb55 package: update readable-stream e6418c2
@fb55 fb55 package: use simple `license` field 0e5775c
@fb55 fb55 replace non-breaking space with regular space
as requested in #70
@fb55 fb55 index: pass `options` argument to constructors c9d4abe
@fb55 fb55 tests: remove unused `cb` argument 298546c
@fb55 fb55 feedhandler: wrap assignments f9bc72f
@fb55 fb55 tests: changed indentation to tabs 5f244df
@fb55 fb55 package: updated dom module versions, 3.4.0 7153b27
@patrick-steele-idem patrick-steele-idem #73 Added support for recognizing self-closing tags and CDATA in non-…
…XML mode
@patrick-steele-idem patrick-steele-idem Fix option to disable lower case tags and attars in non-XML mode 6c173b8
@patrick-steele-idem patrick-steele-idem Added this._lowerCaseTagNames and this._lowerCaseAttributeNames 357be1d
@patrick-steele-idem patrick-steele-idem Handle case where options is null and allow truthy values bdb1273
@superdweebie superdweebie Add self-closeing svg tags ea8b652
@fb55 fb55 Merge pull request #75 from superdweebie/master
Add self-closeing svg tags
@patrick-steele-idem patrick-steele-idem Switched to using "in" operator for options adfaafb
@patrick-steele-idem patrick-steele-idem Merged options initialization into a single line 54f33ad
@fb55 fb55 Merge pull request #74 from patrick-steele-idem/master
Fix existing and add new options
@fb55 fb55 3.4.1 ad22179
@fb55 fb55 parser: adjusted whitespace, fixed _updatePosition 40b9cb1
@fb55 fb55 3.5.0
3.4.1 was a mistake & was unpublished. The changes require at least a
minor version update.
@fb55 fb55 Delete .DS_Store 3ba7059
@fb55 fb55 tokenizer: Fixed handling of text containing `&` when decoding entities f4091b2
@fb55 fb55 3.5.1 8006c5b
@fb55 fb55 readme: use badges
look much better on retina displays ('cause svg)
@fb55 fb55 improved style edde16b
@fb55 fb55 test: load FeedHandler from index.js
as RssHandler, so that’s tested, too
@fb55 fb55 use jshint 01c567a
@fb55 fb55 deleted .gitignore 0196598
@fb55 fb55 tests: added several test cases
with more to come
@fb55 fb55 tokenizer: removed unavoidable branch 62a17bc
@fb55 fb55 index: removed unnecessary `parser` variable 56a79e5
@fb55 fb55 parser: moved some shared logic to _getInstructionName 76d00e5
@fb55 fb55 tokenizer: fixed bug in attribute values without tags 3434286
@fb55 fb55 tests: added, extended test cases dd658ba
@fb55 fb55 tests: added file for general API tests 7002a0f
@fb55 fb55 tokenizer: added specialized characterState function b7ac8f5
@fb55 fb55 tokenizer: reconsume characters in ifElseState()
fixes handling of <![CD> and friends
@fb55 fb55 tokenizer: fixed boundaries of legacy entities 645a6ef
@fb55 fb55 test: added/extended test cases dcb1d89
@fb55 fb55 tokenizer: reconsume last token when not in CDATA 35c1dbe
@fb55 fb55 index: fixed typo 0faca5d
@fb55 fb55 moved _ended logic from parser to tokenizer 39bea1d
@fb55 fb55 test: extended test cases 24d6936
@fb55 fb55 implement .pause/.resume in parser, fixed the implementation, added test 100d86e
@fb55 fb55 tokenizer: ignore unfinished tags in _handleTrailingData 391dd0a
@fb55 fb55 tokenizer: fixed handling of empty numeric entities
eg. &#x;
@fb55 fb55 test: added test for .resume() without any data written d634ab3
@fb55 fb55 3.6.0 d34cfe9
@fb55 fb55 track coverage on coveralls 9a910f9
@fb55 fb55 readme: moved testing-related badges to new line 75f602e
@fb55 fb55 tokenizer: use entity maps of `entities`
I don't want to maintain them a second time, plus, the overhead isn't
too much.
@fb55 fb55 3.7.0 8ff1a55
@fb55 fb55 package: use domutils@1.4 270c2cd
@fb55 fb55 3.7.1 629dabb
@fb55 fb55 readme: use travis' svg badge 3ceb39f
@jugglinmike jugglinmike Update to latest version of domutils 235a124
@fb55 fb55 Merge pull request #81 from jugglinmike/domutils-1.5
Update to latest version of domutils
@fb55 fb55 3.7.2 e84da5b
@fb55 fb55 travis: don't run tests on 0.8 1ca87a6
@fb55 fb55 use flat icons
& only display test status of master
@jugglinmike jugglinmike Handle partial comment endings as normal content
If a comment ending is interrupted by an unexpected character, treat the
parsed characters as comment content and revert to the "in comment"
@fb55 fb55 Merge pull request #87 from jugglinmike/partial-comment-end
Handle partial comment endings as normal content
@fb55 fb55 3.7.3 3c8707b
@duncanbeevers duncanbeevers Use jscs to enforce code style 2e6c0e0
@duncanbeevers duncanbeevers Run jscs as lint task 78c20a2
@fb55 fb55 Merge pull request #93 from duncanbeevers/jscs
Enforce code style with jscs
@devongovett devongovett Ignore readable-stream dependency in browserify b22d6d1
@fb55 fb55 Merge pull request #98 from devongovett/patch-1
Ignore readable-stream dependency in browserify
@chbrown chbrown Paragraph elements cannot contain headings f2e8ade
@chbrown chbrown Add h1--h6 to Parser's openImpliesClose tagset, to close paragraphs 0b4831c
@fb55 fb55 Merge pull request #99 from chbrown/master
h1, h2, ... h6 should auto-close p elements
@Ackar Ackar Add missing void elements 08adc50
@fb55 fb55 Merge pull request #100 from Ackar/master
Add missing void elements
@cvrebert cvrebert add support for onparserinit event callback
So that handler can obtain reference to the Parser
(and then access Parser.startIndex)
Refs fb55/domhandler#7
@fb55 fb55 Merge pull request #106 from cvrebert/onparserinit
add support for onparserinit event callback
@fb55 fb55 3.8.0 2b6949e
@sailxjx sailxjx use entry:content if entry:summary is not exist 9c01e06
@fb55 fb55 parser: call `onparserinit` when resetting f0b2a83
@fb55 fb55 parser: remove tab in empty line 81b1a29
@fb55 fb55 Merge pull request #107 from teambition/atom-content
use entry:content if entry:summary is not exist
@fb55 fb55 indexing now supports streams
fixes #108
@fb55 fb55 3.8.1 6e59ad5
@cvrebert cvrebert bump domhandler dependency to 2.3 e19d8b9
@fb55 fb55 Merge pull request #110 from cvrebert/master
bump domhandler dependency to 2.3
@fb55 fb55 3.8.2 748d3da
@fb55 fb55 readme: pass decodeEntities to constructor
in the example
@fb55 fb55 travis: run tests inside a container
following fb55/css-what#8
@fb55 fb55 readme: support npm's markdown parser, fixes 41c84b8
@fb55 fb55 readme: fixes 9e770fc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment