Skip to content
This repository

Restructured, cleaned code; improved speed #31

wants to merge 518 commits into from
Felix Böhm

The code was a mess. I spent some time restructuring it, so that it could be understood easily. I dropped support for browsers, they already have HTML parsers.

Felix Böhm

To clarify: I don't expect this to be merged, there are probably too many changes. But anyone who's interested may use my code.

Matthew Mueller

This is awesome - I'm getting 2x speed on this patch with no modifications to my existing code other than changing the requires to "htmlparser2"

Also the directory structure definitely needed a facelift. Thanks!

Is there any way you can open up issues on your forked page, or does everything go back to tautologistics repo?

Felix Böhm

The issues tab is opened now. Thank you for the feedback :D

Dean Mao
deanmao commented May 01, 2012

hmm, it's too bad this won't run on the browser -- that's the only reason I use htmlparser

browsers may already have html parsers, but there are other reasons why this package is used there.

Matthew Mueller

@deanmao I'm curious, what's your use case? The only instance I can think of is running server-side tests in the browser instead the command line.

Also, this has become a hilarious pull request:

+ 3,536 additions 
- 3,273 deletions

...maybe tautologistics will merge it ;-)

Dean Mao
deanmao commented May 01, 2012

There's lots of use cases for in browser... for example, you're manipulating html in the browser, but you'd like to do it the same way as on the server side so that the same set of js can be in both places.

For me, I have a pseudo html that doesn't really map to proper html tags. The browser is too smart for it's own good and would convert unknown tags to something that suits itself. For example, if I created the <x> tag, it might convert it into a <div> for me if I start appending children to it.

Dean Mao
deanmao commented May 01, 2012

Also, why bother making it a pull request? Why not just make this one a different npm module that was derived from htmlparser? I'm sure others would still find it useful even though it no longer resembles the original node-htmlparser.

You could just attribute the original npm module in package.json and give it a separate name. I'm sure there will be users who will appreciate this module for what it is.

EDIT: I see that it is already under htmlparser2... makes sense. I also see that it can be run in the browser as well, it doesn't make use of node specific apis.

Matthew Mueller

I'd use JSDOM for something like that... it's slow but it's a closer representation to the original DOM.

As far as the pull request, check the original pull request date - it was 7 months ago. You can install this version with npm install htmlparser2. @FB55 has done a great job improving this library. I'm using his fork in my library, cheerio (

Dean Mao
deanmao commented May 01, 2012

JSDOM doesn't actually include a parser if you've read their source. JSDOM actually just uses htmlparser 1.x behind the scenes, so it's not really that much gain over from using htmlparser directly. You'll probably notice it fails on the same sets of bad html.

Matthew Mueller

Ohhh thats right.. so you're looking for something that exactly matches the browser's parser? (ie. <x> => <div>)

Dean Mao
deanmao commented May 01, 2012

well, not really... i'm specifically looking for something that doesn't match the browser's parser at all :-)

and others added some commits February 16, 2013
Felix Böhm Update
The example checked the `language` attribute. Changed it to `type`.
jugglinmike Do not parse CDATA-like text inside special tags
Special nodes (e.g. script tags, style tags, comment nodes, etc.) can
contain only text nodes.
Felix Böhm Merge pull request #32 from jugglinmike/cdata-inside-special
Do not parse CDATA-like text inside special tags
Felix Böhm 2.6.0 8756001
Felix Böhm landed first version of FSM based tokenizer
fsm style taken from creationix/jsonparse

support for special tags (<script> & <style>) is missing
Add a new test for issue #36
Only finds first attribute when there is no whitespace between

- Added a html example
- Added a test
Felix Böhm Merge pull request #37 from eonlepapillon/Add-test-for-Issue-#36
Add a new test for issue #36
Felix Böhm added logic for special tags d90e7a3
Felix Böhm [tokenizer] don't fail on `< >` and `< / >`
they are now emitted as text
Felix Böhm [tokenizer] fixed ordering in cleanup 1bc6568
Felix Böhm [tokenizer] overwrite WritableStream#end, emit everything that's left 400bf43
Felix Böhm [tokenizer] take care of this._index in cleanup, emit all text 550b42e
Felix Böhm [tokenizer] set _sectionStart to 0 when text was emitted dabe165
Felix Böhm [tokenizer] call WritableStream#end after emitting the remaining data b9d568a
Felix Böhm [tokenizer] call .write instead of ._write 1144e42
Felix Böhm [parser] use the tokenizer c3d4025
Felix Böhm removed WritableStream.js and ElementType.js
both aren't needed anymore
Felix Böhm [parser] made Parser#reset work again
absolutely aweful.
Felix Böhm fall back to the readable-stream module 5c155ca
Felix Böhm [travis] removed 0.6 & 0.9, added 0.10 and 0.11 5a28547
Felix Böhm minor changes c445375
Felix Böhm [index.js] removed redundant code 1ab593a
Felix Böhm [stream] use a named function
fixes export
Felix Böhm 3.0.0
also updated domutils version & specified main-field
Felix Böhm [tokenizer] always call WritableStream#end b48adc2
Felix Böhm [parser] call Tokenizer#end, clear the stack 17b7ebe
Felix Böhm [index.js] added `createDomStream()` convenience method 654c4d4
Felix Böhm [tokenizer] added `opentagend` event 628b99e
Felix Böhm [parser] use `opentagend` event f70f545
Felix Böhm 3.0.1 b7cc1aa
Felix Böhm [tokenizer] emit opentagend on selfclosing tags, fixed handling of < …
…in special tags
Felix Böhm [index.js] added tokenizer 94e794f
Felix Böhm [tests] text events now contain more data 9793593
Felix Böhm [tokenizer] don't inherit from stream.Writable, fixed several bugs ab8b653
Felix Böhm [tests/events] concat text events 09b8833
Felix Böhm [tests/events] fixed order of attribute/opentag events, merged text e…
Felix Böhm [tokenizer] use strings instead of buffers
has a huge impact on speed
Felix Böhm [parser] don't implement stream.Writable, use new tokenizer interface b837b95
Felix Böhm [tests/stream] fixed order of events db95f00
Felix Böhm [tokenizer] simplified logic e4982e1
Felix Böhm [parser] fixed handling of implied closing and empty tags 1905dd3
Felix Böhm [tests/events] accidentally removed part of the document 70c6865
Felix Böhm added a WritableStream interface again
this time, it's implementing stream.Writable
Felix Böhm 3.0.0 (finally!)
the 3.x releases before were crappy, and I will deny to have published
Felix Böhm [tokenizer] changed internal name to `Tokenizer` 1db8148
[tokenizer] fix for script tags causing following nodes to be interpr…
…eted as TEXT

* this._special reverted to 0 after "closetag" event
[02-template.json] added <p>...</p> around script tag to ensure that closing </p> is seen as a tag rather than text node
Felix Böhm [proxyhandler] don't use getters/setters 9898b9a
Felix Böhm added CollectingHandler
collects all events and passes them through to another handler

can simulate a reset for the underlying handler using the `restart()`
Felix Böhm [tests] use the new CollectingHandler 01d8adf
Felix Böhm [tests] removed unused `f` var f2542db
Felix Böhm 3.0.1 fcb35f0
Felix Böhm Merge pull request #38 from burl/master
fix for script tags
Felix Böhm 3.0.2 779e608
Felix Böhm [bench] use setImmediate instead of process.nextTick c848d69
Felix Böhm [bench] try to test all available modules 1384620
Felix Böhm [bench] removed unused functions, improved output 9f465ca
Felix Böhm [readme] updated benchmarks
also use the more readable unit ms/el
Felix Böhm [doc] call `end`, use single quotes bc00862
Felix Böhm [doc] updated section about node-htmlparser 6935c0d
Felix Böhm renamed repository, 3.0.3 8a91aac
Felix Böhm use DomUtils.getText in fetch, split getElements e7ad785
Felix Böhm [tokenizer] name states consistently 6b995ab
Felix Böhm [feedhandler] recursively walk the tree 0b88170
Felix Böhm [readme] small updates
tests pass now thanks to updates to the domhandler module
Felix Böhm [tokenizer] don't emit an "onopentagend" event for self-closing tags e6f0199
Felix Böhm [parser] fixed handling of self-closing tags a3a9954
Felix Böhm [tests] stream tests are run again 9d478ea
Felix Böhm [tests/feeds] run rdf test again e612238
Felix Böhm [tests/stream] enabled xmlMode for RSS test 3b821dc
Felix Böhm [tests/stream] create a new handler for the second run 1bb92f7
Felix Böhm [tests/stream] added tests for the files in tests/Documents ae58e56
Felix Böhm 3.0.4 83c75dc
Felix Böhm [parser] lowercase instruction names if lowerCaseTags option is set
for backwards compat
Felix Böhm 3.0.5 61c5a80
Felix Böhm [tests/events] added test case for jsdom#368 d79b1b3
Felix Böhm changed behavior for non-xml mode
• lowercase tag and attribute names by default
• CDATA is now emitted as a comment (fixes tmpvar/jsdom#618)
Felix Böhm [tests/events] updated tests to reflect latest changes 357a825
Felix Böhm 3.1.0 96c41b1
Andreas Lind Petersen Added missing void elements. 75fb1cf
Felix Böhm Merge pull request #46 from One-com/missing_void_elements
Added missing void elements.
Andreas Madsen [tokenizer] text in special tags there looks like a tag ending 7ca6d22
Felix Böhm Merge pull request #48 from AndreasMadsen/script-in-script
[tokenizer] text in special tags there looks like a tag ending
Felix Böhm [tokenizer] consume token again
after switching from BEFORE_CLOSING_TAG_NAME to TEXT state (inside a special tag)
Felix Böhm [parser] still recognize other options in non-xml-mode
using the easiest solution (applying DeMorgan).
Felix Böhm 3.1.1 231a746
Andreas Madsen [tokenizer] don't reset comment state in case of long endings 7ef5de8
Felix Böhm Merge pull request #49 from AndreasMadsen/long-comment
[tokenizer] don't reset comment state in case of long endings
Andreas Madsen [Tokenizer] don't reset CDATA state in case of long endings e8dc84a
Felix Böhm Merge pull request #50 from AndreasMadsen/long-cdata-ending
[Tokenizer] don't reset CDATA state in case of long endings
Felix Böhm readme: added version badge a768e88
Felix Böhm [readme] added yet another badge (dependency versions) 40a2339
Felix Böhm [bench] added the hubbub & html-parser modules
todo: update readme
Felix Böhm 3.1.2 dda8df2
Andreas Madsen [Parser] open tags before close if never opened 7fd58aa
Andreas Madsen

@fb55 Request a publish

Also I assume you know that

while(pos--) this._cbs.onclosetag(this._stack.pop());

don't follow the ever so complicated HTML5 parseing rules.

update: I found the issue when analysing

Quoting from the HTML 5 spec (section 12.2.5):

If the stack of open elements does not have an element in scope with the same tag name as that of the token, then this is a parse error; ignore the token.

There is a special case for p tags, though:

If the stack of open elements does not have an element in button scope with the same tag name as that of the token, then this is a parse error; act as if a start tag with the tag name "p" had been seen, then reprocess the current token.

Plus another special case:

An end tag whose tag name is "sarcasm":
Take a deep breath, then act as described in the "any other end tag" entry below.


Okay I will take another look tomorrow.

Felix Böhm

Do you mean "that"? And you usually say "on the stack" :)

I'm quite sure I mean "there", but I could be wrong.

"there" refers to a location, but you're talking about the tags (which are at a location). The correct name would be something like "Close tags that are not on the stack".

Using there, you would need to say something like "Close tags when there wasn't an opening one on the stack.".

Okay, learned something new then, thanks

and others added some commits June 11, 2013
Andreas Madsen [Parser] implicit open only p and br tags 694dea7
Anthony BARRE Fix perf regression in the Tokenizer : avoid a concatenation
Version 2.3.1 :
-> % node bench2.js
htmlparser2:  01.86 ms/el

Version 3.1.2 without the fix :
-> % node tests/bench.js 
htmlparser2:  04.50 ms/el

Version 3.1.2 with the fix :
-> % node tests/bench.js
htmlparser2:  01.75 ms/el
Felix Böhm Merge pull request #54 from abarre/master
[tokenizer] fix perf regression
Felix Böhm Merge pull request #52 from AndreasMadsen/implicit-open
[parser] implicit open only p and br tags
Felix Böhm 3.1.3 eade820
Felix Böhm [parser] renamed emptyTags to voidElements, sorted them 0ca2c1e
Felix Böhm [parser] improved consistency & simplified 26117ef
Felix Böhm [tokenizer] simplified `end` logic 7932367
Felix Böhm [tokenizer] removed noop blocks in AFTER_{COMMENT,CDATA}_2 45d9067
Felix Böhm [tokenizer] use `continue` instead of decreasing the index 87c6f2b
Felix Böhm [bench] removed unnecessary noop functions 7608c11
Felix Böhm [tokenizer] improved handling of remaining data d00b391
Felix Böhm [readme] it~~'~~s 863183a
Forbes Lindesay Add parseDOM and parseFeed helper methods 77bf0ae
Felix Böhm Merge pull request #55 from ForbesLindesay/patch-1
Add parseDOM and parseFeed helper methods
Forbes Lindesay Add link to live demo 16aef00
Felix Böhm Merge pull request #56 from ForbesLindesay/patch-1
Add link to live demo
Felix Böhm [parser] default options & cbs to empty objects
fixes #57
Felix Böhm 3.1.4 529f727
[tokenizer] fix case where `<` followed by whitespace doesn't parse c…
Felix Böhm Merge pull request #58 from xcoderzach/master
[tokenizer] fix case where `<` followed by whitespace doesn't parse
Felix Böhm 3.1.5 830c157
Felix Böhm [parser] don't overwrite attribute values on second occurence
as described in #42
Felix Böhm [readme] behavior of example changed due to #58 4d56157
Forbes Lindesay Add .gitignore ca311d4
Forbes Lindesay Add .gitattributes so tests still work on windows 909a3f1
Forbes Lindesay Normalize line endings f6f93ef
Felix Böhm [tokenizer] recognize the form field (U+0C), drop the carriage return…
… (U+0D)

to be inline with the HTML5 spec

(recognized in cheeriojs/cheerio#242)
Andreas Madsen [Tokenizer] move if context to methods allowing .write to be optimized f8ddbe6
Felix Böhm Merge pull request #61 from AndreasMadsen/optimize
[Tokenizer] move if context to methods allowing .write to be optimized

fixes #60
Felix Böhm [tokenizer] don't save the options object 0219e3a
Felix Böhm [tokenizer] use ternary expressions for simple states 2aae96f
Felix Böhm [tokenizer] added variables for states of _special f6e21dd
Felix Böhm [tokenizer] fixed whitespace f3fb8d7
Felix Böhm [tokenizer] more ternaries bf0eaa4
Felix Böhm [tokenizer] simplified _cleanup a bit 57eb985
Felix Böhm [tokenizer] united some branches 917ecf0
Felix Böhm [tokenizer] get rid of _reconsume
use _index-- instead
Felix Böhm [tokenizer] even more ternaries 4bc1ec4
Andreas Madsen

This commit breaks the tests, I'm quite sure its because SPECIAL_SCRIPT is not a valid _state value.

Thanks, I already saw that but had no time for fixing it. I just had, so the error should be gone :)

and others added some commits August 02, 2013
Felix Böhm [tokenizer] added abstractions for common state types, fixed previous…
… regression
Felix Böhm [tokenizer] added _getSection, completely inlined _emitIfToken, partl…
…y inlined _emitToken
Felix Böhm [tokenizer] simplified _stateInTagName 607c81a
Felix Böhm [tokenizer] simplified _stateInAttributeValueNoQuotes, reordered _sta…
Felix Böhm 3.1.6 bd63b0b
Felix Böhm [tests] added test for second occurance of same attribute
fixes #42
Felix Böhm [tokenizer] started adding support for HTML entities
TODO: so far, only numeric entities are decoded
Felix Böhm [tokenizer] corrected decoding of numeric entities fac2449
Felix Böhm [tokenizer] numeric entities are now decoded
TODO: attribute values aren't handled yet
Felix Böhm [tests] added test case for numeric entities a6fb99e
Forbes Lindesay Update link to demo bcd00ed
David Rousselie Add startIndex and endIndex positional attributes to the parser c2db3df
Felix Böhm Merge pull request #63 from fasterize/parser_positions
Add startIndex and endIndex positional attributes to the parser
Felix Böhm [tokenizer] renamed the self-closing tags state, moved it to its own …
Felix Böhm [tokenizer] commented out support for entities in attributes
requires adding a new event to make this work, so delayed for now
Felix Böhm [readme] updated benchmark results
switched the results to @AndreasMadsen's htmlparser-benchmark
Felix Böhm [bench] removed internal benchmarks
in favor of htmlparser-benchmark
Felix Böhm [parser] fixed whitespace bc193a6
Felix Böhm [parser] moved common logic to _updatePosition function 2221630
Felix Böhm [tokenizer] renamed IN_ATTRIBUTE_NAME_* states, improved formatting d26e087
Felix Böhm [tokenizer] re-added the carriage return as whitespace
fixes #62

apparently Google's gumbo-parser does behave this way:
Felix Böhm [tokenizer] fixed handling of unparsed data in end(), added support f…
…or several states
Felix Böhm [entities] added maps for normal & legacy entities 3a92796
Felix Böhm [tokenizer] added support for decoding HTML entities in `ontext` events
There is still a number of TODOs:
• support decoding entities in attributes
• when in XML mode, only decode XML entities (skip legacy entities)
• move the decodeMap to a JSON file
Felix Böhm [tests] added test cases for decoding legacy & named entities
both containing one of the longest available entities, to ensure they
are propperly decoded (especially relevant for legacy entities)
Felix Böhm [entities] added map for XML entities 927a9e9
Felix Böhm [tokenizer] added support for XML entities
also moved handling of trailing data to _handleTrailingData() (as it
has to be called recursively now)
Felix Böhm [tests] also test trailing data support in the numeric entity test b60cf04
Felix Böhm [tokenizer] fixed handling non-existent entities e45e4ec
Felix Böhm [tests] added test case for XML entities 12edc94
Felix Böhm [tokenizer] added _emitEntity
as a preparation for supporting decoding entities in attribute values
Felix Böhm 3.2.0 076fcf7
Felix Böhm [tokenizer] moved decodeMap to entities/decode.json f46765d
Felix Böhm [tokenizer] renamed _emitEntity to _emitPartial 389102d
Felix Böhm [index] statically export Parser, Tokenizer and DomHandler 6ca87ff
Felix Böhm [parser] use String#search and String#substr instead of String#split
vastly improves performance of onprocessinginstruction and ondeclaration
Felix Böhm [parser] added onattribdata and onattribend events, dropped onattribv…
Felix Böhm [tokenizer] enable support for decoding entities in attributes, added…
… onattribend and onattribdata events, removed onattribvalue
Felix Böhm [tests] added test case for entities in attributes feafd9d
Felix Böhm 3.2.1 311e48e
Felix Böhm [tokenizer] don't decode entities in special tags e2fa485
Felix Böhm 3.2.2 36ee76e
Felix Böhm [tokenizer] reintroduced _special, removed IN_SCRIPT and IN_STYLE
also fixed some semantics
Felix Böhm 3.2.3 effc3a9
Felix Böhm only respect self-closing tags in XML mode e4fb613
Felix Böhm [parser] properly removed self-closing tag support
also replaced call to `Array#slice` with setting the stack's `length`
Felix Böhm [tests] read files in the tests file, improved os interoperability of…
… stream test
Felix Böhm [tests] added helper.getCallback method be0dafa
Felix Böhm [tests] converted tests to mocha b948e86
Felix Böhm [tests] renamed tests dir to `test`
as required by mocha
Felix Böhm [package] run mocha as the test script 96a00fb
Felix Böhm Delete .DS_Store 41ad914
Felix Böhm [tokenizer] emit `onattribdata` in `_handleTrailingData`
fixes #66
Felix Böhm [tests] simplifications 336af9b
Felix Böhm 3.2.4 fc0918c
Felix Böhm [readme] updated performance characteristics 7b1e4c9
Felix Böhm [tokenizer] handle `<<` correctly 76643d3
Felix Böhm 3.2.5 2f24491
Felix Böhm [tests] added test case for cheeriojs/cheerio#247 834d6d2
Felix Böhm update to DomHandler@2.1, updated FeedHandler accordingly, bump 994cfda
Felix Böhm [tests] write only single characters for testing chunked data
failed previously (only for FeedHandler tests), fixed now due to
DomHandler upgrade (which removed the `ignoreWhitespace` option)
Felix Böhm [package] require domutils@1.2
as requested in fb55/CSSselect#11
Felix Böhm package: update readable-stream e6418c2
Felix Böhm package: use simple `license` field 0e5775c
Felix Böhm replace non-breaking space with regular space
as requested in #70
Felix Böhm index: pass `options` argument to constructors c9d4abe
Felix Böhm tests: remove unused `cb` argument 298546c
Felix Böhm feedhandler: wrap assignments f9bc72f
Felix Böhm tests: changed indentation to tabs 5f244df
Felix Böhm package: updated dom module versions, 3.4.0 7153b27
Patrick Steele-Idem #73 Added support for recognizing self-closing tags and CDATA in non-…
…XML mode
Patrick Steele-Idem Fix option to disable lower case tags and attars in non-XML mode 6c173b8
Patrick Steele-Idem Added this._lowerCaseTagNames and this._lowerCaseAttributeNames 357be1d
Patrick Steele-Idem Handle case where options is null and allow truthy values bdb1273
Tim Roediger Add self-closeing svg tags ea8b652
Felix Böhm Merge pull request #75 from superdweebie/master
Add self-closeing svg tags
Patrick Steele-Idem Switched to using "in" operator for options adfaafb
Patrick Steele-Idem Merged options initialization into a single line 54f33ad
Felix Böhm Merge pull request #74 from patrick-steele-idem/master
Fix existing and add new options
Felix Böhm 3.4.1 ad22179
Felix Böhm parser: adjusted whitespace, fixed _updatePosition 40b9cb1
Felix Böhm 3.5.0
3.4.1 was a mistake & was unpublished. The changes require at least a
minor version update.
Felix Böhm Delete .DS_Store 3ba7059
Felix Böhm tokenizer: Fixed handling of text containing `&` when decoding entities f4091b2
Felix Böhm 3.5.1 8006c5b
Felix Böhm readme: use badges
look much better on retina displays ('cause svg)
Felix Böhm improved style edde16b
Felix Böhm test: load FeedHandler from index.js
as RssHandler, so that’s tested, too
Felix Böhm use jshint 01c567a
Felix Böhm deleted .gitignore 0196598
Felix Böhm tests: added several test cases
with more to come
Felix Böhm tokenizer: removed unavoidable branch 62a17bc
Felix Böhm index: removed unnecessary `parser` variable 56a79e5
Felix Böhm parser: moved some shared logic to _getInstructionName 76d00e5
Felix Böhm tokenizer: fixed bug in attribute values without tags 3434286
Felix Böhm tests: added, extended test cases dd658ba
Felix Böhm tests: added file for general API tests 7002a0f
Felix Böhm tokenizer: added specialized characterState function b7ac8f5
Felix Böhm tokenizer: reconsume characters in ifElseState()
fixes handling of <![CD> and friends
Felix Böhm tokenizer: fixed boundaries of legacy entities 645a6ef
Felix Böhm test: added/extended test cases dcb1d89
Felix Böhm tokenizer: reconsume last token when not in CDATA 35c1dbe
Felix Böhm index: fixed typo 0faca5d
Felix Böhm moved _ended logic from parser to tokenizer 39bea1d
Felix Böhm test: extended test cases 24d6936
Felix Böhm implement .pause/.resume in parser, fixed the implementation, added test 100d86e
Felix Böhm tokenizer: ignore unfinished tags in _handleTrailingData 391dd0a
Felix Böhm tokenizer: fixed handling of empty numeric entities
eg. &#x;
Felix Böhm test: added test for .resume() without any data written d634ab3
Felix Böhm 3.6.0 d34cfe9
Felix Böhm track coverage on coveralls 9a910f9
Felix Böhm readme: moved testing-related badges to new line 75f602e
Felix Böhm tokenizer: use entity maps of `entities`
I don't want to maintain them a second time, plus, the overhead isn't
too much.
Felix Böhm 3.7.0 8ff1a55
Felix Böhm package: use domutils@1.4 270c2cd
Felix Böhm 3.7.1 629dabb
Felix Böhm readme: use travis' svg badge 3ceb39f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.