HtmlHandler, for normalizing tag cases #24

wants to merge 471 commits into

10 participants


As I thought through #20 and #22, I realized that the problem was not with the parser itself, but rather the results the parser created. Rather than hacking on the parser and breaking things like RSS/XML support, I decided a better approach would be to create another handler, called HtmlHandler. It embraces the case-insensitive nature of html tags, and toUpperCase()'s all tag names to respect the standard. When reserializing, the printHtml method (provided by tomdz) now toLowerCase()'s all tags, because it's printing HTML, not XML/RSS.

I've updated all tests, as well as added a few to test for scenarios where tags have mixed cases. This fork is currently in production on

Please let me know any thoughts, as I'm more than willing to hear alternate opinions!

fb55 and others added some commits Jun 2, 2012
@fb55 fb55 removed switch in Stream.js 7750ec1
@fb55 fb55 fixed whitespace 04476a0
@fb55 fb55 quick fix for #19 18d3f37
@lahmatiy lahmatiy Fix getOuterHTML for directives 69c9f0f
@fb55 fb55 Merge pull request #21 from lahmatiy/master
fix of htmlparser.DomUtils.getOuterHTML for directives
@fb55 fb55 added lowerCaseAttributeNames option
yep, it's insanely short
@fb55 fb55 2.3.0 e0d359e
@fb55 fb55 Added a `onopentagend` event
to get a signal when there won't be any more attributes coming
@fb55 fb55 moved DomHandler & DomUtils to their own module
they are now available as `domhandler`
@fb55 fb55 Updated readme c0b7eda
@fb55 fb55 2.3.1 a928109
@fb55 fb55 publish the element types from DomHandler b90c1e6
@fb55 fb55 use numeric element types
'case numbers are faster to compare

NOT breaking due to last commit
@fb55 fb55 don't expose HandlerModule 401cc09
@fb55 fb55 fixed travis badge f5925c9
@fb55 fb55 stylistic changes 181c31b
@fb55 fb55 use the new dom modules, 2.5.0
Attention: The DOM changes slightly.
@myndzi myndzi Made the attribute regular expression more correct with regards to un…
…quoted attribute values.

Require self-closing tags to be void
@myndzi myndzi I didn't understand how RegExps worked in this way, and was desynchin…
…g the attributes count. Here's a different way to accomplish the same thing.
@fb55 fb55 Revert "stylistic changes"
This reverts commit 181c31b.
@fb55 fb55 Revert "Revert "stylistic changes""
This reverts commit f7b6d54.
@fb55 fb55 added missing comma in benchmark script 6730fde
@fb55 fb55 domelementtype must be version 1.x (not 1.0) 840291e
@fb55 fb55 2.5.1 46cd546
Kris Reeves Merge branch 'master' of a68f329
Kris Reeves Better handling of implied close tags. A list is given of tags whose …
…close is implied by other tags being opened, and these are closed when those tags are opened. This helps correctly parse things like lists and tables with unterminated LI or TD tags.
Kris Reeves spaces -> tabs, thought the merge would update my local files to the …
…correct spacing (and tried to match that)
Kris Reeves Derp. a126b18
@fb55 fb55 added missing comma in benchmark script 5a72c28
@fb55 fb55 domelementtype must be version 1.x (not 1.0) eca12d8
@fb55 fb55 2.5.1 7f0389f
@jugglinmike jugglinmike Recognize closing CDATA tags as end of "special"
This allows for correct parsing of text that directly follows CDATA tags
@fb55 fb55 Merge pull request #31 from jugglinmike/text-after-cdata
Recognize closing CDATA tags as end of "special"
@fb55 fb55 test on node 0.6, 0.8 & 0.9 d21706b
@fb55 fb55 FeedHandler should return an error when nothing's found 4dc73a5
@fb55 fb55 added missing semicolon in test-helper.js e976099
@fb55 fb55 improved how tests are run 36650b8
@fb55 fb55 don't run 03-rdf.js test
it currently fails, requires investigation
@fb55 fb55 renamed tests 0746690
@fb55 fb55 added semicolons & use EE#on in 02-stream.js d1d9cae
@fb55 fb55 changed how the end of all tests is shown 7c77a1f
@fb55 fb55 allow `>` at the beginning of a document
fixes #25

also allows `>`s to be at the beginning of text or after a `>`.
@fb55 fb55 2.5.2 f707bd7
Kris Reeves Merge remote-tracking branch 'upstream/master'
Kris Reeves Tests for changes. 05a99ef
Kris Reeves Fixes discussed in fe6b8d6
@fb55 fb55 Merge pull request #28 from myndzi/master
basic support for implied close tags, bugfix for attribute values containing a slash at the end being recognized as self-closing tags.
@fb55 fb55 Update
The example checked the `language` attribute. Changed it to `type`.
@jugglinmike jugglinmike Do not parse CDATA-like text inside special tags
Special nodes (e.g. script tags, style tags, comment nodes, etc.) can
contain only text nodes.
@fb55 fb55 Merge pull request #32 from jugglinmike/cdata-inside-special
Do not parse CDATA-like text inside special tags
@fb55 fb55 2.6.0 8756001
@fb55 fb55 landed first version of FSM based tokenizer
fsm style taken from creationix/jsonparse

support for special tags (<script> & <style>) is missing
eonlepapillon Add a new test for issue #36
Only finds first attribute when there is no whitespace between

- Added a html example
- Added a test
@fb55 fb55 Merge pull request #37 from eonlepapillon/Add-test-for-Issue-#36
Add a new test for issue #36
@fb55 fb55 added logic for special tags d90e7a3
@fb55 fb55 [tokenizer] don't fail on `< >` and `< / >`
they are now emitted as text
@fb55 fb55 [tokenizer] fixed ordering in cleanup 1bc6568
@fb55 fb55 [tokenizer] overwrite WritableStream#end, emit everything that's left 400bf43
@fb55 fb55 [tokenizer] take care of this._index in cleanup, emit all text 550b42e
@fb55 fb55 [tokenizer] set _sectionStart to 0 when text was emitted dabe165
@fb55 fb55 [tokenizer] call WritableStream#end after emitting the remaining data b9d568a
@fb55 fb55 [tokenizer] call .write instead of ._write 1144e42
@fb55 fb55 [parser] use the tokenizer c3d4025
@fb55 fb55 removed WritableStream.js and ElementType.js
both aren't needed anymore
@fb55 fb55 [parser] made Parser#reset work again
absolutely aweful.
@fb55 fb55 fall back to the readable-stream module 5c155ca
@fb55 fb55 [travis] removed 0.6 & 0.9, added 0.10 and 0.11 5a28547
@fb55 fb55 minor changes c445375
@fb55 fb55 [index.js] removed redundant code 1ab593a
@fb55 fb55 [stream] use a named function
fixes export
@fb55 fb55 3.0.0
also updated domutils version & specified main-field
@fb55 fb55 [tokenizer] always call WritableStream#end b48adc2
@fb55 fb55 [parser] call Tokenizer#end, clear the stack 17b7ebe
@fb55 fb55 [index.js] added `createDomStream()` convenience method 654c4d4
@fb55 fb55 [tokenizer] added `opentagend` event 628b99e
@fb55 fb55 [parser] use `opentagend` event f70f545
@fb55 fb55 3.0.1 b7cc1aa
@fb55 fb55 [tokenizer] emit opentagend on selfclosing tags, fixed handling of < …
…in special tags
@fb55 fb55 [index.js] added tokenizer 94e794f
@fb55 fb55 [tests] text events now contain more data 9793593
@fb55 fb55 [tokenizer] don't inherit from stream.Writable, fixed several bugs ab8b653
@fb55 fb55 [tests/events] concat text events 09b8833
@fb55 fb55 [tests/events] fixed order of attribute/opentag events, merged text e…
@fb55 fb55 [tokenizer] use strings instead of buffers
has a huge impact on speed
@fb55 fb55 [parser] don't implement stream.Writable, use new tokenizer interface b837b95
@fb55 fb55 [tests/stream] fixed order of events db95f00
@fb55 fb55 [tokenizer] simplified logic e4982e1
@fb55 fb55 [parser] fixed handling of implied closing and empty tags 1905dd3
@fb55 fb55 [tests/events] accidentally removed part of the document 70c6865
@fb55 fb55 added a WritableStream interface again
this time, it's implementing stream.Writable
@fb55 fb55 3.0.0 (finally!)
the 3.x releases before were crappy, and I will deny to have published
@fb55 fb55 [tokenizer] changed internal name to `Tokenizer` 1db8148
burl [tokenizer] fix for script tags causing following nodes to be interpr…
…eted as TEXT

* this._special reverted to 0 after "closetag" event
[02-template.json] added <p>...</p> around script tag to ensure that closing </p> is seen as a tag rather than text node
@fb55 fb55 [proxyhandler] don't use getters/setters 9898b9a
@fb55 fb55 added CollectingHandler
collects all events and passes them through to another handler

can simulate a reset for the underlying handler using the `restart()`
@fb55 fb55 [tests] use the new CollectingHandler 01d8adf
@fb55 fb55 [tests] removed unused `f` var f2542db
@fb55 fb55 3.0.1 fcb35f0
@fb55 fb55 Merge pull request #38 from burl/master
fix for script tags
@fb55 fb55 3.0.2 779e608
fb55 and others added some commits Jun 14, 2013
@fb55 fb55 [tokenizer] use `continue` instead of decreasing the index 87c6f2b
@fb55 fb55 [bench] removed unnecessary noop functions 7608c11
@fb55 fb55 [tokenizer] improved handling of remaining data d00b391
@fb55 fb55 [readme] it~~'~~s 863183a
@ForbesLindesay ForbesLindesay Add parseDOM and parseFeed helper methods 77bf0ae
@fb55 fb55 Merge pull request #55 from ForbesLindesay/patch-1
Add parseDOM and parseFeed helper methods
@ForbesLindesay ForbesLindesay Add link to live demo 16aef00
@fb55 fb55 Merge pull request #56 from ForbesLindesay/patch-1
Add link to live demo
@fb55 fb55 [parser] default options & cbs to empty objects
fixes #57
@fb55 fb55 3.1.4 529f727
Zach Smith [tokenizer] fix case where `<` followed by whitespace doesn't parse c…
@fb55 fb55 Merge pull request #58 from xcoderzach/master
[tokenizer] fix case where `<` followed by whitespace doesn't parse
@fb55 fb55 3.1.5 830c157
@fb55 fb55 [parser] don't overwrite attribute values on second occurence
as described in #42
@fb55 fb55 [readme] behavior of example changed due to #58 4d56157
@ForbesLindesay ForbesLindesay Add .gitignore ca311d4
@ForbesLindesay ForbesLindesay Add .gitattributes so tests still work on windows 909a3f1
@ForbesLindesay ForbesLindesay Normalize line endings f6f93ef
@fb55 fb55 [tokenizer] recognize the form field (U+0C), drop the carriage return…
… (U+0D)

to be inline with the HTML5 spec

(recognized in cheeriojs/cheerio#242)
@AndreasMadsen AndreasMadsen [Tokenizer] move if context to methods allowing .write to be optimized f8ddbe6
@fb55 fb55 Merge pull request #61 from AndreasMadsen/optimize
[Tokenizer] move if context to methods allowing .write to be optimized

fixes #60
@fb55 fb55 [tokenizer] don't save the options object 0219e3a
@fb55 fb55 [tokenizer] use ternary expressions for simple states 2aae96f
@fb55 fb55 [tokenizer] added variables for states of _special f6e21dd
@fb55 fb55 [tokenizer] fixed whitespace f3fb8d7
@fb55 fb55 [tokenizer] more ternaries bf0eaa4
@fb55 fb55 [tokenizer] simplified _cleanup a bit 57eb985
@fb55 fb55 [tokenizer] united some branches 917ecf0
@fb55 fb55 [tokenizer] get rid of _reconsume
use _index-- instead
@fb55 fb55 [tokenizer] even more ternaries 4bc1ec4
@fb55 fb55 [tokenizer] added abstractions for common state types, fixed previous…
… regression
@fb55 fb55 [tokenizer] added _getSection, completely inlined _emitIfToken, partl…
…y inlined _emitToken
@fb55 fb55 [tokenizer] simplified _stateInTagName 607c81a
@fb55 fb55 [tokenizer] simplified _stateInAttributeValueNoQuotes, reordered _sta…
@fb55 fb55 3.1.6 bd63b0b
@fb55 fb55 [tests] added test for second occurance of same attribute
fixes #42
@fb55 fb55 [tokenizer] started adding support for HTML entities
TODO: so far, only numeric entities are decoded
@fb55 fb55 [tokenizer] corrected decoding of numeric entities fac2449
@fb55 fb55 [tokenizer] numeric entities are now decoded
TODO: attribute values aren't handled yet
@fb55 fb55 [tests] added test case for numeric entities a6fb99e
@ForbesLindesay ForbesLindesay Update link to demo bcd00ed
@dax dax Add startIndex and endIndex positional attributes to the parser c2db3df
@fb55 fb55 Merge pull request #63 from fasterize/parser_positions
Add startIndex and endIndex positional attributes to the parser
@fb55 fb55 [tokenizer] renamed the self-closing tags state, moved it to its own …
@fb55 fb55 [tokenizer] commented out support for entities in attributes
requires adding a new event to make this work, so delayed for now
@fb55 fb55 [readme] updated benchmark results
switched the results to @AndreasMadsen's htmlparser-benchmark
@fb55 fb55 [bench] removed internal benchmarks
in favor of htmlparser-benchmark
@fb55 fb55 [parser] fixed whitespace bc193a6
@fb55 fb55 [parser] moved common logic to _updatePosition function 2221630
@fb55 fb55 [tokenizer] renamed IN_ATTRIBUTE_NAME_* states, improved formatting d26e087
@fb55 fb55 [tokenizer] re-added the carriage return as whitespace
fixes #62

apparently Google's gumbo-parser does behave this way:
@fb55 fb55 [tokenizer] fixed handling of unparsed data in end(), added support f…
…or several states
@fb55 fb55 [entities] added maps for normal & legacy entities 3a92796
@fb55 fb55 [tokenizer] added support for decoding HTML entities in `ontext` events
There is still a number of TODOs:
• support decoding entities in attributes
• when in XML mode, only decode XML entities (skip legacy entities)
• move the decodeMap to a JSON file
@fb55 fb55 [tests] added test cases for decoding legacy & named entities
both containing one of the longest available entities, to ensure they
are propperly decoded (especially relevant for legacy entities)
@fb55 fb55 [entities] added map for XML entities 927a9e9
@fb55 fb55 [tokenizer] added support for XML entities
also moved handling of trailing data to _handleTrailingData() (as it
has to be called recursively now)
@fb55 fb55 [tests] also test trailing data support in the numeric entity test b60cf04
@fb55 fb55 [tokenizer] fixed handling non-existent entities e45e4ec
@fb55 fb55 [tests] added test case for XML entities 12edc94
@fb55 fb55 [tokenizer] added _emitEntity
as a preparation for supporting decoding entities in attribute values
@fb55 fb55 3.2.0 076fcf7
@fb55 fb55 [tokenizer] moved decodeMap to entities/decode.json f46765d
@fb55 fb55 [tokenizer] renamed _emitEntity to _emitPartial 389102d
@fb55 fb55 [index] statically export Parser, Tokenizer and DomHandler 6ca87ff
@fb55 fb55 [parser] use String#search and String#substr instead of String#split
vastly improves performance of onprocessinginstruction and ondeclaration
@fb55 fb55 [parser] added onattribdata and onattribend events, dropped onattribv…
@fb55 fb55 [tokenizer] enable support for decoding entities in attributes, added…
… onattribend and onattribdata events, removed onattribvalue
@fb55 fb55 [tests] added test case for entities in attributes feafd9d
@fb55 fb55 3.2.1 311e48e
@fb55 fb55 [tokenizer] don't decode entities in special tags e2fa485
@fb55 fb55 3.2.2 36ee76e
@fb55 fb55 [tokenizer] reintroduced _special, removed IN_SCRIPT and IN_STYLE
also fixed some semantics
@fb55 fb55 3.2.3 effc3a9
@fb55 fb55 only respect self-closing tags in XML mode e4fb613
@fb55 fb55 [parser] properly removed self-closing tag support
also replaced call to `Array#slice` with setting the stack's `length`
@fb55 fb55 [tests] read files in the tests file, improved os interoperability of…
… stream test
@fb55 fb55 [tests] added helper.getCallback method be0dafa
@fb55 fb55 [tests] converted tests to mocha b948e86
@fb55 fb55 [tests] renamed tests dir to `test`
as required by mocha
@fb55 fb55 [package] run mocha as the test script 96a00fb
@fb55 fb55 Delete .DS_Store 41ad914
@fb55 fb55 [tokenizer] emit `onattribdata` in `_handleTrailingData`
fixes #66
@fb55 fb55 [tests] simplifications 336af9b
@fb55 fb55 3.2.4 fc0918c
@fb55 fb55 [readme] updated performance characteristics 7b1e4c9
@fb55 fb55 [tokenizer] handle `<<` correctly 76643d3
@fb55 fb55 3.2.5 2f24491
@fb55 fb55 [tests] added test case for cheeriojs/cheerio#247 834d6d2
@fb55 fb55 update to DomHandler@2.1, updated FeedHandler accordingly, bump 994cfda
@fb55 fb55 [tests] write only single characters for testing chunked data
failed previously (only for FeedHandler tests), fixed now due to
DomHandler upgrade (which removed the `ignoreWhitespace` option)
@fb55 fb55 [package] require domutils@1.2
as requested in fb55/css-select#11
@fb55 fb55 package: update readable-stream e6418c2
@fb55 fb55 package: use simple `license` field 0e5775c
@fb55 fb55 replace non-breaking space with regular space
as requested in #70
@fb55 fb55 index: pass `options` argument to constructors c9d4abe
@fb55 fb55 tests: remove unused `cb` argument 298546c
@fb55 fb55 feedhandler: wrap assignments f9bc72f
@fb55 fb55 tests: changed indentation to tabs 5f244df
@fb55 fb55 package: updated dom module versions, 3.4.0 7153b27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment