Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

HtmlHandler, for normalizing tag cases #24

Open
wants to merge 471 commits into from

10 participants

@kirbysayshi

As I thought through #20 and #22, I realized that the problem was not with the parser itself, but rather the results the parser created. Rather than hacking on the parser and breaking things like RSS/XML support, I decided a better approach would be to create another handler, called HtmlHandler. It embraces the case-insensitive nature of html tags, and toUpperCase()'s all tag names to respect the standard. When reserializing, the printHtml method (provided by tomdz) now toLowerCase()'s all tags, because it's printing HTML, not XML/RSS.

I've updated all tests, as well as added a few to test for scenarios where tags have mixed cases. This fork is currently in production on https://citational.com.

Please let me know any thoughts, as I'm more than willing to hear alternate opinions!

fb55 and others added some commits
@fb55 fb55 removed switch in Stream.js 7750ec1
@fb55 fb55 fixed whitespace 04476a0
@fb55 fb55 quick fix for #19 18d3f37
@lahmatiy lahmatiy Fix getOuterHTML for directives 69c9f0f
@fb55 fb55 Merge pull request #21 from lahmatiy/master
fix of htmlparser.DomUtils.getOuterHTML for directives
f8e6aad
@fb55 fb55 added lowerCaseAttributeNames option
yep, it's insanely short
82455a9
@fb55 fb55 2.3.0 e0d359e
@fb55 fb55 Added a `onopentagend` event
to get a signal when there won't be any more attributes coming
a8c13c8
@fb55 fb55 moved DomHandler & DomUtils to their own module
they are now available as `domhandler`
c1dfdda
@fb55 fb55 Updated readme c0b7eda
@fb55 fb55 2.3.1 a928109
@fb55 fb55 publish the element types from DomHandler b90c1e6
@fb55 fb55 use numeric element types
'case numbers are faster to compare

NOT breaking due to last commit
b6c4a73
@fb55 fb55 don't expose HandlerModule 401cc09
@fb55 fb55 fixed travis badge f5925c9
@fb55 fb55 stylistic changes 181c31b
@fb55 fb55 use the new dom modules, 2.5.0
Attention: The DOM changes slightly.
84012d6
@myndzi myndzi Made the attribute regular expression more correct with regards to un…
…quoted attribute values.

Require self-closing tags to be void
b3bc413
@myndzi myndzi I didn't understand how RegExps worked in this way, and was desynchin…
…g the attributes count. Here's a different way to accomplish the same thing.
0f71a49
@fb55 fb55 Revert "stylistic changes"
This reverts commit 181c31b.
f7b6d54
@fb55 fb55 Revert "Revert "stylistic changes""
This reverts commit f7b6d54.
c75da20
@fb55 fb55 added missing comma in benchmark script 6730fde
@fb55 fb55 domelementtype must be version 1.x (not 1.0) 840291e
@fb55 fb55 2.5.1 46cd546
Kris Reeves Merge branch 'master' of https://github.com/fb55/node-htmlparser a68f329
Kris Reeves Better handling of implied close tags. A list is given of tags whose …
…close is implied by other tags being opened, and these are closed when those tags are opened. This helps correctly parse things like lists and tables with unterminated LI or TD tags.
a83c708
Kris Reeves spaces -> tabs, thought the merge would update my local files to the …
…correct spacing (and tried to match that)
a1777a9
Kris Reeves Derp. a126b18
@fb55 fb55 added missing comma in benchmark script 5a72c28
@fb55 fb55 domelementtype must be version 1.x (not 1.0) eca12d8
@fb55 fb55 2.5.1 7f0389f
@jugglinmike jugglinmike Recognize closing CDATA tags as end of "special"
This allows for correct parsing of text that directly follows CDATA tags
8df87ab
@fb55 fb55 Merge pull request #31 from jugglinmike/text-after-cdata
Recognize closing CDATA tags as end of "special"
ef8b078
@fb55 fb55 test on node 0.6, 0.8 & 0.9 d21706b
@fb55 fb55 FeedHandler should return an error when nothing's found 4dc73a5
@fb55 fb55 added missing semicolon in test-helper.js e976099
@fb55 fb55 improved how tests are run 36650b8
@fb55 fb55 don't run 03-rdf.js test
it currently fails, requires investigation
610da2c
@fb55 fb55 renamed tests 0746690
@fb55 fb55 added semicolons & use EE#on in 02-stream.js d1d9cae
@fb55 fb55 changed how the end of all tests is shown 7c77a1f
@fb55 fb55 allow `>` at the beginning of a document
fixes #25

also allows `>`s to be at the beginning of text or after a `>`.
0494e90
@fb55 fb55 2.5.2 f707bd7
Kris Reeves Merge remote-tracking branch 'upstream/master'
Conflicts:
	package.json
2fc40c5
Kris Reeves Tests for changes. 05a99ef
Kris Reeves Fixes discussed in https://github.com/fb55/node-htmlparser/pull/28 fe6b8d6
@fb55 fb55 Merge pull request #28 from myndzi/master
basic support for implied close tags, bugfix for attribute values containing a slash at the end being recognized as self-closing tags.
33d55cd
@fb55 fb55 Update README.md
The example checked the `language` attribute. Changed it to `type`.
f162767
@jugglinmike jugglinmike Do not parse CDATA-like text inside special tags
Special nodes (e.g. script tags, style tags, comment nodes, etc.) can
contain only text nodes.
c0bd69c
@fb55 fb55 Merge pull request #32 from jugglinmike/cdata-inside-special
Do not parse CDATA-like text inside special tags
5b096bf
@fb55 fb55 2.6.0 8756001
@fb55 fb55 landed first version of FSM based tokenizer
fsm style taken from creationix/jsonparse

support for special tags (<script> & <style>) is missing
5e6fcb3
eonlepapillon Add a new test for issue #36
Only finds first attribute when there is no whitespace between
attributes.

- Added a html example
- Added a test
7be1360
@fb55 fb55 Merge pull request #37 from eonlepapillon/Add-test-for-Issue-#36
Add a new test for issue #36
833432b
@fb55 fb55 added logic for special tags d90e7a3
@fb55 fb55 [tokenizer] don't fail on `< >` and `< / >`
they are now emitted as text
aa19a0b
@fb55 fb55 [tokenizer] fixed ordering in cleanup 1bc6568
@fb55 fb55 [tokenizer] overwrite WritableStream#end, emit everything that's left 400bf43
@fb55 fb55 [tokenizer] take care of this._index in cleanup, emit all text 550b42e
@fb55 fb55 [tokenizer] set _sectionStart to 0 when text was emitted dabe165
@fb55 fb55 [tokenizer] call WritableStream#end after emitting the remaining data b9d568a
@fb55 fb55 [tokenizer] call .write instead of ._write 1144e42
@fb55 fb55 [parser] use the tokenizer c3d4025
@fb55 fb55 removed WritableStream.js and ElementType.js
both aren't needed anymore
627a38b
@fb55 fb55 [parser] made Parser#reset work again
absolutely aweful.
358944e
@fb55 fb55 fall back to the readable-stream module 5c155ca
@fb55 fb55 [travis] removed 0.6 & 0.9, added 0.10 and 0.11 5a28547
@fb55 fb55 minor changes c445375
@fb55 fb55 [index.js] removed redundant code 1ab593a
@fb55 fb55 [stream] use a named function
fixes export
f78d1ed
@fb55 fb55 3.0.0
also updated domutils version & specified main-field
1b6a264
@fb55 fb55 [tokenizer] always call WritableStream#end b48adc2
@fb55 fb55 [parser] call Tokenizer#end, clear the stack 17b7ebe
@fb55 fb55 [index.js] added `createDomStream()` convenience method 654c4d4
@fb55 fb55 [tokenizer] added `opentagend` event 628b99e
@fb55 fb55 [parser] use `opentagend` event f70f545
@fb55 fb55 3.0.1 b7cc1aa
@fb55 fb55 [tokenizer] emit opentagend on selfclosing tags, fixed handling of < …
…in special tags
acc0d05
@fb55 fb55 [index.js] added tokenizer 94e794f
@fb55 fb55 [tests] text events now contain more data 9793593
@fb55 fb55 [tokenizer] don't inherit from stream.Writable, fixed several bugs ab8b653
@fb55 fb55 [tests/events] concat text events 09b8833
@fb55 fb55 [tests/events] fixed order of attribute/opentag events, merged text e…
…vents
00d63cf
@fb55 fb55 [tokenizer] use strings instead of buffers
has a huge impact on speed
643a7f0
@fb55 fb55 [parser] don't implement stream.Writable, use new tokenizer interface b837b95
@fb55 fb55 [tests/stream] fixed order of events db95f00
@fb55 fb55 [tokenizer] simplified logic e4982e1
@fb55 fb55 [parser] fixed handling of implied closing and empty tags 1905dd3
@fb55 fb55 [tests/events] accidentally removed part of the document 70c6865
@fb55 fb55 added a WritableStream interface again
this time, it's implementing stream.Writable
4a7eb12
@fb55 fb55 3.0.0 (finally!)
the 3.x releases before were crappy, and I will deny to have published
them
a23d7a6
@fb55 fb55 [tokenizer] changed internal name to `Tokenizer` 1db8148
burl [tokenizer] fix for script tags causing following nodes to be interpr…
…eted as TEXT

* this._special reverted to 0 after "closetag" event
[02-template.json] added <p>...</p> around script tag to ensure that closing </p> is seen as a tag rather than text node
b7f6df5
@fb55 fb55 [proxyhandler] don't use getters/setters 9898b9a
@fb55 fb55 added CollectingHandler
collects all events and passes them through to another handler

can simulate a reset for the underlying handler using the `restart()`
method
84815a3
@fb55 fb55 [tests] use the new CollectingHandler 01d8adf
@fb55 fb55 [tests] removed unused `f` var f2542db
@fb55 fb55 3.0.1 fcb35f0
@fb55 fb55 Merge pull request #38 from burl/master
fix for script tags
605aa6c
@fb55 fb55 3.0.2 779e608
@fb55 fb55 [bench] use setImmediate instead of process.nextTick c848d69
@fb55 fb55 [bench] try to test all available modules 1384620
@fb55 fb55 [bench] removed unused functions, improved output 9f465ca
@fb55 fb55 [readme] updated benchmarks
also use the more readable unit ms/el
2f38140
@fb55 fb55 [doc] call `end`, use single quotes bc00862
@fb55 fb55 [doc] updated section about node-htmlparser 6935c0d
@fb55 fb55 renamed repository, 3.0.3 8a91aac
@fb55 fb55 use DomUtils.getText in fetch, split getElements e7ad785
@fb55 fb55 [tokenizer] name states consistently 6b995ab
@fb55 fb55 [feedhandler] recursively walk the tree 0b88170
@fb55 fb55 [readme] small updates
tests pass now thanks to updates to the domhandler module
b06cb29
@fb55 fb55 [tokenizer] don't emit an "onopentagend" event for self-closing tags e6f0199
@fb55 fb55 [parser] fixed handling of self-closing tags a3a9954
@fb55 fb55 [tests] stream tests are run again 9d478ea
@fb55 fb55 [tests/feeds] run rdf test again e612238
@fb55 fb55 [tests/stream] enabled xmlMode for RSS test 3b821dc
@fb55 fb55 [tests/stream] create a new handler for the second run 1bb92f7
@fb55 fb55 [tests/stream] added tests for the files in tests/Documents ae58e56
@fb55 fb55 3.0.4 83c75dc
@fb55 fb55 [parser] lowercase instruction names if lowerCaseTags option is set
for backwards compat
e36f3d0
@fb55 fb55 3.0.5 61c5a80
@fb55 fb55 [tests/events] added test case for jsdom#368 d79b1b3
@fb55 fb55 changed behavior for non-xml mode
• lowercase tag and attribute names by default
• CDATA is now emitted as a comment (fixes tmpvar/jsdom#618)
1123da8
@fb55 fb55 [tests/events] updated tests to reflect latest changes 357a825
@fb55 fb55 3.1.0 96c41b1
@papandreou papandreou Added missing void elements. 75fb1cf
@fb55 fb55 Merge pull request #46 from One-com/missing_void_elements
Added missing void elements.
f58c1d3
@AndreasMadsen AndreasMadsen [tokenizer] text in special tags there looks like a tag ending 7ca6d22
@fb55 fb55 Merge pull request #48 from AndreasMadsen/script-in-script
[tokenizer] text in special tags there looks like a tag ending
46d3b21
@fb55 fb55 [tokenizer] consume token again
after switching from BEFORE_CLOSING_TAG_NAME to TEXT state (inside a special tag)
02f12e2
@fb55 fb55 [parser] still recognize other options in non-xml-mode
using the easiest solution (applying DeMorgan).
6e1669f
@fb55 fb55 3.1.1 231a746
@AndreasMadsen AndreasMadsen [tokenizer] don't reset comment state in case of long endings 7ef5de8
@fb55 fb55 Merge pull request #49 from AndreasMadsen/long-comment
[tokenizer] don't reset comment state in case of long endings
623cd89
@AndreasMadsen AndreasMadsen [Tokenizer] don't reset CDATA state in case of long endings e8dc84a
@fb55 fb55 Merge pull request #50 from AndreasMadsen/long-cdata-ending
[Tokenizer] don't reset CDATA state in case of long endings
c88dd9a
@fb55 fb55 readme: added version badge a768e88
@fb55 fb55 [readme] added yet another badge (dependency versions) 40a2339
@fb55 fb55 [bench] added the hubbub & html-parser modules
todo: update readme
8b390bd
@fb55 fb55 3.1.2 dda8df2
@AndreasMadsen AndreasMadsen [Parser] open tags before close if never opened 7fd58aa
@AndreasMadsen AndreasMadsen [Parser] implicit open only p and br tags 694dea7
@abarre abarre Fix perf regression in the Tokenizer : avoid a concatenation
Version 2.3.1 :
-> % node bench2.js
htmlparser2:  01.86 ms/el

Version 3.1.2 without the fix :
-> % node tests/bench.js 
htmlparser2:  04.50 ms/el

Version 3.1.2 with the fix :
-> % node tests/bench.js
htmlparser2:  01.75 ms/el
d64986c
@fb55 fb55 Merge pull request #54 from abarre/master
[tokenizer] fix perf regression
a842129
@fb55 fb55 Merge pull request #52 from AndreasMadsen/implicit-open
[parser] implicit open only p and br tags
0e320fc
@fb55 fb55 3.1.3 eade820
@fb55 fb55 [parser] renamed emptyTags to voidElements, sorted them 0ca2c1e
@fb55 fb55 [parser] improved consistency & simplified 26117ef
@fb55 fb55 [tokenizer] simplified `end` logic 7932367
@fb55 fb55 [tokenizer] removed noop blocks in AFTER_{COMMENT,CDATA}_2 45d9067
@fb55 fb55 [tokenizer] use `continue` instead of decreasing the index 87c6f2b
@fb55 fb55 [bench] removed unnecessary noop functions 7608c11
@fb55 fb55 [tokenizer] improved handling of remaining data d00b391
@fb55 fb55 [readme] it~~'~~s 863183a
@ForbesLindesay ForbesLindesay Add parseDOM and parseFeed helper methods 77bf0ae
@fb55 fb55 Merge pull request #55 from ForbesLindesay/patch-1
Add parseDOM and parseFeed helper methods
740bbe9
@ForbesLindesay ForbesLindesay Add link to live demo 16aef00
@fb55 fb55 Merge pull request #56 from ForbesLindesay/patch-1
Add link to live demo
288bb93
@fb55 fb55 [parser] default options & cbs to empty objects
fixes #57
b00177f
@fb55 fb55 3.1.4 529f727
Zach Smith [tokenizer] fix case where `<` followed by whitespace doesn't parse c…
…orrectly
9f54942
@fb55 fb55 Merge pull request #58 from xcoderzach/master
[tokenizer] fix case where `<` followed by whitespace doesn't parse
d3c1fcd
@fb55 fb55 3.1.5 830c157
@fb55 fb55 [parser] don't overwrite attribute values on second occurence
as described in #42
a6b6865
@fb55 fb55 [readme] behavior of example changed due to #58 4d56157
@ForbesLindesay ForbesLindesay Add .gitignore ca311d4
@ForbesLindesay ForbesLindesay Add .gitattributes so tests still work on windows 909a3f1
@ForbesLindesay ForbesLindesay Normalize line endings f6f93ef
@fb55 fb55 [tokenizer] recognize the form field (U+0C), drop the carriage return…
… (U+0D)

to be inline with the HTML5 spec

(recognized in cheeriojs/cheerio#242)
263775f
@AndreasMadsen AndreasMadsen [Tokenizer] move if context to methods allowing .write to be optimized f8ddbe6
@fb55 fb55 Merge pull request #61 from AndreasMadsen/optimize
[Tokenizer] move if context to methods allowing .write to be optimized

fixes #60
9ab0b0e
@fb55 fb55 [tokenizer] don't save the options object 0219e3a
@fb55 fb55 [tokenizer] use ternary expressions for simple states 2aae96f
@fb55 fb55 [tokenizer] added variables for states of _special f6e21dd
@fb55 fb55 [tokenizer] fixed whitespace f3fb8d7
@fb55 fb55 [tokenizer] more ternaries bf0eaa4
@fb55 fb55 [tokenizer] simplified _cleanup a bit 57eb985
@fb55 fb55 [tokenizer] united some branches 917ecf0
@fb55 fb55 [tokenizer] get rid of _reconsume
use _index-- instead
7f9082c
@fb55 fb55 [tokenizer] even more ternaries 4bc1ec4
@fb55 fb55 [tokenizer] added abstractions for common state types, fixed previous…
… regression
24bbf86
@fb55 fb55 [tokenizer] added _getSection, completely inlined _emitIfToken, partl…
…y inlined _emitToken
ce87df1
@fb55 fb55 [tokenizer] simplified _stateInTagName 607c81a
@fb55 fb55 [tokenizer] simplified _stateInAttributeValueNoQuotes, reordered _sta…
…teBeforeAttributeName
5b8955a
@fb55 fb55 3.1.6 bd63b0b
@fb55 fb55 [tests] added test for second occurance of same attribute
fixes #42
4589ecd
@fb55 fb55 [tokenizer] started adding support for HTML entities
TODO: so far, only numeric entities are decoded
9eea898
@fb55 fb55 [tokenizer] corrected decoding of numeric entities fac2449
@fb55 fb55 [tokenizer] numeric entities are now decoded
TODO: attribute values aren't handled yet
e485fb2
@fb55 fb55 [tests] added test case for numeric entities a6fb99e
@ForbesLindesay ForbesLindesay Update link to demo bcd00ed
@dax dax Add startIndex and endIndex positional attributes to the parser c2db3df
@fb55 fb55 Merge pull request #63 from fasterize/parser_positions
Add startIndex and endIndex positional attributes to the parser
6330226
@fb55 fb55 [tokenizer] renamed the self-closing tags state, moved it to its own …
…function
b70b28d
@fb55 fb55 [tokenizer] commented out support for entities in attributes
requires adding a new event to make this work, so delayed for now
ad1d8f0
@fb55 fb55 [readme] updated benchmark results
switched the results to @AndreasMadsen's htmlparser-benchmark
ab8926e
@fb55 fb55 [bench] removed internal benchmarks
in favor of htmlparser-benchmark
e5197b3
@fb55 fb55 [parser] fixed whitespace bc193a6
@fb55 fb55 [parser] moved common logic to _updatePosition function 2221630
@fb55 fb55 [tokenizer] renamed IN_ATTRIBUTE_NAME_* states, improved formatting d26e087
@fb55 fb55 [tokenizer] re-added the carriage return as whitespace
fixes #62

apparently Google's gumbo-parser does behave this way:
https://github.com/google/gumbo-parser/blob/101726c50e172e45be6002c51b85
e45f27f0c2c6/src/tokenizer.c#L322
163a4ce
@fb55 fb55 [tokenizer] fixed handling of unparsed data in end(), added support f…
…or several states
ea26f0e
@fb55 fb55 [entities] added maps for normal & legacy entities 3a92796
@fb55 fb55 [tokenizer] added support for decoding HTML entities in `ontext` events
There is still a number of TODOs:
• support decoding entities in attributes
• when in XML mode, only decode XML entities (skip legacy entities)
• move the decodeMap to a JSON file
ba3c1c7
@fb55 fb55 [tests] added test cases for decoding legacy & named entities
both containing one of the longest available entities, to ensure they
are propperly decoded (especially relevant for legacy entities)
e9a8496
@fb55 fb55 [entities] added map for XML entities 927a9e9
@fb55 fb55 [tokenizer] added support for XML entities
also moved handling of trailing data to _handleTrailingData() (as it
has to be called recursively now)
7adb053
@fb55 fb55 [tests] also test trailing data support in the numeric entity test b60cf04
@fb55 fb55 [tokenizer] fixed handling non-existent entities e45e4ec
@fb55 fb55 [tests] added test case for XML entities 12edc94
@fb55 fb55 [tokenizer] added _emitEntity
as a preparation for supporting decoding entities in attribute values
271dee2
@fb55 fb55 3.2.0 076fcf7
@fb55 fb55 [tokenizer] moved decodeMap to entities/decode.json f46765d
@fb55 fb55 [tokenizer] renamed _emitEntity to _emitPartial 389102d
@fb55 fb55 [index] statically export Parser, Tokenizer and DomHandler 6ca87ff
@fb55 fb55 [parser] use String#search and String#substr instead of String#split
vastly improves performance of onprocessinginstruction and ondeclaration
1c8600b
@fb55 fb55 [parser] added onattribdata and onattribend events, dropped onattribv…
…alue
e3a75dd
@fb55 fb55 [tokenizer] enable support for decoding entities in attributes, added…
… onattribend and onattribdata events, removed onattribvalue
8494b03
@fb55 fb55 [tests] added test case for entities in attributes feafd9d
@fb55 fb55 3.2.1 311e48e
@fb55 fb55 [tokenizer] don't decode entities in special tags e2fa485
@fb55 fb55 3.2.2 36ee76e
@fb55 fb55 [tokenizer] reintroduced _special, removed IN_SCRIPT and IN_STYLE
also fixed some semantics
cce466c
@fb55 fb55 3.2.3 effc3a9
@fb55 fb55 only respect self-closing tags in XML mode e4fb613
@fb55 fb55 [parser] properly removed self-closing tag support
also replaced call to `Array#slice` with setting the stack's `length`
property
80a1ecb
@fb55 fb55 [tests] read files in the tests file, improved os interoperability of…
… stream test
0347cd7
@fb55 fb55 [tests] added helper.getCallback method be0dafa
@fb55 fb55 [tests] converted tests to mocha b948e86
@fb55 fb55 [tests] renamed tests dir to `test`
as required by mocha
8737bf1
@fb55 fb55 [package] run mocha as the test script 96a00fb
@fb55 fb55 Delete .DS_Store 41ad914
@fb55 fb55 [tokenizer] emit `onattribdata` in `_handleTrailingData`
fixes #66
fc22b7d
@fb55 fb55 [tests] simplifications 336af9b
@fb55 fb55 3.2.4 fc0918c
@fb55 fb55 [readme] updated performance characteristics 7b1e4c9
@fb55 fb55 [tokenizer] handle `<<` correctly 76643d3
@fb55 fb55 3.2.5 2f24491
@fb55 fb55 [tests] added test case for cheeriojs/cheerio#247 834d6d2
@fb55 fb55 update to DomHandler@2.1, updated FeedHandler accordingly, bump 994cfda
@fb55 fb55 [tests] write only single characters for testing chunked data
failed previously (only for FeedHandler tests), fixed now due to
DomHandler upgrade (which removed the `ignoreWhitespace` option)
11eba28
@fb55 fb55 [package] require domutils@1.2
as requested in fb55/css-select#11
029c565
@fb55 fb55 package: update readable-stream e6418c2
@fb55 fb55 package: use simple `license` field 0e5775c
@fb55 fb55 replace non-breaking space with regular space
as requested in #70
2c568d3
@fb55 fb55 index: pass `options` argument to constructors c9d4abe
@fb55 fb55 tests: remove unused `cb` argument 298546c
@fb55 fb55 feedhandler: wrap assignments f9bc72f
@fb55 fb55 tests: changed indentation to tabs 5f244df
@fb55 fb55 package: updated dom module versions, 3.4.0 7153b27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.