Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Newer
Older
100644 254 lines (198 sloc) 5.113 kB
792b9ca @tautologistics Updated README
authored
1 #NodeHtmlParser
3b30e87 @tautologistics Added support for RSS/Atom feeds
authored
2 A forgiving HTML/XML/RSS parser written in JS for both the browser and NodeJS (yes, despite the name it works just fine in any modern browser). The parser can handle streams (chunked data) and supports custom handlers for writing custom DOMs/output.
792b9ca @tautologistics Updated README
authored
3
dd34125 @tautologistics Added npm instructions
authored
4 ##Installing
5
577c9a4 @tautologistics Added npm instructions
authored
6 npm install htmlparser
dd34125 @tautologistics Added npm instructions
authored
7
792b9ca @tautologistics Updated README
authored
8 ##Running Tests
9
10 ###Run tests under node:
a0ab011 @tautologistics Updated README
authored
11 node runtests.js
792b9ca @tautologistics Updated README
authored
12
13 ###Run tests in browser:
14 View runtests.html in any browser
15
16 ##Usage In Node
7ded5ba @tautologistics Changed how internal case insensitive tag evaluation occurs
authored
17
18 ```javascript
19 var htmlparser = require("htmlparser");
20 var rawHtml = "Xyz <script language= javascript>var foo = '<<bar>>';< / script><!--<!-- Waah! -- -->";
21 var handler = new htmlparser.DefaultHandler(function (error, dom) {
22 if (error)
23 [...do something for errors...]
24 else
25 [...parsing done, do something...]
26 });
27 var parser = new htmlparser.Parser(handler);
28 parser.parseComplete(rawHtml);
29 sys.puts(sys.inspect(handler.dom, false, null));
30 ```
792b9ca @tautologistics Updated README
authored
31
32 ##Usage In Browser
7ded5ba @tautologistics Changed how internal case insensitive tag evaluation occurs
authored
33
34 ```javascript
35 var handler = new Tautologistics.NodeHtmlParser.DefaultHandler(function (error, dom) {
36 if (error)
37 [...do something for errors...]
38 else
39 [...parsing done, do something...]
40 });
41 var parser = new Tautologistics.NodeHtmlParser.Parser(handler);
42 parser.parseComplete(document.body.innerHTML);
43 alert(JSON.stringify(handler.dom, null, 2));
44 ```
792b9ca @tautologistics Updated README
authored
45
46 ##Example output
7ded5ba @tautologistics Changed how internal case insensitive tag evaluation occurs
authored
47
48 ```javascript
49 [ { raw: 'Xyz ', data: 'Xyz ', type: 'text' }
50 , { raw: 'script language= javascript'
51 , data: 'script language= javascript'
52 , type: 'script'
53 , name: 'script'
54 , attribs: { language: 'javascript' }
55 , children:
56 [ { raw: 'var foo = \'<bar>\';<'
57 , data: 'var foo = \'<bar>\';<'
58 , type: 'text'
59 }
60 ]
61 }
62 , { raw: '<!-- Waah! -- '
63 , data: '<!-- Waah! -- '
64 , type: 'comment'
65 }
66 ]
67 ```
792b9ca @tautologistics Updated README
authored
68
69 ##Streaming To Parser
7ded5ba @tautologistics Changed how internal case insensitive tag evaluation occurs
authored
70
71 ```javascript
72 while (...) {
73 ...
74 parser.parseChunk(chunk);
75 }
76 parser.done();
77 ```
792b9ca @tautologistics Updated README
authored
78
935986d @cistov Adding note about stream interface in the readme
cistov authored
79 ##Streaming To Parser in Node
80
81 ```javascript
82 fs.createReadStream('./path_to_file.html').pipe(parser);
83 ```
84
3b30e87 @tautologistics Added support for RSS/Atom feeds
authored
85 ##Parsing RSS/Atom Feeds
86
7ded5ba @tautologistics Changed how internal case insensitive tag evaluation occurs
authored
87 ```javascript
88 new htmlparser.RssHandler(function (error, dom) {
89 ...
90 });
91 ```
3b30e87 @tautologistics Added support for RSS/Atom feeds
authored
92
93 ##DefaultHandler Options
792b9ca @tautologistics Updated README
authored
94
a0ab011 @tautologistics Updated README
authored
95 ###Usage
7ded5ba @tautologistics Changed how internal case insensitive tag evaluation occurs
authored
96
97 ```javascript
98 var handler = new htmlparser.DefaultHandler(
99 function (error) { ... }
100 , { verbose: false, ignoreWhitespace: true }
101 );
102 ```
103
a0ab011 @tautologistics Updated README
authored
104 ###Option: ignoreWhitespace
105 Indicates whether the DOM should exclude text nodes that consists solely of whitespace. The default value is "false".
106
107 ####Example: true
7ded5ba @tautologistics Changed how internal case insensitive tag evaluation occurs
authored
108
a0ab011 @tautologistics Updated README
authored
109 The following HTML:
7ded5ba @tautologistics Changed how internal case insensitive tag evaluation occurs
authored
110
111 ```html
112 <font>
113 <br>this is the text
114 <font>
115 ```
116
a0ab011 @tautologistics Updated README
authored
117 becomes:
7ded5ba @tautologistics Changed how internal case insensitive tag evaluation occurs
authored
118
119 ```javascript
120 [ { raw: 'font'
121 , data: 'font'
122 , type: 'tag'
123 , name: 'font'
124 , children:
125 [ { raw: 'br', data: 'br', type: 'tag', name: 'br' }
126 , { raw: 'this is the text\n'
127 , data: 'this is the text\n'
128 , type: 'text'
129 }
130 , { raw: 'font', data: 'font', type: 'tag', name: 'font' }
131 ]
132 }
133 ]
134 ```
a0ab011 @tautologistics Updated README
authored
135
136 ####Example: false
7ded5ba @tautologistics Changed how internal case insensitive tag evaluation occurs
authored
137
a0ab011 @tautologistics Updated README
authored
138 The following HTML:
7ded5ba @tautologistics Changed how internal case insensitive tag evaluation occurs
authored
139
140 ```html
141 <font>
142 <br>this is the text
143 <font>
144 ```
145
a0ab011 @tautologistics Updated README
authored
146 becomes:
7ded5ba @tautologistics Changed how internal case insensitive tag evaluation occurs
authored
147
148 ```javascript
149 [ { raw: 'font'
150 , data: 'font'
151 , type: 'tag'
152 , name: 'font'
153 , children:
154 [ { raw: '\n\t', data: '\n\t', type: 'text' }
155 , { raw: 'br', data: 'br', type: 'tag', name: 'br' }
156 , { raw: 'this is the text\n'
157 , data: 'this is the text\n'
158 , type: 'text'
159 }
160 , { raw: 'font', data: 'font', type: 'tag', name: 'font' }
161 ]
162 }
163 ]
164 ```
a0ab011 @tautologistics Updated README
authored
165
166 ###Option: verbose
167 Indicates whether to include extra information on each node in the DOM. This information consists of the "raw" attribute (original, unparsed text found between "<" and ">") and the "data" attribute on "tag", "script", and "comment" nodes. The default value is "true".
168
169 ####Example: true
170 The following HTML:
7ded5ba @tautologistics Changed how internal case insensitive tag evaluation occurs
authored
171
172 ```html
173 <a href="test.html">xxx</a>
174 ```
175
a0ab011 @tautologistics Updated README
authored
176 becomes:
7ded5ba @tautologistics Changed how internal case insensitive tag evaluation occurs
authored
177
178 ```javascript
179 [ { raw: 'a href="test.html"'
180 , data: 'a href="test.html"'
181 , type: 'tag'
182 , name: 'a'
183 , attribs: { href: 'test.html' }
184 , children: [ { raw: 'xxx', data: 'xxx', type: 'text' } ]
185 }
186 ]
187 ```
a0ab011 @tautologistics Updated README
authored
188
189 ####Example: false
190 The following HTML:
7ded5ba @tautologistics Changed how internal case insensitive tag evaluation occurs
authored
191
192 ```javascript
193 <a href="test.html">xxx</a>
194 ```
195
a0ab011 @tautologistics Updated README
authored
196 becomes:
7ded5ba @tautologistics Changed how internal case insensitive tag evaluation occurs
authored
197
198 ```javascript
199 [ { type: 'tag'
200 , name: 'a'
201 , attribs: { href: 'test.html' }
202 , children: [ { data: 'xxx', type: 'text' } ]
203 }
204 ]
205 ```
514ad43 @tautologistics Added DomUtils
authored
206
b954e7f @tautologistics Added DefaultHandler option "enforceEmptyTags" so that XML can be par…
authored
207 ###Option: enforceEmptyTags
208 Indicates whether the DOM should prevent children on tags marked as empty in the HTML spec. Typically this should be set to "true" HTML parsing and "false" for XML parsing. The default value is "true".
209
210 ####Example: true
211 The following HTML:
7ded5ba @tautologistics Changed how internal case insensitive tag evaluation occurs
authored
212
213 ```html
214 <link>text</link>
215 ```
216
b954e7f @tautologistics Added DefaultHandler option "enforceEmptyTags" so that XML can be par…
authored
217 becomes:
7ded5ba @tautologistics Changed how internal case insensitive tag evaluation occurs
authored
218
219 ```javascript
220 [ { raw: 'link', data: 'link', type: 'tag', name: 'link' }
221 , { raw: 'text', data: 'text', type: 'text' }
222 ]
223 ```
b954e7f @tautologistics Added DefaultHandler option "enforceEmptyTags" so that XML can be par…
authored
224
225 ####Example: false
226 The following HTML:
7ded5ba @tautologistics Changed how internal case insensitive tag evaluation occurs
authored
227
228 ```html
229 <link>text</link>
230 ```
231
b954e7f @tautologistics Added DefaultHandler option "enforceEmptyTags" so that XML can be par…
authored
232 becomes:
7ded5ba @tautologistics Changed how internal case insensitive tag evaluation occurs
authored
233
234 ```javascript
235 [ { raw: 'link'
236 , data: 'link'
237 , type: 'tag'
238 , name: 'link'
239 , children: [ { raw: 'text', data: 'text', type: 'text' } ]
240 }
241 ]
242 ```
b954e7f @tautologistics Added DefaultHandler option "enforceEmptyTags" so that XML can be par…
authored
243
514ad43 @tautologistics Added DomUtils
authored
244 ##DomUtils
245
246 ###TBD (see utils_example.js for now)
00fbea9 @tautologistics Fixed DomUtils.testElement() and added new, related projects to the R…
authored
247
248 ##Related Projects
249
250 Looking for CSS selectors to search the DOM? Try Node-SoupSelect, a port of SoupSelect to NodeJS: http://github.com/harryf/node-soupselect
251
17b9274 @tautologistics Updated docs
authored
252 There's also a port of hpricot to NodeJS that uses HtmlParser for HTML parsing: http://github.com/silentrob/Apricot
00fbea9 @tautologistics Fixed DomUtils.testElement() and added new, related projects to the R…
authored
253
Something went wrong with that request. Please try again.