Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Newer
Older
100644 380 lines (270 sloc) 13.996 kB
50cdcf2 @rgrove Initial commit.
authored
1 = Sanitize
2
3 Sanitize is a whitelist-based HTML sanitizer. Given a list of acceptable
4 elements and attributes, Sanitize will remove all unacceptable HTML from a
5 string.
6
7 Using a simple configuration syntax, you can tell Sanitize to allow certain
8 elements, certain attributes within those elements, and even certain URL
9 protocols within attributes that contain URLs. Any HTML elements or attributes
10 that you don't explicitly allow will be removed.
11
bbc992a @rgrove Use an incremental version number until we're ready for a release
authored
12 Because it's based on Nokogiri, a full-fledged HTML parser, rather than a bunch
50cdcf2 @rgrove Initial commit.
authored
13 of fragile regular expressions, Sanitize has no trouble dealing with malformed
5bbd6d3 @rgrove Added an :escape_only config setting. If set to true, Sanitize will e…
authored
14 or maliciously-formed HTML, and will always output valid HTML or XHTML.
50cdcf2 @rgrove Initial commit.
authored
15
16 *Author*:: Ryan Grove (mailto:ryan@wonko.com)
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a …
authored
17 *Version*:: 2.0.0 (git)
674996a @rgrove Holy crap it's 2011.
authored
18 *Copyright*:: Copyright (c) 2011 Ryan Grove. All rights reserved.
50cdcf2 @rgrove Initial commit.
authored
19 *License*:: MIT License (http://opensource.org/licenses/mit-license.php)
20 *Website*:: http://github.com/rgrove/sanitize
21
22 == Requires
23
d10eaaa @rgrove Default to HTML output instead of XHTML.
authored
24 * Nokogiri ~> 1.4.4
bbc992a @rgrove Use an incremental version number until we're ready for a release
authored
25 * libxml2 >= 2.7.2
50cdcf2 @rgrove Initial commit.
authored
26
e71ca65 @rgrove Add installation instructions
authored
27 == Installation
28
29 Latest stable release:
30
31 gem install sanitize
32
33 Latest development version:
34
4465619 @rgrove README cleanup and version bump.
authored
35 gem install sanitize --pre
e71ca65 @rgrove Add installation instructions
authored
36
50cdcf2 @rgrove Initial commit.
authored
37 == Usage
38
39 If you don't specify any configuration options, Sanitize will use its strictest
3597b24 @rgrove Transformer refinements and (gasp!) documentation.
authored
40 settings by default, which means it will strip all HTML and leave only text
41 behind.
50cdcf2 @rgrove Initial commit.
authored
42
43 require 'rubygems'
44 require 'sanitize'
45
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a …
authored
46 html = '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg">'
50cdcf2 @rgrove Initial commit.
authored
47
48 Sanitize.clean(html) # => 'foo'
49
50 == Configuration
51
52 In addition to the ultra-safe default settings, Sanitize comes with three other
53 built-in modes.
54
55 === Sanitize::Config::RESTRICTED
56
57 Allows only very simple inline formatting markup. No links, images, or block
58 elements.
59
60 Sanitize.clean(html, Sanitize::Config::RESTRICTED) # => '<b>foo</b>'
61
62 === Sanitize::Config::BASIC
63
64 Allows a variety of markup including formatting tags, links, and lists. Images
65 and tables are not allowed, links are limited to FTP, HTTP, HTTPS, and mailto
66 protocols, and a <code>rel="nofollow"</code> attribute is added to all links to
67 mitigate SEO spam.
68
69 Sanitize.clean(html, Sanitize::Config::BASIC)
70 # => '<b><a href="http://foo.com/" rel="nofollow">foo</a></b>'
71
72 === Sanitize::Config::RELAXED
73
74 Allows an even wider variety of markup than BASIC, including images and tables.
75 Links are still limited to FTP, HTTP, HTTPS, and mailto protocols, while images
76 are limited to HTTP and HTTPS. In this mode, <code>rel="nofollow"</code> is not
77 added to links.
78
79 Sanitize.clean(html, Sanitize::Config::RELAXED)
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a …
authored
80 # => '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg">'
50cdcf2 @rgrove Initial commit.
authored
81
82 === Custom Configuration
83
84 If the built-in modes don't meet your needs, you can easily specify a custom
85 configuration:
86
87 Sanitize.clean(html, :elements => ['a', 'span'],
88 :attributes => {'a' => ['href', 'title'], 'span' => ['class']},
89 :protocols => {'a' => {'href' => ['http', 'https', 'mailto']}})
90
4465619 @rgrove README cleanup and version bump.
authored
91 ==== :add_attributes (Hash)
50cdcf2 @rgrove Initial commit.
authored
92
4465619 @rgrove README cleanup and version bump.
authored
93 Attributes to add to specific elements. If the attribute already exists, it will
94 be replaced with the value specified here. Specify all element names and
95 attributes in lowercase.
50cdcf2 @rgrove Initial commit.
authored
96
4465619 @rgrove README cleanup and version bump.
authored
97 :add_attributes => {
98 'a' => {'rel' => 'nofollow'}
99 }
50cdcf2 @rgrove Initial commit.
authored
100
4465619 @rgrove README cleanup and version bump.
authored
101 ==== :attributes (Hash)
50cdcf2 @rgrove Initial commit.
authored
102
103 Attributes to allow for specific elements. Specify all element names and
104 attributes in lowercase.
105
106 :attributes => {
107 'a' => ['href', 'title'],
108 'blockquote' => ['cite'],
109 'img' => ['alt', 'src', 'title']
110 }
111
c0495a8 @rgrove Use :all instead of '*' to specify attributes to be added to all elem…
authored
112 If you'd like to allow certain attributes on all elements, use the symbol
113 <code>:all</code> instead of an element name.
114
115 :attributes => {
116 :all => ['class'],
31fd7e3 @rgrove Fix stray comma.
authored
117 'a' => ['href', 'title']
c0495a8 @rgrove Use :all instead of '*' to specify attributes to be added to all elem…
authored
118 }
119
4465619 @rgrove README cleanup and version bump.
authored
120 ==== :allow_comments (boolean)
50cdcf2 @rgrove Initial commit.
authored
121
4465619 @rgrove README cleanup and version bump.
authored
122 Whether or not to allow HTML comments. Allowing comments is strongly
123 discouraged, since IE allows script execution within conditional comments. The
124 default value is <code>false</code>.
50cdcf2 @rgrove Initial commit.
authored
125
4465619 @rgrove README cleanup and version bump.
authored
126 ==== :elements (Array)
127
128 Array of element names to allow. Specify all names in lowercase.
129
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a …
authored
130 :elements => %w[
131 a abbr b blockquote br cite code dd dfn dl dt em i kbd li mark ol p pre
132 q s samp small strike strong sub sup time u ul var
4465619 @rgrove README cleanup and version bump.
authored
133 ]
134
135 ==== :output (Symbol)
136
137 Output format. Supported formats are <code>:html</code> and <code>:xhtml</code>,
d10eaaa @rgrove Default to HTML output instead of XHTML.
authored
138 defaulting to <code>:html</code>.
50cdcf2 @rgrove Initial commit.
authored
139
fc6b1fb @rgrove Add an :output_encoding config setting, defaulting to 'utf-8'.
authored
140 ==== :output_encoding (String)
141
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a …
authored
142 Character encoding to use for HTML output. Default is <code>utf-8</code>.
f8688ac @rgrove Rename :allow_text config setting to :process_text_nodes. Add docs.
authored
143
4465619 @rgrove README cleanup and version bump.
authored
144 ==== :protocols (Hash)
50cdcf2 @rgrove Initial commit.
authored
145
146 URL protocols to allow in specific attributes. If an attribute is listed here
147 and contains a protocol other than those specified (or if it contains no
148 protocol at all), it will be removed.
149
150 :protocols => {
151 'a' => {'href' => ['ftp', 'http', 'https', 'mailto']},
152 'img' => {'src' => ['http', 'https']}
153 }
154
5682777 @rgrove You can now specify :relative in a protocol config array to allow att…
authored
155 If you'd like to allow the use of relative URLs which don't have a protocol,
c0495a8 @rgrove Use :all instead of '*' to specify attributes to be added to all elem…
authored
156 include the symbol <code>:relative</code> in the protocol array:
5682777 @rgrove You can now specify :relative in a protocol config array to allow att…
authored
157
158 :protocols => {
159 'a' => {'href' => ['http', 'https', :relative]}
160 }
161
ee900a6 @rgrove Update docs and history with :remove_contents Array behavior.
authored
162 ==== :remove_contents (boolean or Array)
5b115eb @rgrove Add :remove_contents config setting. If set to true, Sanitize will re…
authored
163
5bbd6d3 @rgrove Added an :escape_only config setting. If set to true, Sanitize will e…
authored
164 If set to +true+, Sanitize will remove the contents of any non-whitelisted
165 elements in addition to the elements themselves. By default, Sanitize leaves the
166 safe parts of an element's contents behind when the element is removed.
167
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a …
authored
168 If set to an array of element names, then only the contents of the specified
ee900a6 @rgrove Update docs and history with :remove_contents Array behavior.
authored
169 elements (when filtered) will be removed, and the contents of all other filtered
170 elements will be left behind.
171
172 The default value is <code>false</code>.
173
5b115eb @rgrove Add :remove_contents config setting. If set to true, Sanitize will re…
authored
174 ==== :transformers
175
5115849 @rgrove Docs and tests for breadth-first transformers.
authored
176 Custom transformer or array of custom transformers to run using depth-first
177 traversal. See the Transformers section below for details.
178
179 === :transformers_breadth
180
181 Custom transformer or array of custom transformers to run using breadth-first
182 traversal. See the Transformers section below for details.
5b115eb @rgrove Add :remove_contents config setting. If set to true, Sanitize will re…
authored
183
8438cc0 @rgrove Add a :whitespace_elements config, which specifies elements (such as …
authored
184 ==== :whitespace_elements (Array)
185
186 Array of lowercase element names that should be replaced with whitespace when
187 removed in order to preserve readability. For example,
188 <code>foo<div>bar</div>baz</code> will become
189 <code>foo bar baz</code> when the <code><div></code> is removed.
190
191 By default, the following elements are included in the
192 <code>:whitespace_elements</code> array:
193
194 address article aside blockquote br dd div dl dt footer h1 h2 h3 h4 h5
195 h6 header hgroup hr li nav ol p pre section ul
196
3597b24 @rgrove Transformer refinements and (gasp!) documentation.
authored
197 === Transformers
198
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a …
authored
199 Transformers allow you to filter and modify nodes using your own custom logic,
200 on top of (or instead of) Sanitize's core filter. A transformer is any object
201 that responds to <code>call()</code> (such as a lambda or proc).
3597b24 @rgrove Transformer refinements and (gasp!) documentation.
authored
202
203 To use one or more transformers, pass them to the <code>:transformers</code>
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a …
authored
204 config setting. You may pass a single transformer or an array of transformers.
3597b24 @rgrove Transformer refinements and (gasp!) documentation.
authored
205
206 Sanitize.clean(html, :transformers => [transformer_one, transformer_two])
207
208 ==== Input
209
210 Each registered transformer's <code>call()</code> method will be called once for
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a …
authored
211 each node in the HTML (including elements, text nodes, comments, etc.), and will
212 receive as an argument an environment Hash that contains the following items:
07f1fb2 @rgrove Add :allowed_elements and :whitelist_nodes to the environment hash pa…
authored
213
3597b24 @rgrove Transformer refinements and (gasp!) documentation.
authored
214 [<code>:config</code>]
215 The current Sanitize configuration Hash.
216
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a …
authored
217 [<code>:is_whitelisted</code>]
218 <code>true</code> if the current node has been whitelisted by a previous
219 transformer, <code>false</code> otherwise. It's generally bad form to remove a
220 node that a previous transformer has whitelisted.
221
3597b24 @rgrove Transformer refinements and (gasp!) documentation.
authored
222 [<code>:node</code>]
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a …
authored
223 A Nokogiri::XML::Node object representing an HTML node. The node may be an
224 element, a text node, a comment, a CDATA node, or a document fragment. Use
225 Nokogiri's inspection methods (<code>element?</code>, <code>text?</code>,
226 etc.) to selectively ignore node types you aren't interested in.
3597b24 @rgrove Transformer refinements and (gasp!) documentation.
authored
227
77a7b4f @rgrove The environment hash passed into transformers now includes a :node_na…
authored
228 [<code>:node_name</code>]
229 The name of the current HTML node, always lowercase (e.g. "div" or "span").
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a …
authored
230 For non-element nodes, the name will be something like "text", "comment",
231 "#cdata-section", "#document-fragment", etc.
232
233 [<code>:node_whitelist</code>]
234 Set of Nokogiri::XML::Node objects in the current document that have been
235 whitelisted by previous transformers, if any. It's generally bad form to
236 remove a node that a previous transformer has whitelisted.
77a7b4f @rgrove The environment hash passed into transformers now includes a :node_na…
authored
237
5115849 @rgrove Docs and tests for breadth-first transformers.
authored
238 [<code>:traversal_mode</code>]
239 Current node traversal mode, either <code>:depth</code> for depth-first (the
240 default mode) or <code>:breadth</code> for breadth-first.
241
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a …
authored
242 ==== Output
243
244 A transformer doesn't have to return anything, but may optionally return a Hash,
245 which may contain the following items:
246
247 [<code>:node_whitelist</code>]
248 Array or Set of specific Nokogiri::XML::Node objects to add to the document's
249 whitelist, bypassing the current Sanitize config. These specific nodes and all
250 their attributes will be whitelisted, but their children will not be.
251
252 If a transformer returns anything other than a Hash, the return value will be
253 ignored.
07f1fb2 @rgrove Add :allowed_elements and :whitelist_nodes to the environment hash pa…
authored
254
3597b24 @rgrove Transformer refinements and (gasp!) documentation.
authored
255 ==== Processing
256
257 Each transformer has full access to the Nokogiri::XML::Node that's passed into
258 it and to the rest of the document via the node's <code>document()</code>
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a …
authored
259 method. Any changes made to the current node or to the document will be
260 reflected instantly in the document and passed on to subsequently-called
261 transformers and to Sanitize itself. A transformer may even call Sanitize
262 internally to perform custom sanitization if needed.
3597b24 @rgrove Transformer refinements and (gasp!) documentation.
authored
263
5115849 @rgrove Docs and tests for breadth-first transformers.
authored
264 Nodes are passed into transformers in the order in which they're traversed. By
265 default, depth-first traversal is used, meaning that markup is traversed from
266 the deepest node upward (not from the first node to the last node):
3597b24 @rgrove Transformer refinements and (gasp!) documentation.
authored
267
268 html = '<div><span>foo</span></div>'
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a …
authored
269 transformer = lambda{|env| puts env[:node_name] }
3597b24 @rgrove Transformer refinements and (gasp!) documentation.
authored
270
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a …
authored
271 # Prints "text", "span", "div", "#document-fragment".
3597b24 @rgrove Transformer refinements and (gasp!) documentation.
authored
272 Sanitize.clean(html, :transformers => transformer)
273
5115849 @rgrove Docs and tests for breadth-first transformers.
authored
274 You may use the <code>:transformers_breadth</code> config to specify one or more
275 transformers that should traverse nodes in breadth-first mode:
276
277 html = '<div><span>foo</span></div>'
278 transformer = lambda{|env| puts env[:node_name] }
279
280 # Prints "#document-fragment", "div", "span", "text".
281 Sanitize.clean(html, :transformers_breadth => transformer)
282
3597b24 @rgrove Transformer refinements and (gasp!) documentation.
authored
283 Transformers have a tremendous amount of power, including the power to
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a …
authored
284 completely bypass Sanitize's built-in filtering. Be careful! Your safety is in
285 your own hands.
91d22e4 @rgrove Add list of contributors.
authored
286
904757c @rgrove Clean up and comment the YouTube transformer, and add it to the READM…
authored
287 ==== Example: Transformer to whitelist YouTube video embeds
288
5115849 @rgrove Docs and tests for breadth-first transformers.
authored
289 The following example demonstrates how to create a depth-first Sanitize
290 transformer that will safely whitelist valid YouTube video embeds without having
291 to blindly allow other kinds of embedded content, which would be the case if you
292 tried to do this by just whitelisting all <code><object></code>,
293 <code><embed></code>, and <code><param></code> elements:
904757c @rgrove Clean up and comment the YouTube transformer, and add it to the READM…
authored
294
295 lambda do |env|
296 node = env[:node]
4465619 @rgrove README cleanup and version bump.
authored
297 node_name = env[:node_name]
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a …
authored
298
299 # Don't continue if this node is already whitelisted or is not an element.
300 return if env[:is_whitelisted] || !node.element?
301
302 parent = node.parent
904757c @rgrove Clean up and comment the YouTube transformer, and add it to the READM…
authored
303
304 # Since the transformer receives the deepest nodes first, we look for a
305 # <param> element or an <embed> element whose parent is an <object>.
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a …
authored
306 return unless (node_name == 'param' || node_name == 'embed') &&
904757c @rgrove Clean up and comment the YouTube transformer, and add it to the READM…
authored
307 parent.name.to_s.downcase == 'object'
308
309 if node_name == 'param'
310 # Quick XPath search to find the <param> node that contains the video URL.
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a …
authored
311 return unless movie_node = parent.search('param[@name="movie"]')[0]
904757c @rgrove Clean up and comment the YouTube transformer, and add it to the READM…
authored
312 url = movie_node['value']
313 else
314 # Since this is an <embed>, the video URL is in the "src" attribute. No
315 # extra work needed.
316 url = node['src']
317 end
318
319 # Verify that the video URL is actually a valid YouTube video URL.
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a …
authored
320 return unless url =~ /^http:\/\/(?:www\.)?youtube\.com\/v\//
904757c @rgrove Clean up and comment the YouTube transformer, and add it to the READM…
authored
321
322 # We're now certain that this is a YouTube embed, but we still need to run
323 # it through a special Sanitize step to ensure that no unwanted elements or
324 # attributes that don't belong in a YouTube embed can sneak in.
325 Sanitize.clean_node!(parent, {
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a …
authored
326 :elements => %w[embed object param],
327
904757c @rgrove Clean up and comment the YouTube transformer, and add it to the READM…
authored
328 :attributes => {
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a …
authored
329 'embed' => %w[allowfullscreen allowscriptaccess height src type width],
330 'object' => %w[height width],
331 'param' => %w[name value]
904757c @rgrove Clean up and comment the YouTube transformer, and add it to the READM…
authored
332 }
333 })
334
335 # Now that we're sure that this is a valid YouTube embed and that there are
336 # no unwanted elements or attributes hidden inside it, we can tell Sanitize
337 # to whitelist the current node (<param> or <embed>) and its parent
338 # (<object>).
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a …
authored
339 {:node_whitelist => [node, parent]}
904757c @rgrove Clean up and comment the YouTube transformer, and add it to the READM…
authored
340 end
341
91d22e4 @rgrove Add list of contributors.
authored
342 == Contributors
343
5115849 @rgrove Docs and tests for breadth-first transformers.
authored
344 Sanitize was created and is maintained by Ryan Grove (ryan@wonko.com).
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a …
authored
345
346 The following lovely people have also contributed to Sanitize:
347
348 * Wilson Bilkovich (wilson@supremetyrant.com)
349 * Peter Cooper (git@peterc.org)
350 * Gabe da Silveira (gabe@websaviour.com)
351 * Nicholas Evans (owlmanatt@gmail.com)
352 * Adam Hooper (adam@adamhooper.com)
353 * Mutwin Kraus (mutle@blogage.de)
354 * Dev Purkayastha (dev.purkayastha@gmail.com)
355 * David Reese (work@whatcould.com)
356 * Ardie Saeidi (ardalan.saeidi@gmail.com)
357 * Rafael Souza (me@rafaelss.com)
358 * Ben Wanicur (bwanicur@verticalresponse.com)
91d22e4 @rgrove Add list of contributors.
authored
359
50cdcf2 @rgrove Initial commit.
authored
360 == License
361
674996a @rgrove Holy crap it's 2011.
authored
362 Copyright (c) 2011 Ryan Grove (ryan@wonko.com)
50cdcf2 @rgrove Initial commit.
authored
363
364 Permission is hereby granted, free of charge, to any person obtaining a copy of
365 this software and associated documentation files (the 'Software'), to deal in
366 the Software without restriction, including without limitation the rights to
367 use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
368 the Software, and to permit persons to whom the Software is furnished to do so,
369 subject to the following conditions:
370
371 The above copyright notice and this permission notice shall be included in all
372 copies or substantial portions of the Software.
373
374 THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
375 IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
376 FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
377 COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
378 IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
379 CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Something went wrong with that request. Please try again.