Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Newer
Older
100644 362 lines (258 sloc) 13.238 kb
50cdcf2 @rgrove Initial commit.
authored
1 = Sanitize
2
3 Sanitize is a whitelist-based HTML sanitizer. Given a list of acceptable
4 elements and attributes, Sanitize will remove all unacceptable HTML from a
5 string.
6
7 Using a simple configuration syntax, you can tell Sanitize to allow certain
8 elements, certain attributes within those elements, and even certain URL
9 protocols within attributes that contain URLs. Any HTML elements or attributes
10 that you don't explicitly allow will be removed.
11
bbc992a @rgrove Use an incremental version number until we're ready for a release
authored
12 Because it's based on Nokogiri, a full-fledged HTML parser, rather than a bunch
50cdcf2 @rgrove Initial commit.
authored
13 of fragile regular expressions, Sanitize has no trouble dealing with malformed
5bbd6d3 @rgrove Added an :escape_only config setting. If set to true, Sanitize will esca...
authored
14 or maliciously-formed HTML, and will always output valid HTML or XHTML.
50cdcf2 @rgrove Initial commit.
authored
15
16 *Author*:: Ryan Grove (mailto:ryan@wonko.com)
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a set...
authored
17 *Version*:: 2.0.0 (git)
aeba940 @rgrove Update copyright year; minor comment cleanup.
authored
18 *Copyright*:: Copyright (c) 2010 Ryan Grove. All rights reserved.
50cdcf2 @rgrove Initial commit.
authored
19 *License*:: MIT License (http://opensource.org/licenses/mit-license.php)
20 *Website*:: http://github.com/rgrove/sanitize
21
22 == Requires
23
d10eaaa @rgrove Default to HTML output instead of XHTML.
authored
24 * Nokogiri ~> 1.4.4
bbc992a @rgrove Use an incremental version number until we're ready for a release
authored
25 * libxml2 >= 2.7.2
50cdcf2 @rgrove Initial commit.
authored
26
e71ca65 @rgrove Add installation instructions
authored
27 == Installation
28
29 Latest stable release:
30
31 gem install sanitize
32
33 Latest development version:
34
4465619 @rgrove README cleanup and version bump.
authored
35 gem install sanitize --pre
e71ca65 @rgrove Add installation instructions
authored
36
50cdcf2 @rgrove Initial commit.
authored
37 == Usage
38
39 If you don't specify any configuration options, Sanitize will use its strictest
3597b24 @rgrove Transformer refinements and (gasp!) documentation.
authored
40 settings by default, which means it will strip all HTML and leave only text
41 behind.
50cdcf2 @rgrove Initial commit.
authored
42
43 require 'rubygems'
44 require 'sanitize'
45
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a set...
authored
46 html = '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg">'
50cdcf2 @rgrove Initial commit.
authored
47
48 Sanitize.clean(html) # => 'foo'
49
50 == Configuration
51
52 In addition to the ultra-safe default settings, Sanitize comes with three other
53 built-in modes.
54
55 === Sanitize::Config::RESTRICTED
56
57 Allows only very simple inline formatting markup. No links, images, or block
58 elements.
59
60 Sanitize.clean(html, Sanitize::Config::RESTRICTED) # => '<b>foo</b>'
61
62 === Sanitize::Config::BASIC
63
64 Allows a variety of markup including formatting tags, links, and lists. Images
65 and tables are not allowed, links are limited to FTP, HTTP, HTTPS, and mailto
66 protocols, and a <code>rel="nofollow"</code> attribute is added to all links to
67 mitigate SEO spam.
68
69 Sanitize.clean(html, Sanitize::Config::BASIC)
70 # => '<b><a href="http://foo.com/" rel="nofollow">foo</a></b>'
71
72 === Sanitize::Config::RELAXED
73
74 Allows an even wider variety of markup than BASIC, including images and tables.
75 Links are still limited to FTP, HTTP, HTTPS, and mailto protocols, while images
76 are limited to HTTP and HTTPS. In this mode, <code>rel="nofollow"</code> is not
77 added to links.
78
79 Sanitize.clean(html, Sanitize::Config::RELAXED)
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a set...
authored
80 # => '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg">'
50cdcf2 @rgrove Initial commit.
authored
81
82 === Custom Configuration
83
84 If the built-in modes don't meet your needs, you can easily specify a custom
85 configuration:
86
87 Sanitize.clean(html, :elements => ['a', 'span'],
88 :attributes => {'a' => ['href', 'title'], 'span' => ['class']},
89 :protocols => {'a' => {'href' => ['http', 'https', 'mailto']}})
90
4465619 @rgrove README cleanup and version bump.
authored
91 ==== :add_attributes (Hash)
50cdcf2 @rgrove Initial commit.
authored
92
4465619 @rgrove README cleanup and version bump.
authored
93 Attributes to add to specific elements. If the attribute already exists, it will
94 be replaced with the value specified here. Specify all element names and
95 attributes in lowercase.
50cdcf2 @rgrove Initial commit.
authored
96
4465619 @rgrove README cleanup and version bump.
authored
97 :add_attributes => {
98 'a' => {'rel' => 'nofollow'}
99 }
50cdcf2 @rgrove Initial commit.
authored
100
4465619 @rgrove README cleanup and version bump.
authored
101 ==== :attributes (Hash)
50cdcf2 @rgrove Initial commit.
authored
102
103 Attributes to allow for specific elements. Specify all element names and
104 attributes in lowercase.
105
106 :attributes => {
107 'a' => ['href', 'title'],
108 'blockquote' => ['cite'],
109 'img' => ['alt', 'src', 'title']
110 }
111
c0495a8 @rgrove Use :all instead of '*' to specify attributes to be added to all element...
authored
112 If you'd like to allow certain attributes on all elements, use the symbol
113 <code>:all</code> instead of an element name.
114
115 :attributes => {
116 :all => ['class'],
31fd7e3 @rgrove Fix stray comma.
authored
117 'a' => ['href', 'title']
c0495a8 @rgrove Use :all instead of '*' to specify attributes to be added to all element...
authored
118 }
119
4465619 @rgrove README cleanup and version bump.
authored
120 ==== :allow_comments (boolean)
50cdcf2 @rgrove Initial commit.
authored
121
4465619 @rgrove README cleanup and version bump.
authored
122 Whether or not to allow HTML comments. Allowing comments is strongly
123 discouraged, since IE allows script execution within conditional comments. The
124 default value is <code>false</code>.
50cdcf2 @rgrove Initial commit.
authored
125
4465619 @rgrove README cleanup and version bump.
authored
126 ==== :elements (Array)
127
128 Array of element names to allow. Specify all names in lowercase.
129
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a set...
authored
130 :elements => %w[
131 a abbr b blockquote br cite code dd dfn dl dt em i kbd li mark ol p pre
132 q s samp small strike strong sub sup time u ul var
4465619 @rgrove README cleanup and version bump.
authored
133 ]
134
135 ==== :output (Symbol)
136
137 Output format. Supported formats are <code>:html</code> and <code>:xhtml</code>,
d10eaaa @rgrove Default to HTML output instead of XHTML.
authored
138 defaulting to <code>:html</code>.
50cdcf2 @rgrove Initial commit.
authored
139
fc6b1fb @rgrove Add an :output_encoding config setting, defaulting to 'utf-8'.
authored
140 ==== :output_encoding (String)
141
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a set...
authored
142 Character encoding to use for HTML output. Default is <code>utf-8</code>.
f8688ac @rgrove Rename :allow_text config setting to :process_text_nodes. Add docs.
authored
143
4465619 @rgrove README cleanup and version bump.
authored
144 ==== :protocols (Hash)
50cdcf2 @rgrove Initial commit.
authored
145
146 URL protocols to allow in specific attributes. If an attribute is listed here
147 and contains a protocol other than those specified (or if it contains no
148 protocol at all), it will be removed.
149
150 :protocols => {
151 'a' => {'href' => ['ftp', 'http', 'https', 'mailto']},
152 'img' => {'src' => ['http', 'https']}
153 }
154
5682777 @rgrove You can now specify :relative in a protocol config array to allow attrib...
authored
155 If you'd like to allow the use of relative URLs which don't have a protocol,
c0495a8 @rgrove Use :all instead of '*' to specify attributes to be added to all element...
authored
156 include the symbol <code>:relative</code> in the protocol array:
5682777 @rgrove You can now specify :relative in a protocol config array to allow attrib...
authored
157
158 :protocols => {
159 'a' => {'href' => ['http', 'https', :relative]}
160 }
161
ee900a6 @rgrove Update docs and history with :remove_contents Array behavior.
authored
162 ==== :remove_contents (boolean or Array)
5b115eb @rgrove Add :remove_contents config setting. If set to true, Sanitize will remov...
authored
163
5bbd6d3 @rgrove Added an :escape_only config setting. If set to true, Sanitize will esca...
authored
164 If set to +true+, Sanitize will remove the contents of any non-whitelisted
165 elements in addition to the elements themselves. By default, Sanitize leaves the
166 safe parts of an element's contents behind when the element is removed.
167
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a set...
authored
168 If set to an array of element names, then only the contents of the specified
ee900a6 @rgrove Update docs and history with :remove_contents Array behavior.
authored
169 elements (when filtered) will be removed, and the contents of all other filtered
170 elements will be left behind.
171
172 The default value is <code>false</code>.
173
5b115eb @rgrove Add :remove_contents config setting. If set to true, Sanitize will remov...
authored
174 ==== :transformers
175
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a set...
authored
176 Custom transformer or array of custom transformers. See the Transformers section
177 below for details.
5b115eb @rgrove Add :remove_contents config setting. If set to true, Sanitize will remov...
authored
178
8438cc0 @rgrove Add a :whitespace_elements config, which specifies elements (such as <br...
authored
179 ==== :whitespace_elements (Array)
180
181 Array of lowercase element names that should be replaced with whitespace when
182 removed in order to preserve readability. For example,
183 <code>foo<div>bar</div>baz</code> will become
184 <code>foo bar baz</code> when the <code><div></code> is removed.
185
186 By default, the following elements are included in the
187 <code>:whitespace_elements</code> array:
188
189 address article aside blockquote br dd div dl dt footer h1 h2 h3 h4 h5
190 h6 header hgroup hr li nav ol p pre section ul
191
3597b24 @rgrove Transformer refinements and (gasp!) documentation.
authored
192 === Transformers
193
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a set...
authored
194 Transformers allow you to filter and modify nodes using your own custom logic,
195 on top of (or instead of) Sanitize's core filter. A transformer is any object
196 that responds to <code>call()</code> (such as a lambda or proc).
3597b24 @rgrove Transformer refinements and (gasp!) documentation.
authored
197
198 To use one or more transformers, pass them to the <code>:transformers</code>
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a set...
authored
199 config setting. You may pass a single transformer or an array of transformers.
3597b24 @rgrove Transformer refinements and (gasp!) documentation.
authored
200
201 Sanitize.clean(html, :transformers => [transformer_one, transformer_two])
202
203 ==== Input
204
205 Each registered transformer's <code>call()</code> method will be called once for
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a set...
authored
206 each node in the HTML (including elements, text nodes, comments, etc.), and will
207 receive as an argument an environment Hash that contains the following items:
07f1fb2 @rgrove Add :allowed_elements and :whitelist_nodes to the environment hash passe...
authored
208
3597b24 @rgrove Transformer refinements and (gasp!) documentation.
authored
209 [<code>:config</code>]
210 The current Sanitize configuration Hash.
211
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a set...
authored
212 [<code>:is_whitelisted</code>]
213 <code>true</code> if the current node has been whitelisted by a previous
214 transformer, <code>false</code> otherwise. It's generally bad form to remove a
215 node that a previous transformer has whitelisted.
216
3597b24 @rgrove Transformer refinements and (gasp!) documentation.
authored
217 [<code>:node</code>]
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a set...
authored
218 A Nokogiri::XML::Node object representing an HTML node. The node may be an
219 element, a text node, a comment, a CDATA node, or a document fragment. Use
220 Nokogiri's inspection methods (<code>element?</code>, <code>text?</code>,
221 etc.) to selectively ignore node types you aren't interested in.
3597b24 @rgrove Transformer refinements and (gasp!) documentation.
authored
222
77a7b4f @rgrove The environment hash passed into transformers now includes a :node_name ...
authored
223 [<code>:node_name</code>]
224 The name of the current HTML node, always lowercase (e.g. "div" or "span").
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a set...
authored
225 For non-element nodes, the name will be something like "text", "comment",
226 "#cdata-section", "#document-fragment", etc.
227
228 [<code>:node_whitelist</code>]
229 Set of Nokogiri::XML::Node objects in the current document that have been
230 whitelisted by previous transformers, if any. It's generally bad form to
231 remove a node that a previous transformer has whitelisted.
77a7b4f @rgrove The environment hash passed into transformers now includes a :node_name ...
authored
232
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a set...
authored
233 ==== Output
234
235 A transformer doesn't have to return anything, but may optionally return a Hash,
236 which may contain the following items:
237
238 [<code>:node_whitelist</code>]
239 Array or Set of specific Nokogiri::XML::Node objects to add to the document's
240 whitelist, bypassing the current Sanitize config. These specific nodes and all
241 their attributes will be whitelisted, but their children will not be.
242
243 If a transformer returns anything other than a Hash, the return value will be
244 ignored.
07f1fb2 @rgrove Add :allowed_elements and :whitelist_nodes to the environment hash passe...
authored
245
3597b24 @rgrove Transformer refinements and (gasp!) documentation.
authored
246 ==== Processing
247
248 Each transformer has full access to the Nokogiri::XML::Node that's passed into
249 it and to the rest of the document via the node's <code>document()</code>
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a set...
authored
250 method. Any changes made to the current node or to the document will be
251 reflected instantly in the document and passed on to subsequently-called
252 transformers and to Sanitize itself. A transformer may even call Sanitize
253 internally to perform custom sanitization if needed.
3597b24 @rgrove Transformer refinements and (gasp!) documentation.
authored
254
255 Nodes are passed into transformers in the order in which they're traversed. It's
256 important to note that Nokogiri traverses markup from the deepest node upward,
257 not from the first node to the last node:
258
259 html = '<div><span>foo</span></div>'
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a set...
authored
260 transformer = lambda{|env| puts env[:node_name] }
3597b24 @rgrove Transformer refinements and (gasp!) documentation.
authored
261
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a set...
authored
262 # Prints "text", "span", "div", "#document-fragment".
3597b24 @rgrove Transformer refinements and (gasp!) documentation.
authored
263 Sanitize.clean(html, :transformers => transformer)
264
265 Transformers have a tremendous amount of power, including the power to
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a set...
authored
266 completely bypass Sanitize's built-in filtering. Be careful! Your safety is in
267 your own hands.
91d22e4 @rgrove Add list of contributors.
authored
268
904757c @rgrove Clean up and comment the YouTube transformer, and add it to the README a...
authored
269 ==== Example: Transformer to whitelist YouTube video embeds
270
271 The following example demonstrates how to create a Sanitize transformer that
272 will safely whitelist valid YouTube video embeds without having to blindly allow
273 other kinds of embedded content, which would be the case if you tried to do this
cd99fa5 @rgrove Release version 1.2.0.
authored
274 by just whitelisting all <code><object></code>, <code><embed></code>, and
904757c @rgrove Clean up and comment the YouTube transformer, and add it to the README a...
authored
275 <code><param></code> elements:
276
277 lambda do |env|
278 node = env[:node]
4465619 @rgrove README cleanup and version bump.
authored
279 node_name = env[:node_name]
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a set...
authored
280
281 # Don't continue if this node is already whitelisted or is not an element.
282 return if env[:is_whitelisted] || !node.element?
283
284 parent = node.parent
904757c @rgrove Clean up and comment the YouTube transformer, and add it to the README a...
authored
285
286 # Since the transformer receives the deepest nodes first, we look for a
287 # <param> element or an <embed> element whose parent is an <object>.
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a set...
authored
288 return unless (node_name == 'param' || node_name == 'embed') &&
904757c @rgrove Clean up and comment the YouTube transformer, and add it to the README a...
authored
289 parent.name.to_s.downcase == 'object'
290
291 if node_name == 'param'
292 # Quick XPath search to find the <param> node that contains the video URL.
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a set...
authored
293 return unless movie_node = parent.search('param[@name="movie"]')[0]
904757c @rgrove Clean up and comment the YouTube transformer, and add it to the README a...
authored
294 url = movie_node['value']
295 else
296 # Since this is an <embed>, the video URL is in the "src" attribute. No
297 # extra work needed.
298 url = node['src']
299 end
300
301 # Verify that the video URL is actually a valid YouTube video URL.
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a set...
authored
302 return unless url =~ /^http:\/\/(?:www\.)?youtube\.com\/v\//
904757c @rgrove Clean up and comment the YouTube transformer, and add it to the README a...
authored
303
304 # We're now certain that this is a YouTube embed, but we still need to run
305 # it through a special Sanitize step to ensure that no unwanted elements or
306 # attributes that don't belong in a YouTube embed can sneak in.
307 Sanitize.clean_node!(parent, {
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a set...
authored
308 :elements => %w[embed object param],
309
904757c @rgrove Clean up and comment the YouTube transformer, and add it to the README a...
authored
310 :attributes => {
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a set...
authored
311 'embed' => %w[allowfullscreen allowscriptaccess height src type width],
312 'object' => %w[height width],
313 'param' => %w[name value]
904757c @rgrove Clean up and comment the YouTube transformer, and add it to the README a...
authored
314 }
315 })
316
317 # Now that we're sure that this is a valid YouTube embed and that there are
318 # no unwanted elements or attributes hidden inside it, we can tell Sanitize
319 # to whitelist the current node (<param> or <embed>) and its parent
320 # (<object>).
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a set...
authored
321 {:node_whitelist => [node, parent]}
904757c @rgrove Clean up and comment the YouTube transformer, and add it to the README a...
authored
322 end
323
91d22e4 @rgrove Add list of contributors.
authored
324 == Contributors
325
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a set...
authored
326 Sanitize was created and is currently maintained by Ryan Grove (ryan@wonko.com).
327
328 The following lovely people have also contributed to Sanitize:
329
330 * Wilson Bilkovich (wilson@supremetyrant.com)
331 * Peter Cooper (git@peterc.org)
332 * Gabe da Silveira (gabe@websaviour.com)
333 * Nicholas Evans (owlmanatt@gmail.com)
334 * Adam Hooper (adam@adamhooper.com)
335 * Mutwin Kraus (mutle@blogage.de)
336 * Dev Purkayastha (dev.purkayastha@gmail.com)
337 * David Reese (work@whatcould.com)
338 * Ardie Saeidi (ardalan.saeidi@gmail.com)
339 * Rafael Souza (me@rafaelss.com)
340 * Ben Wanicur (bwanicur@verticalresponse.com)
91d22e4 @rgrove Add list of contributors.
authored
341
50cdcf2 @rgrove Initial commit.
authored
342 == License
343
e5de3ad @rgrove Refactor transformers, and move core filtering implementation into a set...
authored
344 Copyright (c) 2010 Ryan Grove (ryan@wonko.com)
50cdcf2 @rgrove Initial commit.
authored
345
346 Permission is hereby granted, free of charge, to any person obtaining a copy of
347 this software and associated documentation files (the 'Software'), to deal in
348 the Software without restriction, including without limitation the rights to
349 use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
350 the Software, and to permit persons to whom the Software is furnished to do so,
351 subject to the following conditions:
352
353 The above copyright notice and this permission notice shall be included in all
354 copies or substantial portions of the Software.
355
356 THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
357 IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
358 FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
359 COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
360 IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
361 CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Something went wrong with that request. Please try again.