Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Newer
Older
100644 300 lines (217 sloc) 10.835 kb
50cdcf2 @rgrove Initial commit.
authored
1 = Sanitize
2
3 Sanitize is a whitelist-based HTML sanitizer. Given a list of acceptable
4 elements and attributes, Sanitize will remove all unacceptable HTML from a
5 string.
6
7 Using a simple configuration syntax, you can tell Sanitize to allow certain
8 elements, certain attributes within those elements, and even certain URL
9 protocols within attributes that contain URLs. Any HTML elements or attributes
10 that you don't explicitly allow will be removed.
11
bbc992a @rgrove Use an incremental version number until we're ready for a release
authored
12 Because it's based on Nokogiri, a full-fledged HTML parser, rather than a bunch
50cdcf2 @rgrove Initial commit.
authored
13 of fragile regular expressions, Sanitize has no trouble dealing with malformed
14 or maliciously-formed HTML. When in doubt, Sanitize always errs on the side of
15 caution.
16
17 *Author*:: Ryan Grove (mailto:ryan@wonko.com)
b4f405a @rgrove Very very early experimental work on node transformers
authored
18 *Version*:: 1.2.0.dev (git)
aeba940 @rgrove Update copyright year; minor comment cleanup.
authored
19 *Copyright*:: Copyright (c) 2010 Ryan Grove. All rights reserved.
50cdcf2 @rgrove Initial commit.
authored
20 *License*:: MIT License (http://opensource.org/licenses/mit-license.php)
21 *Website*:: http://github.com/rgrove/sanitize
22
23 == Requires
24
904757c @rgrove Clean up and comment the YouTube transformer, and add it to the README a...
authored
25 * Nokogiri ~> 1.4.1
bbc992a @rgrove Use an incremental version number until we're ready for a release
authored
26 * libxml2 >= 2.7.2
50cdcf2 @rgrove Initial commit.
authored
27
e71ca65 @rgrove Add installation instructions
authored
28 == Installation
29
30 Latest stable release:
31
32 gem install sanitize
33
34 Latest development version:
35
d95f2c7 @rgrove Release version 1.1.0.
authored
36 gem install sanitize -s http://gemcutter.org --prerelease
e71ca65 @rgrove Add installation instructions
authored
37
50cdcf2 @rgrove Initial commit.
authored
38 == Usage
39
40 If you don't specify any configuration options, Sanitize will use its strictest
3597b24 @rgrove Transformer refinements and (gasp!) documentation.
authored
41 settings by default, which means it will strip all HTML and leave only text
42 behind.
50cdcf2 @rgrove Initial commit.
authored
43
44 require 'rubygems'
45 require 'sanitize'
46
47 html = '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg" />'
48
49 Sanitize.clean(html) # => 'foo'
50
51 == Configuration
52
53 In addition to the ultra-safe default settings, Sanitize comes with three other
54 built-in modes.
55
56 === Sanitize::Config::RESTRICTED
57
58 Allows only very simple inline formatting markup. No links, images, or block
59 elements.
60
61 Sanitize.clean(html, Sanitize::Config::RESTRICTED) # => '<b>foo</b>'
62
63 === Sanitize::Config::BASIC
64
65 Allows a variety of markup including formatting tags, links, and lists. Images
66 and tables are not allowed, links are limited to FTP, HTTP, HTTPS, and mailto
67 protocols, and a <code>rel="nofollow"</code> attribute is added to all links to
68 mitigate SEO spam.
69
70 Sanitize.clean(html, Sanitize::Config::BASIC)
71 # => '<b><a href="http://foo.com/" rel="nofollow">foo</a></b>'
72
73 === Sanitize::Config::RELAXED
74
75 Allows an even wider variety of markup than BASIC, including images and tables.
76 Links are still limited to FTP, HTTP, HTTPS, and mailto protocols, while images
77 are limited to HTTP and HTTPS. In this mode, <code>rel="nofollow"</code> is not
78 added to links.
79
80 Sanitize.clean(html, Sanitize::Config::RELAXED)
81 # => '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg" />'
82
83 === Custom Configuration
84
85 If the built-in modes don't meet your needs, you can easily specify a custom
86 configuration:
87
88 Sanitize.clean(html, :elements => ['a', 'span'],
89 :attributes => {'a' => ['href', 'title'], 'span' => ['class']},
90 :protocols => {'a' => {'href' => ['http', 'https', 'mailto']}})
91
92 ==== :elements
93
94 Array of element names to allow. Specify all names in lowercase.
95
96 :elements => [
97 'a', 'b', 'blockquote', 'br', 'cite', 'code', 'dd', 'dl', 'dt', 'em',
98 'i', 'li', 'ol', 'p', 'pre', 'q', 'small', 'strike', 'strong', 'sub',
99 'sup', 'u', 'ul'
100 ]
101
102 ==== :attributes
103
104 Attributes to allow for specific elements. Specify all element names and
105 attributes in lowercase.
106
107 :attributes => {
108 'a' => ['href', 'title'],
109 'blockquote' => ['cite'],
110 'img' => ['alt', 'src', 'title']
111 }
112
c0495a8 @rgrove Use :all instead of '*' to specify attributes to be added to all element...
authored
113 If you'd like to allow certain attributes on all elements, use the symbol
114 <code>:all</code> instead of an element name.
115
116 :attributes => {
117 :all => ['class'],
31fd7e3 @rgrove Fix stray comma.
authored
118 'a' => ['href', 'title']
c0495a8 @rgrove Use :all instead of '*' to specify attributes to be added to all element...
authored
119 }
120
50cdcf2 @rgrove Initial commit.
authored
121 ==== :add_attributes
122
123 Attributes to add to specific elements. If the attribute already exists, it will
124 be replaced with the value specified here. Specify all element names and
125 attributes in lowercase.
126
127 :add_attributes => {
128 'a' => {'rel' => 'nofollow'}
129 }
130
131 ==== :protocols
132
133 URL protocols to allow in specific attributes. If an attribute is listed here
134 and contains a protocol other than those specified (or if it contains no
135 protocol at all), it will be removed.
136
137 :protocols => {
138 'a' => {'href' => ['ftp', 'http', 'https', 'mailto']},
139 'img' => {'src' => ['http', 'https']}
140 }
141
5682777 @rgrove You can now specify :relative in a protocol config array to allow attrib...
authored
142 If you'd like to allow the use of relative URLs which don't have a protocol,
c0495a8 @rgrove Use :all instead of '*' to specify attributes to be added to all element...
authored
143 include the symbol <code>:relative</code> in the protocol array:
5682777 @rgrove You can now specify :relative in a protocol config array to allow attrib...
authored
144
145 :protocols => {
146 'a' => {'href' => ['http', 'https', :relative]}
147 }
148
3597b24 @rgrove Transformer refinements and (gasp!) documentation.
authored
149 === Transformers
150
151 Transformers allow you to filter and alter nodes using your own custom logic, on
152 top of (or instead of) Sanitize's core filter. A transformer is any object that
153 responds to <code>call()</code> (such as a lambda or proc) and returns either
154 <code>nil</code> or a Hash containing certain optional response values.
155
156 To use one or more transformers, pass them to the <code>:transformers</code>
157 config setting:
158
159 Sanitize.clean(html, :transformers => [transformer_one, transformer_two])
160
161 ==== Input
162
163 Each registered transformer's <code>call()</code> method will be called once for
164 each element node in the HTML, and will receive as an argument an environment
165 Hash that contains the following items:
166
167 [<code>:config</code>]
168 The current Sanitize configuration Hash.
169
170 [<code>:node</code>]
171 A Nokogiri::XML::Node object representing an HTML element.
172
173 ==== Processing
174
175 Each transformer has full access to the Nokogiri::XML::Node that's passed into
176 it and to the rest of the document via the node's <code>document()</code>
177 method. Any changes will be reflected instantly in the document and passed on to
178 subsequently-called transformers and to Sanitize itself. A transformer may even
179 call Sanitize internally to perform custom sanitization if needed.
180
181 Nodes are passed into transformers in the order in which they're traversed. It's
182 important to note that Nokogiri traverses markup from the deepest node upward,
183 not from the first node to the last node:
184
185 html = '<div><span>foo</span></div>'
186 transformer = lambda{|env| puts env[:node].name }
187
188 # Prints "span", then "div".
189 Sanitize.clean(html, :transformers => transformer)
190
191 Transformers have a tremendous amount of power, including the power to
192 completely bypass Sanitize's built-in filtering. Be careful!
193
194 ==== Output
195
196 A transformer may return either +nil+ or a Hash. A return value of +nil+
197 indicates that the transformer does not wish to act on the current node in any
198 way. A returned Hash may contain the following items, all of which are optional:
199
200 [<code>:attr_whitelist</code>]
201 Array of attribute names to add to the whitelist for the current node, in
202 addition to any whitelisted attributes already defined in the current config.
203
204 [<code>:node</code>]
205 A Nokogiri::XML::Node object that should replace the current node. All
206 subsequent transformers and Sanitize itself will receive this new node.
207
208 [<code>:whitelist</code>]
209 If _true_, the current node (and only the current node) will be whitelisted,
210 regardless of the current Sanitize config.
211
212 [<code>:whitelist_nodes</code>]
213 Array of specific Nokogiri::XML::Node objects to whitelist, anywhere in the
214 document, regardless of the current Sanitize config.
91d22e4 @rgrove Add list of contributors.
authored
215
904757c @rgrove Clean up and comment the YouTube transformer, and add it to the README a...
authored
216 ==== Example: Transformer to whitelist YouTube video embeds
217
218 The following example demonstrates how to create a Sanitize transformer that
219 will safely whitelist valid YouTube video embeds without having to blindly allow
220 other kinds of embedded content, which would be the case if you tried to do this
221 by just whitelisting the <code><object></code>, <code><embed></code>, and
222 <code><param></code> elements:
223
224 lambda do |env|
225 node = env[:node]
226 node_name = node.name.to_s.downcase
227 parent = node.parent
228
229 # Since the transformer receives the deepest nodes first, we look for a
230 # <param> element or an <embed> element whose parent is an <object>.
231 return nil unless node_name == 'param' || node_name == 'embed' &&
232 parent.name.to_s.downcase == 'object'
233
234 if node_name == 'param'
235 # Quick XPath search to find the <param> node that contains the video URL.
236 return nil unless movie_node = parent.search('param[@name="movie"]')[0]
237 url = movie_node['value']
238 else
239 # Since this is an <embed>, the video URL is in the "src" attribute. No
240 # extra work needed.
241 url = node['src']
242 end
243
244 # Verify that the video URL is actually a valid YouTube video URL.
245 return nil unless url =~ /^http:\/\/(?:www\.)?youtube\.com\/v\//
246
247 # We're now certain that this is a YouTube embed, but we still need to run
248 # it through a special Sanitize step to ensure that no unwanted elements or
249 # attributes that don't belong in a YouTube embed can sneak in.
250 Sanitize.clean_node!(parent, {
251 :elements => ['embed', 'object', 'param'],
252 :attributes => {
253 'embed' => ['allowfullscreen', 'allowscriptaccess', 'height', 'src', 'type', 'width'],
254 'object' => ['height', 'width'],
255 'param' => ['name', 'value']
256 }
257 })
258
259 # Now that we're sure that this is a valid YouTube embed and that there are
260 # no unwanted elements or attributes hidden inside it, we can tell Sanitize
261 # to whitelist the current node (<param> or <embed>) and its parent
262 # (<object>).
263 {:whitelist_nodes => [node, parent]}
264 end
265
91d22e4 @rgrove Add list of contributors.
authored
266 == Contributors
267
268 The following lovely people have contributed to Sanitize in the form of patches
269 or ideas that later became code:
270
00ed7c1 @rgrove Update history, bump gemspec, and add Peter Cooper to the contributor li...
authored
271 * Peter Cooper <git@peterc.org>
c2cf61a @rgrove Add Gabe da Silveira to contributor list
authored
272 * Gabe da Silveira <gabe@websaviour.com>
91d22e4 @rgrove Add list of contributors.
authored
273 * Ryan Grove <ryan@wonko.com>
11ae498 @rgrove Whoops, forgot one.
authored
274 * Adam Hooper <adam@adamhooper.com>
91d22e4 @rgrove Add list of contributors.
authored
275 * Mutwin Kraus <mutle@blogage.de>
276 * Dev Purkayastha <dev.purkayastha@gmail.com>
1db7552 @rgrove Require Nokogiri >= 1.4.0 and bump gemspec. Closes #7
authored
277 * David Reese <work@whatcould.com>
71238c1 @rgrove Work around an Hpricot bug that prevents attribute names from being down...
authored
278 * Ben Wanicur <bwanicur@verticalresponse.com>
91d22e4 @rgrove Add list of contributors.
authored
279
50cdcf2 @rgrove Initial commit.
authored
280 == License
281
904757c @rgrove Clean up and comment the YouTube transformer, and add it to the README a...
authored
282 Copyright (c) 2010 Ryan Grove <ryan@wonko.com>
50cdcf2 @rgrove Initial commit.
authored
283
284 Permission is hereby granted, free of charge, to any person obtaining a copy of
285 this software and associated documentation files (the 'Software'), to deal in
286 the Software without restriction, including without limitation the rights to
287 use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
288 the Software, and to permit persons to whom the Software is furnished to do so,
289 subject to the following conditions:
290
291 The above copyright notice and this permission notice shall be included in all
292 copies or substantial portions of the Software.
293
294 THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
295 IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
296 FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
297 COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
298 IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
299 CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Something went wrong with that request. Please try again.