Permalink
Browse files

Refactor transformers, and move core filtering implementation into a …

…set of default transformers.

Existing transformers from pre-2.0.0 Sanitize will need to be updated.
See the README for details.
  • Loading branch information...
1 parent 7eb0b17 commit e5de3ad4e9b5c252263ad3a388a09d40c45764c7 @rgrove committed Dec 11, 2010
View
17 HISTORY
@@ -1,7 +1,13 @@
Sanitize History
================================================================================
-Version 1.3.0 (git)
+Version 2.0.0 (git)
+ * The environment data passed into transformers and the return values expected
+ from transformers have changed. Old transformers will need to be updated.
+ See the README for details.
+ * Transformers now receive nodes of all types, not just element nodes.
+ * Sanitize's own core filtering logic is now implemented as a set of always-on
+ transformers.
* The default value for the :output config is now :html. Previously it was
:xhtml.
* Added a :whitespace_elements config, which specifies elements (such as <br>
@@ -15,15 +21,6 @@ Version 1.3.0 (git)
`ruby`, and `wbr` elements to the whitelist for `Sanitize::Config::RELAXED`.
* The `dir`, `lang`, and `title` attributes are now whitelisted for all
elements in `Sanitize::Config::RELAXED`.
- * The environment hash passed into transformers now includes an
- :allowed_elements Hash to facilitate faster lookups when attempting to
- determine whether an element is in the whitelist. [Suggested by Nicholas
- Evans]
- * The environment hash passed into transformers now includes a
- :whitelist_nodes Array, so transformers now have insight into what nodes
- have been whitelisted by other transformers. [Suggested by Nicholas Evans]
- * Added a :process_text_nodes config setting. If set to true, Sanitize will
- pass text nodes to transformers. The default is false. [Ardie Saeidi]
* Bumped minimum Nokogiri version to 1.4.4 to avoid a bug in 1.4.2+ (issue
#315) that caused "</body></html>" to be appended to the CDATA inside
unterminated script and style elements.
View
165 README.rdoc
@@ -14,7 +14,7 @@ of fragile regular expressions, Sanitize has no trouble dealing with malformed
or maliciously-formed HTML, and will always output valid HTML or XHTML.
*Author*:: Ryan Grove (mailto:ryan@wonko.com)
-*Version*:: 1.3.0 (git)
+*Version*:: 2.0.0 (git)
*Copyright*:: Copyright (c) 2010 Ryan Grove. All rights reserved.
*License*:: MIT License (http://opensource.org/licenses/mit-license.php)
*Website*:: http://github.com/rgrove/sanitize
@@ -43,7 +43,7 @@ behind.
require 'rubygems'
require 'sanitize'
- html = '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg" />'
+ html = '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg">'
Sanitize.clean(html) # => 'foo'
@@ -77,7 +77,7 @@ are limited to HTTP and HTTPS. In this mode, <code>rel="nofollow"</code> is not
added to links.
Sanitize.clean(html, Sanitize::Config::RELAXED)
- # => '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg" />'
+ # => '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg">'
=== Custom Configuration
@@ -127,10 +127,9 @@ default value is <code>false</code>.
Array of element names to allow. Specify all names in lowercase.
- :elements => [
- 'a', 'b', 'blockquote', 'br', 'cite', 'code', 'dd', 'dl', 'dt', 'em',
- 'i', 'li', 'ol', 'p', 'pre', 'q', 'small', 'strike', 'strong', 'sub',
- 'sup', 'u', 'ul'
+ :elements => %w[
+ a abbr b blockquote br cite code dd dfn dl dt em i kbd li mark ol p pre
+ q s samp small strike strong sub sup time u ul var
]
==== :output (Symbol)
@@ -140,12 +139,7 @@ defaulting to <code>:html</code>.
==== :output_encoding (String)
-Character encoding to use for HTML output. Default is <code>'utf-8'</code>.
-
-==== :process_text_nodes (Boolean)
-
-Whether or not to process text nodes. Enabling this will allow text nodes to be
-processed by transformers. The default is <code>false</code>.
+Character encoding to use for HTML output. Default is <code>utf-8</code>.
==== :protocols (Hash)
@@ -171,15 +165,16 @@ If set to +true+, Sanitize will remove the contents of any non-whitelisted
elements in addition to the elements themselves. By default, Sanitize leaves the
safe parts of an element's contents behind when the element is removed.
-If set to an Array of element names, then only the contents of the specified
+If set to an array of element names, then only the contents of the specified
elements (when filtered) will be removed, and the contents of all other filtered
elements will be left behind.
The default value is <code>false</code>.
==== :transformers
-See below.
+Custom transformer or array of custom transformers. See the Transformers section
+below for details.
==== :whitespace_elements (Array)
@@ -196,81 +191,80 @@ By default, the following elements are included in the
=== Transformers
-Transformers allow you to filter and alter nodes using your own custom logic, on
-top of (or instead of) Sanitize's core filter. A transformer is any object that
-responds to <code>call()</code> (such as a lambda or proc) and returns either
-<code>nil</code> or a Hash containing certain optional response values.
+Transformers allow you to filter and modify nodes using your own custom logic,
+on top of (or instead of) Sanitize's core filter. A transformer is any object
+that responds to <code>call()</code> (such as a lambda or proc).
To use one or more transformers, pass them to the <code>:transformers</code>
-config setting:
+config setting. You may pass a single transformer or an array of transformers.
Sanitize.clean(html, :transformers => [transformer_one, transformer_two])
==== Input
Each registered transformer's <code>call()</code> method will be called once for
-each element node in the HTML, and will receive as an argument an environment
-Hash that contains the following items:
-
-[<code>:allowed_elements</code>]
- Hash with whitelisted element names as keys, to facilitate fast lookups of
- whitelisted elements.
+each node in the HTML (including elements, text nodes, comments, etc.), and will
+receive as an argument an environment Hash that contains the following items:
[<code>:config</code>]
The current Sanitize configuration Hash.
+[<code>:is_whitelisted</code>]
+ <code>true</code> if the current node has been whitelisted by a previous
+ transformer, <code>false</code> otherwise. It's generally bad form to remove a
+ node that a previous transformer has whitelisted.
+
[<code>:node</code>]
- A Nokogiri::XML::Node object representing an HTML element.
+ A Nokogiri::XML::Node object representing an HTML node. The node may be an
+ element, a text node, a comment, a CDATA node, or a document fragment. Use
+ Nokogiri's inspection methods (<code>element?</code>, <code>text?</code>,
+ etc.) to selectively ignore node types you aren't interested in.
[<code>:node_name</code>]
The name of the current HTML node, always lowercase (e.g. "div" or "span").
+ For non-element nodes, the name will be something like "text", "comment",
+ "#cdata-section", "#document-fragment", etc.
+
+[<code>:node_whitelist</code>]
+ Set of Nokogiri::XML::Node objects in the current document that have been
+ whitelisted by previous transformers, if any. It's generally bad form to
+ remove a node that a previous transformer has whitelisted.
-[<code>:whitelist_nodes</code>]
- Array of Nokogiri::XML::Node instances that have already been whitelisted by
- previous transformers, if any.
+==== Output
+
+A transformer doesn't have to return anything, but may optionally return a Hash,
+which may contain the following items:
+
+[<code>:node_whitelist</code>]
+ Array or Set of specific Nokogiri::XML::Node objects to add to the document's
+ whitelist, bypassing the current Sanitize config. These specific nodes and all
+ their attributes will be whitelisted, but their children will not be.
+
+If a transformer returns anything other than a Hash, the return value will be
+ignored.
==== Processing
Each transformer has full access to the Nokogiri::XML::Node that's passed into
it and to the rest of the document via the node's <code>document()</code>
-method. Any changes will be reflected instantly in the document and passed on to
-subsequently-called transformers and to Sanitize itself. A transformer may even
-call Sanitize internally to perform custom sanitization if needed.
+method. Any changes made to the current node or to the document will be
+reflected instantly in the document and passed on to subsequently-called
+transformers and to Sanitize itself. A transformer may even call Sanitize
+internally to perform custom sanitization if needed.
Nodes are passed into transformers in the order in which they're traversed. It's
important to note that Nokogiri traverses markup from the deepest node upward,
not from the first node to the last node:
html = '<div><span>foo</span></div>'
- transformer = lambda{|env| puts env[:node].name }
+ transformer = lambda{|env| puts env[:node_name] }
- # Prints "span", then "div".
+ # Prints "text", "span", "div", "#document-fragment".
Sanitize.clean(html, :transformers => transformer)
Transformers have a tremendous amount of power, including the power to
-completely bypass Sanitize's built-in filtering. Be careful!
-
-==== Output
-
-A transformer may return either +nil+ or a Hash. A return value of +nil+
-indicates that the transformer does not wish to act on the current node in any
-way. A returned Hash may contain the following items, all of which are optional:
-
-[<code>:attr_whitelist</code>]
- Array of attribute names to add to the whitelist for the current node, in
- addition to any whitelisted attributes already defined in the current config.
-
-[<code>:node</code>]
- A Nokogiri::XML::Node object that should replace the current node. All
- subsequent transformers and Sanitize itself will receive this new node.
-
-[<code>:whitelist</code>]
- If _true_, the current node (and only the current node) will be whitelisted,
- regardless of the current Sanitize config.
-
-[<code>:whitelist_nodes</code>]
- Array of specific Nokogiri::XML::Node objects to whitelist, anywhere in the
- document, regardless of the current Sanitize config.
+completely bypass Sanitize's built-in filtering. Be careful! Your safety is in
+your own hands.
==== Example: Transformer to whitelist YouTube video embeds
@@ -283,16 +277,20 @@ by just whitelisting all <code><object></code>, <code><embed></code>, and
lambda do |env|
node = env[:node]
node_name = env[:node_name]
- parent = node.parent
+
+ # Don't continue if this node is already whitelisted or is not an element.
+ return if env[:is_whitelisted] || !node.element?
+
+ parent = node.parent
# Since the transformer receives the deepest nodes first, we look for a
# <param> element or an <embed> element whose parent is an <object>.
- return nil unless (node_name == 'param' || node_name == 'embed') &&
+ return unless (node_name == 'param' || node_name == 'embed') &&
parent.name.to_s.downcase == 'object'
if node_name == 'param'
# Quick XPath search to find the <param> node that contains the video URL.
- return nil unless movie_node = parent.search('param[@name="movie"]')[0]
+ return unless movie_node = parent.search('param[@name="movie"]')[0]
url = movie_node['value']
else
# Since this is an <embed>, the video URL is in the "src" attribute. No
@@ -301,48 +299,49 @@ by just whitelisting all <code><object></code>, <code><embed></code>, and
end
# Verify that the video URL is actually a valid YouTube video URL.
- return nil unless url =~ /^http:\/\/(?:www\.)?youtube\.com\/v\//
+ return unless url =~ /^http:\/\/(?:www\.)?youtube\.com\/v\//
# We're now certain that this is a YouTube embed, but we still need to run
# it through a special Sanitize step to ensure that no unwanted elements or
# attributes that don't belong in a YouTube embed can sneak in.
Sanitize.clean_node!(parent, {
- :elements => ['embed', 'object', 'param'],
+ :elements => %w[embed object param],
+
:attributes => {
- 'embed' => ['allowfullscreen', 'allowscriptaccess', 'height', 'src', 'type', 'width'],
- 'object' => ['height', 'width'],
- 'param' => ['name', 'value']
+ 'embed' => %w[allowfullscreen allowscriptaccess height src type width],
+ 'object' => %w[height width],
+ 'param' => %w[name value]
}
})
# Now that we're sure that this is a valid YouTube embed and that there are
# no unwanted elements or attributes hidden inside it, we can tell Sanitize
# to whitelist the current node (<param> or <embed>) and its parent
# (<object>).
- {:whitelist_nodes => [node, parent]}
+ {:node_whitelist => [node, parent]}
end
== Contributors
-The following lovely people have contributed to Sanitize in the form of patches
-or ideas that later became code:
-
-* Ryan Grove <ryan@wonko.com>
-* Wilson Bilkovich <wilson@supremetyrant.com>
-* Peter Cooper <git@peterc.org>
-* Gabe da Silveira <gabe@websaviour.com>
-* Nicholas Evans <owlmanatt@gmail.com>
-* Adam Hooper <adam@adamhooper.com>
-* Mutwin Kraus <mutle@blogage.de>
-* Dev Purkayastha <dev.purkayastha@gmail.com>
-* David Reese <work@whatcould.com>
-* Ardie Saeidi <ardalan.saeidi@gmail.com>
-* Rafael Souza <me@rafaelss.com>
-* Ben Wanicur <bwanicur@verticalresponse.com>
+Sanitize was created and is currently maintained by Ryan Grove (ryan@wonko.com).
+
+The following lovely people have also contributed to Sanitize:
+
+* Wilson Bilkovich (wilson@supremetyrant.com)
+* Peter Cooper (git@peterc.org)
+* Gabe da Silveira (gabe@websaviour.com)
+* Nicholas Evans (owlmanatt@gmail.com)
+* Adam Hooper (adam@adamhooper.com)
+* Mutwin Kraus (mutle@blogage.de)
+* Dev Purkayastha (dev.purkayastha@gmail.com)
+* David Reese (work@whatcould.com)
+* Ardie Saeidi (ardalan.saeidi@gmail.com)
+* Rafael Souza (me@rafaelss.com)
+* Ben Wanicur (bwanicur@verticalresponse.com)
== License
-Copyright (c) 2010 Ryan Grove <ryan@wonko.com>
+Copyright (c) 2010 Ryan Grove (ryan@wonko.com)
Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the 'Software'), to deal in
View
14 benchmark/benchmark.rb
@@ -49,13 +49,13 @@ def bench(html, times, is_fragment)
Loofah.scrub_fragment(html, :prune).to_s
end
else
- measure('Loofah :strip', times) do
- Loofah.scrub_document(html, :strip).to_s
- end
-
- measure('Loofah :prune', times) do
- Loofah.scrub_document(html, :prune).to_s
- end
+ # measure('Loofah :strip', times) do
+ # Loofah.scrub_document(html, :strip).to_s
+ # end
+ #
+ # measure('Loofah :prune', times) do
+ # Loofah.scrub_document(html, :prune).to_s
+ # end
end
measure('Sanitize.clean (strip)', times) do
View
75 lib/sanitize.rb
@@ -29,6 +29,8 @@
require 'sanitize/config/restricted'
require 'sanitize/config/basic'
require 'sanitize/config/relaxed'
+require 'sanitize/transformers/clean_cdata'
+require 'sanitize/transformers/clean_comment'
require 'sanitize/transformers/clean_element'
class Sanitize
@@ -68,17 +70,15 @@ def self.clean_node!(node, config = {})
# Returns a new Sanitize object initialized with the settings in _config_.
def initialize(config = {})
- # Sanitize configuration.
- @config = Config::DEFAULT.merge(config)
+ @config = Config::DEFAULT.merge(config)
@transformers = Array(@config[:transformers].dup)
- # Default transformers.
- @transformers << Transformers::CleanElement.new(@config)
-
- # Specific nodes to whitelist (along with all their attributes). This array
- # is generated at runtime by transformers, and is cleared before and after
- # a fragment is cleaned (so it applies only to a specific fragment).
- @whitelist_nodes = []
+ # Default transformers. These always run at the end of the transformer
+ # chain, after any custom transformers.
+ @transformers <<
+ Transformers::CleanComment <<
+ Transformers::CleanCDATA <<
+ Transformers::CleanElement.new(@config)
end
# Returns a sanitized copy of _html_.
@@ -115,69 +115,30 @@ def clean!(html)
def clean_node!(node)
raise ArgumentError unless node.is_a?(Nokogiri::XML::Node)
- node.traverse do |child|
- # traverse(node) do |child|
- if child.element? || (child.text? && @config[:process_text_nodes])
- clean_element!(child)
- elsif child.comment?
- child.unlink unless @config[:allow_comments]
- elsif child.cdata?
- child.replace(Nokogiri::XML::Text.new(child.text, child.document))
- end
- end
+ node_whitelist = Set.new
+ node.traverse {|child| transform_node!(child, node_whitelist) }
node
end
private
- # def traverse(node, &block)
- # block.call(node)
- # node.children.each {|child| traverse(child, &block)} if node
- # end
-
- def clean_element!(node)
- # Run this node through all configured transformers.
- transform = transform_element!(node)
-
- # # If this node is in the dynamic whitelist array (built at runtime by
- # # transformers), let it live with all of its attributes intact.
- # return if @whitelist_nodes.include?(node)
-
- transform
- end
-
- def transform_element!(node)
- document = node.document
-
- attr_whitelist = Set.new
- node_whitelist = Set.new
-
- # TODO: node_whitelist needs to be a global whitelist, persistent during the
- # current clean operation (not just the current node transform).
- #
- # But we also need a way of adding the current node to the local whitelist,
- # as if it were in :allowed_elements.
- #
- # Or maybe we should only ever allow local whitelisting and never global
- # persistent whitelisting. Hmm.
-
+ def transform_node!(node, node_whitelist)
@transformers.each do |transformer|
result = transformer.call({
- :attr_whitelist => attr_whitelist,
:config => @config,
+ :is_whitelisted => node_whitelist.include?(node),
:node => node,
:node_name => node.name.downcase,
:node_whitelist => node_whitelist
})
- # If the node has been destroyed or removed from the document, there's no
- # point running subsequent transformers.
- break unless node && node.document == document
+ # If the node has been unlinked, there's no point running subsequent
+ # transformers.
+ break if node.parent.nil? && !node.fragment?
- if result.is_a?(Hash)
- attr_whitelist.merge(result[:attr_whitelist]) if result[:attr_whitelist].respond_to?(:each)
- node_whitelist.merge(result[:node_whitelist]) if result[:node_whitelist].respond_to?(:each)
+ if result.is_a?(Hash) && result[:node_whitelist].respond_to?(:each)
+ node_whitelist.merge(result[:node_whitelist])
end
end
View
4 lib/sanitize/config.rb
@@ -47,10 +47,6 @@ module Config
# Character encoding to use for HTML output. Default is 'utf-8'.
:output_encoding => 'utf-8',
- # Whether or not to process text nodes. Enabling this will allow text
- # nodes to be processed by transformers.
- :process_text_nodes => false,
-
# URL handling protocols to allow in specific attributes. By default, no
# protocols are allowed. Use :relative in place of a protocol if you want
# to allow relative URLs sans protocol.
View
13 lib/sanitize/transformers/clean_cdata.rb
@@ -0,0 +1,13 @@
+class Sanitize; module Transformers
+
+ CleanCDATA = lambda do |env|
+ return if env[:is_whitelisted]
+
+ node = env[:node]
+
+ if node.cdata?
+ node.replace(Nokogiri::XML::Text.new(node.text, node.document))
+ end
+ end
+
+end; end
View
10 lib/sanitize/transformers/clean_comment.rb
@@ -0,0 +1,10 @@
+class Sanitize; module Transformers
+
+ CleanComment = lambda do |env|
+ return if env[:is_whitelisted]
+
+ node = env[:node]
+ node.unlink if node.comment? && !env[:config][:allow_comments]
+ end
+
+end; end
View
21 lib/sanitize/transformers/clean_element.rb
@@ -27,9 +27,10 @@ def call(env)
name = env[:node_name]
node = env[:node]
- # Delete any element that isn't in the whitelist.
- # TODO: support transformer-whitelisted nodes
- unless @allowed_elements[name] || env[:node_whitelist].include?(node)
+ return if env[:is_whitelisted] || !node.element?
+
+ # Delete any element that isn't in the config whitelist.
+ unless @allowed_elements[name]
# Elements like br, div, p, etc. need to be replaced with whitespace in
# order to preserve readability.
if @whitespace_elements[name]
@@ -38,21 +39,19 @@ def call(env)
end
unless @remove_all_contents || @remove_element_contents[name]
- node.children.each { |n| node.add_previous_sibling(n) }
+ node.children.each {|n| node.add_previous_sibling(n) }
end
node.unlink
-
- return nil
+ return
end
- # TODO: transformers need attr_whitelist in the env?
- attr_whitelist = Set.new(#env[:attr_whitelist] +
- (@attributes[name] || []) + (@attributes[:all] || []))
+ attr_whitelist = Set.new((@attributes[name] || []) +
+ (@attributes[:all] || []))
if attr_whitelist.empty?
# Delete all attributes from elements with no whitelisted attributes.
- node.attribute_nodes.each {|attr| attr.remove }
+ node.attribute_nodes.each {|attr| attr.unlink }
else
# Delete any attribute that isn't in the whitelist for this element.
node.attribute_nodes.each do |attr|
@@ -82,8 +81,6 @@ def call(env)
if @add_attributes.has_key?(name)
@add_attributes[name].each {|key, val| node[key] = val }
end
-
- nil
end
end
View
2 lib/sanitize/version.rb
@@ -1,3 +1,3 @@
class Sanitize
- VERSION = '1.3.0.dev.20101210'
+ VERSION = '2.0.0.dev.20101211'
end
View
8 sanitize.gemspec
@@ -2,16 +2,16 @@
Gem::Specification.new do |s|
s.name = %q{sanitize}
- s.version = "1.3.0.dev.20101210"
+ s.version = "2.0.0.dev.20101211"
s.required_rubygems_version = Gem::Requirement.new("> 1.3.1") if s.respond_to? :required_rubygems_version=
s.authors = ["Ryan Grove"]
- s.date = %q{2010-12-10}
+ s.date = %q{2010-12-11}
s.email = %q{ryan@wonko.com}
- s.files = ["HISTORY", "LICENSE", "README.rdoc", "lib/sanitize/config/basic.rb", "lib/sanitize/config/relaxed.rb", "lib/sanitize/config/restricted.rb", "lib/sanitize/config.rb", "lib/sanitize/version.rb", "lib/sanitize.rb"]
+ s.files = ["HISTORY", "LICENSE", "README.rdoc", "lib/sanitize/config/basic.rb", "lib/sanitize/config/relaxed.rb", "lib/sanitize/config/restricted.rb", "lib/sanitize/config.rb", "lib/sanitize/transformers/clean_cdata.rb", "lib/sanitize/transformers/clean_comment.rb", "lib/sanitize/transformers/clean_element.rb", "lib/sanitize/version.rb", "lib/sanitize.rb"]
s.homepage = %q{https://github.com/rgrove/sanitize/}
s.require_paths = ["lib"]
- s.required_ruby_version = Gem::Requirement.new(">= 1.8.6")
+ s.required_ruby_version = Gem::Requirement.new(">= 1.8.7")
s.rubyforge_project = %q{riposte}
s.rubygems_version = %q{1.3.7}
s.summary = %q{Whitelist-based HTML sanitizer.}
View
131 test/test_sanitize.rb
@@ -334,16 +334,20 @@
youtube = lambda do |env|
node = env[:node]
node_name = env[:node_name]
- parent = node.parent
+
+ # Don't continue if this node is already whitelisted or is not an element.
+ return if env[:is_whitelisted] || !node.element?
+
+ parent = node.parent
# Since the transformer receives the deepest nodes first, we look for a
# <param> element or an <embed> element whose parent is an <object>.
- return nil unless (node_name == 'param' || node_name == 'embed') &&
+ return unless (node_name == 'param' || node_name == 'embed') &&
parent.name.to_s.downcase == 'object'
if node_name == 'param'
# Quick XPath search to find the <param> node that contains the video URL.
- return nil unless movie_node = parent.search('param[@name="movie"]')[0]
+ return unless movie_node = parent.search('param[@name="movie"]')[0]
url = movie_node['value']
else
# Since this is an <embed>, the video URL is in the "src" attribute. No
@@ -352,82 +356,102 @@
end
# Verify that the video URL is actually a valid YouTube video URL.
- return nil unless url =~ /^http:\/\/(?:www\.)?youtube\.com\/v\//
+ return unless url =~ /^http:\/\/(?:www\.)?youtube\.com\/v\//
# We're now certain that this is a YouTube embed, but we still need to run
# it through a special Sanitize step to ensure that no unwanted elements or
# attributes that don't belong in a YouTube embed can sneak in.
Sanitize.clean_node!(parent, {
- :elements => ['embed', 'object', 'param'],
+ :elements => %w[embed object param],
+
:attributes => {
- 'embed' => ['allowfullscreen', 'allowscriptaccess', 'height', 'src', 'type', 'width'],
- 'object' => ['height', 'width'],
- 'param' => ['name', 'value']
+ 'embed' => %w[allowfullscreen allowscriptaccess height src type width],
+ 'object' => %w[height width],
+ 'param' => %w[name value]
}
})
# Now that we're sure that this is a valid YouTube embed and that there are
# no unwanted elements or attributes hidden inside it, we can tell Sanitize
# to whitelist the current node (<param> or <embed>) and its parent
# (<object>).
- {:whitelist_nodes => [node, parent]}
+ {:node_whitelist => [node, parent]}
end
- # Text transform.
- # Example of transforming text nodes.
- text_transform = lambda do |env|
- node = env[:node]
- node_name = env[:node_name]
- parent = node.parent
-
- return nil unless node_name == "text" && parent.name == "#document-fragment"
-
- # we can modify the text nodes content or completely replace it
- node.replace(Nokogiri::HTML.fragment("<p>#{node.text}</p>"))
-
- {:whitelist_nodes => [node]}
- end
-
- it 'should receive the Sanitize config, current node, and node name as input' do
+ it 'should receive a complete env Hash as input' do
Sanitize.clean!('<SPAN>foo</SPAN>', :foo => :bar, :transformers => lambda {|env|
+ return unless env[:node].element?
+
env[:config][:foo].must_equal(:bar)
+ env[:is_whitelisted].must_equal(false)
env[:node].must_be_kind_of(Nokogiri::XML::Node)
env[:node_name].must_equal('span')
- nil
+ env[:node_whitelist].must_be_kind_of(Set)
+ env[:node_whitelist].must_be_empty
})
end
- it 'should receive allowed_elements and whitelist_nodes as input' do
- Sanitize.clean!('<span>foo</span>', :elements => ['span'], :transformers => lambda {|env|
- env[:allowed_elements].must_be_instance_of(Hash)
- env[:allowed_elements]['span'].must_equal(true)
- env[:whitelist_nodes].must_be_instance_of(Array)
- env[:whitelist_nodes].must_be_empty
- nil
+ it 'should traverse all node types, including the fragment itself' do
+ nodes = []
+
+ Sanitize.clean!('<div>foo</div><!--bar--><script>cdata!</script>', :transformers => proc {|env|
+ nodes << env[:node_name]
})
+
+ nodes.must_equal(%w[
+ text div comment #cdata-section script #document-fragment
+ ])
end
it 'should traverse from the deepest node outward' do
nodes = []
- Sanitize.clean!('<div><span>foo</span></div><p>bar</p>', :transformers => lambda {|env|
- nodes << env[:node_name]
- nil
+ Sanitize.clean!('<div><span>foo</span></div><p>bar</p>', :transformers => proc {|env|
+ nodes << env[:node_name] if env[:node].element?
})
nodes.must_equal(['span', 'div', 'p'])
end
- it 'should whitelist the current node when :whitelist => true' do
- Sanitize.clean!('<div class="foo">foo</div><span>bar</span>', :transformers => lambda {|env|
- {:whitelist => true} if env[:node_name] == 'div'
- }).must_equal('<div>foo</div>bar')
+ it 'should whitelist nodes in the node whitelist' do
+ Sanitize.clean!('<div class="foo">foo</div><span>bar</span>', :transformers => [
+ proc {|env|
+ {:node_whitelist => [env[:node]]} if env[:node_name] == 'div'
+ },
+
+ proc {|env|
+ env[:is_whitelisted].must_equal(false) unless env[:node_name] == 'div'
+ env[:is_whitelisted].must_equal(true) if env[:node_name] == 'div'
+ env[:node_whitelist].must_include(env[:node]) if env[:node_name] == 'div'
+ }
+ ]).must_equal('<div class="foo">foo</div>bar')
end
- it 'should whitelist attributes specified in :attr_whitelist' do
- Sanitize.clean!('<div class="foo" id="bar" width="50">foo</div><span>bar</span>', :transformers => lambda {|env|
- {:whitelist => true, :attr_whitelist => ['id', 'class']} if env[:node_name] == 'div'
- }).must_equal('<div class="foo" id="bar">foo</div>bar')
+ it 'should clear the node whitelist after each fragment' do
+ called = false
+
+ Sanitize.clean!('<div>foo</div>', :transformers => proc {|env|
+ {:node_whitelist => [env[:node]]}
+ })
+
+ Sanitize.clean!('<div>foo</div>', :transformers => proc {|env|
+ called = true
+ env[:is_whitelisted].must_equal(false)
+ env[:node_whitelist].must_be_empty
+ })
+
+ called.must_equal(true)
+ end
+
+ it 'should stop running transformers if the node is destroyed' do
+ called = false
+
+ Sanitize.clean!('<div>foo</div>', :transformers => [
+ proc {|env| env[:node].unlink if env[:node_name] == 'div' },
+ proc {|env| called = true if env[:node_name] == 'div' }
+ ])
+
+ called.must_equal(false)
end
it 'should allow youtube video embeds via the youtube transformer' do
@@ -443,25 +467,6 @@
Sanitize.clean!(input, :transformers => youtube).must_equal(output)
end
-
- it 'should raise Sanitize::Error when a transformer returns something silly' do
- proc {
- Sanitize.clean!('<b>foo</b>', :transformers => lambda {|env| 'hello' })
- }.must_raise(Sanitize::Error)
- end
-
- it 'should processing text nodes when :process_text_nodes is true' do
- input = "foo"
- output = "<p>foo</p>"
-
- Sanitize.clean(input, :process_text_nodes => true, :transformers => text_transform).must_equal(output)
- end
-
- it 'should not process text nodes by default' do
- input = "foo"
-
- Sanitize.clean(input, :transformers => text_transform).must_equal(input)
- end
end
describe 'bugs' do

0 comments on commit e5de3ad

Please sign in to comment.