Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

add optional fallback encoding to HTML::Document.parse w/ autodetect #660

Closed
wants to merge 1 commit into from

3 participants

@riffraff

add an escape hatch to pass a default encoding to HTML::Document.parse even when we want Nokogiri's
autodetect to happen.

Sending pull request as discussed in the mailing list.

@riffraff riffraff add an escape hatch to pass a default encoding
to HTML::Document.parse even when we want Nokogiri's
autodetect to happen.
09811dd
@flavorjones
Owner

Thanks, will take a look.

@flavorjones
Owner

Hi there!

Because we're very likely going to heavily refactor EncodingReader soon, I want to make sure we've got complete test coverage for Nokogiri's behavior in all scenarios.

Based on your nokogiri-talk thread, I think there should probably be clear test coverage for the following cases where .parse is passed an IO object:

  • HTML declares charset X, encoding Y is passed to .parse -> what happens?
  • HTML declares charset X, no encoding passed to .parse -> parsed as X
  • HTML declares charset X, fallback encoding Y is passed to .parse -> parsed as X
  • HTML with encoding X does not declare charset, encoding Y is passed to parse -> what happens?
  • HTML with encoding X does not declare charset, no encoding is passed to .parse -> what happens?
  • HTML with encoding X does not declare charset, fallback encoding Y is passed to .parse -> what happens?

And then repeat all these tests for a String object.

I don't see explicit coverage for all these cases ... would you mind making sure they're covered?

@riffraff
@leejarvis
Owner

@riffraff Closing this due to the time span and I'm cleaning up, happy to discuss a merge should it be revisited.

@leejarvis leejarvis closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Apr 22, 2012
  1. @riffraff

    add an escape hatch to pass a default encoding

    riffraff authored
    to HTML::Document.parse even when we want Nokogiri's
    autodetect to happen.
This page is out of date. Refresh to see the latest.
View
14 lib/nokogiri/html/document.rb
@@ -79,6 +79,14 @@ class << self
# is a number that sets options in the parser, such as
# Nokogiri::XML::ParseOptions::RECOVER. See the constants in
# Nokogiri::XML::ParseOptions.
+ #
+ # If encoding is +nil+ Nokogiri will try to autodetect an encoding from document meta tags,
+ # and the fallback to letting libxml guess it.
+ # If you want to provide a forced fallback encoding you can pass an option hash, with key
+ # +autodetect_fallback+ e.g.:
+ #
+ # +{:autodetect_fallback => 'utf-8'}+
+ #
def parse string_or_io, url = nil, encoding = nil, options = XML::ParseOptions::DEFAULT_HTML
options = Nokogiri::XML::ParseOptions.new(options) if Fixnum === options
@@ -93,7 +101,11 @@ def parse string_or_io, url = nil, encoding = nil, options = XML::ParseOptions::
if string_or_io.respond_to?(:read)
url ||= string_or_io.respond_to?(:path) ? string_or_io.path : nil
- if !encoding
+ if !encoding || encoding.is_a?(Hash)
+ if encoding
+ fallback_encoding = encoding[:autodetect_fallback]
+ encoding = fallback_encoding
+ end
# Libxml2's parser has poor support for encoding
# detection. First, it does not recognize the HTML5
# style meta charset declaration. Secondly, even if it
View
3  test/files/noencoding_utf8.html
@@ -0,0 +1,3 @@
+<!doctype html>
+<title>doc</title>
+<p>We Don’t have an encoding</p>
View
1  test/helper.rb
@@ -22,6 +22,7 @@ class TestCase < MiniTest::Spec
METACHARSET_FILE = File.join(ASSETS_DIR, 'metacharset.html')
NICH_FILE = File.join(ASSETS_DIR, '2ch.html')
NOENCODING_FILE = File.join(ASSETS_DIR, 'noencoding.html')
+ NOENCODING_UTF8_FILE= File.join(ASSETS_DIR, 'noencoding_utf8.html')
PO_SCHEMA_FILE = File.join(ASSETS_DIR, 'po.xsd')
PO_XML_FILE = File.join(ASSETS_DIR, 'po.xml')
SHIFT_JIS_HTML = File.join(ASSETS_DIR, 'shift_jis.html')
View
17 test/html/test_document_encoding.rb
@@ -133,6 +133,23 @@ def test_document_xhtml_enc
assert_equal(evil, ary_from_file)
}
end
+
+ def test_document_wrongly_detected_as_iso88591
+ if Nokogiri.jruby?
+ mangled_string = "We Don\342\200\231t have an encoding"
+ else
+ mangled_string = "We Donâ\u0080\u0099t have an encoding"
+ end
+ doc = Nokogiri::HTML.parse(binopen(NOENCODING_UTF8_FILE))
+ assert_equal(mangled_string, doc.at('p').text)
+
+ correct_string = 'We Don’t have an encoding'
+ doc = Nokogiri::HTML.parse(binopen(NOENCODING_UTF8_FILE), nil, 'utf-8')
+ assert_equal(correct_string, doc.at('p').text)
+
+ doc = Nokogiri::HTML.parse(binopen(NOENCODING_UTF8_FILE), nil, {:autodetect_fallback =>'utf-8'})
+ assert_equal(correct_string, doc.at('p').text)
+ end
end
end
end
Something went wrong with that request. Please try again.