ISO 8859-1 Symbols not handled properly by Nokogiri::HTML(html) #482

Open
pboling opened this Issue Jul 1, 2011 · 30 comments

Projects

None yet

5 participants

@pboling
pboling commented Jul 1, 2011
irb(main):013:0> Nokogiri::HTML(' ').text
=> " "
irb(main):014:0> Nokogiri::HTML('  ').text
=> "  "
irb(main):015:0> Nokogiri::HTML('    ').text
=> "  "
irb(main):016:0> Nokogiri::HTML.parse('  ', nil, 'utf-8').text
=> "  "
irb(main):017:0> Nokogiri::HTML.fragment('  ', 'utf-8').text
=> "  "
irb(main):019:0> Nokogiri::HTML.fragment(' Very Lovely Bug this one! ', 'utf-8')
=> #<Nokogiri::HTML::DocumentFragment:0xf1fa28 name="#document-fragment" children=[#<Nokogiri::XML::Text:0xf1f3bc " Very Lovely Bug this one! ">]>

This was tested in Nokogiri, 1.4.4, 1.4.5, and 1.4.6, and all exhibit the same behavior. Tested in Windows XP with mysysgit (Ruby 1.9.2p180), and in RHEL 5.5 (Ruby 1.9.2p0).

For comparison, &apos; works fine:

 irb(main):020:0> Nokogiri::HTML.fragment('&apos;Very Lovely Bug this one!&apos;', 'utf-8').text
 => "'Very Lovely Bug this one!'"

It appears that all the ASCII Printable Characters, and ISO-8859-1 Reserved Characters in HTML, &quot;, &apos;, &amp;, &lt;, &gt; work when referenced by entity name or number.

It appears that none of the ISO 8859-1 Symbols work by entity name or number.

http://www.w3schools.com/tags/ref_entities.asp

Here are a few more examples of the non-working set:

irb(main):017:0> Nokogiri::HTML.fragment('&#160;', 'utf-8').text
=> " "
irb(main):018:0> Nokogiri::HTML.fragment('&#174;', 'utf-8').text
=> "®"
irb(main):019:0> Nokogiri::HTML.fragment('&#182;', 'utf-8').text
=> "┬╢"
irb(main):020:0> Nokogiri::HTML.fragment('&#187;', 'utf-8').text
=> "┬╗"
irb(main):021:0> Nokogiri::HTML.fragment('&#169;', 'utf-8').text
=> "©"
irb(main):022:0> Nokogiri::HTML.fragment('&#247;', 'utf-8').text
=> "├╖"

Examples that work:

irb(main):023:0> Nokogiri::HTML.fragment('&#62;', 'utf-8').text
=> ">"
irb(main):024:0> Nokogiri::HTML.fragment('&#32;', 'utf-8').text
=> " "
irb(main):025:0> Nokogiri::HTML.fragment('&#35;', 'utf-8').text
=> "#"
Owner

Do you mine posting the output of nokogiri -v?

pboling commented Jul 1, 2011

On Windows (after I downgraded back through 1.4.5 to 1.4.4.1 to make sure this wasn't introduced by a Nokogiri upgrade):

$ nokogiri -v
---
warnings: []

nokogiri: 1.4.4.1
ruby:
  version: 1.9.2
  platform: i386-mingw32
  engine: ruby
libxml:
  binding: extension
  compiled: 2.7.7
  loaded: 2.7.7

The ruby version here is:

$ ruby -v
ruby 1.9.2p180 (2011-02-18) [i386-mingw32]
pboling commented Jul 1, 2011

On Redhat (after upgrading back to 1.4.6):

$ nokogiri -v
--- 
warnings: []

nokogiri: 1.4.6
ruby: 
  version: 1.9.2
  platform: x86_64-linux
  description: ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-linux]
  engine: ruby
libxml: 
  binding: extension
  compiled: 2.6.26
  loaded: 2.6.26

Oops, looks like i have 1.9.2p0 on the Redhat machine...

pboling commented Jul 1, 2011

@tenderlove: I know there is a good chance this is a problem in libxml2, so I am hoping you may have some insight on a temporary workaround.

Owner

@pboling does this happen on both windows and linux? Does the output change if you specify the encoding is iso-8859-1 rather than utf-8?

pboling commented Jul 1, 2011

Example on Redhat:

>> Nokogiri::HTML.fragment('&nbsp;', 'utf-8').text
=> " "
>> Nokogiri::HTML.fragment('&nbsp;', 'iso-8859-1').text
=> " "

Example on Windows:

irb(main):002:0> Nokogiri::HTML.fragment('&nbsp;', 'utf-8').text
=> " "
irb(main):001:0> Nokogiri::HTML.fragment('&nbsp;', 'iso-8859-1').text
=> " "

So not identical output, but similar problem. Switching to iso-8859-1 did not change anything.

Owner

Can you try against libxml2 version 2.7.8 on redhat?

On windows, does it exhibit the same behavior if you just parse a normal html document? Like:

Nokogiri.HTML('&nbsp;').at_css('body').content
pboling commented Jul 1, 2011

The same failing examples from OP, but run on Redhat:

>> Nokogiri::HTML.fragment('&#160;', 'iso-8859-1').text
=> " "
>> Nokogiri::HTML.fragment('&#174;', 'iso-8859-1').text
=> "®"
>> Nokogiri::HTML.fragment('&#182;', 'iso-8859-1').text
=> "¶"
>> Nokogiri::HTML.fragment('&#187;', 'iso-8859-1').text
=> "»"
>> Nokogiri::HTML.fragment('&#169;', 'iso-8859-1').text
=> "©"
>> Nokogiri::HTML.fragment('&#247;', 'iso-8859-1').text
=> "÷"

So on RedHat they are mostly correct (last one should be a division symbol ÷), but are prepended by a strange Ã

The same working examples from OP, but run on Redhat:

>> Nokogiri::HTML.fragment('&#62;', 'iso-8859-1').text
=> ">"
>> Nokogiri::HTML.fragment('&#32;', 'iso-8859-1').text
=> " "
>> Nokogiri::HTML.fragment('&#35;', 'iso-8859-1').text
=> "#"
pboling commented Jul 1, 2011

Yes, the full HTML document behaves the same way on Windows (and Redhat):

irb(main):003:0> Nokogiri.HTML('&nbsp;').at_css('body').content
=> " "

I will check on trying with a newer libxml on redhat.

pboling commented Jul 1, 2011

Did some tests on my Mac running 10.6.8, ruby 1.9.2p136 (2010-12-25 revision 30365) [x86_64-darwin10.7.0]

There are similar problems wuth the same character set.

Non-working examples using entity names and entity numbers:

>> Nokogiri::HTML('&nbsp;').text
=> "\u00A0"
>> Nokogiri::HTML('&nbsp;', 'utf-8').text
=> "\u00A0"
>> Nokogiri::HTML('&#160;', 'utf-8').text
=> "\u00A0"
>> Nokogiri::HTML('&#174;', 'utf-8').text
=> "\u00AE"
>> Nokogiri::HTML.fragment('&#174;', 'utf-8').text
=> "\u00AE"
>> Nokogiri::HTML('&#160;', 'iso-18859-1').text
=> "\u00A0"
>> Nokogiri::HTML.fragment('&#160;', 'iso-18859-1').text
=> "\u00A0"

Working examples:

>> Nokogiri::HTML('&apos;', 'utf-8').text
=> "'"
>> Nokogiri::HTML('&#32;', 'utf-8').text
=> " "
>> Nokogiri::HTML('&#35;', 'utf-8').text
=> "#"

Here is Nokogiri version info:

$nokogiri -v
--- 
warnings: []

nokogiri: 1.4.4
ruby: 
  version: 1.9.2
  platform: x86_64-darwin10.7.0
  engine: ruby
libxml: 
  binding: extension
  compiled: 2.7.6
  loaded: 2.7.6

Still working on trying the newer libxml on RHEL.

pboling commented Jul 1, 2011

Just installed latest Nokogiri 1.5.0 on my Mac.

Non-working examples using entity names and entity numbers:

>> require 'nokogiri'
=> true
>> Nokogiri::VERSION
=> "1.5.0"
>> Nokogiri::HTML('&nbsp;', 'utf-8').text
=> "\u00A0"
>> Nokogiri.HTML('&nbsp;').at_css('body').content
=> "\u00A0"
>> Nokogiri.HTML('&#160;').at_css('body').content
=> "\u00A0"
>> Nokogiri.HTML('&#174;').at_css('body').content
=> "\u00AE"

Working examples:

>> Nokogiri.HTML('&#35;').at_css('body').content
=> "#"
>> Nokogiri.HTML('&#35;').text
=> "#"
>> Nokogiri.HTML('&apos;').text
=> "'"

Here is Nokogiri version info:

$nokogiri -v
# Nokogiri (1.5.0)
    --- 
    warnings: []

    nokogiri: 1.5.0
    ruby: 
      version: 1.9.2
      platform: x86_64-darwin10.7.0
      description: ruby 1.9.2p136 (2010-12-25 revision 30365) [x86_64-darwin10.7.0]
      engine: ruby
    libxml: 
      binding: extension
      compiled: 2.7.6
      loaded: 2.7.6
pboling commented Jul 5, 2011

In Windows XP:

I think I have figured something out. In a brand new vanilla irb sesison (no rails) I try an experiment:

irb(main):001:0> require 'nokogiri'
=> true
irb(main):023:0> Nokogiri::HTML('&nbsp;').text
=> "\u00A0"
irb(main):024:0> Nokogiri::HTML('&nbsp;').text.encoding
=> #<Encoding:UTF-8>


irb(main):025:0> require 'rails'
=> true
irb(main):026:0> Nokogiri::HTML('&nbsp;').text.encoding
=> #<Encoding:UTF-8>
irb(main):027:0> Nokogiri::HTML('&nbsp;').text
=> " "

It is rails that is FUBARing the string... somehow.

Owner

What does it say when you do '&nbsp;'.encoding ?

pboling commented Jul 5, 2011

In Windows XP:

 irb(main):028:0> '&nbsp;'.encoding
 => #<Encoding:IBM437>

However...

irb(main):029:0> a = '&nbsp;'.force_encoding('UTF-8')
=> "&nbsp;"
irb(main):030:0> a.encoding
=> #<Encoding:UTF-8>
irb(main):031:0> Nokogiri::HTML('&nbsp;', 'UTF-8').text
=> " "
irb(main):034:0> Nokogiri::HTML(a, 'UTF-8').text
=> " "
Owner

Please don't use force_encoding. force_encoding maintains the same bytes and does no transcoding. Your string is still encoded with IBM437 but is now erroneously tagged as UTF-8.

Can you try this:

> Nokogiri.HTML('&nbsp;', nil, 'IBM437')

Also compare it to this:

> Nokogiri.HTML('&nbsp;'.encode('UTF-8'), nil, 'UTF-8')

Thanks!

pboling commented Jul 5, 2011

In Windows XP:

In irb after requiring rails & nokogiri:

irb(main):040:0> Nokogiri.HTML('&nbsp;', nil, 'IBM437').text
=> " "
irb(main):041:0> Nokogiri.HTML('&nbsp;'.encode('UTF-8'), nil, 'UTF-8').text
=> " "

A new session of irb with only nokogiri loaded:

irb(main):003:0> Nokogiri.HTML('&nbsp;', nil, 'IBM437').text
=> "\u00A0"
irb(main):004:0> Nokogiri.HTML('&nbsp;'.encode('UTF-8'), nil, 'UTF-8').text
=> "\u00A0"
pboling commented Jul 5, 2011

Results on my Mac are interesting (Nokogiri 1.5.0, rails 3.0.9, ruby 1.9.2@136):

In a fresh vanilla irb session everything is great (I assume that "\u00A0" is what I should see as it is the UTF-8 string for non breaking space):

>> require 'nokogiri'
=> true
>> '&nbsp;'.encoding.name
=> "US-ASCII"
>> Nokogiri.HTML('&nbsp;', nil, 'UTF-8').text
=> "\u00A0"
>> Nokogiri.HTML('&nbsp;'.encode('UTF-8'), nil, 'UTF-8').text
=> "\u00A0"

Now I require rails and things go crazy:

>> require 'rails'
=> true
>> '&nbsp;'.encoding.name
=> "US-ASCII"
>> Nokogiri.HTML('&nbsp;', nil, 'UTF-8').text
=> " "
>> Nokogiri.HTML('&nbsp;'.encode('UTF-8'), nil, 'UTF-8').text
=> " "
Owner

Interesting. Yes, 00A0 is correct for UTF-8.

So this is definitely rails doing something.

After you load rails and nokogiri, can you run puts Nokogiri::VersionInfo.instance.to_markdown and make sure it matches the output from nokogiri -v. If they are the same, can you open a new irb session and check Encoding.default_internal and Encoding.default_external before and after loading rails.

Also, it seems your terminal is not set to UTF-8 encoding. Did you manually set it to the IBM encoding? This is starting to seem like a rails bug.

pboling commented Jul 6, 2011

Back on Windows XP, Ruby 1.9.2p180 i386-mingw32, Nokogiri 1.5.0, libxml 2.7.7...

I have narrowed it down. I began by requiring the gems in the same order they are bundled by bundler into my rails app and stopped when the encoding broke:

C:\home>irb
irb(main):001:0> require 'nokogiri'
=> true
irb(main):002:0> Nokogiri.HTML('&nbsp;'.encode('UTF-8'), nil, 'UTF-8').text
=> "\u00A0"
irb(main):003:0> require 'rake'
=> true
irb(main):004:0> Nokogiri.HTML('&nbsp;'.encode('UTF-8'), nil, 'UTF-8').text
=> "\u00A0"
irb(main):005:0> require 'abstract'
=> true
irb(main):006:0> Nokogiri.HTML('&nbsp;'.encode('UTF-8'), nil, 'UTF-8').text
=> "\u00A0"
irb(main):008:0> require 'active_support/railtie'
=> true
irb(main):009:0> Nokogiri.HTML('&nbsp;'.encode('UTF-8'), nil, 'UTF-8').text
=> " "

It appears that the activesupport gem (or a dependency - though it doesn't seem to have any dependencies) may be the culprit. But then I went further along my list of gems from bundler, and found that requiring activemodel has the same effect.

irb(main):002:0> Nokogiri.HTML('&nbsp;'.encode('UTF-8'), nil, 'UTF-8').text
=> "\u00A0"
irb(main):003:0> require 'builder'
=> true
irb(main):004:0> Nokogiri.HTML('&nbsp;'.encode('UTF-8'), nil, 'UTF-8').text
=> "\u00A0"
irb(main):005:0> require 'i18n'
=> true
irb(main):006:0> Nokogiri.HTML('&nbsp;'.encode('UTF-8'), nil, 'UTF-8').text
=> "\u00A0"
irb(main):007:0> require 'active_model/railtie'
=> true
irb(main):008:0> Nokogiri.HTML('&nbsp;'.encode('UTF-8'), nil, 'UTF-8').text
=> " "

It appears that the activerecord gem has the same effect.

irb(main):001:0> require 'nokogiri'
=> true
irb(main):002:0> Nokogiri.HTML('&nbsp;'.encode('UTF-8'), nil, 'UTF-8').text
=> "\u00A0"
irb(main):003:0> require 'active_record/railtie'
=> true
irb(main):004:0> Nokogiri.HTML('&nbsp;'.encode('UTF-8'), nil, 'UTF-8').text
=> " "
pboling commented Jul 6, 2011

In response to your request:

C:\home>irb
irb(main):001:0> require 'nokogiri'
=> true
irb(main):002:0> require 'rails'
=> true
irb(main):003:0> puts Nokogiri::VersionInfo.instance.to_markdown
# Nokogiri (1.5.0)
    ---
    warnings: []
    nokogiri: 1.5.0
    ruby:
      version: 1.9.2
      platform: i386-mingw32
      description: ruby 1.9.2p180 (2011-02-18) [i386-mingw32]
      engine: ruby
    libxml:
      binding: extension
      compiled: 2.7.7
      loaded: 2.7.7
=> nil
irb(main):004:0> exit

C:\home>nokogiri -v
# Nokogiri (1.5.0)
    ---
    warnings: []
    nokogiri: 1.5.0
    ruby:
      version: 1.9.2
      platform: i386-mingw32
      description: ruby 1.9.2p180 (2011-02-18) [i386-mingw32]
      engine: ruby
    libxml:
      binding: extension
      compiled: 2.7.7
      loaded: 2.7.7

C:\home>irb
irb(main):001:0> Encoding.default_internal
=> nil
irb(main):002:0> Encoding.default_external
=> #<Encoding:IBM437>
irb(main):003:0> require 'rails'
=> true
irb(main):004:0> Encoding.default_internal
=> #<Encoding:UTF-8>
irb(main):005:0> Encoding.default_external
=> #<Encoding:UTF-8>
pboling commented Jul 6, 2011

Expanding on that... the problem is in the railtie gem.

Lines 16-27 of lib/rails/rails.rb in the railtie gem:

# For Ruby 1.8, this initialization sets $KCODE to 'u' to enable the
# multibyte safe operations. Plugin authors supporting other encodings
# should override this behaviour and set the relevant +default_charset+
# on ActionController::Base.
#
# For Ruby 1.9, UTF-8 is the default internal and external encoding.
if RUBY_VERSION < '1.9'
  $KCODE='u'
else
  Encoding.default_external = Encoding::UTF_8
  Encoding.default_internal = Encoding::UTF_8
end

Lines 14 and 15 are being executed in all cases I've reported (all rails 1.9.2). I've run those lines separately in a vanilla irb session with only nokogiri loaded, and watch the magical unicorn bug appear. It occurs after setting either line 14, or line 15.

C:\home>irb
irb(main):001:0> require 'nokogiri'
=> true
irb(main):002:0> Nokogiri.HTML('&nbsp;'.encode('UTF-8'), nil, 'UTF-8').text
=> "\u00A0"
irb(main):003:0> Encoding.default_external = Encoding::UTF_8
=> #<Encoding:UTF-8>
irb(main):004:0> Nokogiri.HTML('&nbsp;'.encode('UTF-8'), nil, 'UTF-8').text
=> " "


C:\home>irb
irb(main):001:0> require 'nokogiri'
=> true
irb(main):002:0> Encoding.default_internal = Encoding::UTF_8
=> #<Encoding:UTF-8>
irb(main):003:0> Nokogiri.HTML('&nbsp;'.encode('UTF-8'), nil, 'UTF-8').text
=> " "
pboling commented Jul 6, 2011

A little more digging:

irb(main):011:0> a = '&nbsp;'
=> "&nbsp;"
irb(main):012:0> a.encoding
=> #<Encoding:IBM437>
irb(main):013:0> a.bytes {|c| print c, ' '}
38 110 98 115 112 59 => "&nbsp;"
irb(main):014:0> b = '&nbsp'.encode('UTF-8')
=> "&nbsp"
irb(main):015:0> b.bytes {|c| print c, ' '}
38 110 98 115 112 => "&nbsp"
irb(main):016:0> b.encoding
=> #<Encoding:UTF-8>
irb(main):018:0> require 'nokogiri'
=> true

irb(main):019:0> Nokogiri.HTML('&nbsp;'.encode('UTF-8'), nil, 'UTF-8').text.bytes {|c| print c, ' '}
194 160 => "\u00A0"

irb(main):021:0> Encoding.default_external
=> #<Encoding:IBM437>
irb(main):022:0> Encoding.default_internal
=> nil
irb(main):023:0> Encoding.default_internal = Encoding::UTF_8
=> #<Encoding:UTF-8>
irb(main):024:0> Encoding.default_external
=> #<Encoding:IBM437>
irb(main):025:0> Encoding.default_internal
=> #<Encoding:UTF-8>

irb(main):026:0> Nokogiri.HTML('&nbsp;'.encode('UTF-8'), nil, 'UTF-8').text.bytes {|c| print c, ' '}
194 160 => " "

irb(main):027:0> Encoding.default_external = Encoding::UTF_8
=> #<Encoding:UTF-8>

irb(main):028:0> Nokogiri.HTML('&nbsp;'.encode('UTF-8'), nil, 'UTF-8').text.bytes {|c| print c, ' '}
194 160 => " "

It looks like the bytes are correct, 194 160 in each case.

pboling commented Jul 12, 2011

To summarize:

Setting either Encoding.default_external = Encoding::UTF_8 or Encoding.default_internal = Encoding::UTF_8 causes Nokogiri.HTML('&nbsp;') to return " " instead of "\u00A0", despite the actual bytes being the same (194 160) for both strings. This problem occurs when requiring rails because on lines 16-27 of lib/rails/rails.rb in the railtie gem the encodings are set in this manner.

Where do you think the blame for this bug lies? I'm guessing it is not in Nokogiri.

Owner

Yes, this is definitely not a bug in nokogiri. I think the blame lies with Rails, but I need to figure out the "right" thing for rails to do. I'm guessing that Rails should consult your terminal's encoding before messing with the Encoding settings, but that's really a guess at this point. I need to talk to @wycats about this.

pboling commented Jul 13, 2011

FYI, we experience this issue when running our test suite as well, so not just in console.

Were you ever able to replicate the issue?

Owner

@pboling I haven't been able to reproduce this issue, but I haven't had a chance to run my terminal as something besides UTF-8 (which I think is the crux of the problem).

andyfo commented Feb 1, 2012

I just run into this error with trying to build a html document with the parser and builder. Any updates or workarounds?

Nokogiri::HTML("<p>&#197;</p>").to_html
=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>Ã
</p></body></html>\n"```

Hi, I'm trying to make some changes to an html page encoded with charset=iso-8859-1

doc = Nokogiri::HTML(open(html_file))

puts doc.to_html messes up all the accents in the page. So if I save it back it looks broken in the browser as well.

I guess I'm having the same issue as the rest. I'm still on Rails 3.0.6...

I hope there is a cure for this problem. Any hints?

Here's one of the pages suffering from that for example: http://www.elmundo.es/accesible/elmundo/2012/03/07/solidaridad/1331108705.html

UPDATE 1 24 March 2012

I'm not really sure if my problem is the same of this issue. I managed to partially solve my problem. I believe this has nothing to do with Nokogiri. Because I just need to open and save the file to get the accents messed up.

The closest to a fix I got is doing this:

thefile = File.open(html_file, "r")
text = thefile.read
doc = Nokogiri::HTML(text)
... do any stuff with nokogiri
File.open(html_file, 'w') {|f| f.write(doc.to_html) }

So, Nokogiri does not open the file but gets the html taken from the file.

The original file came with iso-8859-1, the saved one goes in utf-8 pretty much it looks ok. Accents are in place. Except for the accents in some places :-P I get question marks like in Econom�a , there should be í (i with an accent)

Getting closer I think. If someone has a hint to cover the capital letters as well it might be almost done.

Cheers.

I'm having a similar issue with my project 3 years later. Trying to leave HTML entities untouched or at least convert them back appropriately once I'm done doing parsing with Nokogiri.

[4] pry(main)> test = 'àèìòùáéíóú è &#233;'
=> "àèìòùáéíóú è &#233;"
[5] pry(main)> doc = Nokogiri::HTML.parse(test)
=> #(Document:0x3ff4e39c6b44 {
  name = "document",
  children = [
    #(DTD:0x3ff4e39ca03c { name = "html" }),
    #(Element:0x3ff4e40ec39c {
      name = "html",
      children = [
        #(Element:0x3ff4e40f5b2c {
          name = "body",
          children = [ #(Element:0x3ff4e40f89e4 { name = "p", children = [ #(Text "àèìòùáéíóú è é")] })]
          })]
      })]
  })
[9] pry(main)> doc.to_html(encoding: Encoding::ISO_8859_1.name)
=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>\xE0\xE8\xEC\xF2\xF9\xE1\xE9\xED\xF3\xFA \xE8 \xE9</p></body></html>\n"

As you can see, the entity is converted back to it's ASCII form when parsed with Nokogiri. But as you can see when I later try to convert the code back to ISO-8859-1 I get a different encoding entirely which is causing � when I send back the modified string to the server I'm working with.

Any help or insight is appreciated. Thanks in advanced!

Note: I know my source strings aren't HTML valid. I just needed a proof of concept.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment