Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

ISO 8859-1 Symbols not handled properly by Nokogiri::HTML(html) #482

Open
pboling opened this Issue · 29 comments

4 participants

@pboling
irb(main):013:0> Nokogiri::HTML(' ').text
=> " "
irb(main):014:0> Nokogiri::HTML('  ').text
=> "  "
irb(main):015:0> Nokogiri::HTML('    ').text
=> "  "
irb(main):016:0> Nokogiri::HTML.parse('  ', nil, 'utf-8').text
=> "  "
irb(main):017:0> Nokogiri::HTML.fragment('  ', 'utf-8').text
=> "  "
irb(main):019:0> Nokogiri::HTML.fragment(' Very Lovely Bug this one! ', 'utf-8')
=> #<Nokogiri::HTML::DocumentFragment:0xf1fa28 name="#document-fragment" children=[#<Nokogiri::XML::Text:0xf1f3bc " Very Lovely Bug this one! ">]>

This was tested in Nokogiri, 1.4.4, 1.4.5, and 1.4.6, and all exhibit the same behavior. Tested in Windows XP with mysysgit (Ruby 1.9.2p180), and in RHEL 5.5 (Ruby 1.9.2p0).

For comparison, &apos; works fine:

 irb(main):020:0> Nokogiri::HTML.fragment('&apos;Very Lovely Bug this one!&apos;', 'utf-8').text
 => "'Very Lovely Bug this one!'"

It appears that all the ASCII Printable Characters, and ISO-8859-1 Reserved Characters in HTML, &quot;, &apos;, &amp;, &lt;, &gt; work when referenced by entity name or number.

It appears that none of the ISO 8859-1 Symbols work by entity name or number.

http://www.w3schools.com/tags/ref_entities.asp

Here are a few more examples of the non-working set:

irb(main):017:0> Nokogiri::HTML.fragment('&#160;', 'utf-8').text
=> " "
irb(main):018:0> Nokogiri::HTML.fragment('&#174;', 'utf-8').text
=> "®"
irb(main):019:0> Nokogiri::HTML.fragment('&#182;', 'utf-8').text
=> "┬╢"
irb(main):020:0> Nokogiri::HTML.fragment('&#187;', 'utf-8').text
=> "┬╗"
irb(main):021:0> Nokogiri::HTML.fragment('&#169;', 'utf-8').text
=> "©"
irb(main):022:0> Nokogiri::HTML.fragment('&#247;', 'utf-8').text
=> "├╖"

Examples that work:

irb(main):023:0> Nokogiri::HTML.fragment('&#62;', 'utf-8').text
=> ">"
irb(main):024:0> Nokogiri::HTML.fragment('&#32;', 'utf-8').text
=> " "
irb(main):025:0> Nokogiri::HTML.fragment('&#35;', 'utf-8').text
=> "#"
@tenderlove
Owner

Do you mine posting the output of nokogiri -v?

@pboling

On Windows (after I downgraded back through 1.4.5 to 1.4.4.1 to make sure this wasn't introduced by a Nokogiri upgrade):

$ nokogiri -v
---
warnings: []

nokogiri: 1.4.4.1
ruby:
  version: 1.9.2
  platform: i386-mingw32
  engine: ruby
libxml:
  binding: extension
  compiled: 2.7.7
  loaded: 2.7.7

The ruby version here is:

$ ruby -v
ruby 1.9.2p180 (2011-02-18) [i386-mingw32]
@pboling

On Redhat (after upgrading back to 1.4.6):

$ nokogiri -v
--- 
warnings: []

nokogiri: 1.4.6
ruby: 
  version: 1.9.2
  platform: x86_64-linux
  description: ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-linux]
  engine: ruby
libxml: 
  binding: extension
  compiled: 2.6.26
  loaded: 2.6.26

Oops, looks like i have 1.9.2p0 on the Redhat machine...

@pboling

@tenderlove: I know there is a good chance this is a problem in libxml2, so I am hoping you may have some insight on a temporary workaround.

@tenderlove
Owner

@pboling does this happen on both windows and linux? Does the output change if you specify the encoding is iso-8859-1 rather than utf-8?

@pboling

Example on Redhat:

>> Nokogiri::HTML.fragment('&nbsp;', 'utf-8').text
=> " "
>> Nokogiri::HTML.fragment('&nbsp;', 'iso-8859-1').text
=> " "

Example on Windows:

irb(main):002:0> Nokogiri::HTML.fragment('&nbsp;', 'utf-8').text
=> " "
irb(main):001:0> Nokogiri::HTML.fragment('&nbsp;', 'iso-8859-1').text
=> " "

So not identical output, but similar problem. Switching to iso-8859-1 did not change anything.

@tenderlove
Owner

Can you try against libxml2 version 2.7.8 on redhat?

On windows, does it exhibit the same behavior if you just parse a normal html document? Like:

Nokogiri.HTML('&nbsp;').at_css('body').content
@pboling

The same failing examples from OP, but run on Redhat:

>> Nokogiri::HTML.fragment('&#160;', 'iso-8859-1').text
=> " "
>> Nokogiri::HTML.fragment('&#174;', 'iso-8859-1').text
=> "®"
>> Nokogiri::HTML.fragment('&#182;', 'iso-8859-1').text
=> "¶"
>> Nokogiri::HTML.fragment('&#187;', 'iso-8859-1').text
=> "»"
>> Nokogiri::HTML.fragment('&#169;', 'iso-8859-1').text
=> "©"
>> Nokogiri::HTML.fragment('&#247;', 'iso-8859-1').text
=> "÷"

So on RedHat they are mostly correct (last one should be a division symbol ÷), but are prepended by a strange Ã

The same working examples from OP, but run on Redhat:

>> Nokogiri::HTML.fragment('&#62;', 'iso-8859-1').text
=> ">"
>> Nokogiri::HTML.fragment('&#32;', 'iso-8859-1').text
=> " "
>> Nokogiri::HTML.fragment('&#35;', 'iso-8859-1').text
=> "#"
@pboling

Yes, the full HTML document behaves the same way on Windows (and Redhat):

irb(main):003:0> Nokogiri.HTML('&nbsp;').at_css('body').content
=> " "

I will check on trying with a newer libxml on redhat.

@pboling

Did some tests on my Mac running 10.6.8, ruby 1.9.2p136 (2010-12-25 revision 30365) [x86_64-darwin10.7.0]

There are similar problems wuth the same character set.

Non-working examples using entity names and entity numbers:

>> Nokogiri::HTML('&nbsp;').text
=> "\u00A0"
>> Nokogiri::HTML('&nbsp;', 'utf-8').text
=> "\u00A0"
>> Nokogiri::HTML('&#160;', 'utf-8').text
=> "\u00A0"
>> Nokogiri::HTML('&#174;', 'utf-8').text
=> "\u00AE"
>> Nokogiri::HTML.fragment('&#174;', 'utf-8').text
=> "\u00AE"
>> Nokogiri::HTML('&#160;', 'iso-18859-1').text
=> "\u00A0"
>> Nokogiri::HTML.fragment('&#160;', 'iso-18859-1').text
=> "\u00A0"

Working examples:

>> Nokogiri::HTML('&apos;', 'utf-8').text
=> "'"
>> Nokogiri::HTML('&#32;', 'utf-8').text
=> " "
>> Nokogiri::HTML('&#35;', 'utf-8').text
=> "#"

Here is Nokogiri version info:

$nokogiri -v
--- 
warnings: []

nokogiri: 1.4.4
ruby: 
  version: 1.9.2
  platform: x86_64-darwin10.7.0
  engine: ruby
libxml: 
  binding: extension
  compiled: 2.7.6
  loaded: 2.7.6

Still working on trying the newer libxml on RHEL.

@pboling

Just installed latest Nokogiri 1.5.0 on my Mac.

Non-working examples using entity names and entity numbers:

>> require 'nokogiri'
=> true
>> Nokogiri::VERSION
=> "1.5.0"
>> Nokogiri::HTML('&nbsp;', 'utf-8').text
=> "\u00A0"
>> Nokogiri.HTML('&nbsp;').at_css('body').content
=> "\u00A0"
>> Nokogiri.HTML('&#160;').at_css('body').content
=> "\u00A0"
>> Nokogiri.HTML('&#174;').at_css('body').content
=> "\u00AE"

Working examples:

>> Nokogiri.HTML('&#35;').at_css('body').content
=> "#"
>> Nokogiri.HTML('&#35;').text
=> "#"
>> Nokogiri.HTML('&apos;').text
=> "'"

Here is Nokogiri version info:

$nokogiri -v
# Nokogiri (1.5.0)
    --- 
    warnings: []

    nokogiri: 1.5.0
    ruby: 
      version: 1.9.2
      platform: x86_64-darwin10.7.0
      description: ruby 1.9.2p136 (2010-12-25 revision 30365) [x86_64-darwin10.7.0]
      engine: ruby
    libxml: 
      binding: extension
      compiled: 2.7.6
      loaded: 2.7.6
@pboling

In Windows XP:

I think I have figured something out. In a brand new vanilla irb sesison (no rails) I try an experiment:

irb(main):001:0> require 'nokogiri'
=> true
irb(main):023:0> Nokogiri::HTML('&nbsp;').text
=> "\u00A0"
irb(main):024:0> Nokogiri::HTML('&nbsp;').text.encoding
=> #<Encoding:UTF-8>


irb(main):025:0> require 'rails'
=> true
irb(main):026:0> Nokogiri::HTML('&nbsp;').text.encoding
=> #<Encoding:UTF-8>
irb(main):027:0> Nokogiri::HTML('&nbsp;').text
=> " "

It is rails that is FUBARing the string... somehow.

@tenderlove
Owner

What does it say when you do '&nbsp;'.encoding ?

@pboling

In Windows XP:

 irb(main):028:0> '&nbsp;'.encoding
 => #<Encoding:IBM437>

However...

irb(main):029:0> a = '&nbsp;'.force_encoding('UTF-8')
=> "&nbsp;"
irb(main):030:0> a.encoding
=> #<Encoding:UTF-8>
irb(main):031:0> Nokogiri::HTML('&nbsp;', 'UTF-8').text
=> " "
irb(main):034:0> Nokogiri::HTML(a, 'UTF-8').text
=> " "
@tenderlove
Owner

Please don't use force_encoding. force_encoding maintains the same bytes and does no transcoding. Your string is still encoded with IBM437 but is now erroneously tagged as UTF-8.

Can you try this:

> Nokogiri.HTML('&nbsp;', nil, 'IBM437')

Also compare it to this:

> Nokogiri.HTML('&nbsp;'.encode('UTF-8'), nil, 'UTF-8')

Thanks!

@pboling

In Windows XP:

In irb after requiring rails & nokogiri:

irb(main):040:0> Nokogiri.HTML('&nbsp;', nil, 'IBM437').text
=> " "
irb(main):041:0> Nokogiri.HTML('&nbsp;'.encode('UTF-8'), nil, 'UTF-8').text
=> " "

A new session of irb with only nokogiri loaded:

irb(main):003:0> Nokogiri.HTML('&nbsp;', nil, 'IBM437').text
=> "\u00A0"
irb(main):004:0> Nokogiri.HTML('&nbsp;'.encode('UTF-8'), nil, 'UTF-8').text
=> "\u00A0"
@pboling

Results on my Mac are interesting (Nokogiri 1.5.0, rails 3.0.9, ruby 1.9.2@136):

In a fresh vanilla irb session everything is great (I assume that "\u00A0" is what I should see as it is the UTF-8 string for non breaking space):

>> require 'nokogiri'
=> true
>> '&nbsp;'.encoding.name
=> "US-ASCII"
>> Nokogiri.HTML('&nbsp;', nil, 'UTF-8').text
=> "\u00A0"
>> Nokogiri.HTML('&nbsp;'.encode('UTF-8'), nil, 'UTF-8').text
=> "\u00A0"

Now I require rails and things go crazy:

>> require 'rails'
=> true
>> '&nbsp;'.encoding.name
=> "US-ASCII"
>> Nokogiri.HTML('&nbsp;', nil, 'UTF-8').text
=> " "
>> Nokogiri.HTML('&nbsp;'.encode('UTF-8'), nil, 'UTF-8').text
=> " "
@tenderlove
Owner

Interesting. Yes, 00A0 is correct for UTF-8.

So this is definitely rails doing something.

After you load rails and nokogiri, can you run puts Nokogiri::VersionInfo.instance.to_markdown and make sure it matches the output from nokogiri -v. If they are the same, can you open a new irb session and check Encoding.default_internal and Encoding.default_external before and after loading rails.

Also, it seems your terminal is not set to UTF-8 encoding. Did you manually set it to the IBM encoding? This is starting to seem like a rails bug.

@pboling

Back on Windows XP, Ruby 1.9.2p180 i386-mingw32, Nokogiri 1.5.0, libxml 2.7.7...

I have narrowed it down. I began by requiring the gems in the same order they are bundled by bundler into my rails app and stopped when the encoding broke:

C:\home>irb
irb(main):001:0> require 'nokogiri'
=> true
irb(main):002:0> Nokogiri.HTML('&nbsp;'.encode('UTF-8'), nil, 'UTF-8').text
=> "\u00A0"
irb(main):003:0> require 'rake'
=> true
irb(main):004:0> Nokogiri.HTML('&nbsp;'.encode('UTF-8'), nil, 'UTF-8').text
=> "\u00A0"
irb(main):005:0> require 'abstract'
=> true
irb(main):006:0> Nokogiri.HTML('&nbsp;'.encode('UTF-8'), nil, 'UTF-8').text
=> "\u00A0"
irb(main):008:0> require 'active_support/railtie'
=> true
irb(main):009:0> Nokogiri.HTML('&nbsp;'.encode('UTF-8'), nil, 'UTF-8').text
=> " "

It appears that the activesupport gem (or a dependency - though it doesn't seem to have any dependencies) may be the culprit. But then I went further along my list of gems from bundler, and found that requiring activemodel has the same effect.

irb(main):002:0> Nokogiri.HTML('&nbsp;'.encode('UTF-8'), nil, 'UTF-8').text
=> "\u00A0"
irb(main):003:0> require 'builder'
=> true
irb(main):004:0> Nokogiri.HTML('&nbsp;'.encode('UTF-8'), nil, 'UTF-8').text
=> "\u00A0"
irb(main):005:0> require 'i18n'
=> true
irb(main):006:0> Nokogiri.HTML('&nbsp;'.encode('UTF-8'), nil, 'UTF-8').text
=> "\u00A0"
irb(main):007:0> require 'active_model/railtie'
=> true
irb(main):008:0> Nokogiri.HTML('&nbsp;'.encode('UTF-8'), nil, 'UTF-8').text
=> " "

It appears that the activerecord gem has the same effect.

irb(main):001:0> require 'nokogiri'
=> true
irb(main):002:0> Nokogiri.HTML('&nbsp;'.encode('UTF-8'), nil, 'UTF-8').text
=> "\u00A0"
irb(main):003:0> require 'active_record/railtie'
=> true
irb(main):004:0> Nokogiri.HTML('&nbsp;'.encode('UTF-8'), nil, 'UTF-8').text
=> " "
@pboling

In response to your request:

C:\home>irb
irb(main):001:0> require 'nokogiri'
=> true
irb(main):002:0> require 'rails'
=> true
irb(main):003:0> puts Nokogiri::VersionInfo.instance.to_markdown
# Nokogiri (1.5.0)
    ---
    warnings: []
    nokogiri: 1.5.0
    ruby:
      version: 1.9.2
      platform: i386-mingw32
      description: ruby 1.9.2p180 (2011-02-18) [i386-mingw32]
      engine: ruby
    libxml:
      binding: extension
      compiled: 2.7.7
      loaded: 2.7.7
=> nil
irb(main):004:0> exit

C:\home>nokogiri -v
# Nokogiri (1.5.0)
    ---
    warnings: []
    nokogiri: 1.5.0
    ruby:
      version: 1.9.2
      platform: i386-mingw32
      description: ruby 1.9.2p180 (2011-02-18) [i386-mingw32]
      engine: ruby
    libxml:
      binding: extension
      compiled: 2.7.7
      loaded: 2.7.7

C:\home>irb
irb(main):001:0> Encoding.default_internal
=> nil
irb(main):002:0> Encoding.default_external
=> #<Encoding:IBM437>
irb(main):003:0> require 'rails'
=> true
irb(main):004:0> Encoding.default_internal
=> #<Encoding:UTF-8>
irb(main):005:0> Encoding.default_external
=> #<Encoding:UTF-8>
@pboling

Expanding on that... the problem is in the railtie gem.

Lines 16-27 of lib/rails/rails.rb in the railtie gem:

# For Ruby 1.8, this initialization sets $KCODE to 'u' to enable the
# multibyte safe operations. Plugin authors supporting other encodings
# should override this behaviour and set the relevant +default_charset+
# on ActionController::Base.
#
# For Ruby 1.9, UTF-8 is the default internal and external encoding.
if RUBY_VERSION < '1.9'
  $KCODE='u'
else
  Encoding.default_external = Encoding::UTF_8
  Encoding.default_internal = Encoding::UTF_8
end

Lines 14 and 15 are being executed in all cases I've reported (all rails 1.9.2). I've run those lines separately in a vanilla irb session with only nokogiri loaded, and watch the magical unicorn bug appear. It occurs after setting either line 14, or line 15.

C:\home>irb
irb(main):001:0> require 'nokogiri'
=> true
irb(main):002:0> Nokogiri.HTML('&nbsp;'.encode('UTF-8'), nil, 'UTF-8').text
=> "\u00A0"
irb(main):003:0> Encoding.default_external = Encoding::UTF_8
=> #<Encoding:UTF-8>
irb(main):004:0> Nokogiri.HTML('&nbsp;'.encode('UTF-8'), nil, 'UTF-8').text
=> " "


C:\home>irb
irb(main):001:0> require 'nokogiri'
=> true
irb(main):002:0> Encoding.default_internal = Encoding::UTF_8
=> #<Encoding:UTF-8>
irb(main):003:0> Nokogiri.HTML('&nbsp;'.encode('UTF-8'), nil, 'UTF-8').text
=> " "
@pboling

A little more digging:

irb(main):011:0> a = '&nbsp;'
=> "&nbsp;"
irb(main):012:0> a.encoding
=> #<Encoding:IBM437>
irb(main):013:0> a.bytes {|c| print c, ' '}
38 110 98 115 112 59 => "&nbsp;"
irb(main):014:0> b = '&nbsp'.encode('UTF-8')
=> "&nbsp"
irb(main):015:0> b.bytes {|c| print c, ' '}
38 110 98 115 112 => "&nbsp"
irb(main):016:0> b.encoding
=> #<Encoding:UTF-8>
irb(main):018:0> require 'nokogiri'
=> true

irb(main):019:0> Nokogiri.HTML('&nbsp;'.encode('UTF-8'), nil, 'UTF-8').text.bytes {|c| print c, ' '}
194 160 => "\u00A0"

irb(main):021:0> Encoding.default_external
=> #<Encoding:IBM437>
irb(main):022:0> Encoding.default_internal
=> nil
irb(main):023:0> Encoding.default_internal = Encoding::UTF_8
=> #<Encoding:UTF-8>
irb(main):024:0> Encoding.default_external
=> #<Encoding:IBM437>
irb(main):025:0> Encoding.default_internal
=> #<Encoding:UTF-8>

irb(main):026:0> Nokogiri.HTML('&nbsp;'.encode('UTF-8'), nil, 'UTF-8').text.bytes {|c| print c, ' '}
194 160 => " "

irb(main):027:0> Encoding.default_external = Encoding::UTF_8
=> #<Encoding:UTF-8>

irb(main):028:0> Nokogiri.HTML('&nbsp;'.encode('UTF-8'), nil, 'UTF-8').text.bytes {|c| print c, ' '}
194 160 => " "

It looks like the bytes are correct, 194 160 in each case.

@pboling

To summarize:

Setting either Encoding.default_external = Encoding::UTF_8 or Encoding.default_internal = Encoding::UTF_8 causes Nokogiri.HTML('&nbsp;') to return " " instead of "\u00A0", despite the actual bytes being the same (194 160) for both strings. This problem occurs when requiring rails because on lines 16-27 of lib/rails/rails.rb in the railtie gem the encodings are set in this manner.

Where do you think the blame for this bug lies? I'm guessing it is not in Nokogiri.

@tenderlove
Owner

Yes, this is definitely not a bug in nokogiri. I think the blame lies with Rails, but I need to figure out the "right" thing for rails to do. I'm guessing that Rails should consult your terminal's encoding before messing with the Encoding settings, but that's really a guess at this point. I need to talk to @wycats about this.

@pboling

FYI, we experience this issue when running our test suite as well, so not just in console.

Were you ever able to replicate the issue?

@tenderlove
Owner

@pboling I haven't been able to reproduce this issue, but I haven't had a chance to run my terminal as something besides UTF-8 (which I think is the crux of the problem).

@zabojnik

I just run into this error with trying to build a html document with the parser and builder. Any updates or workarounds?

Nokogiri::HTML("<p>&#197;</p>").to_html
=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>Ã
</p></body></html>\n"```
@josegrad

Hi, I'm trying to make some changes to an html page encoded with charset=iso-8859-1

doc = Nokogiri::HTML(open(html_file))

puts doc.to_html messes up all the accents in the page. So if I save it back it looks broken in the browser as well.

I guess I'm having the same issue as the rest. I'm still on Rails 3.0.6...

I hope there is a cure for this problem. Any hints?

Here's one of the pages suffering from that for example: http://www.elmundo.es/accesible/elmundo/2012/03/07/solidaridad/1331108705.html

UPDATE 1 24 March 2012

I'm not really sure if my problem is the same of this issue. I managed to partially solve my problem. I believe this has nothing to do with Nokogiri. Because I just need to open and save the file to get the accents messed up.

The closest to a fix I got is doing this:

thefile = File.open(html_file, "r")
text = thefile.read
doc = Nokogiri::HTML(text)
... do any stuff with nokogiri
File.open(html_file, 'w') {|f| f.write(doc.to_html) }

So, Nokogiri does not open the file but gets the html taken from the file.

The original file came with iso-8859-1, the saved one goes in utf-8 pretty much it looks ok. Accents are in place. Except for the accents in some places :-P I get question marks like in Econom�a , there should be í (i with an accent)

Getting closer I think. If someone has a hint to cover the capital letters as well it might be almost done.

Cheers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.