Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to output the HTML::Document in the original encoding #215

Closed
romanbsd opened this issue Jan 27, 2010 · 8 comments
Closed

Unable to output the HTML::Document in the original encoding #215

romanbsd opened this issue Jan 27, 2010 · 8 comments

Comments

@romanbsd
Copy link

romanbsd commented Jan 27, 2010

Consider this scenario:

require 'open-uri'
require 'nokogiri'
puts Nokogiri::VERSION_INFO
puts Nokogiri::LIBXML_ICONV_ENABLED
f = open('http://www.hometheater.co.il/').read; true
puts f.encoding.to_s
p = Nokogiri::HTML(f); true
puts p.encoding
puts p.meta_encoding
p.to_html

Result:

{"warnings"=>[], "nokogiri"=>"1.4.1", "libxml"=>{"binding"=>"extension", "compiled"=>"2.7.6", "loaded"=>"2.7.6"}}
true
ASCII-8BIT
windows-1255
windows-1255
encoding error : output conversion failed due to conv error, bytes 0xEE 0xD7 0x92 0xD7
I/O error : encoder error

So it effectively prohibits me from outputting this page in the original "windows-1255" encoding.

@tenderlove
Copy link
Member

Huh. It works for me. What version of iconv are you using?

@tenderlove
Copy link
Member

iconv --version

Also, iconv -l might help

@romanbsd
Copy link
Author

Happens both on FC12 Linux:
iconv (GNU libc) 2.11.1
and on my Mac:
iconv (GNU libiconv 1.13)
I verified by otool -L that libxml2 which nokogiri uses is in fact linked to this dylib.
I'm using
ruby 1.9.1p378 (2010-01-10 revision 26272) [i686-linux]
and
ruby 1.9.1p378 (2010-01-10 revision 26272) [i386-darwin10.2.0]
respectively.
iconv -l is rather long, how would you like me to provide it?

@tenderlove
Copy link
Member

Just put the iconv output in a gist if you don't mind.

I tried this with 1.8.7, let me try with 1.9.1p378. I'm starting to suspect it's the encoding stuff in Ruby rather than libxml2.

@romanbsd
Copy link
Author

Yes, I suspect that too.
http://gist.github.com/289660

@tenderlove
Copy link
Member

Alright, so I can confirm this is a bug in libxml2. It's happening because there are "strange" characters occurring before the meta tag, and libxml2 isn't using the correct encoding for the characters in the title tag.

We can demonstrate the problem in libxml2 by specifically specifying a code when parsing the document:

require 'open-uri'
require 'nokogiri'

f = open('http://www.hometheater.co.il/').read

doc = Nokogiri::HTML(f, nil, 'windows-1255')
puts doc.encoding
puts doc.meta_encoding
p doc.to_html # yay! it works!

or by removing the title tag before emitting the document:

require 'open-uri'
require 'nokogiri'

f = open('http://www.hometheater.co.il/').read

doc = Nokogiri::HTML(f)
puts doc.encoding
puts doc.meta_encoding
doc.at('title').unlink
p doc.to_html # yay! it works!

I believe the bug is related to this ticket I filed with libxml2 a while back:

https://bugzilla.gnome.org/show_bug.cgi?id=579317

http://mail.gnome.org/archives/xml/2009-April/msg00035.html

But unfortunately it seems that fix didn't deal with this case. I will put together a C program to reproduce the problem and submit a test case to the libxml2 people. Until this is fixed in libxml2, the best work around is to specifically specify the encoding when parsing the document.

Outputting the document as UTF-8 also seems to work, but I'm not sure that the data in the title tag is correct:

f = open('http://www.hometheater.co.il/').read

doc = Nokogiri::HTML(f)
puts doc.encoding
puts doc.meta_encoding
p doc.to_html(:encoding => 'UTF-8')

@darkhelmet
Copy link

I'm running into this bug too. Time to specify me some encoding.

@flavorjones
Copy link
Member

This appears to be fine now, with iconv 2.23 and

# Nokogiri (1.7.0.1)
    ---
    warnings: []
    nokogiri: 1.7.0.1
    ruby:
      version: 2.4.0
      platform: x86_64-linux
      description: ruby 2.4.0p0 (2016-12-24 revision 57164) [x86_64-linux]
      engine: ruby
    libxml:
      binding: extension
      source: packaged
      libxml2_path: "/home/flavorjones/.rvm/gems/ruby-2.4.0/gems/nokogiri-1.7.0.1/ports/x86_64-pc-linux-gnu/libxml2/2.9.4"
      libxslt_path: "/home/flavorjones/.rvm/gems/ruby-2.4.0/gems/nokogiri-1.7.0.1/ports/x86_64-pc-linux-gnu/libxslt/1.1.29"
      libxml2_patches: []
      libxslt_patches: []
      compiled: 2.9.4
      loaded: 2.9.4

If this is still an issue for you, please comment and we'll reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants