-
-
Notifications
You must be signed in to change notification settings - Fork 897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to output the HTML::Document in the original encoding #215
Comments
Huh. It works for me. What version of iconv are you using? |
Also, |
Happens both on FC12 Linux: |
Just put the iconv output in a gist if you don't mind. I tried this with 1.8.7, let me try with 1.9.1p378. I'm starting to suspect it's the encoding stuff in Ruby rather than libxml2. |
Yes, I suspect that too. |
Alright, so I can confirm this is a bug in libxml2. It's happening because there are "strange" characters occurring before the meta tag, and libxml2 isn't using the correct encoding for the characters in the title tag. We can demonstrate the problem in libxml2 by specifically specifying a code when parsing the document:
or by removing the title tag before emitting the document:
I believe the bug is related to this ticket I filed with libxml2 a while back: https://bugzilla.gnome.org/show_bug.cgi?id=579317 http://mail.gnome.org/archives/xml/2009-April/msg00035.html But unfortunately it seems that fix didn't deal with this case. I will put together a C program to reproduce the problem and submit a test case to the libxml2 people. Until this is fixed in libxml2, the best work around is to specifically specify the encoding when parsing the document. Outputting the document as UTF-8 also seems to work, but I'm not sure that the data in the title tag is correct:
|
I'm running into this bug too. Time to specify me some encoding. |
This appears to be fine now, with iconv 2.23 and
If this is still an issue for you, please comment and we'll reopen. |
Consider this scenario:
Result:
So it effectively prohibits me from outputting this page in the original "windows-1255" encoding.
The text was updated successfully, but these errors were encountered: