Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

decoding failure for Ç #3

Closed
wksmall opened this issue Dec 2, 2010 · 9 comments
Closed

decoding failure for Ç #3

wksmall opened this issue Dec 2, 2010 · 9 comments

Comments

@wksmall
Copy link

wksmall commented Dec 2, 2010

I seem to be having a problem with Ç. For example: 'FRANÇOIS' is being decoded as 'FRANÇOIS'. However, 'François' is correctly handled as 'François'. I thought that it might be a case of the input string being latin1, but I'm pretty sure that's not the case and your documentation seems to imply that it won't decode things that it doesn't understand.

@threedaymonk
Copy link
Owner

I'm unable to reproduce this. In irb:

>> $KCODE = 'u'
=> "u"
>> require 'htmlentities'
=> true
>> coder = HTMLEntities.new
=> #<HTMLEntities:0x7f58e53a71f0 @flavor="xhtml1">
>> coder.decode('FRAN&Ccedil;OIS')
=> "FRANÇOIS"
>> coder.decode('Fran&ccedil;ois')
=> "François"

If you can supply a minimal test case, I'll investigate.

@wksmall
Copy link
Author

wksmall commented Dec 2, 2010

Do you know of a way for me to be absolutely certain that I am feeding utf-8 to the decoder? I have a unit test on that section of my app that also passes however when I run the app in full, I'm getting this error. I do not for an instant discount a problem later down the line but my investigations so far brought me to suspect the decoder or what I'm feeding it.

@wksmall
Copy link
Author

wksmall commented Dec 2, 2010

UPDATE: In script/console, I get this:
>> $KCODE='u'
=> "u"
>> coder = HTMLEntities.new
=> #<HTMLEntities:0x2aaaac8274f0 @flavor="xhtml1">
>> coder.decode('FRANÇOIS')
=> "FRANÃOIS"
>> coder.decode('François')
=> "François"

in irb, the require 'htmlentities' didn't work. I probably need the full path.

We're using htmlentities 4.2.0 under Rails 2.3.8 and Ruby 1.8.7

@threedaymonk
Copy link
Owner

It looks like your terminal is not UTF-8. That's a separate problem.

@wksmall
Copy link
Author

wksmall commented Dec 2, 2010

That's what I thought too but locale seems to think otherwise.

$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

@wksmall
Copy link
Author

wksmall commented Dec 2, 2010

UPDATE: I found that further down the line in our code, I am trying to decode hex entities created by Nokogiri. It is Ç that is failing to decode properly. I'm getting à instead of Ç.

@threedaymonk
Copy link
Owner

locale tells you what your programs are using, but it doesn't tell you what the terminal emulator is doing. (Which one are you using, by the way?)

However, let's try a different tack. Regardless of what your terminal is doing, the bytes should be the same. In irb:

>> coder.decode("&ccedil;").unpack("C*")
=> [195, 167]
>> coder.decode("&Ccedil;").unpack("C*")
=> [195, 135]
>> coder.decode("&#xC7;").unpack("C*")
=> [195, 135]

What results do you get?

@wksmall
Copy link
Author

wksmall commented Dec 3, 2010

I'm using puTTY in xterm mode.

I ran your example and got the same results. I've looked into this further and have determined that my tests are inadequate and misleading. It looks like my problem is that I have been feeding latin encoding to your decoder and this is the source of my difficulties. If I make sure that it it UTF-8, I get the proper decoding.

Thank you for your time. Sorry to have wasted it.

@threedaymonk
Copy link
Owner

Ah, yes, doing UTF-8 on Windows is usually harder than it should be. Glad you've got a bit closer to the problem.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants