decoding failure for Ç #3

wksmall · 2010-12-02T20:02:39Z

I seem to be having a problem with Ç. For example: 'FRANÇOIS' is being decoded as 'FRANÃ‡OIS'. However, 'François' is correctly handled as 'François'. I thought that it might be a case of the input string being latin1, but I'm pretty sure that's not the case and your documentation seems to imply that it won't decode things that it doesn't understand.

threedaymonk · 2010-12-02T20:06:06Z

I'm unable to reproduce this. In irb:

>> $KCODE = 'u'
=> "u"
>> require 'htmlentities'
=> true
>> coder = HTMLEntities.new
=> #<HTMLEntities:0x7f58e53a71f0 @flavor="xhtml1">
>> coder.decode('FRAN&Ccedil;OIS')
=> "FRANÇOIS"
>> coder.decode('Fran&ccedil;ois')
=> "François"

If you can supply a minimal test case, I'll investigate.

wksmall · 2010-12-02T20:22:41Z

Do you know of a way for me to be absolutely certain that I am feeding utf-8 to the decoder? I have a unit test on that section of my app that also passes however when I run the app in full, I'm getting this error. I do not for an instant discount a problem later down the line but my investigations so far brought me to suspect the decoder or what I'm feeding it.

wksmall · 2010-12-02T20:57:29Z

UPDATE: In script/console, I get this:
>> $KCODE='u'
=> "u"
>> coder = HTMLEntities.new
=> #<HTMLEntities:0x2aaaac8274f0 @flavor="xhtml1">
>> coder.decode('FRANÇOIS')
=> "FRANÃOIS"
>> coder.decode('François')
=> "FranÃ§ois"

in irb, the require 'htmlentities' didn't work. I probably need the full path.

We're using htmlentities 4.2.0 under Rails 2.3.8 and Ruby 1.8.7

threedaymonk · 2010-12-02T20:58:51Z

It looks like your terminal is not UTF-8. That's a separate problem.

wksmall · 2010-12-02T21:07:08Z

That's what I thought too but locale seems to think otherwise.

$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

wksmall · 2010-12-02T22:01:22Z

UPDATE: I found that further down the line in our code, I am trying to decode hex entities created by Nokogiri. It is Ç that is failing to decode properly. I'm getting Ã instead of Ç.

threedaymonk · 2010-12-02T22:12:03Z

locale tells you what your programs are using, but it doesn't tell you what the terminal emulator is doing. (Which one are you using, by the way?)

However, let's try a different tack. Regardless of what your terminal is doing, the bytes should be the same. In irb:

>> coder.decode("&ccedil;").unpack("C*")
=> [195, 167]
>> coder.decode("&Ccedil;").unpack("C*")
=> [195, 135]
>> coder.decode("&#xC7;").unpack("C*")
=> [195, 135]

What results do you get?

wksmall · 2010-12-03T14:40:34Z

I'm using puTTY in xterm mode.

I ran your example and got the same results. I've looked into this further and have determined that my tests are inadequate and misleading. It looks like my problem is that I have been feeding latin encoding to your decoder and this is the source of my difficulties. If I make sure that it it UTF-8, I get the proper decoding.

Thank you for your time. Sorry to have wasted it.

threedaymonk · 2010-12-04T19:54:40Z

Ah, yes, doing UTF-8 on Windows is usually harder than it should be. Glad you've got a bit closer to the problem.

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

decoding failure for Ç #3

decoding failure for Ç #3

wksmall commented Dec 2, 2010

threedaymonk commented Dec 2, 2010

wksmall commented Dec 2, 2010

wksmall commented Dec 2, 2010

threedaymonk commented Dec 2, 2010

wksmall commented Dec 2, 2010

wksmall commented Dec 2, 2010

threedaymonk commented Dec 2, 2010

wksmall commented Dec 3, 2010

threedaymonk commented Dec 4, 2010

decoding failure for &Ccedil; #3

decoding failure for &Ccedil; #3

Comments

wksmall commented Dec 2, 2010

threedaymonk commented Dec 2, 2010

wksmall commented Dec 2, 2010

wksmall commented Dec 2, 2010

threedaymonk commented Dec 2, 2010

wksmall commented Dec 2, 2010

wksmall commented Dec 2, 2010

threedaymonk commented Dec 2, 2010

wksmall commented Dec 3, 2010

threedaymonk commented Dec 4, 2010

decoding failure for Ç #3

decoding failure for Ç #3