StringField get wrong encoding #120

localshred · 2013-09-17T04:42:54Z

I've got a StringField whose value contains unicode characters.

It seems that when the message is serialized to binary bytes and then deserialized to message, the unicode characters in the StringField is replaced with empty char "".

I found that in Protobuf::Field::StringFiled.encode(value) function, value's encoding is "ASCII-8BIT", but it should be UTF-8 actually. After calling value.encode!, because char in UTF-8 is not valid in ASCII-8BIT, these chars are replaced with char "".

I didn't find out when and where the value encoding is changed from UTF-8 to ASCII-8BIT.

I plan to comment out the encode! line to get rid of this temporarily.

Hope this will be resolved.

thanks!

localshred · 2013-09-16T15:55:17Z

@foxban I'll check this out. Could you please provide a failing gist example? Also which ruby version you are running is important as JRuby and MRI behave differently.

localshred · 2013-09-16T16:33:48Z

@foxban here is my rudimentary test to verify this issue. I've run the test on ruby 2.0.0-p195 and jruby-1.7.4 and cannot reproduce your issue.

  context 'when decoding an encoded string with unicode characters' do
    it 'decodes to the same string' do
      source_string = "¢"
      encoded = ::Test::Resource.encode(:name => source_string)
      decoded = ::Test::Resource.decode(encoded)
      decoded.name.should eq source_string
    end
  end

Can you provide a failing example?

foxban · 2013-09-17T01:41:49Z

@localshred , I figured it out finally.

Calling top level message's to_s method twice is the key point to reproduce it.

In my production environment, message is written into log with logger.error() method, and then sent to the client, so to_s is called twice implicitly.

I think that's why I encountered the issue.

Is it a bug? Or I MUST NOT call to_s for more than one time?

protobuf (2.8.5)

thanks.

localshred · 2013-09-17T04:03:20Z

@foxban So just to be clear, you have a proto like:

proto = MyProto.new(:foo => 'bar')
proto.to_s
proto.to_s # <- this raises the encoding error?

It's difficult to reproduce without code to look at and test.

localshred · 2013-09-17T04:08:08Z

OK, I was able to reproduce the issue calling encode on a proto instance twice. The first time the encoding is correct, the second time it fails.

resource = Test::Resource.new(:name => "\u20ac")
# => {:name=>"€"}
>> resource.encode
# => "\n\x03\xE2\x82\xAC"
>> resource.encode
# => "\n\u0000"

Now to figure out why.

We should always dup the string or bytes value before encoding because we are `force_encoding` to BINARY encoding (ascii-8bit) which is actually changing the value of the string if it contains unicode characters. Before this change, if a string or bytes field contained a unicode character, that field was effectively corrupted from its source value after the proto was encoded. ```ruby resource = Test::Resource.new(:name => "\u20ac") # => {:name=>"€"} resource.encode resource.name # => "\xE2\x82\xAC" resource.encode resource.name # => "" ``` It should be noted this behavior is not present when the string contains characters that fall inside the normal ascii range. Fixes #120.

localshred · 2013-09-17T04:43:39Z

Any feedback @abrandoned or @devin-c?

Fix string/bytes encoding when unicode characters present

localshred · 2013-09-17T15:38:47Z

Released the fix for this issue under v2.8.6 and v2.7.12.

ghost assigned abrandoned Sep 16, 2013

localshred mentioned this pull request Sep 17, 2013

Backport string field encode fixes from 2.8.x #122

Merged

localshred added a commit that referenced this pull request Sep 17, 2013

Merge pull request #120 from localshred/bj/120_fix_string_encodings

61fc0dc

Fix string/bytes encoding when unicode characters present

localshred merged commit 61fc0dc into 2-8-stable Sep 17, 2013

localshred deleted the bj/120_fix_string_encodings branch September 17, 2013 15:28

foxban unassigned abrandoned May 5, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

StringField get wrong encoding #120

StringField get wrong encoding #120

Uh oh!

localshred commented Sep 17, 2013

Uh oh!

localshred commented Sep 16, 2013

Uh oh!

localshred commented Sep 16, 2013

Uh oh!

foxban commented Sep 17, 2013

Uh oh!

localshred commented Sep 17, 2013

Uh oh!

localshred commented Sep 17, 2013

Uh oh!

localshred commented Sep 17, 2013

Uh oh!

localshred commented Sep 17, 2013

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

StringField get wrong encoding #120

StringField get wrong encoding #120

Uh oh!

Conversation

localshred commented Sep 17, 2013

Uh oh!

localshred commented Sep 16, 2013

Uh oh!

localshred commented Sep 16, 2013

Uh oh!

foxban commented Sep 17, 2013

Uh oh!

localshred commented Sep 17, 2013

Uh oh!

localshred commented Sep 17, 2013

Uh oh!

localshred commented Sep 17, 2013

Uh oh!

localshred commented Sep 17, 2013

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants