Skip to content

Conversation

@localshred
Copy link
Contributor

I've got a StringField whose value contains unicode characters.

It seems that when the message is serialized to binary bytes and then deserialized to message, the unicode characters in the StringField is replaced with empty char "".

I found that in Protobuf::Field::StringFiled.encode(value) function, value's encoding is "ASCII-8BIT", but it should be UTF-8 actually. After calling value.encode!, because char in UTF-8 is not valid in ASCII-8BIT, these chars are replaced with char "".

I didn't find out when and where the value encoding is changed from UTF-8 to ASCII-8BIT.

I plan to comment out the encode! line to get rid of this temporarily.

Hope this will be resolved.

thanks!

@localshred
Copy link
Contributor

@foxban I'll check this out. Could you please provide a failing gist example? Also which ruby version you are running is important as JRuby and MRI behave differently.

@ghost ghost assigned abrandoned Sep 16, 2013
@localshred
Copy link
Contributor

@foxban here is my rudimentary test to verify this issue. I've run the test on ruby 2.0.0-p195 and jruby-1.7.4 and cannot reproduce your issue.

  context 'when decoding an encoded string with unicode characters' do
    it 'decodes to the same string' do
      source_string = "¢"
      encoded = ::Test::Resource.encode(:name => source_string)
      decoded = ::Test::Resource.decode(encoded)
      decoded.name.should eq source_string
    end
  end

Can you provide a failing example?

@foxban
Copy link
Author

foxban commented Sep 17, 2013

@localshred , I figured it out finally.

Calling top level message's to_s method twice is the key point to reproduce it.

In my production environment, message is written into log with logger.error() method, and then sent to the client, so to_s is called twice implicitly.

I think that's why I encountered the issue.

Is it a bug? Or I MUST NOT call to_s for more than one time?

protobuf (2.8.5)

thanks.

@localshred
Copy link
Contributor

@foxban So just to be clear, you have a proto like:

proto = MyProto.new(:foo => 'bar')
proto.to_s
proto.to_s # <- this raises the encoding error?

It's difficult to reproduce without code to look at and test.

@localshred
Copy link
Contributor

OK, I was able to reproduce the issue calling encode on a proto instance twice. The first time the encoding is correct, the second time it fails.

resource = Test::Resource.new(:name => "\u20ac")
# => {:name=>"€"}
>> resource.encode
# => "\n\x03\xE2\x82\xAC"
>> resource.encode
# => "\n\u0000"

Now to figure out why.

We should always dup the string or bytes value before encoding because
we are `force_encoding` to BINARY encoding (ascii-8bit) which is
actually changing the value of the string if it contains unicode
characters.

Before this change, if a string or bytes field contained a unicode
character, that field was effectively corrupted from its source value
after the proto was encoded.

```ruby
resource = Test::Resource.new(:name => "\u20ac") # => {:name=>"€"}
resource.encode
resource.name # => "\xE2\x82\xAC"
resource.encode
resource.name # => ""
```

It should be noted this behavior is not present when the string
contains characters that fall inside the normal ascii range.

Fixes #120.
@localshred
Copy link
Contributor

Any feedback @abrandoned or @devin-c?

localshred added a commit that referenced this pull request Sep 17, 2013
Fix string/bytes encoding when unicode characters present
@localshred localshred merged commit 61fc0dc into 2-8-stable Sep 17, 2013
@localshred localshred deleted the bj/120_fix_string_encodings branch September 17, 2013 15:28
@localshred
Copy link
Contributor

Released the fix for this issue under v2.8.6 and v2.7.12.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants