-
Notifications
You must be signed in to change notification settings - Fork 98
StringField get wrong encoding #120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@foxban I'll check this out. Could you please provide a failing gist example? Also which ruby version you are running is important as JRuby and MRI behave differently. |
|
@foxban here is my rudimentary test to verify this issue. I've run the test on ruby 2.0.0-p195 and jruby-1.7.4 and cannot reproduce your issue. context 'when decoding an encoded string with unicode characters' do
it 'decodes to the same string' do
source_string = "¢"
encoded = ::Test::Resource.encode(:name => source_string)
decoded = ::Test::Resource.decode(encoded)
decoded.name.should eq source_string
end
endCan you provide a failing example? |
|
@localshred , I figured it out finally. Calling top level message's In my production environment, message is written into log with logger.error() method, and then sent to the client, so I think that's why I encountered the issue. Is it a bug? Or I MUST NOT call protobuf (2.8.5) thanks. |
|
@foxban So just to be clear, you have a proto like: proto = MyProto.new(:foo => 'bar')
proto.to_s
proto.to_s # <- this raises the encoding error?It's difficult to reproduce without code to look at and test. |
|
OK, I was able to reproduce the issue calling resource = Test::Resource.new(:name => "\u20ac")
# => {:name=>"€"}
>> resource.encode
# => "\n\x03\xE2\x82\xAC"
>> resource.encode
# => "\n\u0000"Now to figure out why. |
We should always dup the string or bytes value before encoding because
we are `force_encoding` to BINARY encoding (ascii-8bit) which is
actually changing the value of the string if it contains unicode
characters.
Before this change, if a string or bytes field contained a unicode
character, that field was effectively corrupted from its source value
after the proto was encoded.
```ruby
resource = Test::Resource.new(:name => "\u20ac") # => {:name=>"€"}
resource.encode
resource.name # => "\xE2\x82\xAC"
resource.encode
resource.name # => ""
```
It should be noted this behavior is not present when the string
contains characters that fall inside the normal ascii range.
Fixes #120.
|
Any feedback @abrandoned or @devin-c? |
Fix string/bytes encoding when unicode characters present
I've got a StringField whose value contains unicode characters.
It seems that when the message is serialized to binary bytes and then deserialized to message, the unicode characters in the StringField is replaced with empty char "".
I found that in Protobuf::Field::StringFiled.encode(value) function, value's encoding is "ASCII-8BIT", but it should be UTF-8 actually. After calling value.encode!, because char in UTF-8 is not valid in ASCII-8BIT, these chars are replaced with char "".
I didn't find out when and where the value encoding is changed from UTF-8 to ASCII-8BIT.
I plan to comment out the encode! line to get rid of this temporarily.
Hope this will be resolved.
thanks!