Skip to content
This repository has been archived by the owner on Jan 14, 2023. It is now read-only.

Update PrimitiveFieldType$STRING.java for UTF-8 #19

Merged
merged 8 commits into from Feb 28, 2014

Conversation

Theosakamg
Copy link
Contributor

Add support of UTF-8!
For ~i18n (eg. TTS).

Docs

  1. "A string must always contain UTF-8 encoded or 7-bit ASCII text."
  2. "unicode strings are currently not supported as a ROS data type. utf-8 should be used to be compatible with ROS string serialization. "

Quote

It works.

  • testing with python node.
  • for bug test, send string msg with "éè", on receive no Multibyte-char managed.
  • check for best implement, just a fix!

Add support of UTF-8!
For i18n (eg. TTS).

"A string must always contain UTF-8 encoded or 7-bit ASCII text." https://developers.google.com/protocol-buffers/docs/proto#scalar

"unicode strings are currently not supported as a ROS data type. utf-8 should be used to be compatible with ROS string serialization. " http://wiki.ros.org/msg

but it works. (testing with python)
@stonier
Copy link
Contributor

stonier commented Feb 18, 2014

@damonkohler will this have any unforeseen side effects?

@damonkohler
Copy link
Member

LGTM but please also fix the serialize method.

@@ -581,7 +581,7 @@ public int getSerializedSize() {
public String deserialize(ChannelBuffer buffer) {
int length = buffer.readInt();
ByteBuffer stringBuffer = buffer.readSlice(length).toByteBuffer();
return Charset.forName("US-ASCII").decode(stringBuffer).toString();
return Charset.forName("UTF-8").decode(stringBuffer).toString();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also need to specify "UTF8" to getBytes() on line 574. See http://docs.oracle.com/javase/tutorial/i18n/text/string.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my apologies...
I have not tested the serializer in every context.
I use ROS on Android, which by default is UTF-8

add symmetric encoding transformation (serializer/deserializer)
@@ -679,6 +680,7 @@ public String getJavaTypeName() {
};

private static final ImmutableSet<String> TYPE_NAMES;
private static final Charset DEFAULT_CHARSET = Charset.forName("utf-8");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use "UTF-8"

@damonkohler
Copy link
Member

Last nit and then LG.

Change parameter of constant
@Theosakamg
Copy link
Contributor Author

@stonier : I testing the change with common case. Do you have other ideas for test cases ?

in org.ros.internal.message.MessageSerializationTest

  @Test
  public void testStringUTF8() {
    std_msgs.String message = defaultMessageFactory.newFromType(std_msgs.String._TYPE);
    message.setData("éêè €àáßëœ 文字化け");
    checkSerializeAndDeserialize(message);

    // i18n test case
    // base on http://www.inter-locale.com/whitepaper/learn/learn-to-test.html

    // Combining Marks and Accents test
    message.setData("àéîōũ");
    checkSerializeAndDeserialize(message);

    // DOS 860 test
    message.setData("você nós mãe avô irmã criança");
    checkSerializeAndDeserialize(message);

    // Windows-1252 test
    message.setData("€ŒœŠš™©‰ƒ");
    checkSerializeAndDeserialize(message);

    // Turkish test
    message.setData("ışık bir İyi Günler");
    checkSerializeAndDeserialize(message);

    // Dakuten and handakuten marks test
    message.setData("がざばだぱか゛さ゛た゛は");
    checkSerializeAndDeserialize(message);

    // Combining Grapheme Joiner character
    message.setData("אִ͏ַ");
    checkSerializeAndDeserialize(message);

    // Bidi with Latin test
    message.setData("abcאבגדabc ");
    checkSerializeAndDeserialize(message);

    message.setData("אבגדabcאבגד");
    checkSerializeAndDeserialize(message);

    message.setData("אבגד012אבגד");
    checkSerializeAndDeserialize(message);

    message.setData("אבגד 012 012");
    checkSerializeAndDeserialize(message);

    // Complex Scripts test
    message.setData("สวัสดี");
    checkSerializeAndDeserialize(message);

    message.setData("டாஹ்கோ");
    checkSerializeAndDeserialize(message);

    message.setData("بِسْمِ اللّهِ الرَّحْمـَنِ الرَّحِيمِ");
    checkSerializeAndDeserialize(message);

    // Numeric Shaping test
    message.setData("عدد مارس ١٩٩٨");
    checkSerializeAndDeserialize(message);

    // Common Scripts and Encodings test
    message.setData("Слава Жанна Ювеналий Ярополк");
    checkSerializeAndDeserialize(message);
  }

@damonkohler
Copy link
Member

Can you please add those tests to this change?

Theosakamg added a commit to Theosakamg-Archive/rosjava_core that referenced this pull request Feb 19, 2014
@Theosakamg
Copy link
Contributor Author

I am not sure of the location. (I use the groovy branch.)

@damonkohler
Copy link
Member

Sorry, I don't understand the problem. Can you please add the message serialization tests you pasted above to this patch?

Add the message serialization tests for UTF-8.
@Theosakamg
Copy link
Contributor Author

The problem was : I use groovy stack and the test is available in https://github.com/rosjava/rosjava_core/blob/master/rosjava/src/test/java/org/ros/internal/message/MessageSerializationTest.java (always available in master)
But with the news stack is in https://github.com/rosjava/rosjava_bootstrap/blob/master/message_generation/src/test/java/org/ros/internal/message/RawMessageSerializationTest.java
(If I understand)

It's done in rosjava_core and rosjava_bootstrap repository. It's ok for you?

@Theosakamg
Copy link
Contributor Author

Apologies I just see the difference (Raw vs Std_msg)

@@ -679,6 +680,7 @@ public String getJavaTypeName() {
};

private static final ImmutableSet<String> TYPE_NAMES;
private static final Charset DEFAULT_CHARSET = Charset.forName("UTF-8");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this to the top of the class (line 34).

@damonkohler
Copy link
Member

Looks good, thanks! Just one last nit. The constant DEFAULT_CHARSET belongs at the top of the class.

@Theosakamg
Copy link
Contributor Author

Done.

damonkohler added a commit that referenced this pull request Feb 28, 2014
Update PrimitiveFieldType$STRING.java for UTF-8
@damonkohler damonkohler merged commit 61079f6 into rosjava:master Feb 28, 2014
@stonier
Copy link
Contributor

stonier commented Mar 16, 2014

@mh0rst is this the error you were also seeing?

http://pastebin.com/LV31s8g0

This was coming from a ros answers post: http://answers.ros.org/question/140371/compilation-problem-in-rosjava-master-branch/

@mh0rst
Copy link

mh0rst commented Mar 16, 2014

@stonier yes, exactly. The fix is trivial, I created a pull request if that helps.

@stonier
Copy link
Contributor

stonier commented Mar 19, 2014

Definitely. @damonkohler will come undoubtedly across it when he's free.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants