Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Java (char *STRING, size_t LENGTH) #2597

Closed
erezgeva opened this issue May 18, 2023 · 28 comments
Closed

Java (char *STRING, size_t LENGTH) #2597

erezgeva opened this issue May 18, 2023 · 28 comments
Labels

Comments

@erezgeva
Copy link
Contributor

erezgeva commented May 18, 2023

Hi,
@wsfulton

Looking on java.swg

And on their 2 tests.
char_binary.i
director_binary_string.i

The typemap is called 'STRNG' NOT bytes .

We can add another typemap for byte memory block.
Named (void *BYTES, size_t LENGTH)

And fix java to use string for string typemap.

I want to explain a bit
C can have

  • Null terminated string, which usually used with char*
  • Sized string with chat* and size_t length
  • bytes with void* or byte* or ubytes* or uint8_t* and a length

Many of SWIG target languages, including Java.
Use a string type that comes with length.
So it is a good practice to support the STRING type map with length, in cases we do not wish or can not depends on null termination of strings.
For example, we may use a string like:
"first sentence\0second sentence\0last sentence\0\0"

On the other hand bytes make sense too.
But with proper name!

@ojwb
Copy link
Member

ojwb commented May 18, 2023

Unfortunately the JNI API seems to disagree with some of your points as there's NewStringUTF(JNIEnv *, const char *) but no equivalent taking a length. If you want a typemap for Java which takes a pointer and length, either you need to copy and nul-terminate (in which case embedded nuls won't work) or you need to do the conversion from UTF-8 to wide characters for yourself. I ran into this recently when adding support for std::string_view() (since data() doesn't return a nul-terminated string).

I'm not sure a "bytes" string typemap makes a lot of sense for Java, as Java strings are wide character. Probably it'd make more sense to convert to a Java container of some sort.

@ojwb ojwb added the Java label May 18, 2023
@erezgeva
Copy link
Contributor Author

Unfortunately the JNI API seems to disagree with some of your points as there's NewStringUTF(JNIEnv *, const char *) but no equivalent taking a length. If you want a typemap for Java which takes a pointer and length, either you need to copy and nul-terminate (in which case embedded nuls won't work) or you need to do the conversion from UTF-8 to wide characters for yourself. I ran into this recently when adding support for std::string_view() (since data() doesn't return a nul-terminated string).

I'm not sure a "bytes" string typemap makes a lot of sense for Java, as Java strings are wide character. Probably it'd make more sense to convert to a Java container of some sort.

Java JNI support UTF8 with 'GetStringUTFChars' .
I wrote 'argc' 'argv' with it.
java/argcargv.i
Confusing 'STRING' with bytes, breaks other languages and mislead users.
For the moment, only 2 languages uses this typemap.
And go uses 'string' properly.
There is a reason why only Java implement director_binary_string.i.
The term binary string is a contradiction.
Yes, you can convert string to binary and via versa.
But as most languages including C++ with std::string , support string properly, I think we should avoid the confusion and the broken code result of it.

I did some tests in D language and got code where the wrapper tries to convent void* to string, which yield broken code.
D may not be so popular, but I guess go would break too with director_binary_string.i.

@erezgeva
Copy link
Contributor Author

erezgeva commented May 19, 2023

JNI uses a modified UTF8
https://docs.oracle.com/javase/8/docs/technotes/guides/jni/spec/types.html#modified_utf_8_strings
https://stackoverflow.com/questions/32205446/getting-true-utf-8-characters-in-java-jni

The character 0, is been escaped.
So we can use a '0' termination instead of length.

Any how, JNI challenges should not break basic logic.
Lead to confusing tests, or misleading term as "binary strings".

@ojwb
Copy link
Member

ojwb commented May 19, 2023

Java JNI support UTF8 with 'GetStringUTFChars' .

Yes, but that's for converting data from Java to UTF8. A (char *STRING, size_t LENGTH) typemap needs to convert in the other direction.

JNI uses a modified UTF8

JNI does, but C/C++ generally doesn't so embedded zero bytes ideally should be handled. Currently that seems to be largely ignored by SWIG (the same issue affects SWIG/Tcl too as Tcl also encodes nul as an overlong UTF-8 sequence).

Any how, JNI challenges should not break basic logic.
Lead to confusing tests, or misleading term as "binary strings".

I think you're misunderstanding the point I was trying to make there, which was really just that it's not as easy as you'd hope to handle converting a string specified by pointer and length to pass to Java.

@erezgeva
Copy link
Contributor Author

ink you're misunderstanding the point I was trying to make there, which was really just that it's not as easy as you'd hope to handle converting a string specified by pointer and length to pass to Java.

Perhaps so, but Java is not alone.
I think if a specific language have issues, it should not change other Languages.

I think the modify UTF8 is workable. We had a more challenging tasks.
We can create a shared swig file with the C functions that handle the conversions for Java and TCL.
The way we now handle this conversion in Java is broken in addition of being confusing.
As user might think that perhaps Java uses byte*, but we can simply "convert" strings from C to Java this way.

@erezgeva
Copy link
Contributor Author

Surprisingly.
Strings in Java are standard UTF8.
Only JNI use modified UTF8.
Perhaps Java need to convert the strings from byye arrays, in the Java and solve the problem.

@erezgeva
Copy link
Contributor Author

erezgeva commented May 19, 2023

From https://stackoverflow.com/questions/57419723/how-to-convert-the-java-modified-utf-8-to-the-regular-utf-8-and-back

"For most cases, I would recommend going directly from UTF-8 to String objects, and getting Java to do most of that work. Simple tools Java provides for that include the constructor String(byte[], String), which initializes a String with data whose encoding you specify, and String.getBytes(String), which gives you the string's character data in the encoding of your choice. Both of these are limited to encodings known to the JVM, but UTF-8 is guaranteed to be among those. You can use those directly from your JNI code, or provide suitable for-purpose wrapper methods for your JNI code to invoke."

So we do the conversion in Java and pass byte array to C where is converted to char * with length.
But there should not be "void*" in director_binary_string.i.
As the conversion is internal. Java public function should use string and char * in C.

@erezgeva
Copy link
Contributor Author

@ojwb
I can try change Java if you please.
Else I will close the issue.

@wsfulton
Copy link
Member

@erezgeva, I'm not sure what the proposal here is to be honest. By all means, please put together a merge request with explanations.

@ojwb
Copy link
Member

ojwb commented May 20, 2023

Strings in Java are standard UTF8.

https://docs.oracle.com/javase/8/docs/api/java/lang/String.html (a far more definitive source that stackoverflow) says they're UTF16:

A String represents a string in the UTF-16 format in which supplementary characters are represented by surrogate pairs (see the section Unicode Character Representations in the Character class for more information). Index values refer to char code units, so a supplementary character uses two positions in a String.

@erezgeva
Copy link
Contributor Author

erezgeva commented May 21, 2023

@erezgeva, I'm not sure what the proposal here is to be honest. By all means, please put together a merge request with explanations.

@wsfulton

Take for example: Examples/test-suite/char_binary.i
You have size_t strlen(const char *str, size_t len)
Which is under %apply (char *STRING, size_t LENGTH)
You would expect that the Java wrapper function would use Java string type. As in go for example, and now in D.

But instead it is: public long Test::strlen(byte[] str)
Which is NOT a string but bytes (array).
Now Java convert strings to byte array.
But if I apply a "STRING" typemap, I would except the function to use Java string.

As for why does it matter.
It is the base of the wrong testing in Examples/test-suite/director_binary_string.i
Which only work in Java.

I will have a go, if you, @ojwb and @wsfulton think it is worth fixing.
I myself did not try to use this STRING map myself, and only 3 languages support it today: Java, go, D.

Erez

@erezgeva
Copy link
Contributor Author

erezgeva commented May 21, 2023

Strings in Java are standard UTF8.

https://docs.oracle.com/javase/8/docs/api/java/lang/String.html (a far more definitive source that stackoverflow) says they're UTF16:

A String represents a string in the UTF-16 format in which supplementary characters are represented by surrogate pairs (see the section Unicode Character Representations in the Character class for more information). Index values refer to char code units, so a supplementary character uses two positions in a String.

Java support several Unicode standards.
In addition to UTF-8 and UTF-16.
What I meant is that although JNI uses modified UTF-8.
When fetching a UTF-8 string in Java itself, it is a standard UTF-8 and NOT modified UTF-8, sorry if I misleads.
Nor does this Q&A in stackoverflow suggest it.

I do not use stackoverflow as official resource, but more of ideas/hacks and how to pull more complicated task, sometime for examples.

P.S. I found Lib/lua/argcargv.i in stackoverflow.

The stackoverflow refer to:
https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#String-byte:A-java.nio.charset.Charset-
https://docs.oracle.com/javase/8/docs/api/java/nio/charset/Charset.html

"The UTF-8 charset is specified by RFC 2279; the transformation format upon which it is based is specified in Amendment 2 of ISO 10646-1 and is also described in the Unicode Standard."

And
https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#getBytes-java.nio.charset.Charset-

So you can convert real UTF-8 from Java String to byte[] and via versa.
Only we need to convert in Java and use byte[] in the JNI.
The JNI part will look similar to current Java (char *STRING, size_t LENGTH) with a conversion in the Java side of the swig wrapper.

Another annoying point.
There are 2 UTF-16 representation: big endian and little endian.
I assume Java String uses big endian UTF-16, as it is a big endian virtual machine. But really we care is that Java uses Unicode, and can support most of earth languages.

@ojwb
Copy link
Member

ojwb commented May 22, 2023

Does converting in Java also avoid the invalid surrogate-pairs-encoded-as-two-codepoints-in-UTF-8 that the JNI GetStringUTFChars() function gives?

If we can convert to/from real UTF-8 by just doing it Java code that seems much better than the current situation which seems like it works correctly until you encounter some data containing U+0000 or anything >= U+10000.

You should probably talk to @wsfulton though as he's the SWIG/Java maintainer.

@erezgeva
Copy link
Contributor Author

onvert to/from real UTF-8 by just doing it Java code that seems much better than the current situation which seems like it works correctly until you encounter some data containing U+0000 or anything >= U+10000.

According to oracle documentation, yes.
RFC 2279 use the Unicode standard and specifically forbid translate 0 to C0 80.
From https://datatracker.ietf.org/doc/html/rfc2279#section-3
NOTE -- actual implementations of the decoding algorithm above should protect against decoding invalid sequences. For instance, a naive implementation may (wrongly) decode the invalid UTF-8 sequence C0 80 into the character U+0000, ...

I think @wsfulton wants to see a PR, before he decide.
I only ask if I am in the right direction.
Or I wast my time on the matter.
From your answers, I think I understand why Java uses byte[].
And how we can improve here.

Only: d, go, guile, java, php, ocaml
Languages have a specific typemap with directive.

2 tests: char_binary.i, director_binary_string.i

The director_binary_string.i is used by Java only, this test I think is based on Java using byte[]. So I think we need to fix/improve the test.

And char_binary.i is implemented in:
javascript, perl5, d, java, python, ocaml, go

I see a generic implementation in typemaps/strings.swg.
Which all other probably should follow. SWIG_AsCharPtrAndSize.
But as most languages are dynamically typed.
Only static typed languages as C#, D, go, Java need a more specific typemap with defined types.
Which probably need to be a string typed.

I do not see any reference of using UTF-8,
which means we always use octet size and not string size.
The C function strlen name is a leftovers from times before Unicode and when characters were only using a single octet.

@erezgeva
Copy link
Contributor Author

@erezgeva, I'm not sure what the proposal here is to be honest. By all means, please put together a merge request with explanations.

@wsfulton See #2609

@erezgeva
Copy link
Contributor Author

Microsoft reference on C# String type and how does it map to C/C++
https://learn.microsoft.com/en-us/dotnet/api/system.string#strings-and-embedded-null-characters

Also refer to null character.

@wsfulton
Copy link
Member

Hi @erezgeva. Am just trying to digest all this. I think you are right that the (char *STRING, size_t LENGTH) family of typemaps should marshall to and from a target language string, so the Java implementation marshalling to and from a Java byte array seems to be a mistake.

First observation is that the change is not consistent for the APIs provided by cdata.i, documented at http://swig.org/Doc4.1/Library.html#Library_nn7. With your changes, some of the functions are using a byte array and others a Java String. I think a similar change is needed here:

%typemap(jstype) SWIGCDATA "byte[]"

@wsfulton
Copy link
Member

The text in the Java docs need updating too. Perhaps you'd like to tweak in each of these locations:

~/swig/github/swig (master =)$ grep -w STRING -r Doc/ 
Doc/Manual/Library.html:%apply (char *STRING, size_t LENGTH) { (char *str, size_t len) };
Doc/Manual/Library.html:The <tt>(char *STRING, int LENGTH)</tt> multi-argument typemap is also available in addition to <tt>(char *STRING, size_t LENGTH)</tt>.
Doc/Manual/Python.html:%apply (char *STRING, int LENGTH) { (char *data, int size) };
Doc/Manual/Java.html:%apply (char *STRING, size_t LENGTH) { (const char data[], size_t len) }

Given this is an incompatible change for Java, the original typemaps will need to be available in some other form. Perhaps as you imply with (char *BYTES, size_t LENGTH) and const version.

Unfortunately the JNI API seems to disagree with some of your points as there's NewStringUTF(JNIEnv *, const char *) but no equivalent taking a length.

@ojwb, it just occurred to me that you could create a Java string in JNI by calling one of the following String constructors from JNI:

String(byte[] bytes)
String(char[] bytes)
String(byte[] bytes, Charset charset)
String(byte[] bytes, int offset, int length)

The length parameter can then be supplied to the array constructor. These constructors keep any 0/null elements that might be in the array.

@ojwb
Copy link
Member

ojwb commented Sep 21, 2023

it just occurred to me that you could create a Java string in JNI by calling one of the following String constructors from JNI

@wsfulton Ooh, that sounds like a good approach - if I'm following it'd be like @erezgeva suggestion of "The JNI part will look similar to current Java (char *STRING, size_t LENGTH) with a conversion in the Java side of the swig wrapper" except we can do it all on via JNI on the C/C++ side which seems likely to be simpler.

Looks like we'd use CallObjectMethod or some variant - do you happen to know what the right way to call a constructor is?

I'm wondering if a constructor is effectively a static method for these purposes (since it doesn't require the context of an existing object...)

@erezgeva
Copy link
Contributor Author

Hi @erezgeva. Am just trying to digest all this. I think you are right that the (char *STRING, size_t LENGTH) family of typemaps should marshall to and from a target language string, so the Java implementation marshalling to and from a Java byte array seems to be a mistake.

That was my point :-)

First observation is that the change is not consistent for the APIs provided by cdata.i, documented at http://swig.org/Doc4.1/Library.html#Library_nn7. With your changes, some of the functions are using a byte array and others a Java String. I think a similar change is needed here:

%typemap(jstype) SWIGCDATA "byte[]"

I do mean (char *STRING, size_t LENGTH) should be used with strings.
But in this case, when using C void *,
it actually make sense to use bytes with a (void *BYTES. size_t LENGTH) typemap.

The first patch shows Java can use a string-length typemap with a string type in Java.
My purposing is to add a bytes-length typemap and use it when it is proper.

Languages with dynamic types would use both typemaps in a similar ways.
But for languages with static types. It does matter.

With your approving, i am ready for a second phase.
Using both string-length typemap and bytes-length typemap.
And try to properly use them in testings.

Let me summarise.
I think we should have:

  1. Typemap for string using a null terminated
  2. Typemap for string-length
  3. Typemap for bytes-length

In basic I ignore charset of strings.
It only make different in languages using static types with a special string type.
In these cases we can translate to UTF8/ASCII, as I did for Java.

I am aware of the wide char mappings.
But I do not think they they are relevant for these changes.

@wsfulton
Copy link
Member

Looks like we'd use CallObjectMethod or some variant - do you happen to know what the right way to call a constructor is?

Call GetMethod id using "<init>", then NewObject, see https://docs.oracle.com/javase/7/docs/technotes/guides/jni/spec/functions.html#wp4517. There is plenty of JNI in the Lib/java directory to crib from.

@wsfulton
Copy link
Member

@erezgeva, yes agreed with your posting. I presume you have in mind for implementing as follows:

Description Typemap
1. Typemap for string using a null terminated char *
2. Typemap for string-length (char *STRING, size_t LENGTH)
3. Typemap for bytes-length (char *BYTES, size_t LENGTH)

Where the typemaps for 1. remains unchanged.

@erezgeva
Copy link
Contributor Author

@erezgeva, yes agreed with your posting. I presume you have in mind for implementing as follows:

Description Typemap

  1. Typemap for string using a null terminated char *
  2. Typemap for string-length (char *STRING, size_t LENGTH)
  3. Typemap for bytes-length (char *BYTES, size_t LENGTH)
    Where the typemaps for 1. remains unchanged.

Exactly!

@wsfulton
Copy link
Member

First observation is that the change is not consistent for the APIs provided by cdata.i, documented at http://swig.org/Doc4.1/Library.html#Library_nn7. With your changes, some of the functions are using a byte array and others a Java String. I think a similar change is needed here:

%typemap(jstype) SWIGCDATA "byte[]"

I do mean (char *STRING, size_t LENGTH) should be used with strings. But in this case, when using C void *, it actually make sense to use bytes with a (void *BYTES. size_t LENGTH) typemap.

The first patch shows Java can use a string-length typemap with a string type in Java. My purposing is to add a bytes-length typemap and use it when it is proper.

The documentation for cdata, explicitly states it is for strings though, see cdata.i docs, so it should be changed to a String from byte[].

A (void *BYTES, size_t LENGTH) is not made available in any other languages, but could be added for Java in

swig/Lib/java/various.i

Lines 130 to 131 in 2230378

* char *BYTE typemaps.
* These are input typemaps for mapping a Java byte[] array to a C char array.

that is, a slight variation on the char *BYTE typemap already in there.

Languages with dynamic types would use both typemaps in a similar ways. But for languages with static types. It does matter.

With your approving, i am ready for a second phase. Using both string-length typemap and bytes-length typemap. And try to properly use them in testings.

Let me summarise. I think we should have:

1. Typemap for string using a null terminated

2. Typemap for string-length

3. Typemap for bytes-length

Please let me know if you are going to do this in the next couple of weeks. I'm tidying up now prior to the swig-4.2 release, so this is a last chance to make API breaking changes.

@wsfulton
Copy link
Member

Please apply this patch to your branch:
bytes.patch.txt

@erezgeva
Copy link
Contributor Author

Please apply this patch to your branch: bytes.patch.txt

OK,
I will also add null to the other cdata testing.

@erezgeva
Copy link
Contributor Author

erezgeva commented Nov 23, 2023

First observation is that the change is not consistent for the APIs provided by cdata.i, documented at http://swig.org/Doc4.1/Library.html#Library_nn7. With your changes, some of the functions are using a byte array and others a Java String. I think a similar change is needed here:

%typemap(jstype) SWIGCDATA "byte[]"

I do mean (char *STRING, size_t LENGTH) should be used with strings. But in this case, when using C void *, it actually make sense to use bytes with a (void *BYTES. size_t LENGTH) typemap.
The first patch shows Java can use a string-length typemap with a string type in Java. My purposing is to add a bytes-length typemap and use it when it is proper.

The documentation for cdata, explicitly states it is for strings though, see cdata.i docs, so it should be changed to a String from byte[].

A (void *BYTES, size_t LENGTH) is not made available in any other languages, but could be added for Java in

swig/Lib/java/various.i

Lines 130 to 131 in 2230378

* char *BYTE typemaps.
* These are input typemaps for mapping a Java byte[] array to a C char array.

that is, a slight variation on the char *BYTE typemap already in there.

Languages with dynamic types would use both typemaps in a similar ways. But for languages with static types. It does matter.
With your approving, i am ready for a second phase. Using both string-length typemap and bytes-length typemap. And try to properly use them in testings.
Let me summarise. I think we should have:

1. Typemap for string using a null terminated

2. Typemap for string-length

3. Typemap for bytes-length

Please let me know if you are going to do this in the next couple of weeks. I'm tidying up now prior to the swig-4.2 release, so this is a last chance to make API breaking changes.

I have some time until Christmas. I'll try to push it.

As I mention the 2 typemap behave the same in all languages that do NOT use static types: Java, C#, D and go.

C#, D and go already uses string.
So only Java is breaking.

This is way I propose to fix Java the STRING typemap (as I already did).
Add BYTES typemap to Java, C#, D and go.
And duplicate STRING as BYTES to all other non-static languages :-)

@erezgeva
Copy link
Contributor Author

Fix #2609 is ready for merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants