New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Java (char *STRING, size_t LENGTH) #2597
Comments
Unfortunately the JNI API seems to disagree with some of your points as there's I'm not sure a "bytes" string typemap makes a lot of sense for Java, as Java strings are wide character. Probably it'd make more sense to convert to a Java container of some sort. |
Java JNI support UTF8 with 'GetStringUTFChars' . I did some tests in D language and got code where the wrapper tries to convent |
JNI uses a modified UTF8 The character Any how, JNI challenges should not break basic logic. |
Yes, but that's for converting data from Java to UTF8. A
JNI does, but C/C++ generally doesn't so embedded zero bytes ideally should be handled. Currently that seems to be largely ignored by SWIG (the same issue affects SWIG/Tcl too as Tcl also encodes nul as an overlong UTF-8 sequence).
I think you're misunderstanding the point I was trying to make there, which was really just that it's not as easy as you'd hope to handle converting a string specified by pointer and length to pass to Java. |
Perhaps so, but Java is not alone. I think the modify UTF8 is workable. We had a more challenging tasks. |
Surprisingly. |
"For most cases, I would recommend going directly from UTF-8 to String objects, and getting Java to do most of that work. Simple tools Java provides for that include the constructor So we do the conversion in Java and pass byte array to C where is converted to |
@ojwb |
@erezgeva, I'm not sure what the proposal here is to be honest. By all means, please put together a merge request with explanations. |
https://docs.oracle.com/javase/8/docs/api/java/lang/String.html (a far more definitive source that stackoverflow) says they're UTF16:
|
Take for example: But instead it is: As for why does it matter. I will have a go, if you, @ojwb and @wsfulton think it is worth fixing. Erez |
Java support several Unicode standards. I do not use stackoverflow as official resource, but more of ideas/hacks and how to pull more complicated task, sometime for examples. P.S. I found The stackoverflow refer to: "The UTF-8 charset is specified by RFC 2279; the transformation format upon which it is based is specified in Amendment 2 of ISO 10646-1 and is also described in the Unicode Standard." And So you can convert real UTF-8 from Java Another annoying point. |
Does converting in Java also avoid the invalid surrogate-pairs-encoded-as-two-codepoints-in-UTF-8 that the JNI If we can convert to/from real UTF-8 by just doing it Java code that seems much better than the current situation which seems like it works correctly until you encounter some data containing U+0000 or anything >= U+10000. You should probably talk to @wsfulton though as he's the SWIG/Java maintainer. |
According to oracle documentation, yes. I think @wsfulton wants to see a PR, before he decide. Only: d, go, guile, java, php, ocaml 2 tests: The And I see a generic implementation in typemaps/strings.swg. I do not see any reference of using UTF-8, |
Microsoft reference on C# String type and how does it map to C/C++ Also refer to null character. |
Hi @erezgeva. Am just trying to digest all this. I think you are right that the First observation is that the change is not consistent for the APIs provided by cdata.i, documented at http://swig.org/Doc4.1/Library.html#Library_nn7. With your changes, some of the functions are using a byte array and others a Java String. I think a similar change is needed here: Line 38 in 4b6b711
|
The text in the Java docs need updating too. Perhaps you'd like to tweak in each of these locations:
Given this is an incompatible change for Java, the original typemaps will need to be available in some other form. Perhaps as you imply with
@ojwb, it just occurred to me that you could create a Java string in JNI by calling one of the following String constructors from JNI: String(byte[] bytes)
String(char[] bytes)
String(byte[] bytes, Charset charset)
String(byte[] bytes, int offset, int length) The length parameter can then be supplied to the array constructor. These constructors keep any 0/null elements that might be in the array. |
@wsfulton Ooh, that sounds like a good approach - if I'm following it'd be like @erezgeva suggestion of "The JNI part will look similar to current Java (char *STRING, size_t LENGTH) with a conversion in the Java side of the swig wrapper" except we can do it all on via JNI on the C/C++ side which seems likely to be simpler. Looks like we'd use I'm wondering if a constructor is effectively a static method for these purposes (since it doesn't require the context of an existing object...) |
That was my point :-)
I do mean The first patch shows Java can use a string-length typemap with a string type in Java. Languages with dynamic types would use both typemaps in a similar ways. With your approving, i am ready for a second phase. Let me summarise.
In basic I ignore charset of strings. I am aware of the wide char mappings. |
Call GetMethod id using |
@erezgeva, yes agreed with your posting. I presume you have in mind for implementing as follows:
Where the typemaps for 1. remains unchanged. |
Exactly! |
The documentation for cdata, explicitly states it is for strings though, see cdata.i docs, so it should be changed to a String from byte[]. A (void *BYTES, size_t LENGTH) is not made available in any other languages, but could be added for Java in Lines 130 to 131 in 2230378
that is, a slight variation on the char *BYTE typemap already in there.
Please let me know if you are going to do this in the next couple of weeks. I'm tidying up now prior to the swig-4.2 release, so this is a last chance to make API breaking changes. |
Please apply this patch to your branch: |
OK, |
I have some time until Christmas. I'll try to push it. As I mention the 2 typemap behave the same in all languages that do NOT use static types: Java, C#, D and go. C#, D and go already uses string. This is way I propose to fix Java the STRING typemap (as I already did). |
Fix #2609 is ready for merge. |
Hi,
@wsfulton
Looking on java.swg
And on their 2 tests.
char_binary.i
director_binary_string.i
The typemap is called 'STRNG' NOT bytes .
We can add another typemap for byte memory block.
Named
(void *BYTES, size_t LENGTH)
And fix java to use string for string typemap.
I want to explain a bit
C can have
char*
chat*
andsize_t length
void*
orbyte*
orubytes*
oruint8_t*
and a lengthMany of SWIG target languages, including Java.
Use a string type that comes with length.
So it is a good practice to support the
STRING
type map with length, in cases we do not wish or can not depends on null termination of strings.For example, we may use a string like:
"first sentence\0second sentence\0last sentence\0\0"
On the other hand bytes make sense too.
But with proper name!
The text was updated successfully, but these errors were encountered: