Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data is corrupted when transferred as byte arrays from Java #91

Closed
agronholm opened this issue Oct 31, 2011 · 16 comments
Closed

Data is corrupted when transferred as byte arrays from Java #91

agronholm opened this issue Oct 31, 2011 · 16 comments
Assignees
Labels
Milestone

Comments

@agronholm
Copy link

I was baffled when my PDFs from JasperReports showed up blank. After a lot of head scratching, I found that if I wrote the PDF data to disk on the Java side, it was fine. But if I transferred the data as byte[] over the network link, it would get "wrinkles" on the way, rendering it useless. Wrinkles here meaning that certain bytes (0xD8-0xDF range) would instead be replaced with nulls (0x00). It took me one long sleepless night to track down this problem, but I finally nailed it. It happens on the Java side of things, when the response object is being written to the network socket. The UTF-8 encoder does not like certain characters which happen to be on the reserved range for UTF-16 surrogates which should always come in pairs. This causes the encoder to silently replace some characters with question marks, which in turn either silently corrupts the data or causes an exception on the Python side on decoding. I concluded that this was a protocol design flaw.

More information at: http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Surrogates

@bartdag
Copy link
Collaborator

bartdag commented Oct 31, 2011

Thanks Alex for this bug report. You are right that on the Python side, this works just fine (I can send a bytearray containing 0xD8 to Java without any problem), but on the Java side, something goes wrong with the UTF-8 conversion for these bytes. I'm not sure whether it's Java that is trying to do too much or if it's Python that is not doing enough.

I made some small tests on Java and every time, I could convert 0xD8 using Py4J's protocol without problem, but when put together with the Python side on a socket, something went awfully wrong.

I'll boost the priority on this one byte[] support is obviously broken :-(

@ghost ghost assigned bartdag Oct 31, 2011
@agronholm
Copy link
Author

I have verified that the problem lies entirely on the Java side -- I dove deep into the sun.* classes with the debugger and found that StreamEncoder is doing character replacing due to failed UTF-8 encoding of some characters.

@bartdag
Copy link
Collaborator

bartdag commented Oct 31, 2011

Yes, you are right. I found more discussion on StackOverflow and there is not way the current solution will work (http://stackoverflow.com/questions/965838/handling-unicode-surrogate-values-in-java-strings).

This is quite a bummer because there was a lot of work put behind the automated conversion of byte array :-( If I want to pass the byte array by value, I'll have to use a more basic approach (length of array followed by the raw bytes).

Not sure when I'll have the time to work on this one, but as soon as I can dedicate some time to Py4J, i'll work on this bug.

fdintino added a commit to theatlantic/py4j that referenced this issue Dec 20, 2011
bartdag added a commit that referenced this issue Jan 2, 2012
Switching Protocol.encodeBytes to use Base64. refs #91
@agronholm
Copy link
Author

Wasn't this fixed in 0.7? Why is it still open?

@bartdag
Copy link
Collaborator

bartdag commented Jun 24, 2012

Although I merged the pull request (post 0.7), the tests still fail for python 3.x so I there is still some work to do.

@agronholm
Copy link
Author

Nice timing :)
I was JUST writing a bytearray-using application on Python 3.2 (chart generation with JFreeChart).

@bartdag
Copy link
Collaborator

bartdag commented Jun 25, 2012

Hi Alex, I finally fixed the remaining issues with byte array.

I had to expand the compatibility layer to make it work with Python 3 and I added the conversion of byte array between from Python to Java. The changes in the Python 3 libraries are subtle and it's not always clear when a function accepts/returns a character string or a byte string :-(

I'll have to do some cleanup, but all the tests are passing now so you can use the code in the master branch.

@agronholm
Copy link
Author

Too bad py4j isn't installable directly from git due to no setup.py being present on the top level.

@agronholm
Copy link
Author

I can probably test this again when you've reverted the listening behavior to what it was in 0.7. Or would say that listening only on the public ip of the machine is correct...?

@bartdag
Copy link
Collaborator

bartdag commented Jul 21, 2012

How do you construct your JavaGateway and GatewayServer instances? The default is to listen to localhost.

Specifically, in Python, the default address used by py4j is "localhost". In Java, it is obtained through InetAddress.getLocalHost() unless you specify an address. I had a problem with a ubuntu machine that selected 127.0.1.1 and changed my host file. I did not have that problem on mac or archlinux.

@agronholm
Copy link
Author

With all due respect, I think you've misunderstood what InetAddress.getLocalHost() does.
Quote from the JDK docs:

Returns the address of the local host. This is achieved by retrieving the name of the host from the system, then resolving that name into an InetAddress.

The "localhost" address is supposed to resolve to 127.0.0.1 or ::1. Instead, that method resolves the external address of the host. That it returns "127.0.0.1" on any system is just lucky. Why did you feel you had to make this change that breaks py4j on every single Ubuntu machine (and possibly more)? What is the benefit?

@agronholm
Copy link
Author

I just tried it on Windows, it breaks there too. Returns the external IP address of my machine. Perhaps you wanted InetAddress.getLoopbackAddress() instead? That would make perfect sense.

@bartdag
Copy link
Collaborator

bartdag commented Jul 21, 2012

Hi Alex,

This change was the result of a discussion #87

I agree that loopback was the intended goal and that local host means different things to different os (again, it works on windows on my side, on mac, and on non-debian linux where localhost and loopback are defaulted to 127.0.0.1). I am trying to find a way to reliably get the loopback ip from python. I might have to hardcode it on both java and python to make sure that it's the same (it can always be specified when constructing a gateway server or javagateway instance).

I'm not sure if this is a cultural thing, but your comments sometimes come across as rude. You have reported many important issues with ideas on how to solve them and I am grateful for that, but I would also appreciate if your comments were nicer.

@bartdag
Copy link
Collaborator

bartdag commented Jul 21, 2012

Ok, getLoopbackAddress was introduced in java 1.7 and it seems hardcoded to return 127.0.0.1.

I guess i'll just hardcode it on both sides.

@agronholm
Copy link
Author

I'm not sure if this is a cultural thing, but your comments sometimes come across as rude. You have reported many important issues with ideas on how to solve them and I am grateful for that, but I would also appreciate if your comments were nicer.

Sorry about that. I'll try to be nicer in the future. Having a responsive and cooperative author is a valuable asset for any F/OSS project, and I don't want to jeopardize that.

@bartdag
Copy link
Collaborator

bartdag commented Jun 14, 2013

Closing since all related issues seem to have been adressed (the localhost address is now hardcoded to 127.0.0.1).

@bartdag bartdag closed this as completed Jun 14, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants