Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use binary protocol for better performance #159

Open
davies opened this issue Apr 6, 2015 · 17 comments
Open

Use binary protocol for better performance #159

davies opened this issue Apr 6, 2015 · 17 comments

Comments

@davies
Copy link

davies commented Apr 6, 2015

Currently, the text protocol is very slow for large objects, we can have optional binary protocol for better performance.

@bartdag
Copy link
Collaborator

bartdag commented Apr 6, 2015

Hi, could you be more specific and clarify what you mean by large objects (Py4J works with references except for primitives and byte arrays)?

I totally agree that for calling methods with large integers or byte arrays as parameters, the protocol is suboptimal because there is a non-trivial encoding/decoding that takes place. Another context where a binary protocol would help is if the client and the server are on high-latency/low-bandwidth network.

But for most of the use cases I encountered (this might not match your experience), the client and the server are on the same machine or the same LAN and most operations do not involve numerical arguments. In this context, the tests I did a few years ago showed that a binary protocol does not help because most arguments (e.g., class name, method name) needs to be converted to strings anyway.

@davies
Copy link
Author

davies commented Apr 6, 2015

In Spark MLlib, it's possible to have vector with several millions double in it, we also want to fetch into python (will serialize it in ByteArray first), so we can do the calculation in Python.

In the past, we have created a special channel (local file or socket) to transfer large object in PySpark (broadcast and collect), but for this case, it's not easy to do in this way, because it sit in a general code path.

@davies
Copy link
Author

davies commented Apr 6, 2015

@bartdag If you think it's reasonable to do, I could try to send out a patch for this when I have some cycles.

@bartdag
Copy link
Collaborator

bartdag commented Apr 6, 2015

Sending byte arrays as bytes (vs using the text protocol) is doable and probably preferable than trying to create a totally new binary protocol. I agree that transferring vectors with several millions double in it will be slow with the current protocol and using a side channel is for now the best way to go. I'll see what I can do, but you are welcome to provide a patch too!

@davies
Copy link
Author

davies commented Apr 6, 2015

We are already sending them back as bytes, but the overhead of string concating and encoding/decoding is still too high.

@bartdag
Copy link
Collaborator

bartdag commented Apr 6, 2015

Yup, I understood. Bytes are converted with base64 so it is really expensive for large arrays.

@eperzhand
Copy link

Could you please profile it, cause there are several possible reasons for performance drawbacks.

  1. encoding/decoding bytes as b64
  2. passing larger datum through the network
  3. copying large objects in memory
  4. others?

If 2) is the reason, it is possible to compress b64 before sending (zip)

bartdag, can you give a clue why b64 was choosen?

@bartdag
Copy link
Collaborator

bartdag commented Oct 30, 2015

Py4J uses a text protocol and base 64 is a common strategy to transfer non-textual data in a text protocol. The text protocol made sense because Py4J was created to interact with the JVM: methods and class names need to be converted at some point to strings so even protobuf (at the time, maybe not now) was slower than just transferring text.

Py4J was not created for the use case of transferring large byte arrays between the JVM and a Python process, and I don't think there is a solution that will be the best for all scenarios. I've met with several Py4J users who transferred large byte arrays, but some were doing it over the network (zip might help, dedicated channel for byte transfer also) and others were running all their processes on the same host (memory mapped file might help? unix sockets instead of tcp? dumping the bytes to a file and reading it again?).

I can make this particular case configurable (e.g., you could decide which transfer strategy you want to use for large byte arrays), but I'm not even sure which strategy is worth implementing at this point so profiling would definitively help.

As a side note, base 64 encoding was contributed by developers from The Atlantic because I tried first a silly solution that utterly failed even if it was very fast :-)

@eperzhand
Copy link

What do you mean by The Atlantic? Doesn't it use java.util.Base64?
I think that the most useful to implement is "receiveRawDatum"/"send.." method for gate which could be used to transfer anything.
Then a developer could manually implement ZIP(on java side)-UNZIP(on python or vice-versa)/raw byte arrays/serialized objects transferring.
I.e. for most cases he would use nice Py4j interface, but for performance critical sections - this "raw channel".

@bartdag
Copy link
Collaborator

bartdag commented Oct 30, 2015

Re The Atlantic: #95

@eperzhand
Copy link

I see - MiGBase64.

@rapgro
Copy link

rapgro commented Oct 30, 2015

Performance comparison of Base64 implementations:
http://java-performance.info/base64-encoding-and-decoding-performance/

We don't use MiGBase64 in the Fedora package, also cause of not allowed bundling of that file. There are patches for Java 7 to use (internal) javax.xml.bind.DatatypeConverter, and java.util.Base64 for Java 8.

http://pkgs.fedoraproject.org/cgit/py4j.git/tree/py4j-Base64-java7.patch
http://pkgs.fedoraproject.org/cgit/py4j.git/tree/py4j-Base64-java8.patch

@bartdag
Copy link
Collaborator

bartdag commented Oct 30, 2015

If I understand correctly, DatatypeConverter is available on all JVMs and is the fastest implementation. That might be a way to slightly improve performance and simplify fedora packaging.

@kaytwo
Copy link

kaytwo commented May 2, 2016

#201 is the likely culprit for this performance degradation, or at least a large chunk of it. @davies, can you see whether py4j built from master improves the performance you were seeing? The OOM crashes are a different story however.

@davies
Copy link
Author

davies commented May 2, 2016

@kaytwo The slowness mostly happened on Python side, so the improvements on JVM side will not help much actually.

@kaytwo
Copy link

kaytwo commented May 2, 2016

This is a change on the Python side (by "built from master" I meant "zip up the library from master" - sorry for being unclear).

@yupbank
Copy link

yupbank commented Apr 19, 2018

is there any updates on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants