New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use binary protocol for better performance #159
Comments
Hi, could you be more specific and clarify what you mean by large objects (Py4J works with references except for primitives and byte arrays)? I totally agree that for calling methods with large integers or byte arrays as parameters, the protocol is suboptimal because there is a non-trivial encoding/decoding that takes place. Another context where a binary protocol would help is if the client and the server are on high-latency/low-bandwidth network. But for most of the use cases I encountered (this might not match your experience), the client and the server are on the same machine or the same LAN and most operations do not involve numerical arguments. In this context, the tests I did a few years ago showed that a binary protocol does not help because most arguments (e.g., class name, method name) needs to be converted to strings anyway. |
In Spark MLlib, it's possible to have vector with several millions double in it, we also want to fetch into python (will serialize it in ByteArray first), so we can do the calculation in Python. In the past, we have created a special channel (local file or socket) to transfer large object in PySpark (broadcast and collect), but for this case, it's not easy to do in this way, because it sit in a general code path. |
@bartdag If you think it's reasonable to do, I could try to send out a patch for this when I have some cycles. |
Sending byte arrays as bytes (vs using the text protocol) is doable and probably preferable than trying to create a totally new binary protocol. I agree that transferring vectors with several millions double in it will be slow with the current protocol and using a side channel is for now the best way to go. I'll see what I can do, but you are welcome to provide a patch too! |
We are already sending them back as bytes, but the overhead of string concating and encoding/decoding is still too high. |
Yup, I understood. Bytes are converted with base64 so it is really expensive for large arrays. |
Could you please profile it, cause there are several possible reasons for performance drawbacks.
If 2) is the reason, it is possible to compress b64 before sending (zip) bartdag, can you give a clue why b64 was choosen? |
Py4J uses a text protocol and base 64 is a common strategy to transfer non-textual data in a text protocol. The text protocol made sense because Py4J was created to interact with the JVM: methods and class names need to be converted at some point to strings so even protobuf (at the time, maybe not now) was slower than just transferring text. Py4J was not created for the use case of transferring large byte arrays between the JVM and a Python process, and I don't think there is a solution that will be the best for all scenarios. I've met with several Py4J users who transferred large byte arrays, but some were doing it over the network (zip might help, dedicated channel for byte transfer also) and others were running all their processes on the same host (memory mapped file might help? unix sockets instead of tcp? dumping the bytes to a file and reading it again?). I can make this particular case configurable (e.g., you could decide which transfer strategy you want to use for large byte arrays), but I'm not even sure which strategy is worth implementing at this point so profiling would definitively help. As a side note, base 64 encoding was contributed by developers from The Atlantic because I tried first a silly solution that utterly failed even if it was very fast :-) |
What do you mean by The Atlantic? Doesn't it use java.util.Base64? |
Re The Atlantic: #95 |
I see - MiGBase64. |
Performance comparison of Base64 implementations: We don't use MiGBase64 in the Fedora package, also cause of not allowed bundling of that file. There are patches for Java 7 to use (internal) javax.xml.bind.DatatypeConverter, and java.util.Base64 for Java 8. http://pkgs.fedoraproject.org/cgit/py4j.git/tree/py4j-Base64-java7.patch |
If I understand correctly, DatatypeConverter is available on all JVMs and is the fastest implementation. That might be a way to slightly improve performance and simplify fedora packaging. |
@kaytwo The slowness mostly happened on Python side, so the improvements on JVM side will not help much actually. |
This is a change on the Python side (by "built from master" I meant "zip up the library from master" - sorry for being unclear). |
is there any updates on this? |
Currently, the text protocol is very slow for large objects, we can have optional binary protocol for better performance.
The text was updated successfully, but these errors were encountered: