Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement autotranslation with arrow #7577

Open
scottdraves opened this issue Jun 26, 2018 · 3 comments
Open

implement autotranslation with arrow #7577

scottdraves opened this issue Jun 26, 2018 · 3 comments

Comments

@scottdraves
Copy link
Contributor

followup to #5039

use arrow and shared memory.

@dclong
Copy link

dclong commented Jun 26, 2018

Looking forward to it.

@MurrayAdth
Copy link

Hello, my name is Murray Davis. I am a developer with AdTheorent (http://adtheorent.com/). I have just made a POC that creates a Pandas DataFrame in a Jupyter Notebook, converts it to an ArrowRecordBatch with PyArrow, and stores the ArrowRecordBatch in a Plasma Object Store, i.e. shared memory. Then, it invokes through Py4J a Scala/Java application to fetch the ArrowRecordBatch from the Plasma Object Store with the Java Plasma Client, convert it into our own DataFrame format, and learn the data with our own real-time Java machine learner.

We successfully tested it with multiple Python threads after synchronizing the code segment that uses the Python Plasma Client, since it is not thread-safe. (We are using Arrow 0.9.0.)

Our motivation was to improve the performance of a Python Jupyter Notebook when integrated with our Java machine learner. We had found that transmitting serialized Pandas DataFrames through Py4J is a significant bottleneck.

Our POC has been successful in significantly improving the speed of our integration.

We have very recently become aware of BeakerX, and I have only started to try it out this week. We can see that BeakerX may have the potential to host our Python/Java integration with Arrow Plasma.

Does it sound like our work could contribute to the implementation of autotranslation with Arrow?

@SemanticBeeng
Copy link

SemanticBeeng commented Nov 9, 2018

"found that transmitting serialized Pandas DataFrames through Py4J is a significant bottleneck."

(context outside beakerx) I used jep https://github.com/ninia/jep NDArray to share zero copy java NIO buffers between JVM (off-heap memory) getting data from spark and calling Python which is doing no IO.

So, again, used in-memory integration and no remoting or copying large data as with with py4j.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants