Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Java serialization should use Apache Arrow serialization format. #2167

Closed
robertnishihara opened this issue May 30, 2018 · 7 comments
Closed
Labels

Comments

@robertnishihara
Copy link
Collaborator

No description provided.

@pcmoritz
Copy link
Contributor

pcmoritz commented May 31, 2018

One day this will hopefully allow interop between Java and Python, although here is going to be some more work involved to actually make that happen (in particular, writing something like pyarrow.serialize for Java).

Note that this is currently blocked on https://issues.apache.org/jira/browse/ARROW-1692, any help with that would be appreciated @eric-jj @songqing @imzhenyu @salah-man

@eric-jj
Copy link
Contributor

eric-jj commented Jun 6, 2018

Someone in my team will take the work item.

@surehb
Copy link
Contributor

surehb commented Jun 19, 2018

Hi @robertnishihara , I have some questions about enabling Arrow serialization for Java worker and hope you can kindly help to input here, thanks in advance!
Based on my rough understanding about Arrow, ideally we should only have an array of byte in memory to keep the data, which is managed by and also accessed through Arrow java interface. That means in the application layer, we may not be able to write java code in a nature way, for example call classInstance.fieldA, and the code will also be complicated. So my opinion is:

  1. Introduce a helper class, within which we implement serialization/deserialization for simple types (int, string, bool, etc.) and sets (list, map, etc.) of simple types.
  2. Introduce an interface that all the customized defined classes need to derive from. A customized classed need to implement the serialize/deserialize by itself or convert the data into sets of simple types. But in this way we may need to create new java class instances during deserialization.
    Does it sound good to you?

@robertnishihara
Copy link
Collaborator Author

Point 1 is similar to what we did to implement pyarrow.serialize and pyarrow.deserialize. See https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/arrow_to_python.cc and https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/python_to_arrow.cc.

With Python, we don't have an interface for custom classes to extend and instead try to do it automatically (though this doesn't always work).

Basically, for a custom class in Python like

class Foo:
    def __init__(self):
        self.a = 1
        self.b = 2

we first convert it to a dictionary like {'_pytype_': 'Foo', 'data': {'a': 1, 'b': 2}} and then serialize that using Arrow. Then to deserialize it, we deserialize it back to a dictionary using Arrow, then we use the string 'Foo' to look up the class definition (which has already been broadcast everywhere by a different mechanism), and then instantiate a Foo object from the data field (see https://github.com/apache/arrow/blob/a82a0273a0b9f8583de005d96475ab8685963ed8/python/pyarrow/serialization.pxi#L128-L186, those methods are called from https://github.com/apache/arrow/blob/a82a0273a0b9f8583de005d96475ab8685963ed8/cpp/src/arrow/python/python_to_arrow.cc#L383).

Do you think an approach like this could work in Java? It's possible that the natural approach in Java differs here, but something like this might work.

For primitive types and maybe some simple objects like arrays/lists/tuples, it may make sense to use the same format that we use in the Python to Arrow code (so that some object serialized from Python may be possible to deserialize from Java).

cc @pcmoritz

@robertnishihara
Copy link
Collaborator Author

cc @wesm @jacques-n

@surehb
Copy link
Contributor

surehb commented Jun 20, 2018

Hi @robertnishihara , nice to hear your response!
Yes, it makes sense to call Arrow in a similar way from both Java and Python, which will make the caller code in Ray (and potential others) more elegant. But to me it sounds like an interface refinement that better to be put in Arrow. From our side we may want to prioritize Arrow adoption in Ray at this point. Actually the current Java serialization interfaces (Serializer.encode/Serializer.decode) are similar to pyarrow.serialize and pyarrow.deserialize and I will follow the style here. For the interface I mentioned in #2, it will be used as the callback functions like in python. We need a custom way for each class to implement their own logic, and an interface class should be a reasonable option. For those primitive types, yes they should have the same format as Python, if Arrow does so. :)

@edoakes
Copy link
Contributor

edoakes commented Mar 19, 2020

Automatically closing stale issue. Please re-open if still relevant.

@edoakes edoakes closed this as completed Mar 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants