Add GRPCDataset to allow pulling data from a gRPC server such as numpy array #206

yongtang · 2019-04-24T22:18:57Z

This fix tries to have a way to pull data from a gRPC server, as long as a gRPC server implement the protocol in endpoint.proto.

The main purpose is to allow reading partial data from numpy array in memory.

In the from_numpy method, the following is done:

Create a gRPC server and exposes ReadRecord endpoint.
Start the gRPC server on a random port in local host.
Pass endpoint (localhost:port) to GRPCDataset
GRPCDataset will pull the data from gRPC server.

Note from_numpy is just a facility method to setup a gRPC server. In theory, a GRPC server could be created by any other process and by any language. Only the endpoint information is needed to create GRPCDataset.

In case numpy array is huge, then this method could be helpful as it is not required to save numpy file, and load back by tf.data, or pass the whole numpy data into a tensor.

This could open up doors for other languages as well. For example, in case of R, we could setup a grpc server with R, and pass R dataframes in memory and allows GRPCDataset to pull a chunk of the data at a time.

Note we create gRPC server locally, but it is possible to expose gRPC server remotely so that the workload could be distributed.

Signed-off-by: Yong Tang yong.tang.github@outlook.com

BryanCutler

@yongtang I only had time for a quick skim right now, but it looks really cool! Do you think I could reuse some of these parts to feed data from Python memory to the ArrowDataset?

I had a small concern about using a void* as one of the method arguments, maybe you can elaborate on this?

BryanCutler · 2019-04-26T16:26:03Z

setup.py

+            rootpath,
+            os.path.relpath(os.path.join(rootname, filename), datapath))
+        print("setup.py - copy {} to {}".format(src, dst))
+        shutil.copyfile(src, dst)


Just wondering why adding this now, was it grabbing some incorrect files?

Also could you just reuse the similar loop above and just nest under a list of extensions? like

for file_pattern in ["*.so", "*py"]: for filename in fnmatch.filter(filenames, file_pattern): ...

Thanks @BryanCutler. The PR has been updated. Previously, we pick up py files in the source tree, and pick up .so files in bazel-bin. However, for gRPC, it consists of generated code in python (.py) based on endpoint.proto file. Those files are generated as part of the build so they are placed in bazel-bin (by bazel). So setup.py has to be updated to capture the py files in bazel-bin as well.

BryanCutler · 2019-04-26T16:32:07Z

tensorflow_io/core/kernels/dataset_ops.h

  virtual ~DataInput() {}
-  virtual Status FromStream(io::InputStreamInterface& s) = 0;
-  virtual Status ReadRecord(io::InputStreamInterface& s, IteratorContext* ctx, std::unique_ptr<T>& state, int64 record_to_read, int64* record_read, std::vector<Tensor>* out_tensors) const = 0;
+  virtual Status ReadReferenceRecord(void* s, IteratorContext* ctx, std::unique_ptr<T>& state, int64 record_to_read, int64* record_read, std::vector<Tensor>* out_tensors) const = 0;


Would you mind explaining why using void * here and the change to using pointers instead of references?

@BryanCutler Previously, references works well as we are mostly dealing with file streams (io::InputStreamInterface& s). However, for gRPC it is not a file stream and is not returning raw bytes (no ReadNBytes). So InputStreamInterface is irrelevant here. Changed to pointers so that nullptr could be passed for gRPC.

The main purpose is to try to reuse the logic of batch where we need to piece together chunks of tensor.

Ok, thanks for the explanation. Having void* arguments on public apis make a little nervous though. Do you think putting this as a protected member makes sense?

@BryanCutler Thanks for the review. The PR has been updated. Now FileInput (and subclass delcared) only need to implement:

ReadRecord(io::InputStreamInterface* s, IteratorContext* ctx, ...)

and StreamInput only need to implement:

Status ReadRecord(IteratorContext* ctx, ...

So void * is hidden and will not be touched by any ops.

Please take a look.

Sounds good thanks!

yongtang · 2019-04-27T17:17:50Z

Add a reference of tensorflow/tensorflow#13530 , which has a need for pandas input. Could be a follow up PR I think.

BryanCutler · 2019-04-30T17:06:41Z

tensorflow_io/grpc/BUILD

+        "python/ops/endpoint_pb2_grpc.py",
+    ],
+    cmd = "python -m grpc_tools.protoc -Itensorflow_io/grpc --python_out=$(BINDIR)/tensorflow_io/grpc/python/ops/ --grpc_python_out=$(BINDIR)/tensorflow_io/grpc/python/ops/ $< ; touch $(BINDIR)/tensorflow_io/grpc/python/ops/__init__.py",
+    output_to_bindir = True,


The flatbuffers build for Arrow had an option to put the generated files under a prefix directory, e.g. out_prefix = "cpp/src/arrow/ipc/", is that possible to do here to just put the files in the right place initially?

@BryanCutler Bazel used mount point to hide the source directory purposely, so that the generated code could not be exposed in source directory. So we have to copy the python file exposed in bazel-bin (not the source directory). The flatbuffers case is different as the generated file are used for compile (bazel could see), not for final exposure (like python case we encounter here).

Ok, I see thanks for the explanation

yongtang · 2019-05-02T20:14:37Z

@BryanCutler The PR has been updated. Please take a look.

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

This fix tries to have a way to pull data from a gRPC server, as long as a gRPC server implement the protocol in endpoint.proto. The main purpose is to allow reading partial data from numpy array in memory. In the from_numpy method, the following is done: - Create a gRPC server and exposes ReadRecord endpoint. - Start the gRPC server on a random port in local host. - Pass endpoint (<localhost:port>) to GRPCDataset - GRPCDataset will pull the data from gRPC server. Note from_numpy is just a facility method to setup a gRPC server. In theory, a GRPC server could be created by any other process and by any language. Only the endpoint information is needed to create GRPCDataset. In case numpy array is huge, then this method could be helpful as it is not required to save numpy file, and load back by tf.data, or pass the whole numpy data into a tensor. This could open up doors for other languages as well. For example, in case of R, we could setup a grpc server with R, and pass R dataframes in memory and allows GRPCDataset to pull a chunk of the data at a time. Note we create gRPC server locally, but it is possible to expose gRPC server remotely so that the workload could be distributed. Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

BryanCutler

LGTM, I had just one more minor comment

BryanCutler · 2019-05-03T00:00:32Z

tensorflow_io/grpc/python/ops/grpc_ops.py

+    endpoint = grpc_server.endpoint()
+    dtype = a.dtype
+    shape = list(a.shape)
+    batch = batch


Is this maybe a typo?

@BryanCutler Thanks! Just updated.

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

yongtang · 2019-05-04T09:35:23Z

Let's merge this one. Saw some discussion about Java for which this PR should help.

Add GRPCDataset to allow pulling data from a gRPC server such as numpy array

yongtang force-pushed the grpc branch 10 times, most recently from 2f8f133 to 27ceb0e Compare April 25, 2019 16:17

yongtang requested review from BryanCutler and terrytangyuan April 25, 2019 17:55

yongtang mentioned this pull request Apr 25, 2019

Allow ArrowDataset to create an initializable iterator for input data #205

Merged

yongtang force-pushed the grpc branch from 27ceb0e to b071195 Compare April 25, 2019 18:33

BryanCutler reviewed Apr 26, 2019

View reviewed changes

yongtang force-pushed the grpc branch from b071195 to c68ded2 Compare April 27, 2019 16:46

yongtang changed the title ~~Add GRPCDataset to allow pulling data from a gRPC server~~ Add GRPCDataset to allow pulling data from a gRPC server such as numpy array Apr 27, 2019

yongtang mentioned this pull request Apr 27, 2019

tf.data equivalent for tf.estimator.inputs.numpy_input_fn tensorflow/tensorflow#24265

Closed

BryanCutler reviewed Apr 30, 2019

View reviewed changes

yongtang force-pushed the grpc branch from c68ded2 to 5f28ad6 Compare May 2, 2019 20:08

yongtang added 2 commits May 2, 2019 21:38

Refactor data input to split FileInput from StreamInput

3133e03

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

yongtang force-pushed the grpc branch from 5f28ad6 to f575a3f Compare May 2, 2019 21:38

BryanCutler approved these changes May 3, 2019

View reviewed changes

Add addiitonal test case with tf.keras, and covers 1D situation

1ad6a2f

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

yongtang force-pushed the grpc branch from f575a3f to 1ad6a2f Compare May 3, 2019 00:19

yongtang merged commit 8242d26 into tensorflow:master May 4, 2019

yongtang deleted the grpc branch May 4, 2019 09:35

byronyi mentioned this pull request Jan 20, 2020

RFC: tf.data Service tensorflow/community#195

Merged

i-ony pushed a commit to i-ony/io that referenced this pull request Feb 8, 2021

Merge pull request tensorflow#206 from yongtang/grpc

8504dc8

Add GRPCDataset to allow pulling data from a gRPC server such as numpy array

Add GRPCDataset to allow pulling data from a gRPC server such as numpy array #206

Add GRPCDataset to allow pulling data from a gRPC server such as numpy array #206

Uh oh!

Conversation

yongtang commented Apr 24, 2019

Uh oh!

BryanCutler left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yongtang commented Apr 27, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BryanCutler May 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yongtang commented May 2, 2019

Uh oh!

BryanCutler left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yongtang commented May 4, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

BryanCutler May 3, 2019 •

edited

Loading