Add ability to start TensorFlow server from Java API #23022

dmitrievanthony · 2018-10-16T11:41:25Z

Standalone Client Mode allows to easily train model utilizing distributed resources. Only one thing we need to do that is to start TensorFlow server on every cluster node we'd like to participate in training. Because of that it's very important to have an ability to start TensorFlow server in different ways and from any environment. I prepared this request as result of discussion on Development List.

The following example demonstrates how to use TensorFlow server in Java:

ClusterDef clusterDef = ClusterDef.newBuilder()
  .addJob(JobDef.newBuilder()
  .setName("worker")
  .putTasks(0, "localhost:4321")
  .build()
).build();

ServerDef serverDef = ServerDef.newBuilder()
  .setCluster(clusterDef)
  .setJobName("worker")
  .setTaskIndex(0)
  .setProtocol("grpc")
  .build();

Server server = new Server(serverDef.toByteArray());
server.start();
server.join();

Please, feel free to comment and suggest changes.

asimshankar

Thanks for the PR! A few comments

tensorflow/c/c_api.h

asimshankar · 2018-10-17T04:22:09Z

tensorflow/c/c_api.h

+// Blocks until the server has shut down (currently blocks forever).
+TF_CAPI_EXPORT extern void TF_JoinServer(TF_Server* server, TF_Status* status);
+
+// Destroy a server, frees memory.


I'd clarify here whether the server is expected to have been stopped/joined before calling this.

tensorflow/c/c_api_internal.h

tensorflow/java/src/main/native/server_jni.cc

asimshankar · 2018-10-17T04:55:11Z

tensorflow/java/src/main/java/org/tensorflow/Server.java

+  }
+
+  @Override
+  public void close() {


Having both stop() and close() seems unnecessary. How about just having close(), which invokes TF_ServerStop? And join() can zero out the nativeHandle after returning.

I would prefer to keep these two methods separately. The reason for that is simple, Server is AutoCloseable and that means that common use case will look like this:

try (Server server = new Server(...)) { server.start(); server.join(); // or server.stop(); }

As you can see in this case method close() will be called automatically after join, so it join frees resources itself it will lead to illegal state exception.

Besides that, it would be great to have as similar API in all languages as it possible.

asimshankar · 2018-10-17T04:57:20Z

tensorflow/java/src/main/java/org/tensorflow/Server.java

+  }
+
+  /** Blocks until the server has shut down (currently blocks forever). */
+  public synchronized void join() {


Given that join() blocks, are you sure we want to mark this method as synchronized?
It appears to me, that if one thread were to invoke join() first, then no other thread will be able to invoke stop() and as a result both threads will remain blocked forever. Am I missing something?

I've checked code and looks like Server methods start, stop and join are synchronized on underlying layer (I didn't know that). So, we only need to prohibit parallel calls of delete and other functions. Do do that I added rw-lock, so please have a look.

Using this lock, user will be able to call stop method from other thread when current thread is locked by join method.

By the way, I checked calling stop when server is locked by join and looks like it doesn't work anyway (see https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc#L412)

I investigated a bit more and found out that close() just can't work correctly. Server destructor calls stop() wrapped into TF_CHECK_OK, but stop() returns unimplemented exception, as result TF_CHECK_OK crushes the process using sig 6.

Looks like it should be fixed, but in other PR (it's not about Java API at all).

I prepared PR that fixes this issue: #23190.

tensorflow/java/src/main/native/BUILD

dmitrievanthony · 2018-10-17T09:50:49Z

Hi @asimshankar, thank you for review. I fixed all your comments, so please have a look.

dmitrievanthony · 2018-10-19T09:56:08Z

Hi @asimshankar. Any updates?

dmitrievanthony · 2018-10-26T13:33:35Z

Hi, @asimshankar, @ymodak. I don't clearly understand current state. Do you have some concerns about this PR or it's ready to be merged?

asimshankar · 2018-10-26T14:53:26Z

@dmitrievanthony - apologies for the delay, I'll aim to take a look at the updated PR soon

asimshankar

Thanks for the changes, some additional comments.

Also, I think with your C API change we should be able to use this in Python as well and reduce the custom wrapping (probably there for historical reasons) in https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/server_lib.i

If you're up for that in a follow up PR, that would be awesome. Otherwise, I can take a stab at it once this PR is in.

tensorflow/c/c_api.h

tensorflow/java/src/main/java/org/tensorflow/Server.java

tensorflow/java/src/main/native/server_jni.cc

dmitrievanthony · 2018-10-29T15:02:36Z

Hi, @asimshankar. Thanks for comments. I updated the code, you can have a look. I didn't fixed only concurrency things in Server.java because from my perspective your approach is much more complicated and error prone. Well, anyway, if you prefer it, I can use it.

Speaking about Python API, I'd be glad to propagate this change into it also, but I think it makes sense to do it in other PR (there are some our internal tasks that depends on this PR, so would be great to merge it as soon as it possible).

BTW, I'm also trying to fix GRPC server stop issue in PR #23190, but it's unclear so far how to do it correctly. If you could have a look and give an advice it would be great.

dmitrievanthony · 2018-10-30T18:22:55Z

Hi, @asimshankar. I think I fixed all your comments, could you please have a look?

asimshankar

A few more leaks, otherwise looks good.

tensorflow/c/c_api.cc

tensorflow/java/src/main/native/server_jni.cc

dmitrievanthony · 2018-10-31T08:44:39Z

Thanks, I eliminated leaks. Please have a look.

asimshankar · 2018-11-07T02:36:20Z

Thanks @dmitrievanthony : I'll run the tests once we have a fix for the licenses failure.

Some background: When we package the JNI library in a tarball, we include the LICENSE files of all libraries that the target depends on. You can probably run ci_sanity.sh locally to reproduce the failure seen in the "Ubuntu Sanity" build.

The LICENSE files are explicitly listed in the //tensorflow/tools/lib_package:jnilicenses_generate BUILD target and the sanity check script aims to ensure that this list is in sync with the actual dependencies of //tensorflow/java:libtensorflow_jni.so target.

Since we're adding a dependency to another third party library (grpc) we need to include the license for that as well - listed in the logs of the failing test:

FAIL: mismatch in packaged licenses and external dependencies
Missing the licenses for the following external dependencies:
@grpc//
@grpc//third_party/address_sorting
@grpc//third_party/nanopb

(Which means we need the LICENSE file for grpc as well as its external dependencies on address_sorting and nanopb).

So I think simply adding these to the dependencies of the //tensorflow/tools/lib_package:jnilicenses_generate target should do.

Let me know if you can do that (and run ci_sanity.sh locally to iterate faster). If not, and you need some help, I can try to patch that in before merging this PR.

Thanks!

dmitrievanthony · 2018-11-07T15:29:23Z

Thanks for very details explanation, @asimshankar! I fixed the problem and checked ci_sanity.sh locally. Please have a look and rerun tests.

dmitrievanthony · 2018-11-07T17:35:12Z

Hi @asimshankar, looks like something not related to my code failed. Some allocation problems as far as I see.

asimshankar · 2018-11-07T18:36:10Z

@dmitrievanthony : Yeah, that seems unrelated. I'll take it from here :). Thanks for the contribution, will hopefully figure that out and get it merged in the next day or two. Will ping back if I discover any issues.

tensorflow/java/src/main/native/server_jni.cc

tensorflow/c/c_api.cc

asimshankar · 2018-11-08T18:58:19Z

(Some more comments based on the failing tests)

dmitrievanthony · 2018-11-12T14:20:07Z

Hi, @asimshankar. Do I understand correctly that the PR will be merged soon by you or someone else? Do I need to make some additional changes?

…-api PiperOrigin-RevId: 221113400

asimshankar · 2018-11-12T18:59:52Z

@dmitrievanthony : Yup, and done! Thanks for the contribution.

And use the C API for servers introduced in #23022 instead. PiperOrigin-RevId: 221207458

byronyi · 2018-12-27T22:57:03Z

Hi @asimshankar, I just checked my libtensorflow v1.12 installation and see no additional C APIs introduced in this PR. @dmitrievanthony Have you been successfully building the Java part of this PR (possibly in tf-io) without source dependency on TF core?

EDIT: sorry, did not notice that r1.12 was cut before this PR merged. I will try master instead.

asimshankar · 2018-12-28T04:00:22Z

@byronyi : the merge commit isn't part of the 1.12 release branch since it landed after that branch was cut. So, these features will be part of 1.13 or when built from source.

Add ability to start TensorFlow server from Java API.

231ef23

dmitrievanthony requested a review from asimshankar as a code owner October 16, 2018 11:41

googlebot added the cla: yes label Oct 16, 2018

dmitrievanthony changed the title ~~Add ability to start TensorFlow server from Java API.~~ Add ability to start TensorFlow server from Java API Oct 16, 2018

ymodak self-assigned this Oct 16, 2018

ymodak added the awaiting review Pull request awaiting review label Oct 16, 2018

asimshankar suggested changes Oct 17, 2018

View reviewed changes

asimshankar added stat:awaiting response Status - Awaiting response from author and removed awaiting review Pull request awaiting review labels Oct 17, 2018

Server Java API updates after review.

e16181f

tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Oct 18, 2018

ymodak added the awaiting review Pull request awaiting review label Oct 19, 2018

dmitrievanthony mentioned this pull request Oct 23, 2018

Implement server Stop method #23190

Closed

tensorflowbutler removed the awaiting review Pull request awaiting review label Oct 24, 2018

asimshankar suggested changes Oct 27, 2018

View reviewed changes

Update after review.

fb51172

Update after review.

1737b80

asimshankar suggested changes Oct 31, 2018

View reviewed changes

tensorflow/c/c_api.cc Show resolved Hide resolved

tensorflow/java/src/main/native/server_jni.cc Show resolved Hide resolved

tensorflow/java/src/main/native/server_jni.cc Show resolved Hide resolved

tensorflow/java/src/main/native/server_jni.cc Show resolved Hide resolved

Update after review.

66c6d05

asimshankar previously approved these changes Nov 2, 2018

View reviewed changes

asimshankar added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Nov 2, 2018

kokoro-team removed the kokoro:force-run Tests on submitted change label Nov 2, 2018

ymodak added the kokoro:force-run Tests on submitted change label Nov 6, 2018

kokoro-team removed the kokoro:force-run Tests on submitted change label Nov 6, 2018

Fix license check.

2aa655b

asimshankar added the kokoro:force-run Tests on submitted change label Nov 7, 2018

kokoro-team removed kokoro:force-run Tests on submitted change labels Nov 7, 2018

asimshankar approved these changes Nov 7, 2018

View reviewed changes

asimshankar suggested changes Nov 7, 2018

View reviewed changes

tensorflow/java/src/main/native/server_jni.cc Outdated Show resolved Hide resolved

Update after review.

b432722

asimshankar approved these changes Nov 8, 2018

View reviewed changes

asimshankar added the kokoro:force-run Tests on submitted change label Nov 8, 2018

kokoro-team removed the kokoro:force-run Tests on submitted change label Nov 8, 2018

asimshankar suggested changes Nov 8, 2018

View reviewed changes

tensorflow/c/c_api.cc Outdated Show resolved Hide resolved

tensorflow/c/c_api.cc Outdated Show resolved Hide resolved

tensorflow/c/c_api.cc Outdated Show resolved Hide resolved

Update after review.

41311db

asimshankar approved these changes Nov 8, 2018

View reviewed changes

asimshankar added the kokoro:force-run Tests on submitted change label Nov 8, 2018

kokoro-team removed the kokoro:force-run Tests on submitted change label Nov 8, 2018

tensorflow-copybara merged commit 41311db into tensorflow:master Nov 12, 2018

tensorflow-copybara pushed a commit that referenced this pull request Nov 12, 2018

Merge pull request #23022 from dmitrievanthony:tensorflow-server-java…

da8f15b

…-api PiperOrigin-RevId: 221113400

tensorflow-copybara pushed a commit that referenced this pull request Nov 13, 2018

Remove server_lib.i

54b65b8

And use the C API for servers introduced in #23022 instead. PiperOrigin-RevId: 221207458

dmitrievanthony mentioned this pull request Jan 23, 2019

Added release notes for 1.13 release #25084

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability to start TensorFlow server from Java API #23022

Add ability to start TensorFlow server from Java API #23022

dmitrievanthony commented Oct 16, 2018

asimshankar left a comment

asimshankar Oct 17, 2018

dmitrievanthony Oct 17, 2018

asimshankar Oct 17, 2018

dmitrievanthony Oct 17, 2018 •

edited

asimshankar Oct 17, 2018

dmitrievanthony Oct 17, 2018

dmitrievanthony Oct 17, 2018 •

edited

dmitrievanthony Oct 17, 2018

dmitrievanthony Oct 23, 2018

dmitrievanthony commented Oct 17, 2018

dmitrievanthony commented Oct 19, 2018

dmitrievanthony commented Oct 26, 2018

asimshankar commented Oct 26, 2018

asimshankar left a comment

dmitrievanthony commented Oct 29, 2018 •

edited

dmitrievanthony commented Oct 30, 2018

asimshankar left a comment

dmitrievanthony commented Oct 31, 2018

asimshankar commented Nov 7, 2018

dmitrievanthony commented Nov 7, 2018

dmitrievanthony commented Nov 7, 2018

asimshankar commented Nov 7, 2018

asimshankar commented Nov 8, 2018

dmitrievanthony commented Nov 12, 2018

asimshankar commented Nov 12, 2018

byronyi commented Dec 27, 2018 •

edited

asimshankar commented Dec 28, 2018

Add ability to start TensorFlow server from Java API #23022

Add ability to start TensorFlow server from Java API #23022

Conversation

dmitrievanthony commented Oct 16, 2018

asimshankar left a comment

Choose a reason for hiding this comment

asimshankar Oct 17, 2018

Choose a reason for hiding this comment

dmitrievanthony Oct 17, 2018

Choose a reason for hiding this comment

asimshankar Oct 17, 2018

Choose a reason for hiding this comment

dmitrievanthony Oct 17, 2018 • edited

Choose a reason for hiding this comment

asimshankar Oct 17, 2018

Choose a reason for hiding this comment

dmitrievanthony Oct 17, 2018

Choose a reason for hiding this comment

dmitrievanthony Oct 17, 2018 • edited

Choose a reason for hiding this comment

dmitrievanthony Oct 17, 2018

Choose a reason for hiding this comment

dmitrievanthony Oct 23, 2018

Choose a reason for hiding this comment

dmitrievanthony commented Oct 17, 2018

dmitrievanthony commented Oct 19, 2018

dmitrievanthony commented Oct 26, 2018

asimshankar commented Oct 26, 2018

asimshankar left a comment

Choose a reason for hiding this comment

dmitrievanthony commented Oct 29, 2018 • edited

dmitrievanthony commented Oct 30, 2018

asimshankar left a comment

Choose a reason for hiding this comment

dmitrievanthony commented Oct 31, 2018

asimshankar commented Nov 7, 2018

dmitrievanthony commented Nov 7, 2018

dmitrievanthony commented Nov 7, 2018

asimshankar commented Nov 7, 2018

asimshankar commented Nov 8, 2018

dmitrievanthony commented Nov 12, 2018

asimshankar commented Nov 12, 2018

byronyi commented Dec 27, 2018 • edited

asimshankar commented Dec 28, 2018

dmitrievanthony Oct 17, 2018 •

edited

dmitrievanthony Oct 17, 2018 •

edited

dmitrievanthony commented Oct 29, 2018 •

edited

byronyi commented Dec 27, 2018 •

edited