Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to start TensorFlow server from Java API #23022

Merged
merged 10 commits into from Nov 12, 2018
Merged

Add ability to start TensorFlow server from Java API #23022

merged 10 commits into from Nov 12, 2018

Conversation

dmitrievanthony
Copy link
Contributor

Standalone Client Mode allows to easily train model utilizing distributed resources. Only one thing we need to do that is to start TensorFlow server on every cluster node we'd like to participate in training. Because of that it's very important to have an ability to start TensorFlow server in different ways and from any environment. I prepared this request as result of discussion on Development List.

The following example demonstrates how to use TensorFlow server in Java:

ClusterDef clusterDef = ClusterDef.newBuilder()
  .addJob(JobDef.newBuilder()
  .setName("worker")
  .putTasks(0, "localhost:4321")
  .build()
).build();

ServerDef serverDef = ServerDef.newBuilder()
  .setCluster(clusterDef)
  .setJobName("worker")
  .setTaskIndex(0)
  .setProtocol("grpc")
  .build();

Server server = new Server(serverDef.toByteArray());
server.start();
server.join();

Please, feel free to comment and suggest changes.

@dmitrievanthony dmitrievanthony changed the title Add ability to start TensorFlow server from Java API. Add ability to start TensorFlow server from Java API Oct 16, 2018
@ymodak ymodak self-assigned this Oct 16, 2018
@ymodak ymodak added the awaiting review Pull request awaiting review label Oct 16, 2018
Copy link
Contributor

@asimshankar asimshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! A few comments

tensorflow/c/c_api.h Outdated Show resolved Hide resolved
tensorflow/c/c_api.h Show resolved Hide resolved
tensorflow/c/c_api.h Outdated Show resolved Hide resolved
// Blocks until the server has shut down (currently blocks forever).
TF_CAPI_EXPORT extern void TF_JoinServer(TF_Server* server, TF_Status* status);

// Destroy a server, frees memory.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd clarify here whether the server is expected to have been stopped/joined before calling this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

tensorflow/c/c_api_internal.h Show resolved Hide resolved
tensorflow/java/src/main/native/server_jni.cc Show resolved Hide resolved
tensorflow/java/src/main/native/server_jni.cc Outdated Show resolved Hide resolved
}

@Override
public void close() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having both stop() and close() seems unnecessary. How about just having close(), which invokes TF_ServerStop? And join() can zero out the nativeHandle after returning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to keep these two methods separately. The reason for that is simple, Server is AutoCloseable and that means that common use case will look like this:

try (Server server = new Server(...)) {
  server.start();
  server.join(); // or server.stop();
}

As you can see in this case method close() will be called automatically after join, so it join frees resources itself it will lead to illegal state exception.

Besides that, it would be great to have as similar API in all languages as it possible.

}

/** Blocks until the server has shut down (currently blocks forever). */
public synchronized void join() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that join() blocks, are you sure we want to mark this method as synchronized?
It appears to me, that if one thread were to invoke join() first, then no other thread will be able to invoke stop() and as a result both threads will remain blocked forever. Am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked code and looks like Server methods start, stop and join are synchronized on underlying layer (I didn't know that). So, we only need to prohibit parallel calls of delete and other functions. Do do that I added rw-lock, so please have a look.

Using this lock, user will be able to call stop method from other thread when current thread is locked by join method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, I checked calling stop when server is locked by join and looks like it doesn't work anyway (see https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc#L412)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I investigated a bit more and found out that close() just can't work correctly. Server destructor calls stop() wrapped into TF_CHECK_OK, but stop() returns unimplemented exception, as result TF_CHECK_OK crushes the process using sig 6.

Looks like it should be fixed, but in other PR (it's not about Java API at all).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prepared PR that fixes this issue: #23190.

tensorflow/java/src/main/native/BUILD Outdated Show resolved Hide resolved
@asimshankar asimshankar added stat:awaiting response Status - Awaiting response from author and removed awaiting review Pull request awaiting review labels Oct 17, 2018
@dmitrievanthony
Copy link
Contributor Author

Hi @asimshankar, thank you for review. I fixed all your comments, so please have a look.

@tensorflowbutler tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Oct 18, 2018
@ymodak ymodak added the awaiting review Pull request awaiting review label Oct 19, 2018
@dmitrievanthony
Copy link
Contributor Author

Hi @asimshankar. Any updates?

@tensorflowbutler tensorflowbutler removed the awaiting review Pull request awaiting review label Oct 24, 2018
@dmitrievanthony
Copy link
Contributor Author

Hi, @asimshankar, @ymodak. I don't clearly understand current state. Do you have some concerns about this PR or it's ready to be merged?

@asimshankar
Copy link
Contributor

@dmitrievanthony - apologies for the delay, I'll aim to take a look at the updated PR soon

Copy link
Contributor

@asimshankar asimshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes, some additional comments.

Also, I think with your C API change we should be able to use this in Python as well and reduce the custom wrapping (probably there for historical reasons) in https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/server_lib.i

If you're up for that in a follow up PR, that would be awesome. Otherwise, I can take a stab at it once this PR is in.

tensorflow/c/c_api.h Show resolved Hide resolved
tensorflow/c/c_api.h Outdated Show resolved Hide resolved
tensorflow/c/c_api.h Outdated Show resolved Hide resolved
tensorflow/java/src/main/java/org/tensorflow/Server.java Outdated Show resolved Hide resolved
tensorflow/java/src/main/java/org/tensorflow/Server.java Outdated Show resolved Hide resolved
tensorflow/java/src/main/java/org/tensorflow/Server.java Outdated Show resolved Hide resolved
tensorflow/java/src/main/native/server_jni.cc Outdated Show resolved Hide resolved
tensorflow/java/src/main/native/server_jni.cc Outdated Show resolved Hide resolved
@dmitrievanthony
Copy link
Contributor Author

dmitrievanthony commented Oct 29, 2018

Hi, @asimshankar. Thanks for comments. I updated the code, you can have a look. I didn't fixed only concurrency things in Server.java because from my perspective your approach is much more complicated and error prone. Well, anyway, if you prefer it, I can use it.

Speaking about Python API, I'd be glad to propagate this change into it also, but I think it makes sense to do it in other PR (there are some our internal tasks that depends on this PR, so would be great to merge it as soon as it possible).

BTW, I'm also trying to fix GRPC server stop issue in PR #23190, but it's unclear so far how to do it correctly. If you could have a look and give an advice it would be great.

@dmitrievanthony
Copy link
Contributor Author

Hi, @asimshankar. I think I fixed all your comments, could you please have a look?

Copy link
Contributor

@asimshankar asimshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more leaks, otherwise looks good.

tensorflow/c/c_api.cc Show resolved Hide resolved
tensorflow/java/src/main/native/server_jni.cc Show resolved Hide resolved
tensorflow/java/src/main/native/server_jni.cc Show resolved Hide resolved
tensorflow/java/src/main/native/server_jni.cc Show resolved Hide resolved
@dmitrievanthony
Copy link
Contributor Author

Thanks, I eliminated leaks. Please have a look.

asimshankar
asimshankar previously approved these changes Nov 2, 2018
@asimshankar asimshankar added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Nov 2, 2018
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Nov 2, 2018
@ymodak ymodak added the kokoro:force-run Tests on submitted change label Nov 6, 2018
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Nov 6, 2018
@asimshankar
Copy link
Contributor

Thanks @dmitrievanthony : I'll run the tests once we have a fix for the licenses failure.

Some background: When we package the JNI library in a tarball, we include the LICENSE files of all libraries that the target depends on. You can probably run ci_sanity.sh locally to reproduce the failure seen in the "Ubuntu Sanity" build.

The LICENSE files are explicitly listed in the //tensorflow/tools/lib_package:jnilicenses_generate BUILD target and the sanity check script aims to ensure that this list is in sync with the actual dependencies of //tensorflow/java:libtensorflow_jni.so target.

Since we're adding a dependency to another third party library (grpc) we need to include the license for that as well - listed in the logs of the failing test:

FAIL: mismatch in packaged licenses and external dependencies
Missing the licenses for the following external dependencies:
@grpc//
@grpc//third_party/address_sorting
@grpc//third_party/nanopb

(Which means we need the LICENSE file for grpc as well as its external dependencies on address_sorting and nanopb).

So I think simply adding these to the dependencies of the //tensorflow/tools/lib_package:jnilicenses_generate target should do.

Let me know if you can do that (and run ci_sanity.sh locally to iterate faster). If not, and you need some help, I can try to patch that in before merging this PR.

Thanks!

@dmitrievanthony
Copy link
Contributor Author

Thanks for very details explanation, @asimshankar! I fixed the problem and checked ci_sanity.sh locally. Please have a look and rerun tests.

@asimshankar asimshankar added the kokoro:force-run Tests on submitted change label Nov 7, 2018
@kokoro-team kokoro-team removed kokoro:force-run Tests on submitted change labels Nov 7, 2018
@dmitrievanthony
Copy link
Contributor Author

Hi @asimshankar, looks like something not related to my code failed. Some allocation problems as far as I see.

@asimshankar
Copy link
Contributor

@dmitrievanthony : Yeah, that seems unrelated. I'll take it from here :). Thanks for the contribution, will hopefully figure that out and get it merged in the next day or two. Will ping back if I discover any issues.

@asimshankar asimshankar added the kokoro:force-run Tests on submitted change label Nov 8, 2018
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Nov 8, 2018
tensorflow/c/c_api.cc Outdated Show resolved Hide resolved
tensorflow/c/c_api.cc Outdated Show resolved Hide resolved
tensorflow/c/c_api.cc Outdated Show resolved Hide resolved
@asimshankar
Copy link
Contributor

(Some more comments based on the failing tests)

@asimshankar asimshankar added the kokoro:force-run Tests on submitted change label Nov 8, 2018
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Nov 8, 2018
@dmitrievanthony
Copy link
Contributor Author

Hi, @asimshankar. Do I understand correctly that the PR will be merged soon by you or someone else? Do I need to make some additional changes?

@tensorflow-copybara tensorflow-copybara merged commit 41311db into tensorflow:master Nov 12, 2018
tensorflow-copybara pushed a commit that referenced this pull request Nov 12, 2018
@asimshankar
Copy link
Contributor

@dmitrievanthony : Yup, and done! Thanks for the contribution.

tensorflow-copybara pushed a commit that referenced this pull request Nov 13, 2018
And use the C API for servers introduced in
#23022 instead.

PiperOrigin-RevId: 221207458
@byronyi
Copy link
Contributor

byronyi commented Dec 27, 2018

Hi @asimshankar, I just checked my libtensorflow v1.12 installation and see no additional C APIs introduced in this PR. @dmitrievanthony Have you been successfully building the Java part of this PR (possibly in tf-io) without source dependency on TF core?

EDIT: sorry, did not notice that r1.12 was cut before this PR merged. I will try master instead.

@asimshankar
Copy link
Contributor

@byronyi : the merge commit isn't part of the 1.12 release branch since it landed after that branch was cut. So, these features will be part of 1.13 or when built from source.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes ready to pull PR ready for merge process
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants