New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to start TensorFlow server from Java API #23022
Add ability to start TensorFlow server from Java API #23022
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! A few comments
tensorflow/c/c_api.h
Outdated
// Blocks until the server has shut down (currently blocks forever). | ||
TF_CAPI_EXPORT extern void TF_JoinServer(TF_Server* server, TF_Status* status); | ||
|
||
// Destroy a server, frees memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd clarify here whether the server is expected to have been stopped/joined before calling this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
} | ||
|
||
@Override | ||
public void close() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having both stop()
and close()
seems unnecessary. How about just having close()
, which invokes TF_ServerStop
? And join()
can zero out the nativeHandle
after returning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to keep these two methods separately. The reason for that is simple, Server
is AutoCloseable
and that means that common use case will look like this:
try (Server server = new Server(...)) {
server.start();
server.join(); // or server.stop();
}
As you can see in this case method close()
will be called automatically after join
, so it join
frees resources itself it will lead to illegal state exception.
Besides that, it would be great to have as similar API in all languages as it possible.
} | ||
|
||
/** Blocks until the server has shut down (currently blocks forever). */ | ||
public synchronized void join() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that join()
blocks, are you sure we want to mark this method as synchronized
?
It appears to me, that if one thread were to invoke join()
first, then no other thread will be able to invoke stop()
and as a result both threads will remain blocked forever. Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've checked code and looks like Server
methods start
, stop
and join
are synchronized on underlying layer (I didn't know that). So, we only need to prohibit parallel calls of delete
and other functions. Do do that I added rw-lock, so please have a look.
Using this lock, user will be able to call stop
method from other thread when current thread is locked by join
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, I checked calling stop
when server is locked by join
and looks like it doesn't work anyway (see https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc#L412)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I investigated a bit more and found out that close()
just can't work correctly. Server
destructor calls stop()
wrapped into TF_CHECK_OK
, but stop()
returns unimplemented
exception, as result TF_CHECK_OK
crushes the process using sig 6.
Looks like it should be fixed, but in other PR (it's not about Java API at all).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prepared PR that fixes this issue: #23190.
Hi @asimshankar, thank you for review. I fixed all your comments, so please have a look. |
Hi @asimshankar. Any updates? |
Hi, @asimshankar, @ymodak. I don't clearly understand current state. Do you have some concerns about this PR or it's ready to be merged? |
@dmitrievanthony - apologies for the delay, I'll aim to take a look at the updated PR soon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes, some additional comments.
Also, I think with your C API change we should be able to use this in Python as well and reduce the custom wrapping (probably there for historical reasons) in https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/server_lib.i
If you're up for that in a follow up PR, that would be awesome. Otherwise, I can take a stab at it once this PR is in.
Hi, @asimshankar. Thanks for comments. I updated the code, you can have a look. I didn't fixed only concurrency things in Speaking about Python API, I'd be glad to propagate this change into it also, but I think it makes sense to do it in other PR (there are some our internal tasks that depends on this PR, so would be great to merge it as soon as it possible). BTW, I'm also trying to fix GRPC server |
Hi, @asimshankar. I think I fixed all your comments, could you please have a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more leaks, otherwise looks good.
Thanks, I eliminated leaks. Please have a look. |
Thanks @dmitrievanthony : I'll run the tests once we have a fix for the licenses failure. Some background: When we package the JNI library in a tarball, we include the LICENSE files of all libraries that the target depends on. You can probably run The LICENSE files are explicitly listed in the Since we're adding a dependency to another third party library (grpc) we need to include the license for that as well - listed in the logs of the failing test:
(Which means we need the LICENSE file for grpc as well as its external dependencies on address_sorting and nanopb). So I think simply adding these to the dependencies of the Let me know if you can do that (and run Thanks! |
Thanks for very details explanation, @asimshankar! I fixed the problem and checked |
Hi @asimshankar, looks like something not related to my code failed. Some allocation problems as far as I see. |
@dmitrievanthony : Yeah, that seems unrelated. I'll take it from here :). Thanks for the contribution, will hopefully figure that out and get it merged in the next day or two. Will ping back if I discover any issues. |
(Some more comments based on the failing tests) |
Hi, @asimshankar. Do I understand correctly that the PR will be merged soon by you or someone else? Do I need to make some additional changes? |
…-api PiperOrigin-RevId: 221113400
@dmitrievanthony : Yup, and done! Thanks for the contribution. |
And use the C API for servers introduced in #23022 instead. PiperOrigin-RevId: 221207458
Hi @asimshankar, I just checked my libtensorflow v1.12 installation and see no additional C APIs introduced in this PR. @dmitrievanthony Have you been successfully building the Java part of this PR (possibly in tf-io) without source dependency on TF core? EDIT: sorry, did not notice that r1.12 was cut before this PR merged. I will try master instead. |
@byronyi : the merge commit isn't part of the 1.12 release branch since it landed after that branch was cut. So, these features will be part of 1.13 or when built from source. |
Standalone Client Mode allows to easily train model utilizing distributed resources. Only one thing we need to do that is to start TensorFlow server on every cluster node we'd like to participate in training. Because of that it's very important to have an ability to start TensorFlow server in different ways and from any environment. I prepared this request as result of discussion on Development List.
The following example demonstrates how to use TensorFlow server in Java:
Please, feel free to comment and suggest changes.