-
Notifications
You must be signed in to change notification settings - Fork 559
Not creating a coordinator service for the single processing job. #6023
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not creating a coordinator service for the single processing job. #6023
Conversation
|
@vanbasten23 Is the crash only in the case of single-process workloads? I wonder if it's a more general issue that we should handle. Agree that we don't need a coordinator for distributed kv store in single-process though. |
269554a to
4c856b3
Compare
It fails in the persistent cache test https://gist.github.com/vanbasten23/2cb90b2f72a40ef965b965bc12bc5ded. I fixed your comment and let me rerun. |
@vanbasten23 I've merged the persistent cache change, if you want to test the single device test with this fix you can do a rebase on master and modify the single-device test to run on GPU here |
| global_process_rank, global_world_size, master_addr, port); | ||
| std::shared_ptr<xla::DistributedRuntimeClient> distributed_client = | ||
| coordinator_->GetClient(); | ||
| if (distributed_client != nullptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should be doing an XLA_CHECK(distributed_client != nullptr) here - is there a case where we want the ComputationClient creation to succeed without a DistributedRuntimeClient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this case already checked in GetClient?
5235c90 to
8b0f9f5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
|
Do we need this in 2.2? |
We don't need this for 2.2. This pr is more of optimization. |
Currently we create a coordinator service even if we run single processing on CUDA. I observed that when the single process runs for a long time (>1h), there is a chance that the coordinator service would crash. But single processing really doesn't need the coordinator service.
Test plan: